Skip to content

Database Operations Runbook

Procedures for RDS failover, backup/restore, DynamoDB operations, and connection pool issues.

RDS Multi-AZ Failover

Symptoms

  • CloudWatch alarm: tradai-{env}-rds-failover
  • Brief connectivity loss to PostgreSQL
  • MLflow experiments temporarily unavailable
  • ECS services reconnecting

Automatic Failover Detection

RDS Multi-AZ failover is automatic when: - Primary instance fails - AZ becomes unavailable - Instance type change (planned maintenance)

Check if failover occurred:

aws rds describe-events \
  --source-identifier tradai-${ENVIRONMENT} \
  --source-type db-instance \
  --duration 1440 \
  --query "Events[?contains(Message, 'failover')]"

Post-Failover Verification

  1. Check RDS status:

    aws rds describe-db-instances \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --query 'DBInstances[0].{Status:DBInstanceStatus,AZ:AvailabilityZone,MultiAZ:MultiAZ}'
    

  2. Verify endpoint resolution:

    nslookup tradai-${ENVIRONMENT}.xxxxxx.${AWS_REGION}.rds.amazonaws.com
    

  3. Test connectivity from ECS:

    aws ecs execute-command \
      --cluster tradai-${ENVIRONMENT} \
      --task $TASK_ARN \
      --container mlflow \
      --interactive \
      --command "psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow -c 'SELECT 1'"
    

  4. Check service health:

    curl -f https://${ALB_DNS}/api/v1/health | jq '.dependencies.mlflow'
    

Manual Failover (Testing/Planned)

# Initiate failover (Multi-AZ only)
aws rds reboot-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --force-failover

Point-in-Time Recovery (PITR)

When to Use

  • Accidental data deletion
  • Corrupted data detected
  • Need to recover to specific timestamp

Prerequisites

  • Automated backups enabled (default: 7 days retention)
  • Know the target recovery timestamp

Procedure

  1. Check available recovery window:

    aws rds describe-db-instances \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --query 'DBInstances[0].{EarliestRestore:EarliestRestorableTime,LatestRestore:LatestRestorableTime}'
    

  2. Create restored instance:

    aws rds restore-db-instance-to-point-in-time \
      --source-db-instance-identifier tradai-${ENVIRONMENT} \
      --target-db-instance-identifier tradai-${ENVIRONMENT}-restored \
      --restore-time "2024-01-15T10:30:00Z" \
      --db-subnet-group-name tradai-${ENVIRONMENT}-db-subnet \
      --vpc-security-group-ids $RDS_SG_ID \
      --db-instance-class db.t4g.micro \
      --no-multi-az
    

  3. Wait for restoration:

    aws rds wait db-instance-available \
      --db-instance-identifier tradai-${ENVIRONMENT}-restored
    

  4. Verify restored data:

    # Connect to restored instance and verify
    psql -h $RESTORED_ENDPOINT -U $DB_USER -d mlflow \
      -c "SELECT COUNT(*) FROM experiments"
    

  5. Promote restored instance (if replacing original):

    # Rename original (keep as backup)
    aws rds modify-db-instance \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --new-db-instance-identifier tradai-${ENVIRONMENT}-old \
      --apply-immediately
    
    # Wait for rename
    sleep 60
    
    # Rename restored to original name
    aws rds modify-db-instance \
      --db-instance-identifier tradai-${ENVIRONMENT}-restored \
      --new-db-instance-identifier tradai-${ENVIRONMENT} \
      --apply-immediately
    

  6. Update services to use restored instance (if endpoint changed)


Snapshot Restore

When to Use

  • Full database recovery
  • Creating test environment from production
  • Disaster recovery

Procedure

  1. List available snapshots:

    aws rds describe-db-snapshots \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --query 'DBSnapshots[*].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime,Status:Status}' \
      --output table
    

  2. Restore from snapshot:

    aws rds restore-db-instance-from-db-snapshot \
      --db-instance-identifier tradai-${ENVIRONMENT}-from-snapshot \
      --db-snapshot-identifier tradai-${ENVIRONMENT}-manual-20240115 \
      --db-subnet-group-name tradai-${ENVIRONMENT}-db-subnet \
      --vpc-security-group-ids $RDS_SG_ID \
      --db-instance-class db.t4g.micro \
      --no-multi-az
    

  3. Wait and verify (same as PITR steps 3-4)

Create Manual Snapshot

aws rds create-db-snapshot \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --db-snapshot-identifier tradai-${ENVIRONMENT}-manual-$(date +%Y%m%d)

Connection Pool Exhaustion

Symptoms

  • FATAL: too many connections for role errors
  • Services timing out on database operations
  • Increasing connection count in CloudWatch metrics

Diagnosis

  1. Check current connections:

    aws cloudwatch get-metric-statistics \
      --namespace AWS/RDS \
      --metric-name DatabaseConnections \
      --dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Maximum
    

  2. Check max connections setting:

    # t4g.micro has ~87 max_connections
    psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow \
      -c "SHOW max_connections"
    

  3. List active connections:

    psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow \
      -c "SELECT pid, usename, application_name, client_addr, state, query_start
          FROM pg_stat_activity
          WHERE datname = 'mlflow'
          ORDER BY query_start"
    

Resolution

Option 1: Terminate idle connections

psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow \
  -c "SELECT pg_terminate_backend(pid)
      FROM pg_stat_activity
      WHERE datname = 'mlflow'
      AND state = 'idle'
      AND query_start < NOW() - INTERVAL '5 minutes'"

Option 2: Restart ECS services (releases connection pools)

for SERVICE in backend-api strategy-service mlflow; do
  aws ecs update-service \
    --cluster tradai-${ENVIRONMENT} \
    --service tradai-${SERVICE}-${ENVIRONMENT} \
    --force-new-deployment
done

Option 3: Increase max_connections (requires instance restart)

# Create/modify parameter group
aws rds modify-db-parameter-group \
  --db-parameter-group-name tradai-${ENVIRONMENT}-params \
  --parameters "ParameterName=max_connections,ParameterValue=150,ApplyMethod=pending-reboot"

# Reboot instance
aws rds reboot-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT}

Option 4: Scale to larger instance

aws rds modify-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --db-instance-class db.t4g.small \
  --apply-immediately


DynamoDB Operations

Check Table Status

# List all TradAI tables
aws dynamodb list-tables \
  --query "TableNames[?contains(@, 'tradai')]"

# Get table details
aws dynamodb describe-table \
  --table-name tradai-workflow-state-${ENVIRONMENT}

DynamoDB Backup

On-demand backup:

aws dynamodb create-backup \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --backup-name tradai-workflow-state-${ENVIRONMENT}-$(date +%Y%m%d)

List backups:

aws dynamodb list-backups \
  --table-name tradai-workflow-state-${ENVIRONMENT}

DynamoDB Restore

aws dynamodb restore-table-from-backup \
  --target-table-name tradai-workflow-state-${ENVIRONMENT}-restored \
  --backup-arn arn:aws:dynamodb:${AWS_REGION}:${ACCOUNT_ID}:table/tradai-workflow-state-${ENVIRONMENT}/backup/...

Point-in-Time Recovery (DynamoDB)

Enable PITR:

aws dynamodb update-continuous-backups \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true

Restore to point in time:

aws dynamodb restore-table-to-point-in-time \
  --source-table-name tradai-workflow-state-${ENVIRONMENT} \
  --target-table-name tradai-workflow-state-${ENVIRONMENT}-restored \
  --restore-date-time "2024-01-15T10:30:00Z"


Config Versions Table

The tradai-config-versions-{ENV} table tracks configuration version history with two GSIs for efficient querying.

Check config version:

aws dynamodb get-item \
  --table-name tradai-config-versions-${ENVIRONMENT} \
  --key '{"strategy_name": {"S": "STRATEGY_NAME"}, "config_id": {"S": "CONFIG_ID"}}'

Query by config hash (detect duplicates):

aws dynamodb query \
  --table-name tradai-config-versions-${ENVIRONMENT} \
  --index-name config_hash-index \
  --key-condition-expression "config_hash = :hash" \
  --expression-attribute-values '{":hash": {"S": "HASH_VALUE"}}'

Query by status:

aws dynamodb query \
  --table-name tradai-config-versions-${ENVIRONMENT} \
  --index-name status-index \
  --key-condition-expression "#status = :status" \
  --expression-attribute-names '{"#status": "status"}' \
  --expression-attribute-values '{":status": {"S": "active"}}'

GSI Verification Procedure

Verify that all DynamoDB tables have their expected Global Secondary Indexes. Run this after infrastructure deployments or when queries against GSIs fail.

Tables with GSIs

Table GSI Name Partition Key Sort Key
tradai-workflow-state-{ENV} status-created_at-index status created_at
tradai-workflow-state-{ENV} trace_id-index trace_id -
tradai-deployments-{ENV} environment-created_at-index environment created_at
tradai-shadow-test-state-{ENV} model_name-status_created_at-index model_name status_created_at
tradai-infra-drift-state-{ENV} has_drift-last_check-index has_drift last_check
tradai-config-versions-{ENV} config_hash-index config_hash -
tradai-config-versions-{ENV} status-index strategy_name status

Verify GSIs for a table:

aws dynamodb describe-table \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --query 'Table.GlobalSecondaryIndexes[].{Name:IndexName,Status:IndexStatus,Keys:KeySchema}'

Verify all GSIs at once:

for TABLE in workflow-state deployments shadow-test-state infra-drift-state config-versions; do
  echo "=== tradai-${TABLE}-${ENVIRONMENT} ==="
  aws dynamodb describe-table \
    --table-name tradai-${TABLE}-${ENVIRONMENT} \
    --query 'Table.GlobalSecondaryIndexes[].{Name:IndexName,Status:IndexStatus}' \
    --output table 2>/dev/null || echo "  Table not found or no GSIs"
done

Verification Checklist

After any database operation:

  • RDS instance status is "available"
  • Database connectivity from all ECS services verified
  • MLflow UI accessible and showing experiments
  • DynamoDB tables accessible
  • No connection errors in CloudWatch logs (last 15 minutes)
  • CloudWatch alarms cleared
  • Connection count within normal range (<50 for t4g.micro)
  • Backtest jobs can read/write workflow state