Database Operations Runbook¶

Procedures for RDS failover, backup/restore, DynamoDB operations, and connection pool issues.

RDS Multi-AZ Failover¶

Symptoms¶

CloudWatch alarm: tradai-{env}-rds-failover
Brief connectivity loss to PostgreSQL
MLflow experiments temporarily unavailable
ECS services reconnecting

Automatic Failover Detection¶

RDS Multi-AZ failover is automatic when: - Primary instance fails - AZ becomes unavailable - Instance type change (planned maintenance)

Check if failover occurred:

aws rds describe-events \
  --source-identifier tradai-${ENVIRONMENT} \
  --source-type db-instance \
  --duration 1440 \
  --query "Events[?contains(Message, 'failover')]"

Post-Failover Verification¶

Check RDS status:

aws rds describe-db-instances \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --query 'DBInstances[0].{Status:DBInstanceStatus,AZ:AvailabilityZone,MultiAZ:MultiAZ}'

Verify endpoint resolution:

nslookup tradai-${ENVIRONMENT}.xxxxxx.${AWS_REGION}.rds.amazonaws.com

Test connectivity from ECS:

aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container mlflow \
  --interactive \
  --command "psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow -c 'SELECT 1'"

Check service health:

curl -f https://${ALB_DNS}/api/v1/health | jq '.dependencies.mlflow'

Manual Failover (Testing/Planned)¶

# Initiate failover (Multi-AZ only)
aws rds reboot-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --force-failover

Point-in-Time Recovery (PITR)¶

When to Use¶

Accidental data deletion
Corrupted data detected
Need to recover to specific timestamp

Prerequisites¶

Automated backups enabled (default: 7 days retention)
Know the target recovery timestamp

Procedure¶

Check available recovery window:

aws rds describe-db-instances \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --query 'DBInstances[0].{EarliestRestore:EarliestRestorableTime,LatestRestore:LatestRestorableTime}'

Create restored instance:

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier tradai-${ENVIRONMENT} \
  --target-db-instance-identifier tradai-${ENVIRONMENT}-restored \
  --restore-time "2024-01-15T10:30:00Z" \
  --db-subnet-group-name tradai-${ENVIRONMENT}-db-subnet \
  --vpc-security-group-ids $RDS_SG_ID \
  --db-instance-class db.t4g.micro \
  --no-multi-az

Wait for restoration:

aws rds wait db-instance-available \
  --db-instance-identifier tradai-${ENVIRONMENT}-restored

Verify restored data:

# Connect to restored instance and verify
psql -h $RESTORED_ENDPOINT -U $DB_USER -d mlflow \
  -c "SELECT COUNT(*) FROM experiments"

Promote restored instance (if replacing original):

# Rename original (keep as backup)
aws rds modify-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --new-db-instance-identifier tradai-${ENVIRONMENT}-old \
  --apply-immediately

# Wait for rename
sleep 60

# Rename restored to original name
aws rds modify-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT}-restored \
  --new-db-instance-identifier tradai-${ENVIRONMENT} \
  --apply-immediately

Update services to use restored instance (if endpoint changed)

Snapshot Restore¶

When to Use¶

Full database recovery
Creating test environment from production
Disaster recovery

Procedure¶

List available snapshots:

aws rds describe-db-snapshots \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --query 'DBSnapshots[*].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime,Status:Status}' \
  --output table

Restore from snapshot:

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier tradai-${ENVIRONMENT}-from-snapshot \
  --db-snapshot-identifier tradai-${ENVIRONMENT}-manual-20240115 \
  --db-subnet-group-name tradai-${ENVIRONMENT}-db-subnet \
  --vpc-security-group-ids $RDS_SG_ID \
  --db-instance-class db.t4g.micro \
  --no-multi-az

Wait and verify (same as PITR steps 3-4)

Create Manual Snapshot¶

aws rds create-db-snapshot \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --db-snapshot-identifier tradai-${ENVIRONMENT}-manual-$(date +%Y%m%d)

Connection Pool Exhaustion¶

Symptoms¶

FATAL: too many connections for role errors
Services timing out on database operations
Increasing connection count in CloudWatch metrics

Diagnosis¶

Check current connections:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum

Check max connections setting:

# t4g.micro has ~87 max_connections
psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow \
  -c "SHOW max_connections"

List active connections:

psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow \
  -c "SELECT pid, usename, application_name, client_addr, state, query_start
      FROM pg_stat_activity
      WHERE datname = 'mlflow'
      ORDER BY query_start"

Resolution¶

Option 1: Terminate idle connections

psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow \
  -c "SELECT pg_terminate_backend(pid)
      FROM pg_stat_activity
      WHERE datname = 'mlflow'
      AND state = 'idle'
      AND query_start < NOW() - INTERVAL '5 minutes'"

Option 2: Restart ECS services (releases connection pools)

for SERVICE in backend-api strategy-service mlflow; do
  aws ecs update-service \
    --cluster tradai-${ENVIRONMENT} \
    --service tradai-${SERVICE}-${ENVIRONMENT} \
    --force-new-deployment
done

Option 3: Increase max_connections (requires instance restart)

# Create/modify parameter group
aws rds modify-db-parameter-group \
  --db-parameter-group-name tradai-${ENVIRONMENT}-params \
  --parameters "ParameterName=max_connections,ParameterValue=150,ApplyMethod=pending-reboot"

# Reboot instance
aws rds reboot-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT}

Option 4: Scale to larger instance

aws rds modify-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --db-instance-class db.t4g.small \
  --apply-immediately

DynamoDB Operations¶

Check Table Status¶

# List all TradAI tables
aws dynamodb list-tables \
  --query "TableNames[?contains(@, 'tradai')]"

# Get table details
aws dynamodb describe-table \
  --table-name tradai-workflow-state-${ENVIRONMENT}

DynamoDB Backup¶

On-demand backup:

aws dynamodb create-backup \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --backup-name tradai-workflow-state-${ENVIRONMENT}-$(date +%Y%m%d)

List backups:

aws dynamodb list-backups \
  --table-name tradai-workflow-state-${ENVIRONMENT}

DynamoDB Restore¶

aws dynamodb restore-table-from-backup \
  --target-table-name tradai-workflow-state-${ENVIRONMENT}-restored \
  --backup-arn arn:aws:dynamodb:${AWS_REGION}:${ACCOUNT_ID}:table/tradai-workflow-state-${ENVIRONMENT}/backup/...

Point-in-Time Recovery (DynamoDB)¶

Enable PITR:

aws dynamodb update-continuous-backups \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true

Restore to point in time:

aws dynamodb restore-table-to-point-in-time \
  --source-table-name tradai-workflow-state-${ENVIRONMENT} \
  --target-table-name tradai-workflow-state-${ENVIRONMENT}-restored \
  --restore-date-time "2024-01-15T10:30:00Z"

Config Versions Table¶

The tradai-config-versions-{ENV} table tracks configuration version history with two GSIs for efficient querying.

Check config version:¶

aws dynamodb get-item \
  --table-name tradai-config-versions-${ENVIRONMENT} \
  --key '{"strategy_name": {"S": "STRATEGY_NAME"}, "config_id": {"S": "CONFIG_ID"}}'

Query by config hash (detect duplicates):¶

aws dynamodb query \
  --table-name tradai-config-versions-${ENVIRONMENT} \
  --index-name config_hash-index \
  --key-condition-expression "config_hash = :hash" \
  --expression-attribute-values '{":hash": {"S": "HASH_VALUE"}}'

Query by status:¶

aws dynamodb query \
  --table-name tradai-config-versions-${ENVIRONMENT} \
  --index-name status-index \
  --key-condition-expression "#status = :status" \
  --expression-attribute-names '{"#status": "status"}' \
  --expression-attribute-values '{":status": {"S": "active"}}'

GSI Verification Procedure¶

Verify that all DynamoDB tables have their expected Global Secondary Indexes. Run this after infrastructure deployments or when queries against GSIs fail.

Tables with GSIs¶

Table	GSI Name	Partition Key	Sort Key
`tradai-workflow-state-{ENV}`	`status-created_at-index`	`status`	`created_at`
`tradai-workflow-state-{ENV}`	`trace_id-index`	`trace_id`	-
`tradai-deployments-{ENV}`	`environment-created_at-index`	`environment`	`created_at`
`tradai-shadow-test-state-{ENV}`	`model_name-status_created_at-index`	`model_name`	`status_created_at`
`tradai-infra-drift-state-{ENV}`	`has_drift-last_check-index`	`has_drift`	`last_check`
`tradai-config-versions-{ENV}`	`config_hash-index`	`config_hash`	-
`tradai-config-versions-{ENV}`	`status-index`	`strategy_name`	`status`

Verify GSIs for a table:¶

aws dynamodb describe-table \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --query 'Table.GlobalSecondaryIndexes[].{Name:IndexName,Status:IndexStatus,Keys:KeySchema}'

Verify all GSIs at once:¶

for TABLE in workflow-state deployments shadow-test-state infra-drift-state config-versions; do
  echo "=== tradai-${TABLE}-${ENVIRONMENT} ==="
  aws dynamodb describe-table \
    --table-name tradai-${TABLE}-${ENVIRONMENT} \
    --query 'Table.GlobalSecondaryIndexes[].{Name:IndexName,Status:IndexStatus}' \
    --output table 2>/dev/null || echo "  Table not found or no GSIs"
done

Verification Checklist¶

After any database operation:

RDS instance status is "available"
Database connectivity from all ECS services verified
MLflow UI accessible and showing experiments
DynamoDB tables accessible
No connection errors in CloudWatch logs (last 15 minutes)
CloudWatch alarms cleared
Connection count within normal range (<50 for t4g.micro)
Backtest jobs can read/write workflow state