Database Operations Runbook¶
Procedures for RDS failover, backup/restore, DynamoDB operations, and connection pool issues.
RDS Multi-AZ Failover¶
Symptoms¶
- CloudWatch alarm:
tradai-{env}-rds-failover - Brief connectivity loss to PostgreSQL
- MLflow experiments temporarily unavailable
- ECS services reconnecting
Automatic Failover Detection¶
RDS Multi-AZ failover is automatic when: - Primary instance fails - AZ becomes unavailable - Instance type change (planned maintenance)
Check if failover occurred:
aws rds describe-events \
--source-identifier tradai-${ENVIRONMENT} \
--source-type db-instance \
--duration 1440 \
--query "Events[?contains(Message, 'failover')]"
Post-Failover Verification¶
-
Check RDS status:
-
Verify endpoint resolution:
-
Test connectivity from ECS:
-
Check service health:
Manual Failover (Testing/Planned)¶
# Initiate failover (Multi-AZ only)
aws rds reboot-db-instance \
--db-instance-identifier tradai-${ENVIRONMENT} \
--force-failover
Point-in-Time Recovery (PITR)¶
When to Use¶
- Accidental data deletion
- Corrupted data detected
- Need to recover to specific timestamp
Prerequisites¶
- Automated backups enabled (default: 7 days retention)
- Know the target recovery timestamp
Procedure¶
-
Check available recovery window:
-
Create restored instance:
aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier tradai-${ENVIRONMENT} \ --target-db-instance-identifier tradai-${ENVIRONMENT}-restored \ --restore-time "2024-01-15T10:30:00Z" \ --db-subnet-group-name tradai-${ENVIRONMENT}-db-subnet \ --vpc-security-group-ids $RDS_SG_ID \ --db-instance-class db.t4g.micro \ --no-multi-az -
Wait for restoration:
-
Verify restored data:
-
Promote restored instance (if replacing original):
# Rename original (keep as backup) aws rds modify-db-instance \ --db-instance-identifier tradai-${ENVIRONMENT} \ --new-db-instance-identifier tradai-${ENVIRONMENT}-old \ --apply-immediately # Wait for rename sleep 60 # Rename restored to original name aws rds modify-db-instance \ --db-instance-identifier tradai-${ENVIRONMENT}-restored \ --new-db-instance-identifier tradai-${ENVIRONMENT} \ --apply-immediately -
Update services to use restored instance (if endpoint changed)
Snapshot Restore¶
When to Use¶
- Full database recovery
- Creating test environment from production
- Disaster recovery
Procedure¶
-
List available snapshots:
-
Restore from snapshot:
aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier tradai-${ENVIRONMENT}-from-snapshot \ --db-snapshot-identifier tradai-${ENVIRONMENT}-manual-20240115 \ --db-subnet-group-name tradai-${ENVIRONMENT}-db-subnet \ --vpc-security-group-ids $RDS_SG_ID \ --db-instance-class db.t4g.micro \ --no-multi-az -
Wait and verify (same as PITR steps 3-4)
Create Manual Snapshot¶
aws rds create-db-snapshot \
--db-instance-identifier tradai-${ENVIRONMENT} \
--db-snapshot-identifier tradai-${ENVIRONMENT}-manual-$(date +%Y%m%d)
Connection Pool Exhaustion¶
Symptoms¶
FATAL: too many connections for roleerrors- Services timing out on database operations
- Increasing connection count in CloudWatch metrics
Diagnosis¶
-
Check current connections:
-
Check max connections setting:
-
List active connections:
Resolution¶
Option 1: Terminate idle connections
psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow \
-c "SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'mlflow'
AND state = 'idle'
AND query_start < NOW() - INTERVAL '5 minutes'"
Option 2: Restart ECS services (releases connection pools)
for SERVICE in backend-api strategy-service mlflow; do
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE}-${ENVIRONMENT} \
--force-new-deployment
done
Option 3: Increase max_connections (requires instance restart)
# Create/modify parameter group
aws rds modify-db-parameter-group \
--db-parameter-group-name tradai-${ENVIRONMENT}-params \
--parameters "ParameterName=max_connections,ParameterValue=150,ApplyMethod=pending-reboot"
# Reboot instance
aws rds reboot-db-instance \
--db-instance-identifier tradai-${ENVIRONMENT}
Option 4: Scale to larger instance
aws rds modify-db-instance \
--db-instance-identifier tradai-${ENVIRONMENT} \
--db-instance-class db.t4g.small \
--apply-immediately
DynamoDB Operations¶
Check Table Status¶
# List all TradAI tables
aws dynamodb list-tables \
--query "TableNames[?contains(@, 'tradai')]"
# Get table details
aws dynamodb describe-table \
--table-name tradai-workflow-state-${ENVIRONMENT}
DynamoDB Backup¶
On-demand backup:
aws dynamodb create-backup \
--table-name tradai-workflow-state-${ENVIRONMENT} \
--backup-name tradai-workflow-state-${ENVIRONMENT}-$(date +%Y%m%d)
List backups:
DynamoDB Restore¶
aws dynamodb restore-table-from-backup \
--target-table-name tradai-workflow-state-${ENVIRONMENT}-restored \
--backup-arn arn:aws:dynamodb:${AWS_REGION}:${ACCOUNT_ID}:table/tradai-workflow-state-${ENVIRONMENT}/backup/...
Point-in-Time Recovery (DynamoDB)¶
Enable PITR:
aws dynamodb update-continuous-backups \
--table-name tradai-workflow-state-${ENVIRONMENT} \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
Restore to point in time:
aws dynamodb restore-table-to-point-in-time \
--source-table-name tradai-workflow-state-${ENVIRONMENT} \
--target-table-name tradai-workflow-state-${ENVIRONMENT}-restored \
--restore-date-time "2024-01-15T10:30:00Z"
Config Versions Table¶
The tradai-config-versions-{ENV} table tracks configuration version history with two GSIs for efficient querying.
Check config version:¶
aws dynamodb get-item \
--table-name tradai-config-versions-${ENVIRONMENT} \
--key '{"strategy_name": {"S": "STRATEGY_NAME"}, "config_id": {"S": "CONFIG_ID"}}'
Query by config hash (detect duplicates):¶
aws dynamodb query \
--table-name tradai-config-versions-${ENVIRONMENT} \
--index-name config_hash-index \
--key-condition-expression "config_hash = :hash" \
--expression-attribute-values '{":hash": {"S": "HASH_VALUE"}}'
Query by status:¶
aws dynamodb query \
--table-name tradai-config-versions-${ENVIRONMENT} \
--index-name status-index \
--key-condition-expression "#status = :status" \
--expression-attribute-names '{"#status": "status"}' \
--expression-attribute-values '{":status": {"S": "active"}}'
GSI Verification Procedure¶
Verify that all DynamoDB tables have their expected Global Secondary Indexes. Run this after infrastructure deployments or when queries against GSIs fail.
Tables with GSIs¶
| Table | GSI Name | Partition Key | Sort Key |
|---|---|---|---|
tradai-workflow-state-{ENV} | status-created_at-index | status | created_at |
tradai-workflow-state-{ENV} | trace_id-index | trace_id | - |
tradai-deployments-{ENV} | environment-created_at-index | environment | created_at |
tradai-shadow-test-state-{ENV} | model_name-status_created_at-index | model_name | status_created_at |
tradai-infra-drift-state-{ENV} | has_drift-last_check-index | has_drift | last_check |
tradai-config-versions-{ENV} | config_hash-index | config_hash | - |
tradai-config-versions-{ENV} | status-index | strategy_name | status |
Verify GSIs for a table:¶
aws dynamodb describe-table \
--table-name tradai-workflow-state-${ENVIRONMENT} \
--query 'Table.GlobalSecondaryIndexes[].{Name:IndexName,Status:IndexStatus,Keys:KeySchema}'
Verify all GSIs at once:¶
for TABLE in workflow-state deployments shadow-test-state infra-drift-state config-versions; do
echo "=== tradai-${TABLE}-${ENVIRONMENT} ==="
aws dynamodb describe-table \
--table-name tradai-${TABLE}-${ENVIRONMENT} \
--query 'Table.GlobalSecondaryIndexes[].{Name:IndexName,Status:IndexStatus}' \
--output table 2>/dev/null || echo " Table not found or no GSIs"
done
Verification Checklist¶
After any database operation:
- RDS instance status is "available"
- Database connectivity from all ECS services verified
- MLflow UI accessible and showing experiments
- DynamoDB tables accessible
- No connection errors in CloudWatch logs (last 15 minutes)
- CloudWatch alarms cleared
- Connection count within normal range (<50 for t4g.micro)
- Backtest jobs can read/write workflow state