Database Operations Runbook¶
Procedures for RDS failover, backup/restore, DynamoDB operations, and connection pool issues.
RDS Multi-AZ Failover¶
Symptoms¶
- CloudWatch alarm:
tradai-{env}-rds-failover - Brief connectivity loss to PostgreSQL
- MLflow experiments temporarily unavailable
- ECS services reconnecting
Automatic Failover Detection¶
RDS Multi-AZ failover is automatic when: - Primary instance fails - AZ becomes unavailable - Instance type change (planned maintenance)
Check if failover occurred:
aws rds describe-events \
--source-identifier tradai-${ENVIRONMENT} \
--source-type db-instance \
--duration 1440 \
--query "Events[?contains(Message, 'failover')]"
Post-Failover Verification¶
-
Check RDS status:
-
Verify endpoint resolution:
-
Test connectivity from ECS:
-
Check service health:
Manual Failover (Testing/Planned)¶
# Initiate failover (Multi-AZ only)
aws rds reboot-db-instance \
--db-instance-identifier tradai-${ENVIRONMENT} \
--force-failover
Point-in-Time Recovery (PITR)¶
When to Use¶
- Accidental data deletion
- Corrupted data detected
- Need to recover to specific timestamp
Prerequisites¶
- Automated backups enabled (default: 7 days retention)
- Know the target recovery timestamp
Procedure¶
-
Check available recovery window:
-
Create restored instance:
aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier tradai-${ENVIRONMENT} \ --target-db-instance-identifier tradai-${ENVIRONMENT}-restored \ --restore-time "2024-01-15T10:30:00Z" \ --db-subnet-group-name tradai-${ENVIRONMENT}-db-subnet \ --vpc-security-group-ids $RDS_SG_ID \ --db-instance-class db.t4g.micro \ --no-multi-az -
Wait for restoration:
-
Verify restored data:
-
Promote restored instance (if replacing original):
# Rename original (keep as backup) aws rds modify-db-instance \ --db-instance-identifier tradai-${ENVIRONMENT} \ --new-db-instance-identifier tradai-${ENVIRONMENT}-old \ --apply-immediately # Wait for rename sleep 60 # Rename restored to original name aws rds modify-db-instance \ --db-instance-identifier tradai-${ENVIRONMENT}-restored \ --new-db-instance-identifier tradai-${ENVIRONMENT} \ --apply-immediately -
Update services to use restored instance (if endpoint changed)
Snapshot Restore¶
When to Use¶
- Full database recovery
- Creating test environment from production
- Disaster recovery
Procedure¶
-
List available snapshots:
-
Restore from snapshot:
aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier tradai-${ENVIRONMENT}-from-snapshot \ --db-snapshot-identifier tradai-${ENVIRONMENT}-manual-20240115 \ --db-subnet-group-name tradai-${ENVIRONMENT}-db-subnet \ --vpc-security-group-ids $RDS_SG_ID \ --db-instance-class db.t4g.micro \ --no-multi-az -
Wait and verify (same as PITR steps 3-4)
Create Manual Snapshot¶
aws rds create-db-snapshot \
--db-instance-identifier tradai-${ENVIRONMENT} \
--db-snapshot-identifier tradai-${ENVIRONMENT}-manual-$(date +%Y%m%d)
Connection Pool Exhaustion¶
Symptoms¶
FATAL: too many connections for roleerrors- Services timing out on database operations
- Increasing connection count in CloudWatch metrics
Diagnosis¶
-
Check current connections:
-
Check max connections setting:
-
List active connections:
Resolution¶
Option 1: Terminate idle connections
psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow \
-c "SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'mlflow'
AND state = 'idle'
AND query_start < NOW() - INTERVAL '5 minutes'"
Option 2: Restart ECS services (releases connection pools)
for SERVICE in backend strategy-service mlflow; do
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE}-${ENVIRONMENT} \
--force-new-deployment
done
Option 3: Increase max_connections (requires instance restart)
# Create/modify parameter group
aws rds modify-db-parameter-group \
--db-parameter-group-name tradai-${ENVIRONMENT}-params \
--parameters "ParameterName=max_connections,ParameterValue=150,ApplyMethod=pending-reboot"
# Reboot instance
aws rds reboot-db-instance \
--db-instance-identifier tradai-${ENVIRONMENT}
Option 4: Scale to larger instance
aws rds modify-db-instance \
--db-instance-identifier tradai-${ENVIRONMENT} \
--db-instance-class db.t4g.small \
--apply-immediately
DynamoDB Operations¶
Check Table Status¶
# List all TradAI tables
aws dynamodb list-tables \
--query "TableNames[?contains(@, 'tradai-${ENVIRONMENT}')]"
# Get table details
aws dynamodb describe-table \
--table-name tradai-${ENVIRONMENT}-workflow-state
DynamoDB Backup¶
On-demand backup:
aws dynamodb create-backup \
--table-name tradai-${ENVIRONMENT}-workflow-state \
--backup-name tradai-${ENVIRONMENT}-workflow-state-$(date +%Y%m%d)
List backups:
DynamoDB Restore¶
aws dynamodb restore-table-from-backup \
--target-table-name tradai-${ENVIRONMENT}-workflow-state-restored \
--backup-arn arn:aws:dynamodb:${AWS_REGION}:${ACCOUNT_ID}:table/tradai-${ENVIRONMENT}-workflow-state/backup/...
Point-in-Time Recovery (DynamoDB)¶
Enable PITR:
aws dynamodb update-continuous-backups \
--table-name tradai-${ENVIRONMENT}-workflow-state \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
Restore to point in time:
aws dynamodb restore-table-to-point-in-time \
--source-table-name tradai-${ENVIRONMENT}-workflow-state \
--target-table-name tradai-${ENVIRONMENT}-workflow-state-restored \
--restore-date-time "2024-01-15T10:30:00Z"
Verification Checklist¶
After any database operation:
- [ ] RDS instance status is "available"
- [ ] Database connectivity from all ECS services verified
- [ ] MLflow UI accessible and showing experiments
- [ ] DynamoDB tables accessible
- [ ] No connection errors in CloudWatch logs (last 15 minutes)
- [ ] CloudWatch alarms cleared
- [ ] Connection count within normal range (<50 for t4g.micro)
- [ ] Backtest jobs can read/write workflow state