Skip to content

Database Operations Runbook

Procedures for RDS failover, backup/restore, DynamoDB operations, and connection pool issues.

RDS Multi-AZ Failover

Symptoms

  • CloudWatch alarm: tradai-{env}-rds-failover
  • Brief connectivity loss to PostgreSQL
  • MLflow experiments temporarily unavailable
  • ECS services reconnecting

Automatic Failover Detection

RDS Multi-AZ failover is automatic when: - Primary instance fails - AZ becomes unavailable - Instance type change (planned maintenance)

Check if failover occurred:

aws rds describe-events \
  --source-identifier tradai-${ENVIRONMENT} \
  --source-type db-instance \
  --duration 1440 \
  --query "Events[?contains(Message, 'failover')]"

Post-Failover Verification

  1. Check RDS status:

    aws rds describe-db-instances \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --query 'DBInstances[0].{Status:DBInstanceStatus,AZ:AvailabilityZone,MultiAZ:MultiAZStatus}'
    

  2. Verify endpoint resolution:

    nslookup tradai-${ENVIRONMENT}.xxxxxx.${AWS_REGION}.rds.amazonaws.com
    

  3. Test connectivity from ECS:

    aws ecs execute-command \
      --cluster tradai-${ENVIRONMENT} \
      --task $TASK_ARN \
      --container mlflow \
      --interactive \
      --command "psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow -c 'SELECT 1'"
    

  4. Check service health:

    curl -f https://${ALB_DNS}/api/v1/health | jq '.dependencies.mlflow'
    

Manual Failover (Testing/Planned)

# Initiate failover (Multi-AZ only)
aws rds reboot-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --force-failover

Point-in-Time Recovery (PITR)

When to Use

  • Accidental data deletion
  • Corrupted data detected
  • Need to recover to specific timestamp

Prerequisites

  • Automated backups enabled (default: 7 days retention)
  • Know the target recovery timestamp

Procedure

  1. Check available recovery window:

    aws rds describe-db-instances \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --query 'DBInstances[0].{EarliestRestore:EarliestRestorableTime,LatestRestore:LatestRestorableTime}'
    

  2. Create restored instance:

    aws rds restore-db-instance-to-point-in-time \
      --source-db-instance-identifier tradai-${ENVIRONMENT} \
      --target-db-instance-identifier tradai-${ENVIRONMENT}-restored \
      --restore-time "2024-01-15T10:30:00Z" \
      --db-subnet-group-name tradai-${ENVIRONMENT}-db-subnet \
      --vpc-security-group-ids $RDS_SG_ID \
      --db-instance-class db.t4g.micro \
      --no-multi-az
    

  3. Wait for restoration:

    aws rds wait db-instance-available \
      --db-instance-identifier tradai-${ENVIRONMENT}-restored
    

  4. Verify restored data:

    # Connect to restored instance and verify
    psql -h $RESTORED_ENDPOINT -U $DB_USER -d mlflow \
      -c "SELECT COUNT(*) FROM experiments"
    

  5. Promote restored instance (if replacing original):

    # Rename original (keep as backup)
    aws rds modify-db-instance \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --new-db-instance-identifier tradai-${ENVIRONMENT}-old \
      --apply-immediately
    
    # Wait for rename
    sleep 60
    
    # Rename restored to original name
    aws rds modify-db-instance \
      --db-instance-identifier tradai-${ENVIRONMENT}-restored \
      --new-db-instance-identifier tradai-${ENVIRONMENT} \
      --apply-immediately
    

  6. Update services to use restored instance (if endpoint changed)


Snapshot Restore

When to Use

  • Full database recovery
  • Creating test environment from production
  • Disaster recovery

Procedure

  1. List available snapshots:

    aws rds describe-db-snapshots \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --query 'DBSnapshots[*].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime,Status:Status}' \
      --output table
    

  2. Restore from snapshot:

    aws rds restore-db-instance-from-db-snapshot \
      --db-instance-identifier tradai-${ENVIRONMENT}-from-snapshot \
      --db-snapshot-identifier tradai-${ENVIRONMENT}-manual-20240115 \
      --db-subnet-group-name tradai-${ENVIRONMENT}-db-subnet \
      --vpc-security-group-ids $RDS_SG_ID \
      --db-instance-class db.t4g.micro \
      --no-multi-az
    

  3. Wait and verify (same as PITR steps 3-4)

Create Manual Snapshot

aws rds create-db-snapshot \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --db-snapshot-identifier tradai-${ENVIRONMENT}-manual-$(date +%Y%m%d)

Connection Pool Exhaustion

Symptoms

  • FATAL: too many connections for role errors
  • Services timing out on database operations
  • Increasing connection count in CloudWatch metrics

Diagnosis

  1. Check current connections:

    aws cloudwatch get-metric-statistics \
      --namespace AWS/RDS \
      --metric-name DatabaseConnections \
      --dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Maximum
    

  2. Check max connections setting:

    # t4g.micro has ~87 max_connections
    psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow \
      -c "SHOW max_connections"
    

  3. List active connections:

    psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow \
      -c "SELECT pid, usename, application_name, client_addr, state, query_start
          FROM pg_stat_activity
          WHERE datname = 'mlflow'
          ORDER BY query_start"
    

Resolution

Option 1: Terminate idle connections

psql -h $RDS_ENDPOINT -U $DB_USER -d mlflow \
  -c "SELECT pg_terminate_backend(pid)
      FROM pg_stat_activity
      WHERE datname = 'mlflow'
      AND state = 'idle'
      AND query_start < NOW() - INTERVAL '5 minutes'"

Option 2: Restart ECS services (releases connection pools)

for SERVICE in backend strategy-service mlflow; do
  aws ecs update-service \
    --cluster tradai-${ENVIRONMENT} \
    --service tradai-${SERVICE}-${ENVIRONMENT} \
    --force-new-deployment
done

Option 3: Increase max_connections (requires instance restart)

# Create/modify parameter group
aws rds modify-db-parameter-group \
  --db-parameter-group-name tradai-${ENVIRONMENT}-params \
  --parameters "ParameterName=max_connections,ParameterValue=150,ApplyMethod=pending-reboot"

# Reboot instance
aws rds reboot-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT}

Option 4: Scale to larger instance

aws rds modify-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --db-instance-class db.t4g.small \
  --apply-immediately


DynamoDB Operations

Check Table Status

# List all TradAI tables
aws dynamodb list-tables \
  --query "TableNames[?contains(@, 'tradai-${ENVIRONMENT}')]"

# Get table details
aws dynamodb describe-table \
  --table-name tradai-${ENVIRONMENT}-workflow-state

DynamoDB Backup

On-demand backup:

aws dynamodb create-backup \
  --table-name tradai-${ENVIRONMENT}-workflow-state \
  --backup-name tradai-${ENVIRONMENT}-workflow-state-$(date +%Y%m%d)

List backups:

aws dynamodb list-backups \
  --table-name tradai-${ENVIRONMENT}-workflow-state

DynamoDB Restore

aws dynamodb restore-table-from-backup \
  --target-table-name tradai-${ENVIRONMENT}-workflow-state-restored \
  --backup-arn arn:aws:dynamodb:${AWS_REGION}:${ACCOUNT_ID}:table/tradai-${ENVIRONMENT}-workflow-state/backup/...

Point-in-Time Recovery (DynamoDB)

Enable PITR:

aws dynamodb update-continuous-backups \
  --table-name tradai-${ENVIRONMENT}-workflow-state \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true

Restore to point in time:

aws dynamodb restore-table-to-point-in-time \
  --source-table-name tradai-${ENVIRONMENT}-workflow-state \
  --target-table-name tradai-${ENVIRONMENT}-workflow-state-restored \
  --restore-date-time "2024-01-15T10:30:00Z"


Verification Checklist

After any database operation:

  • [ ] RDS instance status is "available"
  • [ ] Database connectivity from all ECS services verified
  • [ ] MLflow UI accessible and showing experiments
  • [ ] DynamoDB tables accessible
  • [ ] No connection errors in CloudWatch logs (last 15 minutes)
  • [ ] CloudWatch alarms cleared
  • [ ] Connection count within normal range (<50 for t4g.micro)
  • [ ] Backtest jobs can read/write workflow state