Skip to content

Operational Runbooks

This directory contains runbooks for responding to alerts and operational issues in the TradAI platform.

Alert Response Matrix

Alert Severity Runbook First Response
Service Unhealthy High service-recovery.md Check CloudWatch logs, force redeploy if needed
Lambda Errors Medium service-recovery.md Check Lambda logs, verify configuration
Model Drift Detected Medium model-drift-response.md Assess PSI severity, consider pause
Significant Drift (PSI > 0.25) High model-drift-response.md Pause trading, trigger retraining
Stale Heartbeat High trading-issues.md Check ECS task, restart if needed
Position Stuck Critical trading-issues.md Manual intervention, contact exchange
NAT Instance Down High infrastructure-issues.md Check instance health, failover
RDS Connectivity High infrastructure-issues.md Check security groups, VPC endpoints
Database Failover High database-operations.md Check RDS events, verify connections
Connection Pool Exhausted High database-operations.md Restart services, check max_connections
Credential Exposure Critical security-incidents.md Rotate immediately, audit access
Unauthorized Access Critical security-incidents.md Isolate resource, review CloudTrail
High API Latency Medium performance-degradation.md Check metrics, identify bottleneck
ECS Memory High Medium performance-degradation.md Scale service or increase task memory
Lambda Timeouts Medium performance-degradation.md Increase timeout, check dependencies

Escalation Path

  1. L1 - On-call engineer: Initial response, follow runbook
  2. L2 - Senior engineer: Complex issues, architecture decisions
  3. L3 - Platform team: Infrastructure, AWS issues

Alert Channels

  • SNS Topics: tradai-{env}-alerts (critical), tradai-{env}-notifications (info)
  • CloudWatch Alarms: Dashboard at https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=tradai-{env}

Environment Variables

All commands assume these environment variables are set:

export AWS_REGION=us-east-1
export ENVIRONMENT=prod  # or staging, dev
export PROJECT_NAME=tradai

Additional Resources