Operational Runbooks¶
This directory contains runbooks for responding to alerts and operational issues in the TradAI platform.
Alert Response Matrix¶
| Alert | Severity | Runbook | First Response |
|---|---|---|---|
| Service Unhealthy | High | Service Recovery | Check CloudWatch logs, force redeploy if needed |
| Lambda Errors | Medium | Service Recovery | Check Lambda logs, verify configuration |
| Model Drift Detected | Medium | Model Drift Response | Assess PSI severity, consider pause |
| Significant Drift (PSI > 0.25) | High | Model Drift Response | Pause trading, trigger retraining |
| Stale Heartbeat | High | Trading Issues | Check ECS task, restart if needed |
| Position Stuck | Critical | Trading Issues | Manual intervention, contact exchange |
| NAT Instance Down | High | Infrastructure Issues | Check instance health, failover |
| RDS Connectivity | High | Infrastructure Issues | Check security groups, VPC endpoints |
| Database Failover | High | Database Operations | Check RDS events, verify connections |
| Connection Pool Exhausted | High | Database Operations | Restart services, check max_connections |
| Credential Exposure | Critical | Security Incidents | Rotate immediately, audit access |
| Unauthorized Access | Critical | Security Incidents | Isolate resource, review CloudTrail |
| High API Latency | Medium | Performance Degradation | Check metrics, identify bottleneck |
| ECS Memory High | Medium | Performance Degradation | Scale service or increase task memory |
| Lambda Timeouts | Medium | Performance Degradation | Increase timeout, check dependencies |
| Workflow Stuck/Failed | High | Debug Workflows | Check Step Functions history, trace via DynamoDB |
Escalation Path¶
- L1 - On-call engineer: Initial response, follow runbook
- L2 - Senior engineer: Complex issues, architecture decisions
- L3 - Platform team: Infrastructure, AWS issues
Alert Channels¶
- SNS Topics:
tradai-alerts-{env}(critical),tradai-registration-{env}(info) - CloudWatch Alarms: Dashboard at
https://console.aws.amazon.com/cloudwatch/home?region=eu-central-1#dashboards:name=tradai-{env}
Quick Links¶
Environment Variables¶
All commands assume these environment variables are set:
Additional Resources¶
- Log Investigation Guide — CloudWatch Logs queries and correlation ID tracing
- Performance Degradation — Performance troubleshooting procedures
- Database Operations — Backup, restore, and failover procedures
- Security Incidents — Security incident response procedures
- Debug Workflows — Step Functions and DynamoDB workflow state tracing