Operational Runbooks¶
This directory contains runbooks for responding to alerts and operational issues in the TradAI platform.
Alert Response Matrix¶
| Alert | Severity | Runbook | First Response |
|---|---|---|---|
| Service Unhealthy | High | service-recovery.md | Check CloudWatch logs, force redeploy if needed |
| Lambda Errors | Medium | service-recovery.md | Check Lambda logs, verify configuration |
| Model Drift Detected | Medium | model-drift-response.md | Assess PSI severity, consider pause |
| Significant Drift (PSI > 0.25) | High | model-drift-response.md | Pause trading, trigger retraining |
| Stale Heartbeat | High | trading-issues.md | Check ECS task, restart if needed |
| Position Stuck | Critical | trading-issues.md | Manual intervention, contact exchange |
| NAT Instance Down | High | infrastructure-issues.md | Check instance health, failover |
| RDS Connectivity | High | infrastructure-issues.md | Check security groups, VPC endpoints |
| Database Failover | High | database-operations.md | Check RDS events, verify connections |
| Connection Pool Exhausted | High | database-operations.md | Restart services, check max_connections |
| Credential Exposure | Critical | security-incidents.md | Rotate immediately, audit access |
| Unauthorized Access | Critical | security-incidents.md | Isolate resource, review CloudTrail |
| High API Latency | Medium | performance-degradation.md | Check metrics, identify bottleneck |
| ECS Memory High | Medium | performance-degradation.md | Scale service or increase task memory |
| Lambda Timeouts | Medium | performance-degradation.md | Increase timeout, check dependencies |
Escalation Path¶
- L1 - On-call engineer: Initial response, follow runbook
- L2 - Senior engineer: Complex issues, architecture decisions
- L3 - Platform team: Infrastructure, AWS issues
Alert Channels¶
- SNS Topics:
tradai-{env}-alerts(critical),tradai-{env}-notifications(info) - CloudWatch Alarms: Dashboard at
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=tradai-{env}
Quick Links¶
Environment Variables¶
All commands assume these environment variables are set:
Additional Resources¶
- Log Investigation Guide - CloudWatch Logs queries and correlation ID tracing
- Performance Degradation - Performance troubleshooting procedures
- Database Operations - Backup, restore, and failover procedures
- Security Incidents - Security incident response procedures