Skip to content

Operational Runbooks

This directory contains runbooks for responding to alerts and operational issues in the TradAI platform.

Alert Response Matrix

Alert Severity Runbook First Response
Service Unhealthy High Service Recovery Check CloudWatch logs, force redeploy if needed
Lambda Errors Medium Service Recovery Check Lambda logs, verify configuration
Model Drift Detected Medium Model Drift Response Assess PSI severity, consider pause
Significant Drift (PSI > 0.25) High Model Drift Response Pause trading, trigger retraining
Stale Heartbeat High Trading Issues Check ECS task, restart if needed
Position Stuck Critical Trading Issues Manual intervention, contact exchange
NAT Instance Down High Infrastructure Issues Check instance health, failover
RDS Connectivity High Infrastructure Issues Check security groups, VPC endpoints
Database Failover High Database Operations Check RDS events, verify connections
Connection Pool Exhausted High Database Operations Restart services, check max_connections
Credential Exposure Critical Security Incidents Rotate immediately, audit access
Unauthorized Access Critical Security Incidents Isolate resource, review CloudTrail
High API Latency Medium Performance Degradation Check metrics, identify bottleneck
ECS Memory High Medium Performance Degradation Scale service or increase task memory
Lambda Timeouts Medium Performance Degradation Increase timeout, check dependencies
Workflow Stuck/Failed High Debug Workflows Check Step Functions history, trace via DynamoDB

Escalation Path

  1. L1 - On-call engineer: Initial response, follow runbook
  2. L2 - Senior engineer: Complex issues, architecture decisions
  3. L3 - Platform team: Infrastructure, AWS issues

Alert Channels

  • SNS Topics: tradai-alerts-{env} (critical), tradai-registration-{env} (info)
  • CloudWatch Alarms: Dashboard at https://console.aws.amazon.com/cloudwatch/home?region=eu-central-1#dashboards:name=tradai-{env}

Environment Variables

All commands assume these environment variables are set:

export AWS_REGION=eu-central-1
export ENVIRONMENT=prod  # or staging, dev
export PROJECT_NAME=tradai

Additional Resources