Service Recovery Runbook¶
Procedures for recovering ECS services, Lambda functions, and container issues.
flowchart TD
Alert["Service Alert"] --> Type{"Symptom?"}
Type -->|"Health check fail"| HC["Check /api/v1/health"]
Type -->|"OOM Kill"| OOM["Check memory limits"]
Type -->|"Crash loop"| CL["Check container logs"]
HC --> Restart["Restart service<br/>just up"]
OOM --> Scale["Increase memory<br/>Update config.py"]
CL --> Logs["just logs service-name"]
Restart --> Verify["Verify health"]
Scale --> Verify
Logs --> Fix["Fix root cause"] --> Verify Production Operations
These procedures affect production services. Always verify the environment (ENVIRONMENT variable) before executing commands.
Key resource patterns:
- ECS cluster:
--cluster tradai-{ENV}(e.g.,tradai-dev,tradai-prod) - Rollback state tracking:
tradai-rollback-state-{ENV}DynamoDB table records service recovery state and rollback history
ECS Service Health Check Failure¶
Symptoms¶
- CloudWatch alarm:
tradai-{env}-{service}-unhealthy - Health endpoint returning non-200 status
- ECS console shows tasks in PENDING/STOPPED state
Diagnosis¶
-
Check ECS service status:
-
Check running tasks:
-
Check task logs:
-
Check health endpoint directly:
Resolution¶
Option 1: Force new deployment
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--force-new-deployment
Option 2: Scale down and up
# Scale to 0
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--desired-count 0
# Wait for tasks to stop
sleep 30
# Scale back up
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--desired-count 1
Option 3: Rollback to previous task definition
Rollback Caution
Rollbacks may cause data inconsistencies if the new version included database migrations. Verify compatibility before rolling back.
# Get previous task definition
PREV_TASK_DEF=$(aws ecs describe-services \
--cluster tradai-${ENVIRONMENT} \
--services tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--query 'services[0].deployments[1].taskDefinition' \
--output text)
# Update service with previous version
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--task-definition $PREV_TASK_DEF
Container Restart Loop¶
Symptoms¶
- ECS tasks repeatedly starting and stopping
- CloudWatch logs show crash patterns
ecs:StoppedReasonindicates OOM or exit code != 0
Diagnosis¶
-
Check stopped task reason:
-
Check for OOM issues:
- Look for
OutOfMemoryErrorin logs -
Check container memory limits vs actual usage
-
Check for dependency failures:
- Database connectivity
- External service availability
- Secret/config loading
Resolution¶
For OOM issues:
# Update task definition with more memory
# Edit the task definition JSON and register new revision
aws ecs register-task-definition --cli-input-json file://updated-task-def.json
# Update service
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--task-definition tradai-${SERVICE_NAME}-${ENVIRONMENT}:NEW_REVISION
For dependency issues: 1. Verify RDS security groups allow ECS tasks 2. Check Secrets Manager permissions 3. Verify VPC endpoints are healthy
Lambda Function Failures¶
Symptoms¶
- CloudWatch alarm:
tradai-{func-name}-errors - Lambda invocation errors in CloudWatch metrics
- SNS alert received
Diagnosis¶
-
Check Lambda logs:
-
Check recent invocations:
-
Check error metrics:
Common Lambda Issues¶
| Error | Cause | Fix |
|---|---|---|
| Task timed out | Long-running operation | Increase timeout or optimize |
| Runtime.ImportModuleError | Missing dependency | Check Lambda layer/packaging |
| AccessDenied | IAM permission issue | Update execution role |
| ResourceNotFoundException | Missing DynamoDB table/SNS topic | Verify resource exists |
Resolution¶
Invoke manually for testing:
aws lambda invoke \
--function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
--payload '{}' \
--log-type Tail \
output.json
# Decode logs
cat output.json | jq -r '.LogResult' | base64 -d
Update environment variables:
aws lambda update-function-configuration \
--function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
--environment "Variables={KEY=value}"
Recovery Verification¶
After any recovery action:
-
Verify health endpoint:
-
Check CloudWatch metrics for 5 minutes to ensure stability
-
Run smoke tests (if available):
-
Check rollback state in DynamoDB:
-
Update incident log with actions taken and outcome