Service Recovery Runbook¶
Procedures for recovering ECS services, Lambda functions, and container issues.
Production Operations
These procedures affect production services. Always verify the environment (ENVIRONMENT variable) before executing commands.
ECS Service Health Check Failure¶
Symptoms¶
- CloudWatch alarm:
tradai-{env}-{service}-unhealthy - Health endpoint returning non-200 status
- ECS console shows tasks in PENDING/STOPPED state
Diagnosis¶
-
Check ECS service status:
-
Check running tasks:
-
Check task logs:
-
Check health endpoint directly:
Resolution¶
Option 1: Force new deployment
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--force-new-deployment
Option 2: Scale down and up
# Scale to 0
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--desired-count 0
# Wait for tasks to stop
sleep 30
# Scale back up
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--desired-count 1
Option 3: Rollback to previous task definition
Rollback Caution
Rollbacks may cause data inconsistencies if the new version included database migrations. Verify compatibility before rolling back.
# Get previous task definition
PREV_TASK_DEF=$(aws ecs describe-services \
--cluster tradai-${ENVIRONMENT} \
--services tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--query 'services[0].deployments[1].taskDefinition' \
--output text)
# Update service with previous version
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--task-definition $PREV_TASK_DEF
Container Restart Loop¶
Symptoms¶
- ECS tasks repeatedly starting and stopping
- CloudWatch logs show crash patterns
ecs:StoppedReasonindicates OOM or exit code != 0
Diagnosis¶
-
Check stopped task reason:
-
Check for OOM issues:
- Look for
OutOfMemoryErrorin logs -
Check container memory limits vs actual usage
-
Check for dependency failures:
- Database connectivity
- External service availability
- Secret/config loading
Resolution¶
For OOM issues:
# Update task definition with more memory
# Edit the task definition JSON and register new revision
aws ecs register-task-definition --cli-input-json file://updated-task-def.json
# Update service
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
--task-definition tradai-${SERVICE_NAME}-${ENVIRONMENT}:NEW_REVISION
For dependency issues: 1. Verify RDS security groups allow ECS tasks 2. Check Secrets Manager permissions 3. Verify VPC endpoints are healthy
Lambda Function Failures¶
Symptoms¶
- CloudWatch alarm:
tradai-{func-name}-errors - Lambda invocation errors in CloudWatch metrics
- SNS alert received
Diagnosis¶
-
Check Lambda logs:
-
Check recent invocations:
-
Check error metrics:
Common Lambda Issues¶
| Error | Cause | Fix |
|---|---|---|
| Task timed out | Long-running operation | Increase timeout or optimize |
| Runtime.ImportModuleError | Missing dependency | Check Lambda layer/packaging |
| AccessDenied | IAM permission issue | Update execution role |
| ResourceNotFoundException | Missing DynamoDB table/SNS topic | Verify resource exists |
Resolution¶
Invoke manually for testing:
aws lambda invoke \
--function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
--payload '{}' \
--log-type Tail \
output.json
# Decode logs
cat output.json | jq -r '.LogResult' | base64 -d
Update environment variables:
aws lambda update-function-configuration \
--function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
--environment "Variables={KEY=value}"
Recovery Verification¶
After any recovery action:
-
Verify health endpoint:
-
Check CloudWatch metrics for 5 minutes to ensure stability
-
Run smoke tests (if available):
-
Update incident log with actions taken and outcome