Service Recovery Runbook¶

Procedures for recovering ECS services, Lambda functions, and container issues.

Production Operations

These procedures affect production services. Always verify the environment (ENVIRONMENT variable) before executing commands.

ECS Service Health Check Failure¶

Symptoms¶

CloudWatch alarm: tradai-{env}-{service}-unhealthy
Health endpoint returning non-200 status
ECS console shows tasks in PENDING/STOPPED state

Diagnosis¶

Check ECS service status:

aws ecs describe-services \
  --cluster tradai-${ENVIRONMENT} \
  --services tradai-${SERVICE_NAME}-${ENVIRONMENT}

Check running tasks:

aws ecs list-tasks \
  --cluster tradai-${ENVIRONMENT} \
  --service-name tradai-${SERVICE_NAME}-${ENVIRONMENT}

Check task logs:

aws logs tail /ecs/tradai-${SERVICE_NAME}-${ENVIRONMENT} --follow

Check health endpoint directly:

# Get task IP
TASK_ARN=$(aws ecs list-tasks --cluster tradai-${ENVIRONMENT} --service-name tradai-${SERVICE_NAME}-${ENVIRONMENT} --query 'taskArns[0]' --output text)

# Get task details
aws ecs describe-tasks --cluster tradai-${ENVIRONMENT} --tasks $TASK_ARN

Resolution¶

Option 1: Force new deployment

aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --force-new-deployment

Option 2: Scale down and up

# Scale to 0
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --desired-count 0

# Wait for tasks to stop
sleep 30

# Scale back up
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --desired-count 1

Option 3: Rollback to previous task definition

Rollback Caution

Rollbacks may cause data inconsistencies if the new version included database migrations. Verify compatibility before rolling back.

# Get previous task definition
PREV_TASK_DEF=$(aws ecs describe-services \
  --cluster tradai-${ENVIRONMENT} \
  --services tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --query 'services[0].deployments[1].taskDefinition' \
  --output text)

# Update service with previous version
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --task-definition $PREV_TASK_DEF

Container Restart Loop¶

Symptoms¶

ECS tasks repeatedly starting and stopping
CloudWatch logs show crash patterns
ecs:StoppedReason indicates OOM or exit code != 0

Diagnosis¶

Check stopped task reason:

aws ecs describe-tasks \
  --cluster tradai-${ENVIRONMENT} \
  --tasks $(aws ecs list-tasks --cluster tradai-${ENVIRONMENT} --desired-status STOPPED --query 'taskArns[0]' --output text)

Check for OOM issues:
Look for OutOfMemoryError in logs
Check container memory limits vs actual usage
Check for dependency failures:
Database connectivity
External service availability
Secret/config loading

Resolution¶

For OOM issues:

# Update task definition with more memory
# Edit the task definition JSON and register new revision
aws ecs register-task-definition --cli-input-json file://updated-task-def.json

# Update service
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --task-definition tradai-${SERVICE_NAME}-${ENVIRONMENT}:NEW_REVISION

For dependency issues: 1. Verify RDS security groups allow ECS tasks 2. Check Secrets Manager permissions 3. Verify VPC endpoints are healthy

Lambda Function Failures¶

Symptoms¶

CloudWatch alarm: tradai-{func-name}-errors
Lambda invocation errors in CloudWatch metrics
SNS alert received

Diagnosis¶

Check Lambda logs:

aws logs tail /aws/lambda/tradai-${FUNCTION_NAME}-${ENVIRONMENT} --follow

Check recent invocations:

aws lambda get-function \
  --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT}

Check error metrics:

aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Errors \
  --dimensions Name=FunctionName,Value=tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Sum

Common Lambda Issues¶

Error	Cause	Fix
Task timed out	Long-running operation	Increase timeout or optimize
Runtime.ImportModuleError	Missing dependency	Check Lambda layer/packaging
AccessDenied	IAM permission issue	Update execution role
ResourceNotFoundException	Missing DynamoDB table/SNS topic	Verify resource exists

Resolution¶

Invoke manually for testing:

aws lambda invoke \
  --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
  --payload '{}' \
  --log-type Tail \
  output.json

# Decode logs
cat output.json | jq -r '.LogResult' | base64 -d

Update environment variables:

aws lambda update-function-configuration \
  --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
  --environment "Variables={KEY=value}"

Recovery Verification¶

After any recovery action:

Verify health endpoint:

curl -f https://${SERVICE_URL}/api/v1/health

Check CloudWatch metrics for 5 minutes to ensure stability
Run smoke tests (if available):
```
pytest tests/smoke/ -m smoke -v
```
Update incident log with actions taken and outcome