Skip to content

Service Recovery Runbook

Procedures for recovering ECS services, Lambda functions, and container issues.

Production Operations

These procedures affect production services. Always verify the environment (ENVIRONMENT variable) before executing commands.

ECS Service Health Check Failure

Symptoms

  • CloudWatch alarm: tradai-{env}-{service}-unhealthy
  • Health endpoint returning non-200 status
  • ECS console shows tasks in PENDING/STOPPED state

Diagnosis

  1. Check ECS service status:

    aws ecs describe-services \
      --cluster tradai-${ENVIRONMENT} \
      --services tradai-${SERVICE_NAME}-${ENVIRONMENT}
    

  2. Check running tasks:

    aws ecs list-tasks \
      --cluster tradai-${ENVIRONMENT} \
      --service-name tradai-${SERVICE_NAME}-${ENVIRONMENT}
    

  3. Check task logs:

    aws logs tail /ecs/tradai-${SERVICE_NAME}-${ENVIRONMENT} --follow
    

  4. Check health endpoint directly:

    # Get task IP
    TASK_ARN=$(aws ecs list-tasks --cluster tradai-${ENVIRONMENT} --service-name tradai-${SERVICE_NAME}-${ENVIRONMENT} --query 'taskArns[0]' --output text)
    
    # Get task details
    aws ecs describe-tasks --cluster tradai-${ENVIRONMENT} --tasks $TASK_ARN
    

Resolution

Option 1: Force new deployment

aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --force-new-deployment

Option 2: Scale down and up

# Scale to 0
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --desired-count 0

# Wait for tasks to stop
sleep 30

# Scale back up
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --desired-count 1

Option 3: Rollback to previous task definition

Rollback Caution

Rollbacks may cause data inconsistencies if the new version included database migrations. Verify compatibility before rolling back.

# Get previous task definition
PREV_TASK_DEF=$(aws ecs describe-services \
  --cluster tradai-${ENVIRONMENT} \
  --services tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --query 'services[0].deployments[1].taskDefinition' \
  --output text)

# Update service with previous version
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --task-definition $PREV_TASK_DEF

Container Restart Loop

Symptoms

  • ECS tasks repeatedly starting and stopping
  • CloudWatch logs show crash patterns
  • ecs:StoppedReason indicates OOM or exit code != 0

Diagnosis

  1. Check stopped task reason:

    aws ecs describe-tasks \
      --cluster tradai-${ENVIRONMENT} \
      --tasks $(aws ecs list-tasks --cluster tradai-${ENVIRONMENT} --desired-status STOPPED --query 'taskArns[0]' --output text)
    

  2. Check for OOM issues:

  3. Look for OutOfMemoryError in logs
  4. Check container memory limits vs actual usage

  5. Check for dependency failures:

  6. Database connectivity
  7. External service availability
  8. Secret/config loading

Resolution

For OOM issues:

# Update task definition with more memory
# Edit the task definition JSON and register new revision
aws ecs register-task-definition --cli-input-json file://updated-task-def.json

# Update service
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --task-definition tradai-${SERVICE_NAME}-${ENVIRONMENT}:NEW_REVISION

For dependency issues: 1. Verify RDS security groups allow ECS tasks 2. Check Secrets Manager permissions 3. Verify VPC endpoints are healthy


Lambda Function Failures

Symptoms

  • CloudWatch alarm: tradai-{func-name}-errors
  • Lambda invocation errors in CloudWatch metrics
  • SNS alert received

Diagnosis

  1. Check Lambda logs:

    aws logs tail /aws/lambda/tradai-${FUNCTION_NAME}-${ENVIRONMENT} --follow
    

  2. Check recent invocations:

    aws lambda get-function \
      --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT}
    

  3. Check error metrics:

    aws cloudwatch get-metric-statistics \
      --namespace AWS/Lambda \
      --metric-name Errors \
      --dimensions Name=FunctionName,Value=tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 300 \
      --statistics Sum
    

Common Lambda Issues

Error Cause Fix
Task timed out Long-running operation Increase timeout or optimize
Runtime.ImportModuleError Missing dependency Check Lambda layer/packaging
AccessDenied IAM permission issue Update execution role
ResourceNotFoundException Missing DynamoDB table/SNS topic Verify resource exists

Resolution

Invoke manually for testing:

aws lambda invoke \
  --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
  --payload '{}' \
  --log-type Tail \
  output.json

# Decode logs
cat output.json | jq -r '.LogResult' | base64 -d

Update environment variables:

aws lambda update-function-configuration \
  --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
  --environment "Variables={KEY=value}"


Recovery Verification

After any recovery action:

  1. Verify health endpoint:

    curl -f https://${SERVICE_URL}/api/v1/health
    

  2. Check CloudWatch metrics for 5 minutes to ensure stability

  3. Run smoke tests (if available):

    pytest tests/smoke/ -m smoke -v
    

  4. Update incident log with actions taken and outcome