Skip to content

Service Recovery Runbook

Procedures for recovering ECS services, Lambda functions, and container issues.

flowchart TD
    Alert["Service Alert"] --> Type{"Symptom?"}
    Type -->|"Health check fail"| HC["Check /api/v1/health"]
    Type -->|"OOM Kill"| OOM["Check memory limits"]
    Type -->|"Crash loop"| CL["Check container logs"]
    HC --> Restart["Restart service<br/>just up"]
    OOM --> Scale["Increase memory<br/>Update config.py"]
    CL --> Logs["just logs service-name"]
    Restart --> Verify["Verify health"]
    Scale --> Verify
    Logs --> Fix["Fix root cause"] --> Verify

Production Operations

These procedures affect production services. Always verify the environment (ENVIRONMENT variable) before executing commands.

Key resource patterns:

  • ECS cluster: --cluster tradai-{ENV} (e.g., tradai-dev, tradai-prod)
  • Rollback state tracking: tradai-rollback-state-{ENV} DynamoDB table records service recovery state and rollback history

ECS Service Health Check Failure

Symptoms

  • CloudWatch alarm: tradai-{env}-{service}-unhealthy
  • Health endpoint returning non-200 status
  • ECS console shows tasks in PENDING/STOPPED state

Diagnosis

  1. Check ECS service status:

    aws ecs describe-services \
      --cluster tradai-${ENVIRONMENT} \
      --services tradai-${SERVICE_NAME}-${ENVIRONMENT}
    

  2. Check running tasks:

    aws ecs list-tasks \
      --cluster tradai-${ENVIRONMENT} \
      --service-name tradai-${SERVICE_NAME}-${ENVIRONMENT}
    

  3. Check task logs:

    aws logs tail /ecs/tradai/${ENVIRONMENT}/services --follow
    # Note: Filter by stream prefix per service (e.g., --log-stream-name-prefix ${SERVICE_NAME})
    

  4. Check health endpoint directly:

    # Get task IP
    TASK_ARN=$(aws ecs list-tasks --cluster tradai-${ENVIRONMENT} --service-name tradai-${SERVICE_NAME}-${ENVIRONMENT} --query 'taskArns[0]' --output text)
    
    # Get task details
    aws ecs describe-tasks --cluster tradai-${ENVIRONMENT} --tasks $TASK_ARN
    

Resolution

Option 1: Force new deployment

aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --force-new-deployment

Option 2: Scale down and up

# Scale to 0
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --desired-count 0

# Wait for tasks to stop
sleep 30

# Scale back up
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --desired-count 1

Option 3: Rollback to previous task definition

Rollback Caution

Rollbacks may cause data inconsistencies if the new version included database migrations. Verify compatibility before rolling back.

# Get previous task definition
PREV_TASK_DEF=$(aws ecs describe-services \
  --cluster tradai-${ENVIRONMENT} \
  --services tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --query 'services[0].deployments[1].taskDefinition' \
  --output text)

# Update service with previous version
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --task-definition $PREV_TASK_DEF

Container Restart Loop

Symptoms

  • ECS tasks repeatedly starting and stopping
  • CloudWatch logs show crash patterns
  • ecs:StoppedReason indicates OOM or exit code != 0

Diagnosis

  1. Check stopped task reason:

    aws ecs describe-tasks \
      --cluster tradai-${ENVIRONMENT} \
      --tasks $(aws ecs list-tasks --cluster tradai-${ENVIRONMENT} --desired-status STOPPED --query 'taskArns[0]' --output text)
    

  2. Check for OOM issues:

  3. Look for OutOfMemoryError in logs
  4. Check container memory limits vs actual usage

  5. Check for dependency failures:

  6. Database connectivity
  7. External service availability
  8. Secret/config loading

Resolution

For OOM issues:

# Update task definition with more memory
# Edit the task definition JSON and register new revision
aws ecs register-task-definition --cli-input-json file://updated-task-def.json

# Update service
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-${SERVICE_NAME}-${ENVIRONMENT} \
  --task-definition tradai-${SERVICE_NAME}-${ENVIRONMENT}:NEW_REVISION

For dependency issues: 1. Verify RDS security groups allow ECS tasks 2. Check Secrets Manager permissions 3. Verify VPC endpoints are healthy


Lambda Function Failures

Symptoms

  • CloudWatch alarm: tradai-{func-name}-errors
  • Lambda invocation errors in CloudWatch metrics
  • SNS alert received

Diagnosis

  1. Check Lambda logs:

    aws logs tail /aws/lambda/tradai-${FUNCTION_NAME}-${ENVIRONMENT} --follow
    

  2. Check recent invocations:

    aws lambda get-function \
      --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT}
    

  3. Check error metrics:

    aws cloudwatch get-metric-statistics \
      --namespace AWS/Lambda \
      --metric-name Errors \
      --dimensions Name=FunctionName,Value=tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 300 \
      --statistics Sum
    

Common Lambda Issues

Error Cause Fix
Task timed out Long-running operation Increase timeout or optimize
Runtime.ImportModuleError Missing dependency Check Lambda layer/packaging
AccessDenied IAM permission issue Update execution role
ResourceNotFoundException Missing DynamoDB table/SNS topic Verify resource exists

Resolution

Invoke manually for testing:

aws lambda invoke \
  --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
  --payload '{}' \
  --log-type Tail \
  output.json

# Decode logs
cat output.json | jq -r '.LogResult' | base64 -d

Update environment variables:

aws lambda update-function-configuration \
  --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
  --environment "Variables={KEY=value}"


Recovery Verification

After any recovery action:

  1. Verify health endpoint:

    curl -f https://${SERVICE_URL}/api/v1/health
    

  2. Check CloudWatch metrics for 5 minutes to ensure stability

  3. Run smoke tests (if available):

    pytest tests/smoke/ -m smoke -v
    

  4. Check rollback state in DynamoDB:

    aws dynamodb scan \
      --table-name tradai-rollback-state-${ENVIRONMENT} \
      --filter-expression "model_name = :svc" \
      --expression-attribute-values '{":svc": {"S": "tradai-SERVICE_NAME-ENVIRONMENT"}}'
    

  5. Update incident log with actions taken and outcome