Skip to content

Log Investigation Runbook

Procedures for investigating issues using CloudWatch Logs, including common queries, log locations, and correlation ID tracing.

Log Locations

Service Logs

Service Log Group Description
Backend API /ecs/tradai-backend-{env} Request handling, orchestration
Strategy Service /ecs/tradai-strategy-service-{env} Strategy config, MLflow integration
Data Collection /ecs/tradai-data-collection-{env} Market data fetching, ArcticDB
MLflow /ecs/tradai-mlflow-{env} Experiment tracking
Strategy Tasks /ecs/tradai-strategy-{env} Backtest/trading execution

Lambda Logs

Function Log Group
Health Check /aws/lambda/tradai-health-check-{env}
Heartbeat Check /aws/lambda/tradai-heartbeat-check-{env}
Orphan Scanner /aws/lambda/tradai-orphan-scanner-{env}
Drift Monitor /aws/lambda/tradai-drift-monitor-{env}
Retraining Scheduler /aws/lambda/tradai-retraining-scheduler-{env}
SQS Consumer /aws/lambda/tradai-sqs-consumer-{env}
Strategy Validator /aws/lambda/tradai-validate-strategy-{env}
Data Proxy /aws/lambda/tradai-data-proxy-{env}

Infrastructure Logs

Resource Log Group
API Gateway /aws/api-gateway/tradai-{env}
VPC Flow Logs /aws/vpc/flowlogs/tradai-{env}
CloudTrail /aws/cloudtrail/tradai-{env}

CloudWatch Logs Insights Queries

Basic Queries

View recent errors across all ECS services:

fields @timestamp, @message, @logStream
| filter @message like /ERROR|Exception|error|exception/
| sort @timestamp desc
| limit 100

Filter by correlation ID:

fields @timestamp, @message, @logStream
| filter @message like /CORRELATION_ID_HERE/
| sort @timestamp asc
| limit 500

Find specific error patterns:

fields @timestamp, @message
| filter @message like /ConnectionError|TimeoutError|ValidationError/
| stats count(*) as error_count by bin(5m)

Service-Specific Queries

Backend - Request tracing:

fields @timestamp, @message
| filter @message like /POST|GET|PUT|DELETE/
| filter @message like /api\/v1/
| parse @message /(?<method>POST|GET|PUT|DELETE)\s+(?<path>[^\s]+)\s+(?<status>\d{3})/
| stats count(*) by path, status
| sort count(*) desc
| limit 50

Strategy Service - Backtest status:

fields @timestamp, @message
| filter @message like /backtest|Backtest/
| filter @message like /started|completed|failed|error/
| sort @timestamp desc
| limit 100

Data Collection - Fetch errors:

fields @timestamp, @message
| filter @message like /fetch|OHLCV|exchange/
| filter @message like /error|failed|timeout/
| stats count(*) as errors by bin(15m)

Lambda Queries

Lambda errors with stack traces:

fields @timestamp, @message, @requestId
| filter @message like /ERROR|Exception|Traceback/
| sort @timestamp desc
| limit 50

Lambda cold starts:

fields @timestamp, @message
| filter @message like /Init Duration/
| parse @message /Init Duration: (?<initDuration>[\d.]+) ms/
| stats avg(initDuration), max(initDuration), count(*) by bin(1h)

Lambda duration analysis:

fields @timestamp, @message
| filter @type = "REPORT"
| parse @message /Duration: (?<duration>[\d.]+) ms/
| stats avg(duration), p95(duration), max(duration) by bin(5m)

API Gateway Queries

Request latency analysis:

fields @timestamp, responseLatency, status, path
| stats avg(responseLatency) as avg_latency, p95(responseLatency) as p95_latency by path
| sort avg_latency desc

4xx/5xx errors:

fields @timestamp, status, path, errorMessage
| filter status >= 400
| stats count(*) by status, path
| sort count(*) desc


Correlation ID Tracing

How Correlation IDs Work

Every request to TradAI services gets a unique X-Correlation-ID header that propagates through: 1. API Gateway → Backend 2. Backend → Strategy Service / Data Collection 3. Services → Lambda functions 4. Services → DynamoDB operations

Finding the Correlation ID

From API response headers:

curl -v https://api.tradai.io/api/v1/health 2>&1 | grep -i x-correlation-id

From CloudWatch logs:

fields @timestamp, @message
| filter @message like /correlation_id|X-Correlation-ID/
| parse @message /"correlation_id":\s*"(?<correlationId>[^"]+)"/
| limit 100

Trace a Request Across Services

Given a correlation ID, trace through all services:

Step 1: Backend logs:

-- Log group: /ecs/tradai-backend-{env}
fields @timestamp, @message, @logStream
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc

Step 2: Strategy Service logs:

-- Log group: /ecs/tradai-strategy-service-{env}
fields @timestamp, @message, @logStream
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc

Step 3: Lambda logs (if applicable):

-- Log group: /aws/lambda/tradai-sqs-consumer-{env}
fields @timestamp, @message, @requestId
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc


Common Error Patterns

Connection Errors

Pattern: ConnectionError, ConnectionRefusedError

fields @timestamp, @message, @logStream
| filter @message like /ConnectionError|ConnectionRefused|ECONNREFUSED/
| parse @message /connecting to (?<target>[^\s]+)/
| stats count(*) by target

Common causes: - Service not running - Security group blocking traffic - VPC endpoint not available - DNS resolution failure

Timeout Errors

Pattern: TimeoutError, ReadTimeout

fields @timestamp, @message
| filter @message like /TimeoutError|ReadTimeout|timeout|Timeout/
| parse @message /timeout after (?<seconds>[\d.]+)/
| stats count(*) by bin(5m)

Common causes: - Slow downstream service - NAT instance issues - Database overloaded - Network congestion

Validation Errors

Pattern: ValidationError, Invalid

fields @timestamp, @message
| filter @message like /ValidationError|Invalid|validation failed/
| sort @timestamp desc
| limit 100

Authentication Errors

Pattern: 401, 403, Unauthorized

fields @timestamp, @message
| filter @message like /401|403|Unauthorized|Forbidden|AuthenticationError/
| sort @timestamp desc
| limit 100


Log Tailing (Real-time)

AWS CLI

# Tail ECS service logs
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow

# Tail Lambda logs
aws logs tail /aws/lambda/tradai-health-check-${ENVIRONMENT} --follow

# Filter while tailing
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow \
  --filter-pattern "ERROR"

# Tail with timestamp format
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow \
  --format short

Multi-Service Tailing

# Terminal 1
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow

# Terminal 2
aws logs tail /ecs/tradai-strategy-service-${ENVIRONMENT} --follow

# Terminal 3
aws logs tail /aws/lambda/tradai-sqs-consumer-${ENVIRONMENT} --follow

Exporting Logs for Analysis

Export to S3

aws logs create-export-task \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --from $(date -u -v-24H +%s)000 \
  --to $(date -u +%s)000 \
  --destination tradai-${ENVIRONMENT}-logs \
  --destination-prefix exports/backend

Download specific time range

aws logs filter-log-events \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --start-time $(date -u -v-1H +%s)000 \
  --end-time $(date -u +%s)000 \
  --output json > backend-logs-$(date +%Y%m%d-%H%M).json

Log Level Reference

Level Use Case Investigation Priority
DEBUG Detailed diagnostics Low - enable temporarily
INFO Key operations, requests Normal - standard monitoring
WARNING Recoverable issues Medium - investigate patterns
ERROR Operation failures High - immediate investigation
CRITICAL System failures Critical - immediate action

Change Log Level (per service)

# Update ECS task environment variable
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-backend-${ENVIRONMENT} \
  --force-new-deployment \
  --task-definition tradai-backend-${ENVIRONMENT}:NEW_REVISION
  # Where NEW_REVISION has LOG_LEVEL=DEBUG

Investigation Workflow

Step-by-Step Process

  1. Identify the symptom:
  2. Error message, alert, user report

  3. Get time range:

  4. When did it start?
  5. Is it ongoing or resolved?

  6. Find affected service:

  7. Which service logs to check first?
  8. Check health endpoints

  9. Search for errors:

    fields @timestamp, @message
    | filter @message like /ERROR|Exception/
    | filter @timestamp >= "YYYY-MM-DDTHH:MM:SSZ"
    | sort @timestamp asc
    | limit 100
    

  10. Get correlation ID:

  11. From error log or API response

  12. Trace across services:

  13. Follow correlation ID through log groups

  14. Identify root cause:

  15. Downstream failure?
  16. Configuration issue?
  17. Resource exhaustion?

  18. Resolve and verify:

  19. Apply fix
  20. Verify logs are clean

Quick Reference Commands

# Recent errors (last 15 min)
aws logs filter-log-events \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --start-time $(date -u -v-15M +%s)000 \
  --filter-pattern "ERROR" \
  --query 'events[*].message' \
  --output text

# Count errors by log stream
aws logs filter-log-events \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --start-time $(date -u -v-1H +%s)000 \
  --filter-pattern "ERROR" \
  --query 'events[*].logStreamName' \
  --output text | sort | uniq -c | sort -rn

# Get log streams (to find specific task)
aws logs describe-log-streams \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --order-by LastEventTime \
  --descending \
  --limit 5

Verification Checklist

After log investigation:

  • [ ] Root cause identified
  • [ ] Affected time range documented
  • [ ] All impacted requests/users identified
  • [ ] Fix applied and verified
  • [ ] No new errors in logs (last 15 minutes)
  • [ ] Incident timeline documented
  • [ ] If recurring, alert/monitoring added