Log Investigation Runbook¶
Procedures for investigating issues using CloudWatch Logs, including common queries, log locations, and correlation ID tracing.
Log Locations¶
Service Logs¶
| Service | Log Group | Description |
|---|---|---|
| Backend API | /ecs/tradai-backend-{env} | Request handling, orchestration |
| Strategy Service | /ecs/tradai-strategy-service-{env} | Strategy config, MLflow integration |
| Data Collection | /ecs/tradai-data-collection-{env} | Market data fetching, ArcticDB |
| MLflow | /ecs/tradai-mlflow-{env} | Experiment tracking |
| Strategy Tasks | /ecs/tradai-strategy-{env} | Backtest/trading execution |
Lambda Logs¶
| Function | Log Group |
|---|---|
| Health Check | /aws/lambda/tradai-health-check-{env} |
| Heartbeat Check | /aws/lambda/tradai-heartbeat-check-{env} |
| Orphan Scanner | /aws/lambda/tradai-orphan-scanner-{env} |
| Drift Monitor | /aws/lambda/tradai-drift-monitor-{env} |
| Retraining Scheduler | /aws/lambda/tradai-retraining-scheduler-{env} |
| SQS Consumer | /aws/lambda/tradai-sqs-consumer-{env} |
| Strategy Validator | /aws/lambda/tradai-validate-strategy-{env} |
| Data Proxy | /aws/lambda/tradai-data-proxy-{env} |
Infrastructure Logs¶
| Resource | Log Group |
|---|---|
| API Gateway | /aws/api-gateway/tradai-{env} |
| VPC Flow Logs | /aws/vpc/flowlogs/tradai-{env} |
| CloudTrail | /aws/cloudtrail/tradai-{env} |
CloudWatch Logs Insights Queries¶
Basic Queries¶
View recent errors across all ECS services:
fields @timestamp, @message, @logStream
| filter @message like /ERROR|Exception|error|exception/
| sort @timestamp desc
| limit 100
Filter by correlation ID:
fields @timestamp, @message, @logStream
| filter @message like /CORRELATION_ID_HERE/
| sort @timestamp asc
| limit 500
Find specific error patterns:
fields @timestamp, @message
| filter @message like /ConnectionError|TimeoutError|ValidationError/
| stats count(*) as error_count by bin(5m)
Service-Specific Queries¶
Backend - Request tracing:
fields @timestamp, @message
| filter @message like /POST|GET|PUT|DELETE/
| filter @message like /api\/v1/
| parse @message /(?<method>POST|GET|PUT|DELETE)\s+(?<path>[^\s]+)\s+(?<status>\d{3})/
| stats count(*) by path, status
| sort count(*) desc
| limit 50
Strategy Service - Backtest status:
fields @timestamp, @message
| filter @message like /backtest|Backtest/
| filter @message like /started|completed|failed|error/
| sort @timestamp desc
| limit 100
Data Collection - Fetch errors:
fields @timestamp, @message
| filter @message like /fetch|OHLCV|exchange/
| filter @message like /error|failed|timeout/
| stats count(*) as errors by bin(15m)
Lambda Queries¶
Lambda errors with stack traces:
fields @timestamp, @message, @requestId
| filter @message like /ERROR|Exception|Traceback/
| sort @timestamp desc
| limit 50
Lambda cold starts:
fields @timestamp, @message
| filter @message like /Init Duration/
| parse @message /Init Duration: (?<initDuration>[\d.]+) ms/
| stats avg(initDuration), max(initDuration), count(*) by bin(1h)
Lambda duration analysis:
fields @timestamp, @message
| filter @type = "REPORT"
| parse @message /Duration: (?<duration>[\d.]+) ms/
| stats avg(duration), p95(duration), max(duration) by bin(5m)
API Gateway Queries¶
Request latency analysis:
fields @timestamp, responseLatency, status, path
| stats avg(responseLatency) as avg_latency, p95(responseLatency) as p95_latency by path
| sort avg_latency desc
4xx/5xx errors:
fields @timestamp, status, path, errorMessage
| filter status >= 400
| stats count(*) by status, path
| sort count(*) desc
Correlation ID Tracing¶
How Correlation IDs Work¶
Every request to TradAI services gets a unique X-Correlation-ID header that propagates through: 1. API Gateway → Backend 2. Backend → Strategy Service / Data Collection 3. Services → Lambda functions 4. Services → DynamoDB operations
Finding the Correlation ID¶
From API response headers:
From CloudWatch logs:
fields @timestamp, @message
| filter @message like /correlation_id|X-Correlation-ID/
| parse @message /"correlation_id":\s*"(?<correlationId>[^"]+)"/
| limit 100
Trace a Request Across Services¶
Given a correlation ID, trace through all services:
Step 1: Backend logs:
-- Log group: /ecs/tradai-backend-{env}
fields @timestamp, @message, @logStream
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc
Step 2: Strategy Service logs:
-- Log group: /ecs/tradai-strategy-service-{env}
fields @timestamp, @message, @logStream
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc
Step 3: Lambda logs (if applicable):
-- Log group: /aws/lambda/tradai-sqs-consumer-{env}
fields @timestamp, @message, @requestId
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc
Common Error Patterns¶
Connection Errors¶
Pattern: ConnectionError, ConnectionRefusedError
fields @timestamp, @message, @logStream
| filter @message like /ConnectionError|ConnectionRefused|ECONNREFUSED/
| parse @message /connecting to (?<target>[^\s]+)/
| stats count(*) by target
Common causes: - Service not running - Security group blocking traffic - VPC endpoint not available - DNS resolution failure
Timeout Errors¶
Pattern: TimeoutError, ReadTimeout
fields @timestamp, @message
| filter @message like /TimeoutError|ReadTimeout|timeout|Timeout/
| parse @message /timeout after (?<seconds>[\d.]+)/
| stats count(*) by bin(5m)
Common causes: - Slow downstream service - NAT instance issues - Database overloaded - Network congestion
Validation Errors¶
Pattern: ValidationError, Invalid
fields @timestamp, @message
| filter @message like /ValidationError|Invalid|validation failed/
| sort @timestamp desc
| limit 100
Authentication Errors¶
Pattern: 401, 403, Unauthorized
fields @timestamp, @message
| filter @message like /401|403|Unauthorized|Forbidden|AuthenticationError/
| sort @timestamp desc
| limit 100
Log Tailing (Real-time)¶
AWS CLI¶
# Tail ECS service logs
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow
# Tail Lambda logs
aws logs tail /aws/lambda/tradai-health-check-${ENVIRONMENT} --follow
# Filter while tailing
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow \
--filter-pattern "ERROR"
# Tail with timestamp format
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow \
--format short
Multi-Service Tailing¶
# Terminal 1
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow
# Terminal 2
aws logs tail /ecs/tradai-strategy-service-${ENVIRONMENT} --follow
# Terminal 3
aws logs tail /aws/lambda/tradai-sqs-consumer-${ENVIRONMENT} --follow
Exporting Logs for Analysis¶
Export to S3¶
aws logs create-export-task \
--log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
--from $(date -u -v-24H +%s)000 \
--to $(date -u +%s)000 \
--destination tradai-${ENVIRONMENT}-logs \
--destination-prefix exports/backend
Download specific time range¶
aws logs filter-log-events \
--log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
--start-time $(date -u -v-1H +%s)000 \
--end-time $(date -u +%s)000 \
--output json > backend-logs-$(date +%Y%m%d-%H%M).json
Log Level Reference¶
| Level | Use Case | Investigation Priority |
|---|---|---|
| DEBUG | Detailed diagnostics | Low - enable temporarily |
| INFO | Key operations, requests | Normal - standard monitoring |
| WARNING | Recoverable issues | Medium - investigate patterns |
| ERROR | Operation failures | High - immediate investigation |
| CRITICAL | System failures | Critical - immediate action |
Change Log Level (per service)¶
# Update ECS task environment variable
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-backend-${ENVIRONMENT} \
--force-new-deployment \
--task-definition tradai-backend-${ENVIRONMENT}:NEW_REVISION
# Where NEW_REVISION has LOG_LEVEL=DEBUG
Investigation Workflow¶
Step-by-Step Process¶
- Identify the symptom:
-
Error message, alert, user report
-
Get time range:
- When did it start?
-
Is it ongoing or resolved?
-
Find affected service:
- Which service logs to check first?
-
Check health endpoints
-
Search for errors:
-
Get correlation ID:
-
From error log or API response
-
Trace across services:
-
Follow correlation ID through log groups
-
Identify root cause:
- Downstream failure?
- Configuration issue?
-
Resource exhaustion?
-
Resolve and verify:
- Apply fix
- Verify logs are clean
Quick Reference Commands¶
# Recent errors (last 15 min)
aws logs filter-log-events \
--log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
--start-time $(date -u -v-15M +%s)000 \
--filter-pattern "ERROR" \
--query 'events[*].message' \
--output text
# Count errors by log stream
aws logs filter-log-events \
--log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
--start-time $(date -u -v-1H +%s)000 \
--filter-pattern "ERROR" \
--query 'events[*].logStreamName' \
--output text | sort | uniq -c | sort -rn
# Get log streams (to find specific task)
aws logs describe-log-streams \
--log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
--order-by LastEventTime \
--descending \
--limit 5
Verification Checklist¶
After log investigation:
- [ ] Root cause identified
- [ ] Affected time range documented
- [ ] All impacted requests/users identified
- [ ] Fix applied and verified
- [ ] No new errors in logs (last 15 minutes)
- [ ] Incident timeline documented
- [ ] If recurring, alert/monitoring added