Log Investigation Runbook¶

Procedures for investigating issues using CloudWatch Logs, including common queries, log locations, and correlation ID tracing.

Log Locations¶

Service Logs¶

Service	Log Group	Description
Backend API	`/ecs/tradai-backend-{env}`	Request handling, orchestration
Strategy Service	`/ecs/tradai-strategy-service-{env}`	Strategy config, MLflow integration
Data Collection	`/ecs/tradai-data-collection-{env}`	Market data fetching, ArcticDB
MLflow	`/ecs/tradai-mlflow-{env}`	Experiment tracking
Strategy Tasks	`/ecs/tradai-strategy-{env}`	Backtest/trading execution

Lambda Logs¶

Function	Log Group
Health Check	`/aws/lambda/tradai-health-check-{env}`
Heartbeat Check	`/aws/lambda/tradai-heartbeat-check-{env}`
Orphan Scanner	`/aws/lambda/tradai-orphan-scanner-{env}`
Drift Monitor	`/aws/lambda/tradai-drift-monitor-{env}`
Retraining Scheduler	`/aws/lambda/tradai-retraining-scheduler-{env}`
SQS Consumer	`/aws/lambda/tradai-sqs-consumer-{env}`
Strategy Validator	`/aws/lambda/tradai-validate-strategy-{env}`
Data Proxy	`/aws/lambda/tradai-data-proxy-{env}`

Infrastructure Logs¶

Resource	Log Group
API Gateway	`/aws/api-gateway/tradai-{env}`
VPC Flow Logs	`/aws/vpc/flowlogs/tradai-{env}`
CloudTrail	`/aws/cloudtrail/tradai-{env}`

CloudWatch Logs Insights Queries¶

Basic Queries¶

View recent errors across all ECS services:

fields @timestamp, @message, @logStream
| filter @message like /ERROR|Exception|error|exception/
| sort @timestamp desc
| limit 100

Filter by correlation ID:

fields @timestamp, @message, @logStream
| filter @message like /CORRELATION_ID_HERE/
| sort @timestamp asc
| limit 500

Find specific error patterns:

fields @timestamp, @message
| filter @message like /ConnectionError|TimeoutError|ValidationError/
| stats count(*) as error_count by bin(5m)

Service-Specific Queries¶

Backend - Request tracing:

fields @timestamp, @message
| filter @message like /POST|GET|PUT|DELETE/
| filter @message like /api\/v1/
| parse @message /(?<method>POST|GET|PUT|DELETE)\s+(?<path>[^\s]+)\s+(?<status>\d{3})/
| stats count(*) by path, status
| sort count(*) desc
| limit 50

Strategy Service - Backtest status:

fields @timestamp, @message
| filter @message like /backtest|Backtest/
| filter @message like /started|completed|failed|error/
| sort @timestamp desc
| limit 100

Data Collection - Fetch errors:

fields @timestamp, @message
| filter @message like /fetch|OHLCV|exchange/
| filter @message like /error|failed|timeout/
| stats count(*) as errors by bin(15m)

Lambda Queries¶

Lambda errors with stack traces:

fields @timestamp, @message, @requestId
| filter @message like /ERROR|Exception|Traceback/
| sort @timestamp desc
| limit 50

Lambda cold starts:

fields @timestamp, @message
| filter @message like /Init Duration/
| parse @message /Init Duration: (?<initDuration>[\d.]+) ms/
| stats avg(initDuration), max(initDuration), count(*) by bin(1h)

Lambda duration analysis:

fields @timestamp, @message
| filter @type = "REPORT"
| parse @message /Duration: (?<duration>[\d.]+) ms/
| stats avg(duration), p95(duration), max(duration) by bin(5m)

API Gateway Queries¶

Request latency analysis:

fields @timestamp, responseLatency, status, path
| stats avg(responseLatency) as avg_latency, p95(responseLatency) as p95_latency by path
| sort avg_latency desc

4xx/5xx errors:

fields @timestamp, status, path, errorMessage
| filter status >= 400
| stats count(*) by status, path
| sort count(*) desc

Correlation ID Tracing¶

How Correlation IDs Work¶

Every request to TradAI services gets a unique X-Correlation-ID header that propagates through: 1. API Gateway → Backend 2. Backend → Strategy Service / Data Collection 3. Services → Lambda functions 4. Services → DynamoDB operations

Finding the Correlation ID¶

From API response headers:

curl -v https://api.tradai.io/api/v1/health 2>&1 | grep -i x-correlation-id

From CloudWatch logs:

fields @timestamp, @message
| filter @message like /correlation_id|X-Correlation-ID/
| parse @message /"correlation_id":\s*"(?<correlationId>[^"]+)"/
| limit 100

Trace a Request Across Services¶

Given a correlation ID, trace through all services:

Step 1: Backend logs:

-- Log group: /ecs/tradai-backend-{env}
fields @timestamp, @message, @logStream
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc

Step 2: Strategy Service logs:

-- Log group: /ecs/tradai-strategy-service-{env}
fields @timestamp, @message, @logStream
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc

Step 3: Lambda logs (if applicable):

-- Log group: /aws/lambda/tradai-sqs-consumer-{env}
fields @timestamp, @message, @requestId
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc

Common Error Patterns¶

Connection Errors¶

Pattern: ConnectionError, ConnectionRefusedError

fields @timestamp, @message, @logStream
| filter @message like /ConnectionError|ConnectionRefused|ECONNREFUSED/
| parse @message /connecting to (?<target>[^\s]+)/
| stats count(*) by target

Common causes: - Service not running - Security group blocking traffic - VPC endpoint not available - DNS resolution failure

Timeout Errors¶

Pattern: TimeoutError, ReadTimeout

fields @timestamp, @message
| filter @message like /TimeoutError|ReadTimeout|timeout|Timeout/
| parse @message /timeout after (?<seconds>[\d.]+)/
| stats count(*) by bin(5m)

Common causes: - Slow downstream service - NAT instance issues - Database overloaded - Network congestion

Validation Errors¶

Pattern: ValidationError, Invalid

fields @timestamp, @message
| filter @message like /ValidationError|Invalid|validation failed/
| sort @timestamp desc
| limit 100

Authentication Errors¶

Pattern: 401, 403, Unauthorized

fields @timestamp, @message
| filter @message like /401|403|Unauthorized|Forbidden|AuthenticationError/
| sort @timestamp desc
| limit 100

Log Tailing (Real-time)¶

AWS CLI¶

# Tail ECS service logs
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow

# Tail Lambda logs
aws logs tail /aws/lambda/tradai-health-check-${ENVIRONMENT} --follow

# Filter while tailing
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow \
  --filter-pattern "ERROR"

# Tail with timestamp format
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow \
  --format short

Multi-Service Tailing¶

# Terminal 1
aws logs tail /ecs/tradai-backend-${ENVIRONMENT} --follow

# Terminal 2
aws logs tail /ecs/tradai-strategy-service-${ENVIRONMENT} --follow

# Terminal 3
aws logs tail /aws/lambda/tradai-sqs-consumer-${ENVIRONMENT} --follow

Exporting Logs for Analysis¶

Export to S3¶

aws logs create-export-task \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --from $(date -u -v-24H +%s)000 \
  --to $(date -u +%s)000 \
  --destination tradai-${ENVIRONMENT}-logs \
  --destination-prefix exports/backend

Download specific time range¶

aws logs filter-log-events \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --start-time $(date -u -v-1H +%s)000 \
  --end-time $(date -u +%s)000 \
  --output json > backend-logs-$(date +%Y%m%d-%H%M).json

Log Level Reference¶

Level	Use Case	Investigation Priority
DEBUG	Detailed diagnostics	Low - enable temporarily
INFO	Key operations, requests	Normal - standard monitoring
WARNING	Recoverable issues	Medium - investigate patterns
ERROR	Operation failures	High - immediate investigation
CRITICAL	System failures	Critical - immediate action

Change Log Level (per service)¶

# Update ECS task environment variable
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-backend-${ENVIRONMENT} \
  --force-new-deployment \
  --task-definition tradai-backend-${ENVIRONMENT}:NEW_REVISION
  # Where NEW_REVISION has LOG_LEVEL=DEBUG

Investigation Workflow¶

Step-by-Step Process¶

Identify the symptom:
Error message, alert, user report
Get time range:
When did it start?
Is it ongoing or resolved?
Find affected service:
Which service logs to check first?
Check health endpoints

Search for errors:

fields @timestamp, @message
| filter @message like /ERROR|Exception/
| filter @timestamp >= "YYYY-MM-DDTHH:MM:SSZ"
| sort @timestamp asc
| limit 100

Get correlation ID:
From error log or API response
Trace across services:
Follow correlation ID through log groups
Identify root cause:
Downstream failure?
Configuration issue?
Resource exhaustion?
Resolve and verify:
Apply fix
Verify logs are clean

Quick Reference Commands¶

# Recent errors (last 15 min)
aws logs filter-log-events \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --start-time $(date -u -v-15M +%s)000 \
  --filter-pattern "ERROR" \
  --query 'events[*].message' \
  --output text

# Count errors by log stream
aws logs filter-log-events \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --start-time $(date -u -v-1H +%s)000 \
  --filter-pattern "ERROR" \
  --query 'events[*].logStreamName' \
  --output text | sort | uniq -c | sort -rn

# Get log streams (to find specific task)
aws logs describe-log-streams \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --order-by LastEventTime \
  --descending \
  --limit 5

Verification Checklist¶

After log investigation:

[ ] Root cause identified
[ ] Affected time range documented
[ ] All impacted requests/users identified
[ ] Fix applied and verified
[ ] No new errors in logs (last 15 minutes)
[ ] Incident timeline documented
[ ] If recurring, alert/monitoring added