Log Investigation Runbook¶

Procedures for investigating issues using CloudWatch Logs, including common queries, log locations, and correlation ID tracing.

Log Locations¶

Service Logs¶

Service	Log Group	Stream Prefix	Description
Backend API	`/ecs/tradai/{env}/services`	`backend-api`	Request handling, orchestration
Strategy Service	`/ecs/tradai/{env}/services`	`strategy-service`	Strategy config, MLflow integration
Data Collection	`/ecs/tradai/{env}/services`	`data-collection`	Market data fetching, ArcticDB
MLflow	`/ecs/tradai/{env}/services`	`mlflow`	Experiment tracking
Strategy Tasks	`/ecs/tradai/{env}`	`strategy`	Backtest/trading execution

Lambda Logs¶

Function	Log Group
Health Check	`/aws/lambda/tradai-health-check-{env}`
Trading Heartbeat Check	`/aws/lambda/tradai-trading-heartbeat-check-{env}`
Orphan Scanner	`/aws/lambda/tradai-orphan-scanner-{env}`
Drift Monitor	`/aws/lambda/tradai-drift-monitor-{env}`
Retraining Scheduler	`/aws/lambda/tradai-retraining-scheduler-{env}`
SQS Consumer	`/aws/lambda/tradai-sqs-consumer-{env}`
Strategy Validator	`/aws/lambda/tradai-validate-strategy-{env}`
Data Collection Proxy	`/aws/lambda/tradai-data-collection-proxy-{env}`

Infrastructure Logs¶

Resource	Log Group
API Gateway	`/aws/api-gateway/tradai-{env}` (Not currently configured in infrastructure. API Gateway access logging is not enabled.)
VPC Flow Logs	`/aws/vpc/tradai-flow-logs`
CloudTrail	`/aws/cloudtrail/tradai`

Consolidated EC2 Instance Logs¶

Dev/staging environments use consolidated EC2

In dev and staging, services run via docker-compose on a single EC2 instance instead of ECS Fargate. Logs are not in CloudWatch ECS log groups. Instead:

SSH/SSM into the instance and use sudo docker logs <container> (e.g., sudo docker logs backend-api)
Docker-compose logs: cd /opt/tradai && sudo docker-compose logs -f
System logs: journalctl -u docker for Docker daemon issues
CloudWatch Agent (if installed) may forward to /ec2/tradai-consolidated-{env}

See the infrastructure-issues runbook for full EC2 debugging procedures.

CloudWatch Logs Insights Queries¶

Basic Queries¶

View recent errors across all ECS services:

fields @timestamp, @message, @logStream
| filter @message like /ERROR|Exception|error|exception/
| sort @timestamp desc
| limit 100

Filter by correlation ID:

fields @timestamp, @message, @logStream
| filter @message like /CORRELATION_ID_HERE/
| sort @timestamp asc
| limit 500

Find specific error patterns:

fields @timestamp, @message
| filter @message like /ConnectionError|TimeoutError|ValidationError/
| stats count(*) as error_count by bin(5m)

Service-Specific Queries¶

Backend - Request tracing:

fields @timestamp, @message
| filter @message like /POST|GET|PUT|DELETE/
| filter @message like /api\/v1/
| parse @message /(?<method>POST|GET|PUT|DELETE)\s+(?<path>[^\s]+)\s+(?<status>\d{3})/
| stats count(*) by path, status
| sort count(*) desc
| limit 50

Strategy Service - Backtest status:

fields @timestamp, @message
| filter @message like /backtest|Backtest/
| filter @message like /started|completed|failed|error/
| sort @timestamp desc
| limit 100

Data Collection - Fetch errors:

fields @timestamp, @message
| filter @message like /fetch|OHLCV|exchange/
| filter @message like /error|failed|timeout/
| stats count(*) as errors by bin(15m)

Lambda Queries¶

Lambda errors with stack traces:

fields @timestamp, @message, @requestId
| filter @message like /ERROR|Exception|Traceback/
| sort @timestamp desc
| limit 50

Lambda cold starts:

fields @timestamp, @message
| filter @message like /Init Duration/
| parse @message /Init Duration: (?<initDuration>[\d.]+) ms/
| stats avg(initDuration), max(initDuration), count(*) by bin(1h)

Lambda duration analysis:

fields @timestamp, @message
| filter @type = "REPORT"
| parse @message /Duration: (?<duration>[\d.]+) ms/
| stats avg(duration), p95(duration), max(duration) by bin(5m)

API Gateway Queries¶

Request latency analysis:

fields @timestamp, responseLatency, status, path
| stats avg(responseLatency) as avg_latency, p95(responseLatency) as p95_latency by path
| sort avg_latency desc

4xx/5xx errors:

fields @timestamp, status, path, errorMessage
| filter status >= 400
| stats count(*) by status, path
| sort count(*) desc

Correlation ID Tracing¶

How Correlation IDs Work¶

Every request to TradAI services gets a unique X-Correlation-ID header that propagates through: 1. API Gateway → Backend 2. Backend → Strategy Service / Data Collection 3. Services → Lambda functions 4. Services → DynamoDB operations

Finding the Correlation ID¶

From API response headers:

curl -v https://api.tradai.io/api/v1/health 2>&1 | grep -i x-correlation-id

From CloudWatch logs:

fields @timestamp, @message
| filter @message like /correlation_id|X-Correlation-ID/
| parse @message /"correlation_id":\s*"(?<correlationId>[^"]+)"/
| limit 100

Trace a Request Across Services¶

Given a correlation ID, trace through all services:

Step 1: Backend logs:

-- Log group: /ecs/tradai/{env}/services (filter by backend-api stream prefix)
fields @timestamp, @message, @logStream
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc

Step 2: Strategy Service logs:

-- Log group: /ecs/tradai/{env}/services (filter by strategy-service stream prefix)
fields @timestamp, @message, @logStream
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc

Step 3: Lambda logs (if applicable):

-- Log group: /aws/lambda/tradai-sqs-consumer-{env}
fields @timestamp, @message, @requestId
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc

Common Error Patterns¶

Connection Errors¶

Pattern: ConnectionError, ConnectionRefusedError

fields @timestamp, @message, @logStream
| filter @message like /ConnectionError|ConnectionRefused|ECONNREFUSED/
| parse @message /connecting to (?<target>[^\s]+)/
| stats count(*) by target

Common causes: - Service not running - Security group blocking traffic - VPC endpoint not available - DNS resolution failure

Timeout Errors¶

Pattern: TimeoutError, ReadTimeout

fields @timestamp, @message
| filter @message like /TimeoutError|ReadTimeout|timeout|Timeout/
| parse @message /timeout after (?<seconds>[\d.]+)/
| stats count(*) by bin(5m)

Common causes: - Slow downstream service - NAT instance issues - Database overloaded - Network congestion

Validation Errors¶

Pattern: ValidationError, Invalid

fields @timestamp, @message
| filter @message like /ValidationError|Invalid|validation failed/
| sort @timestamp desc
| limit 100

Authentication Errors¶

Pattern: 401, 403, Unauthorized

fields @timestamp, @message
| filter @message like /401|403|Unauthorized|Forbidden|AuthenticationError/
| sort @timestamp desc
| limit 100

Log Tailing (Real-time)¶

AWS CLI¶

# Tail ECS service logs (all services share one log group)
aws logs tail /ecs/tradai/${ENVIRONMENT}/services --follow

# Tail Lambda logs
aws logs tail /aws/lambda/tradai-health-check-${ENVIRONMENT} --follow

# Filter while tailing (filter by service stream prefix)
aws logs tail /ecs/tradai/${ENVIRONMENT}/services --follow \
  --filter-pattern "ERROR" \
  --log-stream-name-prefix backend-api

# Tail with timestamp format
aws logs tail /ecs/tradai/${ENVIRONMENT}/services --follow \
  --format short

Multi-Service Tailing¶

# Terminal 1 (backend logs)
aws logs tail /ecs/tradai/${ENVIRONMENT}/services --follow \
  --log-stream-name-prefix backend-api

# Terminal 2 (strategy service logs)
aws logs tail /ecs/tradai/${ENVIRONMENT}/services --follow \
  --log-stream-name-prefix strategy-service

# Terminal 3
aws logs tail /aws/lambda/tradai-sqs-consumer-${ENVIRONMENT} --follow

Exporting Logs for Analysis¶

Export to S3¶

aws logs create-export-task \
  --log-group-name /ecs/tradai/${ENVIRONMENT}/services \
  --from $(date -u -v-24H +%s)000 \
  --to $(date -u +%s)000 \
  --destination tradai-logs-${ENVIRONMENT} \
  --destination-prefix exports/services

Download specific time range¶

aws logs filter-log-events \
  --log-group-name /ecs/tradai/${ENVIRONMENT}/services \
  --log-stream-name-prefix backend-api \
  --start-time $(date -u -v-1H +%s)000 \
  --end-time $(date -u +%s)000 \
  --output json > backend-logs-$(date +%Y%m%d-%H%M).json

Log Level Reference¶

Level	Use Case	Investigation Priority
DEBUG	Detailed diagnostics	Low - enable temporarily
INFO	Key operations, requests	Normal - standard monitoring
WARNING	Recoverable issues	Medium - investigate patterns
ERROR	Operation failures	High - immediate investigation
CRITICAL	System failures	Critical - immediate action

Change Log Level (per service)¶

# Update ECS task environment variable
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-backend-api-${ENVIRONMENT} \
  --force-new-deployment \
  --task-definition tradai-backend-api-${ENVIRONMENT}:NEW_REVISION
  # Where NEW_REVISION has LOG_LEVEL=DEBUG

Investigation Workflow¶

Step-by-Step Process¶

Identify the symptom:
Error message, alert, user report
Get time range:
When did it start?
Is it ongoing or resolved?
Find affected service:
Which service logs to check first?
Check health endpoints

Search for errors:

fields @timestamp, @message
| filter @message like /ERROR|Exception/
| filter @timestamp >= "YYYY-MM-DDTHH:MM:SSZ"
| sort @timestamp asc
| limit 100

Get correlation ID:
From error log or API response
Trace across services:
Follow correlation ID through log groups
Identify root cause:
Downstream failure?
Configuration issue?
Resource exhaustion?
Resolve and verify:
Apply fix
Verify logs are clean

Quick Reference Commands¶

# Recent errors (last 15 min, all services)
aws logs filter-log-events \
  --log-group-name /ecs/tradai/${ENVIRONMENT}/services \
  --start-time $(date -u -v-15M +%s)000 \
  --filter-pattern "ERROR" \
  --query 'events[*].message' \
  --output text

# Count errors by log stream (identifies which service)
aws logs filter-log-events \
  --log-group-name /ecs/tradai/${ENVIRONMENT}/services \
  --start-time $(date -u -v-1H +%s)000 \
  --filter-pattern "ERROR" \
  --query 'events[*].logStreamName' \
  --output text | sort | uniq -c | sort -rn

# Get log streams (to find specific task)
aws logs describe-log-streams \
  --log-group-name /ecs/tradai/${ENVIRONMENT}/services \
  --order-by LastEventTime \
  --descending \
  --limit 5

Verification Checklist¶

After log investigation:

Root cause identified
Affected time range documented
All impacted requests/users identified
Fix applied and verified
No new errors in logs (last 15 minutes)
Incident timeline documented
If recurring, alert/monitoring added