Skip to content

Log Investigation Runbook

Procedures for investigating issues using CloudWatch Logs, including common queries, log locations, and correlation ID tracing.

Log Locations

Service Logs

Service Log Group Stream Prefix Description
Backend API /ecs/tradai/{env}/services backend-api Request handling, orchestration
Strategy Service /ecs/tradai/{env}/services strategy-service Strategy config, MLflow integration
Data Collection /ecs/tradai/{env}/services data-collection Market data fetching, ArcticDB
MLflow /ecs/tradai/{env}/services mlflow Experiment tracking
Strategy Tasks /ecs/tradai/{env} strategy Backtest/trading execution

Lambda Logs

Function Log Group
Health Check /aws/lambda/tradai-health-check-{env}
Trading Heartbeat Check /aws/lambda/tradai-trading-heartbeat-check-{env}
Orphan Scanner /aws/lambda/tradai-orphan-scanner-{env}
Drift Monitor /aws/lambda/tradai-drift-monitor-{env}
Retraining Scheduler /aws/lambda/tradai-retraining-scheduler-{env}
SQS Consumer /aws/lambda/tradai-sqs-consumer-{env}
Strategy Validator /aws/lambda/tradai-validate-strategy-{env}
Data Collection Proxy /aws/lambda/tradai-data-collection-proxy-{env}

Infrastructure Logs

Resource Log Group
API Gateway /aws/api-gateway/tradai-{env} (Not currently configured in infrastructure. API Gateway access logging is not enabled.)
VPC Flow Logs /aws/vpc/tradai-flow-logs
CloudTrail /aws/cloudtrail/tradai

Consolidated EC2 Instance Logs

Dev/staging environments use consolidated EC2

In dev and staging, services run via docker-compose on a single EC2 instance instead of ECS Fargate. Logs are not in CloudWatch ECS log groups. Instead:

  • SSH/SSM into the instance and use sudo docker logs <container> (e.g., sudo docker logs backend-api)
  • Docker-compose logs: cd /opt/tradai && sudo docker-compose logs -f
  • System logs: journalctl -u docker for Docker daemon issues
  • CloudWatch Agent (if installed) may forward to /ec2/tradai-consolidated-{env}

See the infrastructure-issues runbook for full EC2 debugging procedures.


CloudWatch Logs Insights Queries

Basic Queries

View recent errors across all ECS services:

fields @timestamp, @message, @logStream
| filter @message like /ERROR|Exception|error|exception/
| sort @timestamp desc
| limit 100

Filter by correlation ID:

fields @timestamp, @message, @logStream
| filter @message like /CORRELATION_ID_HERE/
| sort @timestamp asc
| limit 500

Find specific error patterns:

fields @timestamp, @message
| filter @message like /ConnectionError|TimeoutError|ValidationError/
| stats count(*) as error_count by bin(5m)

Service-Specific Queries

Backend - Request tracing:

fields @timestamp, @message
| filter @message like /POST|GET|PUT|DELETE/
| filter @message like /api\/v1/
| parse @message /(?<method>POST|GET|PUT|DELETE)\s+(?<path>[^\s]+)\s+(?<status>\d{3})/
| stats count(*) by path, status
| sort count(*) desc
| limit 50

Strategy Service - Backtest status:

fields @timestamp, @message
| filter @message like /backtest|Backtest/
| filter @message like /started|completed|failed|error/
| sort @timestamp desc
| limit 100

Data Collection - Fetch errors:

fields @timestamp, @message
| filter @message like /fetch|OHLCV|exchange/
| filter @message like /error|failed|timeout/
| stats count(*) as errors by bin(15m)

Lambda Queries

Lambda errors with stack traces:

fields @timestamp, @message, @requestId
| filter @message like /ERROR|Exception|Traceback/
| sort @timestamp desc
| limit 50

Lambda cold starts:

fields @timestamp, @message
| filter @message like /Init Duration/
| parse @message /Init Duration: (?<initDuration>[\d.]+) ms/
| stats avg(initDuration), max(initDuration), count(*) by bin(1h)

Lambda duration analysis:

fields @timestamp, @message
| filter @type = "REPORT"
| parse @message /Duration: (?<duration>[\d.]+) ms/
| stats avg(duration), p95(duration), max(duration) by bin(5m)

API Gateway Queries

Request latency analysis:

fields @timestamp, responseLatency, status, path
| stats avg(responseLatency) as avg_latency, p95(responseLatency) as p95_latency by path
| sort avg_latency desc

4xx/5xx errors:

fields @timestamp, status, path, errorMessage
| filter status >= 400
| stats count(*) by status, path
| sort count(*) desc


Correlation ID Tracing

How Correlation IDs Work

Every request to TradAI services gets a unique X-Correlation-ID header that propagates through: 1. API Gateway → Backend 2. Backend → Strategy Service / Data Collection 3. Services → Lambda functions 4. Services → DynamoDB operations

Finding the Correlation ID

From API response headers:

curl -v https://api.tradai.io/api/v1/health 2>&1 | grep -i x-correlation-id

From CloudWatch logs:

fields @timestamp, @message
| filter @message like /correlation_id|X-Correlation-ID/
| parse @message /"correlation_id":\s*"(?<correlationId>[^"]+)"/
| limit 100

Trace a Request Across Services

Given a correlation ID, trace through all services:

Step 1: Backend logs:

-- Log group: /ecs/tradai/{env}/services (filter by backend-api stream prefix)
fields @timestamp, @message, @logStream
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc

Step 2: Strategy Service logs:

-- Log group: /ecs/tradai/{env}/services (filter by strategy-service stream prefix)
fields @timestamp, @message, @logStream
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc

Step 3: Lambda logs (if applicable):

-- Log group: /aws/lambda/tradai-sqs-consumer-{env}
fields @timestamp, @message, @requestId
| filter @message like /YOUR_CORRELATION_ID/
| sort @timestamp asc


Common Error Patterns

Connection Errors

Pattern: ConnectionError, ConnectionRefusedError

fields @timestamp, @message, @logStream
| filter @message like /ConnectionError|ConnectionRefused|ECONNREFUSED/
| parse @message /connecting to (?<target>[^\s]+)/
| stats count(*) by target

Common causes: - Service not running - Security group blocking traffic - VPC endpoint not available - DNS resolution failure

Timeout Errors

Pattern: TimeoutError, ReadTimeout

fields @timestamp, @message
| filter @message like /TimeoutError|ReadTimeout|timeout|Timeout/
| parse @message /timeout after (?<seconds>[\d.]+)/
| stats count(*) by bin(5m)

Common causes: - Slow downstream service - NAT instance issues - Database overloaded - Network congestion

Validation Errors

Pattern: ValidationError, Invalid

fields @timestamp, @message
| filter @message like /ValidationError|Invalid|validation failed/
| sort @timestamp desc
| limit 100

Authentication Errors

Pattern: 401, 403, Unauthorized

fields @timestamp, @message
| filter @message like /401|403|Unauthorized|Forbidden|AuthenticationError/
| sort @timestamp desc
| limit 100


Log Tailing (Real-time)

AWS CLI

# Tail ECS service logs (all services share one log group)
aws logs tail /ecs/tradai/${ENVIRONMENT}/services --follow

# Tail Lambda logs
aws logs tail /aws/lambda/tradai-health-check-${ENVIRONMENT} --follow

# Filter while tailing (filter by service stream prefix)
aws logs tail /ecs/tradai/${ENVIRONMENT}/services --follow \
  --filter-pattern "ERROR" \
  --log-stream-name-prefix backend-api

# Tail with timestamp format
aws logs tail /ecs/tradai/${ENVIRONMENT}/services --follow \
  --format short

Multi-Service Tailing

# Terminal 1 (backend logs)
aws logs tail /ecs/tradai/${ENVIRONMENT}/services --follow \
  --log-stream-name-prefix backend-api

# Terminal 2 (strategy service logs)
aws logs tail /ecs/tradai/${ENVIRONMENT}/services --follow \
  --log-stream-name-prefix strategy-service

# Terminal 3
aws logs tail /aws/lambda/tradai-sqs-consumer-${ENVIRONMENT} --follow

Exporting Logs for Analysis

Export to S3

aws logs create-export-task \
  --log-group-name /ecs/tradai/${ENVIRONMENT}/services \
  --from $(date -u -v-24H +%s)000 \
  --to $(date -u +%s)000 \
  --destination tradai-logs-${ENVIRONMENT} \
  --destination-prefix exports/services

Download specific time range

aws logs filter-log-events \
  --log-group-name /ecs/tradai/${ENVIRONMENT}/services \
  --log-stream-name-prefix backend-api \
  --start-time $(date -u -v-1H +%s)000 \
  --end-time $(date -u +%s)000 \
  --output json > backend-logs-$(date +%Y%m%d-%H%M).json

Log Level Reference

Level Use Case Investigation Priority
DEBUG Detailed diagnostics Low - enable temporarily
INFO Key operations, requests Normal - standard monitoring
WARNING Recoverable issues Medium - investigate patterns
ERROR Operation failures High - immediate investigation
CRITICAL System failures Critical - immediate action

Change Log Level (per service)

# Update ECS task environment variable
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-backend-api-${ENVIRONMENT} \
  --force-new-deployment \
  --task-definition tradai-backend-api-${ENVIRONMENT}:NEW_REVISION
  # Where NEW_REVISION has LOG_LEVEL=DEBUG

Investigation Workflow

Step-by-Step Process

  1. Identify the symptom:
  2. Error message, alert, user report

  3. Get time range:

  4. When did it start?
  5. Is it ongoing or resolved?

  6. Find affected service:

  7. Which service logs to check first?
  8. Check health endpoints

  9. Search for errors:

    fields @timestamp, @message
    | filter @message like /ERROR|Exception/
    | filter @timestamp >= "YYYY-MM-DDTHH:MM:SSZ"
    | sort @timestamp asc
    | limit 100
    

  10. Get correlation ID:

  11. From error log or API response

  12. Trace across services:

  13. Follow correlation ID through log groups

  14. Identify root cause:

  15. Downstream failure?
  16. Configuration issue?
  17. Resource exhaustion?

  18. Resolve and verify:

  19. Apply fix
  20. Verify logs are clean

Quick Reference Commands

# Recent errors (last 15 min, all services)
aws logs filter-log-events \
  --log-group-name /ecs/tradai/${ENVIRONMENT}/services \
  --start-time $(date -u -v-15M +%s)000 \
  --filter-pattern "ERROR" \
  --query 'events[*].message' \
  --output text

# Count errors by log stream (identifies which service)
aws logs filter-log-events \
  --log-group-name /ecs/tradai/${ENVIRONMENT}/services \
  --start-time $(date -u -v-1H +%s)000 \
  --filter-pattern "ERROR" \
  --query 'events[*].logStreamName' \
  --output text | sort | uniq -c | sort -rn

# Get log streams (to find specific task)
aws logs describe-log-streams \
  --log-group-name /ecs/tradai/${ENVIRONMENT}/services \
  --order-by LastEventTime \
  --descending \
  --limit 5

Verification Checklist

After log investigation:

  • Root cause identified
  • Affected time range documented
  • All impacted requests/users identified
  • Fix applied and verified
  • No new errors in logs (last 15 minutes)
  • Incident timeline documented
  • If recurring, alert/monitoring added