Debug Workflows¶

Step-by-step debugging procedures for common scenarios.

General Debug Methodology¶

5-Step Debug Process¶

Reproduce - Can you consistently trigger the issue?
Isolate - Which component is failing?
Investigate - Check logs, metrics, state
Resolve - Apply fix
Verify - Confirm fix works, no regressions

Key Diagnostic Tools¶

# Health checks
just doctor services                   # All services
curl http://localhost:8000/api/v1/health | jq  # Single service

# Logs
docker compose logs -f backend         # Local
aws logs tail /ecs/tradai-backend-dev --follow  # AWS

# Metrics
aws cloudwatch get-metric-data ...     # CloudWatch

Backtest Debugging¶

Backtest Lifecycle¶

1. Submit → Backend → SQS Queue
2. SQS → Lambda Consumer → ECS Task
3. ECS Task → Freqtrade → S3 Results
4. DynamoDB → Status Update → Backend

Debug Flow¶

Step 1: Check job status

JOB_ID="bt-abc123"

# Get status from API
curl http://localhost:8000/api/v1/backtests/${JOB_ID} | jq

# Check DynamoDB directly
aws dynamodb get-item \
  --table-name tradai-${ENVIRONMENT}-workflow-state \
  --key '{"job_id": {"S": "'${JOB_ID}'"}}' \
  --query 'Item' | jq

Step 2: Find the ECS task

# List running tasks
aws ecs list-tasks --cluster tradai-${ENVIRONMENT} --family tradai-strategy-${ENVIRONMENT}

# Get task details
aws ecs describe-tasks \
  --cluster tradai-${ENVIRONMENT} \
  --tasks $TASK_ARN

Step 3: Check task logs

# Get log stream from task
aws ecs describe-tasks \
  --cluster tradai-${ENVIRONMENT} \
  --tasks $TASK_ARN \
  --query 'tasks[0].containers[0].runtimeId'

# Tail logs
aws logs tail /ecs/tradai-strategy-${ENVIRONMENT} \
  --log-stream-name-prefix $CONTAINER_ID \
  --follow

Step 4: Check for common failures

# OOM (out of memory)
aws logs filter-log-events \
  --log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
  --filter-pattern "MemoryError OR OutOfMemory" \
  --limit 20

# Data errors
aws logs filter-log-events \
  --log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
  --filter-pattern "insufficient data OR no data" \
  --limit 20

# Strategy errors
aws logs filter-log-events \
  --log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
  --filter-pattern "ERROR OR Exception" \
  --limit 50

Step 5: Check S3 results

# List results for job
aws s3 ls s3://tradai-${ENVIRONMENT}-results/backtests/${JOB_ID}/

# Download result
aws s3 cp s3://tradai-${ENVIRONMENT}-results/backtests/${JOB_ID}/backtest-result.json .

Common Backtest Issues¶

Symptom	Cause	Resolution
Task never starts	SQS consumer failed	Check Lambda logs
Task stops immediately	Container crash	Check OOM, missing config
Runs but no results	Freqtrade error	Check strategy logs
Stuck in running	Task hung	Force stop, investigate

Memory Issues¶

Diagnosing OOM¶

# Check ECS task memory
aws ecs describe-task-definition \
  --task-definition tradai-strategy-${ENVIRONMENT} \
  --query 'taskDefinition.containerDefinitions[0].memory'

# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ClusterName,Value=tradai-${ENVIRONMENT} \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 --statistics Maximum

# Check for OOM killer
aws logs filter-log-events \
  --log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
  --filter-pattern "OOM" \
  --limit 20

Common Memory Issues¶

Service	Typical Usage	Increase To
Backend	512 MB	1024 MB
Strategy Task	1024 MB	2048 MB
Data Collection	512 MB	1024 MB

Resolution¶

# Update task definition with more memory
# (Requires new task definition revision)

# Or reduce workload
# - Fewer pairs per backtest
# - Shorter date range
# - Simpler indicators

Request Tracing¶

Using Correlation IDs¶

Every request gets a X-Correlation-ID header that propagates through all services.

Step 1: Get correlation ID

# From response headers
curl -v http://localhost:8000/api/v1/backtests 2>&1 | grep -i correlation

# Example: X-Correlation-ID: abc-123-def-456

Step 2: Search logs

CORR_ID="abc-123-def-456"

# Backend logs
aws logs filter-log-events \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --filter-pattern "${CORR_ID}" \
  --limit 100

# Strategy service logs
aws logs filter-log-events \
  --log-group-name /ecs/tradai-strategy-service-${ENVIRONMENT} \
  --filter-pattern "${CORR_ID}" \
  --limit 100

# Lambda logs
aws logs filter-log-events \
  --log-group-name /aws/lambda/tradai-sqs-consumer-${ENVIRONMENT} \
  --filter-pattern "${CORR_ID}" \
  --limit 100

Step 3: Build timeline

-- CloudWatch Logs Insights query
fields @timestamp, @message, @logStream
| filter @message like /abc-123-def-456/
| sort @timestamp asc

Service Health Debugging¶

Health Check Flow¶

1. /api/v1/health → Backend
2. Backend → Strategy Service health
3. Backend → Data Collection health
4. Backend → MLflow health
5. Backend → DynamoDB connectivity

Debug Unhealthy Service¶

Step 1: Check health endpoint

# Full health response
curl http://localhost:8000/api/v1/health | jq

# Example response with failure
{
  "status": "unhealthy",
  "dependencies": {
    "strategy_service": "unhealthy",
    "data_collection": "healthy",
    "mlflow": "healthy"
  }
}

Step 2: Check failing dependency

# Direct check to strategy service
curl http://localhost:8003/api/v1/health | jq

# Check container
docker compose ps strategy-service
docker compose logs strategy-service --tail 100

Step 3: Common health failures

Dependency	Failure Mode	Resolution
strategy_service	Connection refused	Restart service
data_collection	Timeout	Check ArcticDB
mlflow	503	Check RDS connectivity
dynamodb	Access denied	Check IAM permissions

Lambda Debugging¶

Lambda Debug Flow¶

Step 1: Check recent invocations

FUNCTION="tradai-health-check-${ENVIRONMENT}"

# Get recent errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/${FUNCTION} \
  --filter-pattern "ERROR" \
  --limit 20

Step 2: Check function config

aws lambda get-function-configuration \
  --function-name ${FUNCTION} \
  --query '{Timeout:Timeout,Memory:MemorySize,Env:Environment.Variables}'

Step 3: Test manually

# Invoke with test payload
aws lambda invoke \
  --function-name ${FUNCTION} \
  --payload '{}' \
  --log-type Tail \
  response.json

# Decode logs
cat response.json | jq -r '.LogResult' | base64 -d

Step 4: Common Lambda issues

Error	Cause	Fix
Task timed out	Long operation	Increase timeout
Module not found	Missing dependency	Check container image
Access denied	IAM permissions	Update role policy
Connection timeout	VPC/NAT issues	Check security groups

Database Debugging¶

RDS Debug Flow¶

Step 1: Check connectivity

# From ECS container
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container mlflow \
  --command "nc -zv $RDS_ENDPOINT 5432"

Step 2: Check RDS status

aws rds describe-db-instances \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --query 'DBInstances[0].{Status:DBInstanceStatus,Endpoint:Endpoint,AZ:AvailabilityZone}'

Step 3: Check connection count

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
  --start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 --statistics Maximum

DynamoDB Debug Flow¶

Step 1: Check table status

aws dynamodb describe-table \
  --table-name tradai-${ENVIRONMENT}-workflow-state \
  --query 'Table.{Status:TableStatus,ItemCount:ItemCount}'

Step 2: Query specific item

aws dynamodb get-item \
  --table-name tradai-${ENVIRONMENT}-workflow-state \
  --key '{"job_id": {"S": "bt-abc123"}}' \
  --query 'Item'

Step 3: Check for throttling

aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ThrottledRequests \
  --dimensions Name=TableName,Value=tradai-${ENVIRONMENT}-workflow-state \
  --start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 --statistics Sum

Quick Debug Commands¶

# === Local Development ===
# Restart everything
docker compose down && docker compose up -d

# Check all logs
docker compose logs -f

# Shell into container
docker compose exec backend bash

# === AWS ===
# Force new deployment
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-backend-${ENVIRONMENT} \
  --force-new-deployment

# Shell into running task
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container backend \
  --interactive \
  --command "/bin/bash"

# Quick health check
for svc in backend strategy-service data-collection mlflow; do
  echo "=== $svc ==="
  aws ecs describe-services \
    --cluster tradai-${ENVIRONMENT} \
    --services tradai-${svc}-${ENVIRONMENT} \
    --query 'services[0].runningCount'
done