Skip to content

Debug Workflows

Step-by-step debugging procedures for common scenarios.

General Debug Methodology

5-Step Debug Process

  1. Reproduce - Can you consistently trigger the issue?
  2. Isolate - Which component is failing?
  3. Investigate - Check logs, metrics, state
  4. Resolve - Apply fix
  5. Verify - Confirm fix works, no regressions

Key Diagnostic Tools

# Health checks
just check-health                      # All services
curl http://localhost:8000/api/v1/health | jq  # Single service

# Logs
docker compose logs -f backend         # Local
aws logs tail /ecs/tradai-backend-dev --follow  # AWS

# Metrics
aws cloudwatch get-metric-data ...     # CloudWatch

Backtest Debugging

Backtest Lifecycle

1. Submit → Backend → SQS Queue
2. SQS → Lambda Consumer → ECS Task
3. ECS Task → Freqtrade → S3 Results
4. DynamoDB → Status Update → Backend

Debug Flow

Step 1: Check job status

JOB_ID="bt-abc123"

# Get status from API
curl http://localhost:8000/api/v1/backtests/${JOB_ID} | jq

# Check DynamoDB directly
aws dynamodb get-item \
  --table-name tradai-${ENVIRONMENT}-workflow-state \
  --key '{"job_id": {"S": "'${JOB_ID}'"}}' \
  --query 'Item' | jq

Step 2: Find the ECS task

# List running tasks
aws ecs list-tasks --cluster tradai-${ENVIRONMENT} --family tradai-strategy-${ENVIRONMENT}

# Get task details
aws ecs describe-tasks \
  --cluster tradai-${ENVIRONMENT} \
  --tasks $TASK_ARN

Step 3: Check task logs

# Get log stream from task
aws ecs describe-tasks \
  --cluster tradai-${ENVIRONMENT} \
  --tasks $TASK_ARN \
  --query 'tasks[0].containers[0].runtimeId'

# Tail logs
aws logs tail /ecs/tradai-strategy-${ENVIRONMENT} \
  --log-stream-name-prefix $CONTAINER_ID \
  --follow

Step 4: Check for common failures

# OOM (out of memory)
aws logs filter-log-events \
  --log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
  --filter-pattern "MemoryError OR OutOfMemory" \
  --limit 20

# Data errors
aws logs filter-log-events \
  --log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
  --filter-pattern "insufficient data OR no data" \
  --limit 20

# Strategy errors
aws logs filter-log-events \
  --log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
  --filter-pattern "ERROR OR Exception" \
  --limit 50

Step 5: Check S3 results

# List results for job
aws s3 ls s3://tradai-${ENVIRONMENT}-results/backtests/${JOB_ID}/

# Download result
aws s3 cp s3://tradai-${ENVIRONMENT}-results/backtests/${JOB_ID}/backtest-result.json .

Common Backtest Issues

Symptom Cause Resolution
Task never starts SQS consumer failed Check Lambda logs
Task stops immediately Container crash Check OOM, missing config
Runs but no results Freqtrade error Check strategy logs
Stuck in running Task hung Force stop, investigate

Memory Issues

Diagnosing OOM

# Check ECS task memory
aws ecs describe-task-definition \
  --task-definition tradai-strategy-${ENVIRONMENT} \
  --query 'taskDefinition.containerDefinitions[0].memory'

# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ClusterName,Value=tradai-${ENVIRONMENT} \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 --statistics Maximum

# Check for OOM killer
aws logs filter-log-events \
  --log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
  --filter-pattern "OOM" \
  --limit 20

Common Memory Issues

Service Typical Usage Increase To
Backend 512 MB 1024 MB
Strategy Task 1024 MB 2048 MB
Data Collection 512 MB 1024 MB

Resolution

# Update task definition with more memory
# (Requires new task definition revision)

# Or reduce workload
# - Fewer pairs per backtest
# - Shorter date range
# - Simpler indicators

Request Tracing

Using Correlation IDs

Every request gets a X-Correlation-ID header that propagates through all services.

Step 1: Get correlation ID

# From response headers
curl -v http://localhost:8000/api/v1/backtests 2>&1 | grep -i correlation

# Example: X-Correlation-ID: abc-123-def-456

Step 2: Search logs

CORR_ID="abc-123-def-456"

# Backend logs
aws logs filter-log-events \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --filter-pattern "${CORR_ID}" \
  --limit 100

# Strategy service logs
aws logs filter-log-events \
  --log-group-name /ecs/tradai-strategy-service-${ENVIRONMENT} \
  --filter-pattern "${CORR_ID}" \
  --limit 100

# Lambda logs
aws logs filter-log-events \
  --log-group-name /aws/lambda/tradai-sqs-consumer-${ENVIRONMENT} \
  --filter-pattern "${CORR_ID}" \
  --limit 100

Step 3: Build timeline

-- CloudWatch Logs Insights query
fields @timestamp, @message, @logStream
| filter @message like /abc-123-def-456/
| sort @timestamp asc


Service Health Debugging

Health Check Flow

1. /api/v1/health → Backend
2. Backend → Strategy Service health
3. Backend → Data Collection health
4. Backend → MLflow health
5. Backend → DynamoDB connectivity

Debug Unhealthy Service

Step 1: Check health endpoint

# Full health response
curl http://localhost:8000/api/v1/health | jq

# Example response with failure
{
  "status": "unhealthy",
  "dependencies": {
    "strategy_service": "unhealthy",
    "data_collection": "healthy",
    "mlflow": "healthy"
  }
}

Step 2: Check failing dependency

# Direct check to strategy service
curl http://localhost:8003/api/v1/health | jq

# Check container
docker compose ps strategy-service
docker compose logs strategy-service --tail 100

Step 3: Common health failures

Dependency Failure Mode Resolution
strategy_service Connection refused Restart service
data_collection Timeout Check ArcticDB
mlflow 503 Check RDS connectivity
dynamodb Access denied Check IAM permissions

Lambda Debugging

Lambda Debug Flow

Step 1: Check recent invocations

FUNCTION="tradai-health-check-${ENVIRONMENT}"

# Get recent errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/${FUNCTION} \
  --filter-pattern "ERROR" \
  --limit 20

Step 2: Check function config

aws lambda get-function-configuration \
  --function-name ${FUNCTION} \
  --query '{Timeout:Timeout,Memory:MemorySize,Env:Environment.Variables}'

Step 3: Test manually

# Invoke with test payload
aws lambda invoke \
  --function-name ${FUNCTION} \
  --payload '{}' \
  --log-type Tail \
  response.json

# Decode logs
cat response.json | jq -r '.LogResult' | base64 -d

Step 4: Common Lambda issues

Error Cause Fix
Task timed out Long operation Increase timeout
Module not found Missing dependency Check container image
Access denied IAM permissions Update role policy
Connection timeout VPC/NAT issues Check security groups

Database Debugging

RDS Debug Flow

Step 1: Check connectivity

# From ECS container
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container mlflow \
  --command "nc -zv $RDS_ENDPOINT 5432"

Step 2: Check RDS status

aws rds describe-db-instances \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --query 'DBInstances[0].{Status:DBInstanceStatus,Endpoint:Endpoint,AZ:AvailabilityZone}'

Step 3: Check connection count

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
  --start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 --statistics Maximum

DynamoDB Debug Flow

Step 1: Check table status

aws dynamodb describe-table \
  --table-name tradai-${ENVIRONMENT}-workflow-state \
  --query 'Table.{Status:TableStatus,ItemCount:ItemCount}'

Step 2: Query specific item

aws dynamodb get-item \
  --table-name tradai-${ENVIRONMENT}-workflow-state \
  --key '{"job_id": {"S": "bt-abc123"}}' \
  --query 'Item'

Step 3: Check for throttling

aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ThrottledRequests \
  --dimensions Name=TableName,Value=tradai-${ENVIRONMENT}-workflow-state \
  --start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 --statistics Sum


Quick Debug Commands

# === Local Development ===
# Restart everything
docker compose down && docker compose up -d

# Check all logs
docker compose logs -f

# Shell into container
docker compose exec backend bash

# === AWS ===
# Force new deployment
aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-backend-${ENVIRONMENT} \
  --force-new-deployment

# Shell into running task
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container backend \
  --interactive \
  --command "/bin/bash"

# Quick health check
for svc in backend strategy-service data-collection mlflow; do
  echo "=== $svc ==="
  aws ecs describe-services \
    --cluster tradai-${ENVIRONMENT} \
    --services tradai-${svc}-${ENVIRONMENT} \
    --query 'services[0].runningCount'
done