Debug Workflows¶
Step-by-step debugging procedures for common scenarios.
General Debug Methodology¶
5-Step Debug Process¶
- Reproduce - Can you consistently trigger the issue?
- Isolate - Which component is failing?
- Investigate - Check logs, metrics, state
- Resolve - Apply fix
- Verify - Confirm fix works, no regressions
Key Diagnostic Tools¶
# Health checks
just check-health # All services
curl http://localhost:8000/api/v1/health | jq # Single service
# Logs
docker compose logs -f backend # Local
aws logs tail /ecs/tradai-backend-dev --follow # AWS
# Metrics
aws cloudwatch get-metric-data ... # CloudWatch
Backtest Debugging¶
Backtest Lifecycle¶
1. Submit → Backend → SQS Queue
2. SQS → Lambda Consumer → ECS Task
3. ECS Task → Freqtrade → S3 Results
4. DynamoDB → Status Update → Backend
Debug Flow¶
Step 1: Check job status
JOB_ID="bt-abc123"
# Get status from API
curl http://localhost:8000/api/v1/backtests/${JOB_ID} | jq
# Check DynamoDB directly
aws dynamodb get-item \
--table-name tradai-${ENVIRONMENT}-workflow-state \
--key '{"job_id": {"S": "'${JOB_ID}'"}}' \
--query 'Item' | jq
Step 2: Find the ECS task
# List running tasks
aws ecs list-tasks --cluster tradai-${ENVIRONMENT} --family tradai-strategy-${ENVIRONMENT}
# Get task details
aws ecs describe-tasks \
--cluster tradai-${ENVIRONMENT} \
--tasks $TASK_ARN
Step 3: Check task logs
# Get log stream from task
aws ecs describe-tasks \
--cluster tradai-${ENVIRONMENT} \
--tasks $TASK_ARN \
--query 'tasks[0].containers[0].runtimeId'
# Tail logs
aws logs tail /ecs/tradai-strategy-${ENVIRONMENT} \
--log-stream-name-prefix $CONTAINER_ID \
--follow
Step 4: Check for common failures
# OOM (out of memory)
aws logs filter-log-events \
--log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
--filter-pattern "MemoryError OR OutOfMemory" \
--limit 20
# Data errors
aws logs filter-log-events \
--log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
--filter-pattern "insufficient data OR no data" \
--limit 20
# Strategy errors
aws logs filter-log-events \
--log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
--filter-pattern "ERROR OR Exception" \
--limit 50
Step 5: Check S3 results
# List results for job
aws s3 ls s3://tradai-${ENVIRONMENT}-results/backtests/${JOB_ID}/
# Download result
aws s3 cp s3://tradai-${ENVIRONMENT}-results/backtests/${JOB_ID}/backtest-result.json .
Common Backtest Issues¶
| Symptom | Cause | Resolution |
|---|---|---|
| Task never starts | SQS consumer failed | Check Lambda logs |
| Task stops immediately | Container crash | Check OOM, missing config |
| Runs but no results | Freqtrade error | Check strategy logs |
| Stuck in running | Task hung | Force stop, investigate |
Memory Issues¶
Diagnosing OOM¶
# Check ECS task memory
aws ecs describe-task-definition \
--task-definition tradai-strategy-${ENVIRONMENT} \
--query 'taskDefinition.containerDefinitions[0].memory'
# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace ECS/ContainerInsights \
--metric-name MemoryUtilized \
--dimensions Name=ClusterName,Value=tradai-${ENVIRONMENT} \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 --statistics Maximum
# Check for OOM killer
aws logs filter-log-events \
--log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
--filter-pattern "OOM" \
--limit 20
Common Memory Issues¶
| Service | Typical Usage | Increase To |
|---|---|---|
| Backend | 512 MB | 1024 MB |
| Strategy Task | 1024 MB | 2048 MB |
| Data Collection | 512 MB | 1024 MB |
Resolution¶
# Update task definition with more memory
# (Requires new task definition revision)
# Or reduce workload
# - Fewer pairs per backtest
# - Shorter date range
# - Simpler indicators
Request Tracing¶
Using Correlation IDs¶
Every request gets a X-Correlation-ID header that propagates through all services.
Step 1: Get correlation ID
# From response headers
curl -v http://localhost:8000/api/v1/backtests 2>&1 | grep -i correlation
# Example: X-Correlation-ID: abc-123-def-456
Step 2: Search logs
CORR_ID="abc-123-def-456"
# Backend logs
aws logs filter-log-events \
--log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
--filter-pattern "${CORR_ID}" \
--limit 100
# Strategy service logs
aws logs filter-log-events \
--log-group-name /ecs/tradai-strategy-service-${ENVIRONMENT} \
--filter-pattern "${CORR_ID}" \
--limit 100
# Lambda logs
aws logs filter-log-events \
--log-group-name /aws/lambda/tradai-sqs-consumer-${ENVIRONMENT} \
--filter-pattern "${CORR_ID}" \
--limit 100
Step 3: Build timeline
-- CloudWatch Logs Insights query
fields @timestamp, @message, @logStream
| filter @message like /abc-123-def-456/
| sort @timestamp asc
Service Health Debugging¶
Health Check Flow¶
1. /api/v1/health → Backend
2. Backend → Strategy Service health
3. Backend → Data Collection health
4. Backend → MLflow health
5. Backend → DynamoDB connectivity
Debug Unhealthy Service¶
Step 1: Check health endpoint
# Full health response
curl http://localhost:8000/api/v1/health | jq
# Example response with failure
{
"status": "unhealthy",
"dependencies": {
"strategy_service": "unhealthy",
"data_collection": "healthy",
"mlflow": "healthy"
}
}
Step 2: Check failing dependency
# Direct check to strategy service
curl http://localhost:8003/api/v1/health | jq
# Check container
docker compose ps strategy-service
docker compose logs strategy-service --tail 100
Step 3: Common health failures
| Dependency | Failure Mode | Resolution |
|---|---|---|
| strategy_service | Connection refused | Restart service |
| data_collection | Timeout | Check ArcticDB |
| mlflow | 503 | Check RDS connectivity |
| dynamodb | Access denied | Check IAM permissions |
Lambda Debugging¶
Lambda Debug Flow¶
Step 1: Check recent invocations
FUNCTION="tradai-health-check-${ENVIRONMENT}"
# Get recent errors
aws logs filter-log-events \
--log-group-name /aws/lambda/${FUNCTION} \
--filter-pattern "ERROR" \
--limit 20
Step 2: Check function config
aws lambda get-function-configuration \
--function-name ${FUNCTION} \
--query '{Timeout:Timeout,Memory:MemorySize,Env:Environment.Variables}'
Step 3: Test manually
# Invoke with test payload
aws lambda invoke \
--function-name ${FUNCTION} \
--payload '{}' \
--log-type Tail \
response.json
# Decode logs
cat response.json | jq -r '.LogResult' | base64 -d
Step 4: Common Lambda issues
| Error | Cause | Fix |
|---|---|---|
| Task timed out | Long operation | Increase timeout |
| Module not found | Missing dependency | Check container image |
| Access denied | IAM permissions | Update role policy |
| Connection timeout | VPC/NAT issues | Check security groups |
Database Debugging¶
RDS Debug Flow¶
Step 1: Check connectivity
# From ECS container
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $TASK_ARN \
--container mlflow \
--command "nc -zv $RDS_ENDPOINT 5432"
Step 2: Check RDS status
aws rds describe-db-instances \
--db-instance-identifier tradai-${ENVIRONMENT} \
--query 'DBInstances[0].{Status:DBInstanceStatus,Endpoint:Endpoint,AZ:AvailabilityZone}'
Step 3: Check connection count
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
--start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 --statistics Maximum
DynamoDB Debug Flow¶
Step 1: Check table status
aws dynamodb describe-table \
--table-name tradai-${ENVIRONMENT}-workflow-state \
--query 'Table.{Status:TableStatus,ItemCount:ItemCount}'
Step 2: Query specific item
aws dynamodb get-item \
--table-name tradai-${ENVIRONMENT}-workflow-state \
--key '{"job_id": {"S": "bt-abc123"}}' \
--query 'Item'
Step 3: Check for throttling
aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ThrottledRequests \
--dimensions Name=TableName,Value=tradai-${ENVIRONMENT}-workflow-state \
--start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 --statistics Sum
Quick Debug Commands¶
# === Local Development ===
# Restart everything
docker compose down && docker compose up -d
# Check all logs
docker compose logs -f
# Shell into container
docker compose exec backend bash
# === AWS ===
# Force new deployment
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-backend-${ENVIRONMENT} \
--force-new-deployment
# Shell into running task
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $TASK_ARN \
--container backend \
--interactive \
--command "/bin/bash"
# Quick health check
for svc in backend strategy-service data-collection mlflow; do
echo "=== $svc ==="
aws ecs describe-services \
--cluster tradai-${ENVIRONMENT} \
--services tradai-${svc}-${ENVIRONMENT} \
--query 'services[0].runningCount'
done