Service Interactions Troubleshooting¶
Guide for debugging communication issues between TradAI services.
Service Architecture¶
graph TD
GW["API Gateway"] --> BE["Backend (8000)<br/>Backtest orchestration<br/>Strategy catalog<br/>Service aggregation"]
BE --> SS["Strategy Service (8003)"]
BE --> DC["Data Collection (8002)"]
BE --> ML["MLflow (5000)"]
SS --> S3["S3 Configs"]
DC --> ADB["ArcticDB"]
ML --> RDS["RDS Postgres"] Backend → Strategy Service¶
Communication Pattern¶
Backend proxies strategy-related requests to Strategy Service: - GET /api/v1/strategies → Strategy Service - POST /api/v1/strategies/{name}/stage → Strategy Service - POST /api/v1/strategies/{name}/promote → Strategy Service - GET/POST /api/v1/hyperopt/* → Strategy Service
Debugging Connection Issues¶
Step 1: Check Strategy Service health
# Direct health check
curl http://localhost:8003/api/v1/health | jq
# Or via ECS
aws ecs describe-services \
--cluster tradai-${ENVIRONMENT} \
--services tradai-strategy-service-${ENVIRONMENT}
Step 2: Check Backend → Strategy Service connection
# Check Backend logs for strategy service calls
docker compose logs backend | grep -i strategy
# Or CloudWatch
aws logs filter-log-events \
--log-group-name /ecs/tradai/${ENVIRONMENT} \
--filter-pattern "strategy service" \
--limit 20
Step 3: Check network connectivity
# Docker Compose
docker compose exec backend curl -v http://strategy-service:8003/api/v1/health
# AWS (service discovery)
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $BACKEND_TASK_ARN \
--container backend \
--command "curl -v http://strategy-service.tradai-${ENVIRONMENT}.local:8003/api/v1/health"
Common Issues¶
| Issue | Symptom | Resolution |
|---|---|---|
| Service not found | DNS resolution fails | Check service discovery |
| Connection refused | Service not running | Restart strategy-service |
| Timeout | Service overloaded | Check CPU/memory, scale |
| 503 errors | Unhealthy | Check strategy-service health |
Backend → Data Collection¶
Communication Pattern¶
Backend proxies data requests: - GET /api/v1/data/symbols → Data Collection - GET /api/v1/data/freshness → Data Collection
Debugging¶
Step 1: Check Data Collection health
Step 2: Check connectivity
# Docker
docker compose exec backend curl http://data-collection:8002/api/v1/health
# AWS
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $BACKEND_TASK_ARN \
--container backend \
--command "curl http://data-collection.tradai-${ENVIRONMENT}.local:8002/api/v1/health"
Step 3: Check ArcticDB connectivity
# Data Collection logs
docker compose logs data-collection | grep -i arcticdb
# ArcticDB uses S3 storage via DATA_COLLECTION_ARCTIC_S3_BUCKET.
# In local dev, this points to LocalStack S3.
Data Collection Issues¶
Data Collection → Exchange¶
Symptom: Data sync failing, stale data.
Step 1: Check exchange connectivity
# Direct exchange test
curl https://api.binance.com/api/v3/ping
# From container
docker compose exec data-collection curl https://api.binance.com/api/v3/ping
Step 2: Check for rate limiting
Step 3: Check NAT (AWS)
# NAT instance health
aws ec2 describe-instances \
--filters "Name=tag:Name,Values=tradai-nat-*" \
--query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'
# Test from ECS
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $DATA_TASK_ARN \
--container data-collection \
--command "curl -w '%{time_total}s' https://api.binance.com/api/v3/ping"
Data Collection → ArcticDB¶
Symptom: Write errors, read failures.
Step 1: Check ArcticDB library
# List symbols
docker compose exec data-collection python << 'EOF'
from arcticdb import Arctic
ac = Arctic("s3s://localhost:4566/tradai-arcticdb-dev?aws_auth=false")
lib = ac.get_library("ohlcv", create_if_missing=True)
print(lib.list_symbols())
EOF
Step 2: Check S3 storage
MLflow Issues¶
Services → MLflow¶
Multiple services connect to MLflow: - Strategy Service: Model registry, experiment tracking - Backend: Model version queries - Strategy Tasks: Logging backtest results
Step 1: Check MLflow health
# Health check
curl http://localhost:5001/health
# API test
curl http://localhost:5001/api/2.0/mlflow/experiments/list | jq
Step 2: Check from other services
# From Backend
docker compose exec backend curl http://mlflow:5000/health
# From Strategy Service
docker compose exec strategy-service curl http://mlflow:5000/health
Step 3: Check RDS connectivity (MLflow backend)
# MLflow logs
docker compose logs mlflow | grep -i "database\|postgres"
# Test database connection
docker compose exec mlflow python << 'EOF'
import psycopg2
import os
conn = psycopg2.connect(os.environ.get('DATABASE_URL'))
cur = conn.cursor()
cur.execute('SELECT 1')
print(cur.fetchone())
conn.close()
EOF
Common MLflow Issues¶
| Issue | Symptom | Resolution |
|---|---|---|
| Can't connect | Connection refused | Start MLflow service |
| Database error | 500 errors | Check RDS connectivity |
| Model not found | 404 errors | Verify model name/version |
| Slow queries | Timeout | Check RDS performance |
Lambda → DynamoDB¶
Lambda Functions Using DynamoDB¶
health-check: Reads/writes health statetrading-heartbeat-check: Reads trading statesqs-consumer: Updates workflow stateorphan-scanner: Queries workflow state
Debugging¶
Step 1: Check Lambda permissions
aws lambda get-function-configuration \
--function-name tradai-health-check-${ENVIRONMENT} \
--query 'Role'
# Check role permissions
aws iam list-role-policies \
--role-name tradai-lambda-role-${ENVIRONMENT}
Step 2: Check DynamoDB access
# Test query
aws dynamodb scan \
--table-name tradai-health-state-${ENVIRONMENT} \
--limit 1
# Check for throttling
aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ThrottledRequests \
--dimensions Name=TableName,Value=tradai-health-state-${ENVIRONMENT} \
--start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 --statistics Sum
Step 3: Check VPC endpoint (for Lambda in VPC)
aws ec2 describe-vpc-endpoints \
--filters "Name=service-name,Values=com.amazonaws.${AWS_REGION}.dynamodb" \
--query 'VpcEndpoints[].State'
ECS → S3¶
Services Using S3¶
- Backend: Config reading, result storage
- Strategy Service: Config loading
- Strategy Tasks: Results upload
Debugging¶
Step 1: Check S3 access
# List buckets
aws s3 ls | grep tradai
# Test read
aws s3 ls s3://tradai-configs-${ENVIRONMENT}/strategies/
Step 2: Check from container
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $TASK_ARN \
--container backend \
--command "aws s3 ls s3://tradai-configs-${ENVIRONMENT}/"
Step 3: Check VPC endpoint
aws ec2 describe-vpc-endpoints \
--filters "Name=service-name,Values=com.amazonaws.${AWS_REGION}.s3" \
--query 'VpcEndpoints[].State'
Service Discovery (AWS)¶
Cloud Map Configuration¶
Services are discoverable via *.tradai-{env}.local: - backend-api.tradai-{env}.local:8000 - strategy-service.tradai-{env}.local:8003 - data-collection.tradai-{env}.local:8002 - mlflow.tradai-{env}.local:5000
Debugging DNS Resolution¶
Step 1: Check Cloud Map namespace
Step 2: Check service registration
Step 3: Test DNS from container
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $TASK_ARN \
--container backend-api \
--command "nslookup strategy-service.tradai-${ENVIRONMENT}.local"
Step Functions → ECS (Backtest Execution)¶
Interaction Pattern¶
Step Functions orchestrates backtest execution by launching ECS tasks:
- API Gateway receives
POST /api/v1/backtestsand sends to SQS - SQS Consumer Lambda creates a DynamoDB workflow state record and starts a Step Functions execution
- Step Functions launches an ECS task (
strategy-container) for the backtest - ECS task runs the backtest, logs results to MLflow, and uploads to S3
- Step Functions invokes the
update-statusLambda to update DynamoDB workflow state - The
notify-completionLambda sends an SNS notification
Debugging Step Functions → ECS¶
Step 1: Check Step Functions execution status
# List recent backtest executions
aws stepfunctions list-executions \
--state-machine-arn arn:aws:states:${AWS_REGION}:${ACCOUNT_ID}:stateMachine:tradai-backtest-workflow-${ENVIRONMENT} \
--max-results 5
# Get execution details
aws stepfunctions describe-execution \
--execution-arn $EXECUTION_ARN
Step 2: Check if ECS task was launched
# Find tasks started by Step Functions
aws ecs list-tasks \
--cluster tradai-${ENVIRONMENT} \
--family tradai-strategy-container-${ENVIRONMENT}
# Check task status
aws ecs describe-tasks \
--cluster tradai-${ENVIRONMENT} \
--tasks $TASK_ARN
Step 3: Check DynamoDB workflow state
aws dynamodb get-item \
--table-name tradai-workflow-state-${ENVIRONMENT} \
--key '{"job_id": {"S": "JOB_ID"}}'
Step 4: Check ECS task logs
Common Issues¶
| Issue | Symptom | Resolution |
|---|---|---|
| ECS task not launched | Step Functions shows "ECS RunTask failed" | Check IAM role, subnet, security group |
| Task exits immediately | Exit code 1, no output | Check container logs for startup errors |
| Task times out | Step Functions heartbeat timeout | Increase timeout or optimize backtest |
| Results not saved | Task succeeded but DynamoDB not updated | Check update-status Lambda logs |
| Wrong container image | Task starts but wrong strategy | Verify task definition image tag |
Timeout Matrix¶
Expected response times for service interactions:
| Source | Target | Normal | Warn | Critical |
|---|---|---|---|---|
| Backend | Strategy Service | <100ms | <500ms | >1s |
| Backend | Data Collection | <100ms | <500ms | >1s |
| Backend | MLflow | <200ms | <1s | >2s |
| Data Collection | Exchange | <500ms | <2s | >5s |
| Lambda | DynamoDB | <50ms | <200ms | >500ms |
| ECS | S3 | <100ms | <500ms | >1s |
Quick Connectivity Tests¶
# === Local Development ===
echo "Backend:"
curl -s http://localhost:8000/api/v1/health | jq '.status'
echo "Strategy Service:"
curl -s http://localhost:8003/api/v1/health | jq '.status'
echo "Data Collection:"
curl -s http://localhost:8002/api/v1/health | jq '.status'
echo "MLflow:"
curl -s http://localhost:5001/health | jq
# === AWS ===
for svc in backend strategy-service data-collection mlflow; do
echo "=== $svc ==="
aws ecs describe-services \
--cluster tradai-${ENVIRONMENT} \
--services tradai-${svc}-${ENVIRONMENT} \
--query 'services[0].{Status:status,Running:runningCount}'
done