Service Interactions Troubleshooting¶
Guide for debugging communication issues between TradAI services.
Service Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ API Gateway │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Backend (8000) │
│ - Backtest orchestration │
│ - Strategy catalog │
│ - Service aggregation │
└─────────┬───────────────┬────────────────┬──────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐
│ Strategy Service│ │ Data Collection │ │ MLflow │
│ (8003) │ │ (8002) │ │ (5000) │
└────────┬────────┘ └────────┬────────┘ └──────────┬──────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ S3 │ │ ArcticDB │ │ RDS │
│ Configs │ │ │ │ Postgres │
└──────────┘ └──────────┘ └──────────┘
Backend → Strategy Service¶
Communication Pattern¶
Backend proxies strategy-related requests to Strategy Service: - GET /strategies → Strategy Service - POST /strategies/{name}/stage → Strategy Service - POST /strategies/{name}/promote → Strategy Service - GET/POST /hyperopt/* → Strategy Service
Debugging Connection Issues¶
Step 1: Check Strategy Service health
# Direct health check
curl http://localhost:8003/api/v1/health | jq
# Or via ECS
aws ecs describe-services \
--cluster tradai-${ENVIRONMENT} \
--services tradai-strategy-service-${ENVIRONMENT}
Step 2: Check Backend → Strategy Service connection
# Check Backend logs for strategy service calls
docker compose logs backend | grep -i strategy
# Or CloudWatch
aws logs filter-log-events \
--log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
--filter-pattern "strategy service" \
--limit 20
Step 3: Check network connectivity
# Docker Compose
docker compose exec backend curl -v http://strategy-service:8003/api/v1/health
# AWS (service discovery)
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $BACKEND_TASK_ARN \
--container backend \
--command "curl -v http://strategy-service.tradai.local:8003/api/v1/health"
Common Issues¶
| Issue | Symptom | Resolution |
|---|---|---|
| Service not found | DNS resolution fails | Check service discovery |
| Connection refused | Service not running | Restart strategy-service |
| Timeout | Service overloaded | Check CPU/memory, scale |
| 503 errors | Unhealthy | Check strategy-service health |
Backend → Data Collection¶
Communication Pattern¶
Backend proxies data requests: - GET /data/symbols → Data Collection - GET /data/freshness → Data Collection
Debugging¶
Step 1: Check Data Collection health
Step 2: Check connectivity
# Docker
docker compose exec backend curl http://data-collection:8002/api/v1/health
# AWS
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $BACKEND_TASK_ARN \
--container backend \
--command "curl http://data-collection.tradai.local:8002/api/v1/health"
Step 3: Check ArcticDB connectivity
# Data Collection logs
docker compose logs data-collection | grep -i arcticdb
# Check ArcticDB storage
ls -la user_data/arcticdb/
Data Collection Issues¶
Data Collection → Exchange¶
Symptom: Data sync failing, stale data.
Step 1: Check exchange connectivity
# Direct exchange test
curl https://api.binance.com/api/v3/ping
# From container
docker compose exec data-collection curl https://api.binance.com/api/v3/ping
Step 2: Check for rate limiting
Step 3: Check NAT (AWS)
# NAT instance health
aws ec2 describe-instances \
--filters "Name=tag:Name,Values=tradai-nat-*" \
--query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'
# Test from ECS
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $DATA_TASK_ARN \
--container data-collection \
--command "curl -w '%{time_total}s' https://api.binance.com/api/v3/ping"
Data Collection → ArcticDB¶
Symptom: Write errors, read failures.
Step 1: Check ArcticDB library
# List symbols
docker compose exec data-collection python << 'EOF'
from arcticdb import Arctic
ac = Arctic("lmdb://./user_data/arcticdb")
lib = ac.get_library("ohlcv", create_if_missing=True)
print(lib.list_symbols())
EOF
Step 2: Check disk space
# Local
df -h user_data/arcticdb
# S3 backend (if using)
aws s3 ls s3://tradai-${ENVIRONMENT}-arcticdb/ --summarize
MLflow Issues¶
Services → MLflow¶
Multiple services connect to MLflow: - Strategy Service: Model registry, experiment tracking - Backend: Model version queries - Strategy Tasks: Logging backtest results
Step 1: Check MLflow health
# Health check
curl http://localhost:5000/health
# API test
curl http://localhost:5000/api/2.0/mlflow/experiments/list | jq
Step 2: Check from other services
# From Backend
docker compose exec backend curl http://mlflow:5000/health
# From Strategy Service
docker compose exec strategy-service curl http://mlflow:5000/health
Step 3: Check RDS connectivity (MLflow backend)
# MLflow logs
docker compose logs mlflow | grep -i "database\|postgres"
# Test database connection
docker compose exec mlflow python << 'EOF'
import psycopg2
import os
conn = psycopg2.connect(os.environ.get('DATABASE_URL'))
cur = conn.cursor()
cur.execute('SELECT 1')
print(cur.fetchone())
conn.close()
EOF
Common MLflow Issues¶
| Issue | Symptom | Resolution |
|---|---|---|
| Can't connect | Connection refused | Start MLflow service |
| Database error | 500 errors | Check RDS connectivity |
| Model not found | 404 errors | Verify model name/version |
| Slow queries | Timeout | Check RDS performance |
Lambda → DynamoDB¶
Lambda Functions Using DynamoDB¶
health-check: Reads/writes health stateheartbeat-check: Reads trading statesqs-consumer: Updates workflow stateorphan-scanner: Queries workflow state
Debugging¶
Step 1: Check Lambda permissions
aws lambda get-function-configuration \
--function-name tradai-health-check-${ENVIRONMENT} \
--query 'Role'
# Check role permissions
aws iam list-role-policies \
--role-name tradai-lambda-role-${ENVIRONMENT}
Step 2: Check DynamoDB access
# Test query
aws dynamodb scan \
--table-name tradai-${ENVIRONMENT}-health-state \
--limit 1
# Check for throttling
aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ThrottledRequests \
--dimensions Name=TableName,Value=tradai-${ENVIRONMENT}-health-state \
--start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 --statistics Sum
Step 3: Check VPC endpoint (for Lambda in VPC)
aws ec2 describe-vpc-endpoints \
--filters "Name=service-name,Values=com.amazonaws.${AWS_REGION}.dynamodb" \
--query 'VpcEndpoints[].State'
ECS → S3¶
Services Using S3¶
- Backend: Config reading, result storage
- Strategy Service: Config loading
- Strategy Tasks: Results upload
Debugging¶
Step 1: Check S3 access
# List buckets
aws s3 ls | grep tradai
# Test read
aws s3 ls s3://tradai-${ENVIRONMENT}-configs/strategies/
Step 2: Check from container
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $TASK_ARN \
--container backend \
--command "aws s3 ls s3://tradai-${ENVIRONMENT}-configs/"
Step 3: Check VPC endpoint
aws ec2 describe-vpc-endpoints \
--filters "Name=service-name,Values=com.amazonaws.${AWS_REGION}.s3" \
--query 'VpcEndpoints[].State'
Service Discovery (AWS)¶
Cloud Map Configuration¶
Services are discoverable via *.tradai.local: - backend.tradai.local:8000 - strategy-service.tradai.local:8003 - data-collection.tradai.local:8002 - mlflow.tradai.local:5000
Debugging DNS Resolution¶
Step 1: Check Cloud Map namespace
Step 2: Check service registration
Step 3: Test DNS from container
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $TASK_ARN \
--container backend \
--command "nslookup strategy-service.tradai.local"
Timeout Matrix¶
Expected response times for service interactions:
| Source | Target | Normal | Warn | Critical |
|---|---|---|---|---|
| Backend | Strategy Service | <100ms | <500ms | >1s |
| Backend | Data Collection | <100ms | <500ms | >1s |
| Backend | MLflow | <200ms | <1s | >2s |
| Data Collection | Exchange | <500ms | <2s | >5s |
| Lambda | DynamoDB | <50ms | <200ms | >500ms |
| ECS | S3 | <100ms | <500ms | >1s |
Quick Connectivity Tests¶
# === Local Development ===
echo "Backend:"
curl -s http://localhost:8000/api/v1/health | jq '.status'
echo "Strategy Service:"
curl -s http://localhost:8003/api/v1/health | jq '.status'
echo "Data Collection:"
curl -s http://localhost:8002/api/v1/health | jq '.status'
echo "MLflow:"
curl -s http://localhost:5000/health | jq
# === AWS ===
for svc in backend strategy-service data-collection mlflow; do
echo "=== $svc ==="
aws ecs describe-services \
--cluster tradai-${ENVIRONMENT} \
--services tradai-${svc}-${ENVIRONMENT} \
--query 'services[0].{Status:status,Running:runningCount}'
done