Skip to content

Service Interactions Troubleshooting

Guide for debugging communication issues between TradAI services.

Service Architecture

┌─────────────────────────────────────────────────────────────┐
│                     API Gateway                             │
└────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│                   Backend (8000)                            │
│  - Backtest orchestration                                   │
│  - Strategy catalog                                         │
│  - Service aggregation                                      │
└─────────┬───────────────┬────────────────┬──────────────────┘
          │               │                │
          ▼               ▼                ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐
│ Strategy Service│ │ Data Collection │ │      MLflow         │
│     (8003)      │ │     (8002)      │ │      (5000)         │
└────────┬────────┘ └────────┬────────┘ └──────────┬──────────┘
         │                   │                     │
         ▼                   ▼                     ▼
   ┌──────────┐       ┌──────────┐          ┌──────────┐
   │  S3      │       │ ArcticDB │          │   RDS    │
   │ Configs  │       │          │          │ Postgres │
   └──────────┘       └──────────┘          └──────────┘

Backend → Strategy Service

Communication Pattern

Backend proxies strategy-related requests to Strategy Service: - GET /strategies → Strategy Service - POST /strategies/{name}/stage → Strategy Service - POST /strategies/{name}/promote → Strategy Service - GET/POST /hyperopt/* → Strategy Service

Debugging Connection Issues

Step 1: Check Strategy Service health

# Direct health check
curl http://localhost:8003/api/v1/health | jq

# Or via ECS
aws ecs describe-services \
  --cluster tradai-${ENVIRONMENT} \
  --services tradai-strategy-service-${ENVIRONMENT}

Step 2: Check Backend → Strategy Service connection

# Check Backend logs for strategy service calls
docker compose logs backend | grep -i strategy

# Or CloudWatch
aws logs filter-log-events \
  --log-group-name /ecs/tradai-backend-${ENVIRONMENT} \
  --filter-pattern "strategy service" \
  --limit 20

Step 3: Check network connectivity

# Docker Compose
docker compose exec backend curl -v http://strategy-service:8003/api/v1/health

# AWS (service discovery)
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $BACKEND_TASK_ARN \
  --container backend \
  --command "curl -v http://strategy-service.tradai.local:8003/api/v1/health"

Common Issues

Issue Symptom Resolution
Service not found DNS resolution fails Check service discovery
Connection refused Service not running Restart strategy-service
Timeout Service overloaded Check CPU/memory, scale
503 errors Unhealthy Check strategy-service health

Backend → Data Collection

Communication Pattern

Backend proxies data requests: - GET /data/symbols → Data Collection - GET /data/freshness → Data Collection

Debugging

Step 1: Check Data Collection health

curl http://localhost:8002/api/v1/health | jq

Step 2: Check connectivity

# Docker
docker compose exec backend curl http://data-collection:8002/api/v1/health

# AWS
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $BACKEND_TASK_ARN \
  --container backend \
  --command "curl http://data-collection.tradai.local:8002/api/v1/health"

Step 3: Check ArcticDB connectivity

# Data Collection logs
docker compose logs data-collection | grep -i arcticdb

# Check ArcticDB storage
ls -la user_data/arcticdb/


Data Collection Issues

Data Collection → Exchange

Symptom: Data sync failing, stale data.

Step 1: Check exchange connectivity

# Direct exchange test
curl https://api.binance.com/api/v3/ping

# From container
docker compose exec data-collection curl https://api.binance.com/api/v3/ping

Step 2: Check for rate limiting

docker compose logs data-collection | grep -i "429\|rate limit"

Step 3: Check NAT (AWS)

# NAT instance health
aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=tradai-nat-*" \
  --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'

# Test from ECS
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $DATA_TASK_ARN \
  --container data-collection \
  --command "curl -w '%{time_total}s' https://api.binance.com/api/v3/ping"

Data Collection → ArcticDB

Symptom: Write errors, read failures.

Step 1: Check ArcticDB library

# List symbols
docker compose exec data-collection python << 'EOF'
from arcticdb import Arctic
ac = Arctic("lmdb://./user_data/arcticdb")
lib = ac.get_library("ohlcv", create_if_missing=True)
print(lib.list_symbols())
EOF

Step 2: Check disk space

# Local
df -h user_data/arcticdb

# S3 backend (if using)
aws s3 ls s3://tradai-${ENVIRONMENT}-arcticdb/ --summarize


MLflow Issues

Services → MLflow

Multiple services connect to MLflow: - Strategy Service: Model registry, experiment tracking - Backend: Model version queries - Strategy Tasks: Logging backtest results

Step 1: Check MLflow health

# Health check
curl http://localhost:5000/health

# API test
curl http://localhost:5000/api/2.0/mlflow/experiments/list | jq

Step 2: Check from other services

# From Backend
docker compose exec backend curl http://mlflow:5000/health

# From Strategy Service
docker compose exec strategy-service curl http://mlflow:5000/health

Step 3: Check RDS connectivity (MLflow backend)

# MLflow logs
docker compose logs mlflow | grep -i "database\|postgres"

# Test database connection
docker compose exec mlflow python << 'EOF'
import psycopg2
import os
conn = psycopg2.connect(os.environ.get('DATABASE_URL'))
cur = conn.cursor()
cur.execute('SELECT 1')
print(cur.fetchone())
conn.close()
EOF

Common MLflow Issues

Issue Symptom Resolution
Can't connect Connection refused Start MLflow service
Database error 500 errors Check RDS connectivity
Model not found 404 errors Verify model name/version
Slow queries Timeout Check RDS performance

Lambda → DynamoDB

Lambda Functions Using DynamoDB

  • health-check: Reads/writes health state
  • heartbeat-check: Reads trading state
  • sqs-consumer: Updates workflow state
  • orphan-scanner: Queries workflow state

Debugging

Step 1: Check Lambda permissions

aws lambda get-function-configuration \
  --function-name tradai-health-check-${ENVIRONMENT} \
  --query 'Role'

# Check role permissions
aws iam list-role-policies \
  --role-name tradai-lambda-role-${ENVIRONMENT}

Step 2: Check DynamoDB access

# Test query
aws dynamodb scan \
  --table-name tradai-${ENVIRONMENT}-health-state \
  --limit 1

# Check for throttling
aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ThrottledRequests \
  --dimensions Name=TableName,Value=tradai-${ENVIRONMENT}-health-state \
  --start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 --statistics Sum

Step 3: Check VPC endpoint (for Lambda in VPC)

aws ec2 describe-vpc-endpoints \
  --filters "Name=service-name,Values=com.amazonaws.${AWS_REGION}.dynamodb" \
  --query 'VpcEndpoints[].State'


ECS → S3

Services Using S3

  • Backend: Config reading, result storage
  • Strategy Service: Config loading
  • Strategy Tasks: Results upload

Debugging

Step 1: Check S3 access

# List buckets
aws s3 ls | grep tradai

# Test read
aws s3 ls s3://tradai-${ENVIRONMENT}-configs/strategies/

Step 2: Check from container

aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container backend \
  --command "aws s3 ls s3://tradai-${ENVIRONMENT}-configs/"

Step 3: Check VPC endpoint

aws ec2 describe-vpc-endpoints \
  --filters "Name=service-name,Values=com.amazonaws.${AWS_REGION}.s3" \
  --query 'VpcEndpoints[].State'


Service Discovery (AWS)

Cloud Map Configuration

Services are discoverable via *.tradai.local: - backend.tradai.local:8000 - strategy-service.tradai.local:8003 - data-collection.tradai.local:8002 - mlflow.tradai.local:5000

Debugging DNS Resolution

Step 1: Check Cloud Map namespace

aws servicediscovery list-namespaces \
  --query "Namespaces[?Name=='tradai.local']"

Step 2: Check service registration

aws servicediscovery list-services \
  --filters Name=NAMESPACE_ID,Values=$NAMESPACE_ID

Step 3: Test DNS from container

aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container backend \
  --command "nslookup strategy-service.tradai.local"


Timeout Matrix

Expected response times for service interactions:

Source Target Normal Warn Critical
Backend Strategy Service <100ms <500ms >1s
Backend Data Collection <100ms <500ms >1s
Backend MLflow <200ms <1s >2s
Data Collection Exchange <500ms <2s >5s
Lambda DynamoDB <50ms <200ms >500ms
ECS S3 <100ms <500ms >1s

Quick Connectivity Tests

# === Local Development ===
echo "Backend:"
curl -s http://localhost:8000/api/v1/health | jq '.status'

echo "Strategy Service:"
curl -s http://localhost:8003/api/v1/health | jq '.status'

echo "Data Collection:"
curl -s http://localhost:8002/api/v1/health | jq '.status'

echo "MLflow:"
curl -s http://localhost:5000/health | jq

# === AWS ===
for svc in backend strategy-service data-collection mlflow; do
  echo "=== $svc ==="
  aws ecs describe-services \
    --cluster tradai-${ENVIRONMENT} \
    --services tradai-${svc}-${ENVIRONMENT} \
    --query 'services[0].{Status:status,Running:runningCount}'
done