Skip to content

Service Interactions Troubleshooting

Guide for debugging communication issues between TradAI services.

Service Architecture

graph TD
    GW["API Gateway"] --> BE["Backend (8000)<br/>Backtest orchestration<br/>Strategy catalog<br/>Service aggregation"]
    BE --> SS["Strategy Service (8003)"]
    BE --> DC["Data Collection (8002)"]
    BE --> ML["MLflow (5000)"]
    SS --> S3["S3 Configs"]
    DC --> ADB["ArcticDB"]
    ML --> RDS["RDS Postgres"]

Backend → Strategy Service

Communication Pattern

Backend proxies strategy-related requests to Strategy Service: - GET /api/v1/strategies → Strategy Service - POST /api/v1/strategies/{name}/stage → Strategy Service - POST /api/v1/strategies/{name}/promote → Strategy Service - GET/POST /api/v1/hyperopt/* → Strategy Service

Debugging Connection Issues

Step 1: Check Strategy Service health

# Direct health check
curl http://localhost:8003/api/v1/health | jq

# Or via ECS
aws ecs describe-services \
  --cluster tradai-${ENVIRONMENT} \
  --services tradai-strategy-service-${ENVIRONMENT}

Step 2: Check Backend → Strategy Service connection

# Check Backend logs for strategy service calls
docker compose logs backend | grep -i strategy

# Or CloudWatch
aws logs filter-log-events \
  --log-group-name /ecs/tradai/${ENVIRONMENT} \
  --filter-pattern "strategy service" \
  --limit 20

Step 3: Check network connectivity

# Docker Compose
docker compose exec backend curl -v http://strategy-service:8003/api/v1/health

# AWS (service discovery)
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $BACKEND_TASK_ARN \
  --container backend \
  --command "curl -v http://strategy-service.tradai-${ENVIRONMENT}.local:8003/api/v1/health"

Common Issues

Issue Symptom Resolution
Service not found DNS resolution fails Check service discovery
Connection refused Service not running Restart strategy-service
Timeout Service overloaded Check CPU/memory, scale
503 errors Unhealthy Check strategy-service health

Backend → Data Collection

Communication Pattern

Backend proxies data requests: - GET /api/v1/data/symbols → Data Collection - GET /api/v1/data/freshness → Data Collection

Debugging

Step 1: Check Data Collection health

curl http://localhost:8002/api/v1/health | jq

Step 2: Check connectivity

# Docker
docker compose exec backend curl http://data-collection:8002/api/v1/health

# AWS
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $BACKEND_TASK_ARN \
  --container backend \
  --command "curl http://data-collection.tradai-${ENVIRONMENT}.local:8002/api/v1/health"

Step 3: Check ArcticDB connectivity

# Data Collection logs
docker compose logs data-collection | grep -i arcticdb

# ArcticDB uses S3 storage via DATA_COLLECTION_ARCTIC_S3_BUCKET.
# In local dev, this points to LocalStack S3.


Data Collection Issues

Data Collection → Exchange

Symptom: Data sync failing, stale data.

Step 1: Check exchange connectivity

# Direct exchange test
curl https://api.binance.com/api/v3/ping

# From container
docker compose exec data-collection curl https://api.binance.com/api/v3/ping

Step 2: Check for rate limiting

docker compose logs data-collection | grep -i "429\|rate limit"

Step 3: Check NAT (AWS)

# NAT instance health
aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=tradai-nat-*" \
  --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'

# Test from ECS
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $DATA_TASK_ARN \
  --container data-collection \
  --command "curl -w '%{time_total}s' https://api.binance.com/api/v3/ping"

Data Collection → ArcticDB

Symptom: Write errors, read failures.

Step 1: Check ArcticDB library

# List symbols
docker compose exec data-collection python << 'EOF'
from arcticdb import Arctic
ac = Arctic("s3s://localhost:4566/tradai-arcticdb-dev?aws_auth=false")
lib = ac.get_library("ohlcv", create_if_missing=True)
print(lib.list_symbols())
EOF

Step 2: Check S3 storage

# Check ArcticDB S3 bucket
aws s3 ls s3://tradai-arcticdb-${ENVIRONMENT}/ --summarize


MLflow Issues

Services → MLflow

Multiple services connect to MLflow: - Strategy Service: Model registry, experiment tracking - Backend: Model version queries - Strategy Tasks: Logging backtest results

Step 1: Check MLflow health

# Health check
curl http://localhost:5001/health

# API test
curl http://localhost:5001/api/2.0/mlflow/experiments/list | jq

Step 2: Check from other services

# From Backend
docker compose exec backend curl http://mlflow:5000/health

# From Strategy Service
docker compose exec strategy-service curl http://mlflow:5000/health

Step 3: Check RDS connectivity (MLflow backend)

# MLflow logs
docker compose logs mlflow | grep -i "database\|postgres"

# Test database connection
docker compose exec mlflow python << 'EOF'
import psycopg2
import os
conn = psycopg2.connect(os.environ.get('DATABASE_URL'))
cur = conn.cursor()
cur.execute('SELECT 1')
print(cur.fetchone())
conn.close()
EOF

Common MLflow Issues

Issue Symptom Resolution
Can't connect Connection refused Start MLflow service
Database error 500 errors Check RDS connectivity
Model not found 404 errors Verify model name/version
Slow queries Timeout Check RDS performance

Lambda → DynamoDB

Lambda Functions Using DynamoDB

  • health-check: Reads/writes health state
  • trading-heartbeat-check: Reads trading state
  • sqs-consumer: Updates workflow state
  • orphan-scanner: Queries workflow state

Debugging

Step 1: Check Lambda permissions

aws lambda get-function-configuration \
  --function-name tradai-health-check-${ENVIRONMENT} \
  --query 'Role'

# Check role permissions
aws iam list-role-policies \
  --role-name tradai-lambda-role-${ENVIRONMENT}

Step 2: Check DynamoDB access

# Test query
aws dynamodb scan \
  --table-name tradai-health-state-${ENVIRONMENT} \
  --limit 1

# Check for throttling
aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ThrottledRequests \
  --dimensions Name=TableName,Value=tradai-health-state-${ENVIRONMENT} \
  --start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 --statistics Sum

Step 3: Check VPC endpoint (for Lambda in VPC)

aws ec2 describe-vpc-endpoints \
  --filters "Name=service-name,Values=com.amazonaws.${AWS_REGION}.dynamodb" \
  --query 'VpcEndpoints[].State'


ECS → S3

Services Using S3

  • Backend: Config reading, result storage
  • Strategy Service: Config loading
  • Strategy Tasks: Results upload

Debugging

Step 1: Check S3 access

# List buckets
aws s3 ls | grep tradai

# Test read
aws s3 ls s3://tradai-configs-${ENVIRONMENT}/strategies/

Step 2: Check from container

aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container backend \
  --command "aws s3 ls s3://tradai-configs-${ENVIRONMENT}/"

Step 3: Check VPC endpoint

aws ec2 describe-vpc-endpoints \
  --filters "Name=service-name,Values=com.amazonaws.${AWS_REGION}.s3" \
  --query 'VpcEndpoints[].State'


Service Discovery (AWS)

Cloud Map Configuration

Services are discoverable via *.tradai-{env}.local: - backend-api.tradai-{env}.local:8000 - strategy-service.tradai-{env}.local:8003 - data-collection.tradai-{env}.local:8002 - mlflow.tradai-{env}.local:5000

Debugging DNS Resolution

Step 1: Check Cloud Map namespace

aws servicediscovery list-namespaces \
  --query "Namespaces[?Name=='tradai-${ENVIRONMENT}.local']"

Step 2: Check service registration

aws servicediscovery list-services \
  --filters Name=NAMESPACE_ID,Values=$NAMESPACE_ID

Step 3: Test DNS from container

aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container backend-api \
  --command "nslookup strategy-service.tradai-${ENVIRONMENT}.local"


Step Functions → ECS (Backtest Execution)

Interaction Pattern

Step Functions orchestrates backtest execution by launching ECS tasks:

  1. API Gateway receives POST /api/v1/backtests and sends to SQS
  2. SQS Consumer Lambda creates a DynamoDB workflow state record and starts a Step Functions execution
  3. Step Functions launches an ECS task (strategy-container) for the backtest
  4. ECS task runs the backtest, logs results to MLflow, and uploads to S3
  5. Step Functions invokes the update-status Lambda to update DynamoDB workflow state
  6. The notify-completion Lambda sends an SNS notification

Debugging Step Functions → ECS

Step 1: Check Step Functions execution status

# List recent backtest executions
aws stepfunctions list-executions \
  --state-machine-arn arn:aws:states:${AWS_REGION}:${ACCOUNT_ID}:stateMachine:tradai-backtest-workflow-${ENVIRONMENT} \
  --max-results 5

# Get execution details
aws stepfunctions describe-execution \
  --execution-arn $EXECUTION_ARN

Step 2: Check if ECS task was launched

# Find tasks started by Step Functions
aws ecs list-tasks \
  --cluster tradai-${ENVIRONMENT} \
  --family tradai-strategy-container-${ENVIRONMENT}

# Check task status
aws ecs describe-tasks \
  --cluster tradai-${ENVIRONMENT} \
  --tasks $TASK_ARN

Step 3: Check DynamoDB workflow state

aws dynamodb get-item \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --key '{"job_id": {"S": "JOB_ID"}}'

Step 4: Check ECS task logs

aws logs tail /ecs/tradai/${ENVIRONMENT} --follow

Common Issues

Issue Symptom Resolution
ECS task not launched Step Functions shows "ECS RunTask failed" Check IAM role, subnet, security group
Task exits immediately Exit code 1, no output Check container logs for startup errors
Task times out Step Functions heartbeat timeout Increase timeout or optimize backtest
Results not saved Task succeeded but DynamoDB not updated Check update-status Lambda logs
Wrong container image Task starts but wrong strategy Verify task definition image tag

Timeout Matrix

Expected response times for service interactions:

Source Target Normal Warn Critical
Backend Strategy Service <100ms <500ms >1s
Backend Data Collection <100ms <500ms >1s
Backend MLflow <200ms <1s >2s
Data Collection Exchange <500ms <2s >5s
Lambda DynamoDB <50ms <200ms >500ms
ECS S3 <100ms <500ms >1s

Quick Connectivity Tests

# === Local Development ===
echo "Backend:"
curl -s http://localhost:8000/api/v1/health | jq '.status'

echo "Strategy Service:"
curl -s http://localhost:8003/api/v1/health | jq '.status'

echo "Data Collection:"
curl -s http://localhost:8002/api/v1/health | jq '.status'

echo "MLflow:"
curl -s http://localhost:5001/health | jq

# === AWS ===
for svc in backend strategy-service data-collection mlflow; do
  echo "=== $svc ==="
  aws ecs describe-services \
    --cluster tradai-${ENVIRONMENT} \
    --services tradai-${svc}-${ENVIRONMENT} \
    --query 'services[0].{Status:status,Running:runningCount}'
done