Common Issues and Solutions¶

FAQ-style guide for frequently encountered problems.

Quick Check

Most issues can be diagnosed with just doctor services and docker compose logs -f <service>.

Connection Issues¶

"Connection refused" - Local Development¶

Symptom: ConnectionRefusedError or ECONNREFUSED when running locally.

Cause: Service not running or wrong port.

Solution:

# Check what's running
docker compose ps

# Start all services
just up

# Check specific service logs
docker compose logs backend

# Verify ports
docker compose port backend 8000

"Connection refused" - Docker Compose¶

Symptom: Services can't communicate within Docker.

Cause: Network misconfiguration or service not ready.

Solution:

# Check network
docker network ls | grep tradai

# Restart with fresh network
docker compose down -v
docker compose up -d

# Check service is listening
docker compose exec backend netstat -tlnp | grep 8000

"Connection refused" - AWS ECS¶

Symptom: ECS tasks can't reach each other or external services.

Cause: Security groups, VPC endpoints, or NAT issues.

AWS Credentials Required

AWS troubleshooting commands require configured credentials with appropriate permissions. Run aws sts get-caller-identity to verify.

Solution:

# Check security groups
aws ec2 describe-security-groups --group-ids $SG_ID

# Verify NAT instance
aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=tradai-nat-*" \
  --query 'Reservations[].Instances[].State.Name'

# Check VPC endpoints
aws ec2 describe-vpc-endpoints --filters "Name=vpc-id,Values=$VPC_ID"

# Test from container
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container backend \
  --command "curl -v https://api.binance.com/api/v3/ping"

Authentication Issues¶

"Token expired" / "Invalid token"¶

Symptom: 401 Unauthorized responses.

Cause: JWT token expired or malformed.

Solution:

# Get new token
aws cognito-idp initiate-auth \
  --client-id $CLIENT_ID \
  --auth-flow USER_PASSWORD_AUTH \
  --auth-parameters USERNAME=$USER,PASSWORD=$PASS

# Verify token (decode)
echo $TOKEN | cut -d. -f2 | base64 -d | jq .

# Check expiration
echo $TOKEN | cut -d. -f2 | base64 -d | jq '.exp | todate'

"Cognito configuration error"¶

Symptom: Service fails to validate tokens.

Cause: Wrong Cognito settings.

Solution:

# Check environment variables
echo $BACKEND_COGNITO_USER_POOL_ID
echo $BACKEND_COGNITO_CLIENT_ID
echo $BACKEND_COGNITO_REGION

# Verify user pool exists
aws cognito-idp describe-user-pool --user-pool-id $USER_POOL_ID

"M2M authentication failed"¶

Symptom: Service-to-service auth failing.

Cause: Wrong client credentials or missing scopes.

Solution:

# Get M2M token
curl -X POST "https://${COGNITO_DOMAIN}/oauth2/token" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "grant_type=client_credentials&client_id=${M2M_CLIENT_ID}&client_secret=${M2M_CLIENT_SECRET}"

# Verify M2M client
aws cognito-idp describe-user-pool-client \
  --user-pool-id $USER_POOL_ID \
  --client-id $M2M_CLIENT_ID

Timeout Issues¶

Exchange API Timeouts¶

Symptom: TimeoutError when fetching market data.

Cause: Exchange rate limits, NAT issues, or network congestion.

Solution:

# Check exchange status
curl https://api.binance.com/api/v3/ping

# Check NAT instance health
aws ec2 describe-instance-status --instance-ids $NAT_INSTANCE_ID

# Check for rate limiting
aws logs filter-log-events \
  --log-group-name /ecs/tradai-data-collection-${ENVIRONMENT} \
  --filter-pattern '"429" OR "rate limit"' \
  --limit 20

# Verify egress works
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --command "curl -w '%{time_total}' https://api.binance.com/api/v3/ping"

Database Timeouts¶

Symptom: Database operations timing out.

Cause: Connection pool exhausted or RDS overloaded.

Solution:

# Check RDS connections
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
  --start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 --statistics Maximum

# Check RDS CPU
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
  --start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 --statistics Average

# See database-operations runbook for resolution

Strategy Issues¶

"Strategy not found"¶

Symptom: Error loading strategy configuration.

Cause: Config file missing or wrong path.

Solution:

# List available configs (local)
ls configs/strategies/

# List available configs (S3)
aws s3 ls s3://tradai-${ENVIRONMENT}-configs/strategies/

# Verify config content
aws s3 cp s3://tradai-${ENVIRONMENT}-configs/strategies/my-strategy.json -

# Check strategy service can read
curl http://localhost:8003/api/v1/strategies | jq

"Invalid strategy configuration"¶

Symptom: Validation errors on strategy config.

Cause: Schema mismatch or missing required fields.

Solution:

# Validate config locally
tradai strategy validate configs/my-strategy.json

# Check required fields
# - strategy (class name)
# - pairs (list of symbols)
# - timeframe (1m, 5m, 1h, etc.)
# - stake_amount (number)

# Example valid config
cat << 'EOF'
{
  "strategy": "TrendFollowingStrategy",
  "pairs": ["BTC/USDT:USDT"],
  "timeframe": "1h",
  "stake_amount": 100.0,
  "dry_run_wallet": 10000.0
}
EOF

"Strategy class not found"¶

Symptom: ModuleNotFoundError or class import error.

Cause: Strategy not installed or wrong class name.

Solution:

# Check strategy is installed
pip list | grep my-strategy

# Verify class name matches
grep -r "class.*Strategy" strategies/my-strategy/

# Check strategy registry
tradai strategy list

Backtest Issues¶

"Backtest never completes"¶

Symptom: Job stays in "running" status indefinitely.

Cause: ECS task failed silently or stuck.

Solution:

# Check job status
curl http://localhost:8000/api/v1/backtests/${JOB_ID} | jq

# Check DynamoDB state
aws dynamodb get-item \
  --table-name tradai-${ENVIRONMENT}-workflow-state \
  --key '{"job_id": {"S": "JOB_ID"}}'

# Find ECS task
aws ecs list-tasks \
  --cluster tradai-${ENVIRONMENT} \
  --family tradai-strategy-${ENVIRONMENT}

# Check task logs
aws logs tail /ecs/tradai-strategy-${ENVIRONMENT} --follow

# If orphaned, cancel manually
curl -X POST http://localhost:8000/api/v1/backtests/${JOB_ID}/cancel

"Insufficient data for backtest"¶

Symptom: Backtest fails with data availability error.

Cause: Missing OHLCV data for requested period.

Solution:

# Check data freshness
curl "http://localhost:8002/api/v1/freshness?symbols=BTC/USDT:USDT" | jq

# Sync missing data
tradai data sync --symbol BTC/USDT:USDT --start 2024-01-01 --end 2024-12-01

# Verify data is available
tradai data check --symbol BTC/USDT:USDT --timeframe 1h

Data Issues¶

"Missing OHLCV data"¶

Symptom: Data queries return empty results.

Cause: Data not synced or ArcticDB issues.

Solution:

# Check symbol availability
tradai data symbols

# Check data range
tradai data info --symbol BTC/USDT:USDT

# Sync data
tradai data sync --symbol BTC/USDT:USDT --days 90

# Verify in ArcticDB
python << 'EOF'
from arcticdb import Arctic
ac = Arctic("lmdb://./user_data/arcticdb")
lib = ac["ohlcv"]
print(lib.list_symbols())
EOF

"Stale data"¶

Symptom: Data is older than expected.

Cause: Sync job failed or not running.

Solution:

# Check data collection service
curl http://localhost:8002/api/v1/health | jq

# Check last sync time
curl "http://localhost:8002/api/v1/freshness?symbols=BTC/USDT:USDT" | jq

# Force sync
curl -X POST "http://localhost:8002/api/v1/sync" \
  -H "Content-Type: application/json" \
  -d '{"symbols": ["BTC/USDT:USDT"], "days": 1}'

# Check sync logs
docker compose logs data-collection | tail -100

MLflow Issues¶

"MLflow tracking server unavailable"¶

Symptom: Cannot log experiments or load models.

Cause: MLflow service down or network issues.

Solution:

# Check MLflow health
curl http://localhost:5001/health

# Check MLflow service (Docker)
docker compose logs mlflow

# Check MLflow service (ECS)
aws ecs describe-services \
  --cluster tradai-${ENVIRONMENT} \
  --services tradai-mlflow-${ENVIRONMENT}

# Verify tracking URI
echo $MLFLOW_TRACKING_URI

# Test connection
python << 'EOF'
import mlflow
mlflow.set_tracking_uri("http://localhost:5001")
print(mlflow.search_experiments())
EOF

"Model not found"¶

Symptom: Cannot load registered model.

Cause: Model not registered or wrong name.

Solution:

# List registered models
curl http://localhost:5001/api/2.0/mlflow/registered-models/list | jq

# Search by name
curl "http://localhost:5001/api/2.0/mlflow/registered-models/get?name=MyStrategy" | jq

# Check model versions
curl "http://localhost:5001/api/2.0/mlflow/registered-models/get-latest-versions?name=MyStrategy" | jq

Environment Variable Issues¶

"Environment variable not set"¶

Symptom: KeyError or missing config errors.

Solution:

# Check local .env
cat .env | grep -v '^#' | grep -v '^$'

# Check Docker environment
docker compose exec backend env | sort

# Check ECS task environment
aws ecs describe-task-definition \
  --task-definition tradai-backend-${ENVIRONMENT} \
  --query 'taskDefinition.containerDefinitions[0].environment'

# Required variables (example)
# DATABASE_URL
# MLFLOW_TRACKING_URI
# AWS_REGION
# S3_CONFIG_BUCKET

Quick Fix Reference¶

Issue	Quick Fix
Service not starting	`docker compose down && docker compose up -d`
Stale Docker state	`docker compose down -v && docker compose up -d`
Port conflict	`lsof -i :8000` and kill the process
Permission denied	Check file ownership and Docker socket
Out of disk	`docker system prune -a`
Network issues	`docker network prune`
ECS task stuck	Force new deployment
Lambda timeout	Increase timeout in config