Common Issues and Solutions¶
FAQ-style guide for frequently encountered problems.
Quick Check
Most issues can be diagnosed with just check-health and docker compose logs -f <service>.
Connection Issues¶
"Connection refused" - Local Development¶
Symptom: ConnectionRefusedError or ECONNREFUSED when running locally.
Cause: Service not running or wrong port.
Solution:
# Check what's running
docker compose ps
# Start all services
just up
# Check specific service logs
docker compose logs backend
# Verify ports
docker compose port backend 8000
"Connection refused" - Docker Compose¶
Symptom: Services can't communicate within Docker.
Cause: Network misconfiguration or service not ready.
Solution:
# Check network
docker network ls | grep tradai
# Restart with fresh network
docker compose down -v
docker compose up -d
# Check service is listening
docker compose exec backend netstat -tlnp | grep 8000
"Connection refused" - AWS ECS¶
Symptom: ECS tasks can't reach each other or external services.
Cause: Security groups, VPC endpoints, or NAT issues.
AWS Credentials Required
AWS troubleshooting commands require configured credentials with appropriate permissions. Run aws sts get-caller-identity to verify.
Solution:
# Check security groups
aws ec2 describe-security-groups --group-ids $SG_ID
# Verify NAT instance
aws ec2 describe-instances \
--filters "Name=tag:Name,Values=tradai-nat-*" \
--query 'Reservations[].Instances[].State.Name'
# Check VPC endpoints
aws ec2 describe-vpc-endpoints --filters "Name=vpc-id,Values=$VPC_ID"
# Test from container
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $TASK_ARN \
--container backend \
--command "curl -v https://api.binance.com/api/v3/ping"
Authentication Issues¶
"Token expired" / "Invalid token"¶
Symptom: 401 Unauthorized responses.
Cause: JWT token expired or malformed.
Solution:
# Get new token
aws cognito-idp initiate-auth \
--client-id $CLIENT_ID \
--auth-flow USER_PASSWORD_AUTH \
--auth-parameters USERNAME=$USER,PASSWORD=$PASS
# Verify token (decode)
echo $TOKEN | cut -d. -f2 | base64 -d | jq .
# Check expiration
echo $TOKEN | cut -d. -f2 | base64 -d | jq '.exp | todate'
"Cognito configuration error"¶
Symptom: Service fails to validate tokens.
Cause: Wrong Cognito settings.
Solution:
# Check environment variables
echo $BACKEND_COGNITO_USER_POOL_ID
echo $BACKEND_COGNITO_CLIENT_ID
echo $BACKEND_COGNITO_REGION
# Verify user pool exists
aws cognito-idp describe-user-pool --user-pool-id $USER_POOL_ID
"M2M authentication failed"¶
Symptom: Service-to-service auth failing.
Cause: Wrong client credentials or missing scopes.
Solution:
# Get M2M token
curl -X POST "https://${COGNITO_DOMAIN}/oauth2/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "grant_type=client_credentials&client_id=${M2M_CLIENT_ID}&client_secret=${M2M_CLIENT_SECRET}"
# Verify M2M client
aws cognito-idp describe-user-pool-client \
--user-pool-id $USER_POOL_ID \
--client-id $M2M_CLIENT_ID
Timeout Issues¶
Exchange API Timeouts¶
Symptom: TimeoutError when fetching market data.
Cause: Exchange rate limits, NAT issues, or network congestion.
Solution:
# Check exchange status
curl https://api.binance.com/api/v3/ping
# Check NAT instance health
aws ec2 describe-instance-status --instance-ids $NAT_INSTANCE_ID
# Check for rate limiting
aws logs filter-log-events \
--log-group-name /ecs/tradai-data-collection-${ENVIRONMENT} \
--filter-pattern '"429" OR "rate limit"' \
--limit 20
# Verify egress works
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $TASK_ARN \
--command "curl -w '%{time_total}' https://api.binance.com/api/v3/ping"
Database Timeouts¶
Symptom: Database operations timing out.
Cause: Connection pool exhausted or RDS overloaded.
Solution:
# Check RDS connections
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
--start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 --statistics Maximum
# Check RDS CPU
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name CPUUtilization \
--dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
--start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 --statistics Average
# See database-operations runbook for resolution
Strategy Issues¶
"Strategy not found"¶
Symptom: Error loading strategy configuration.
Cause: Config file missing or wrong path.
Solution:
# List available configs (local)
ls configs/strategies/
# List available configs (S3)
aws s3 ls s3://tradai-${ENVIRONMENT}-configs/strategies/
# Verify config content
aws s3 cp s3://tradai-${ENVIRONMENT}-configs/strategies/my-strategy.json -
# Check strategy service can read
curl http://localhost:8003/api/v1/strategies | jq
"Invalid strategy configuration"¶
Symptom: Validation errors on strategy config.
Cause: Schema mismatch or missing required fields.
Solution:
# Validate config locally
tradai strategy validate configs/my-strategy.json
# Check required fields
# - strategy (class name)
# - pairs (list of symbols)
# - timeframe (1m, 5m, 1h, etc.)
# - stake_amount (number)
# Example valid config
cat << 'EOF'
{
"strategy": "TrendFollowingStrategy",
"pairs": ["BTC/USDT:USDT"],
"timeframe": "1h",
"stake_amount": 100.0,
"dry_run_wallet": 10000.0
}
EOF
"Strategy class not found"¶
Symptom: ModuleNotFoundError or class import error.
Cause: Strategy not installed or wrong class name.
Solution:
# Check strategy is installed
pip list | grep my-strategy
# Verify class name matches
grep -r "class.*Strategy" strategies/my-strategy/
# Check strategy registry
tradai strategy list
Backtest Issues¶
"Backtest never completes"¶
Symptom: Job stays in "running" status indefinitely.
Cause: ECS task failed silently or stuck.
Solution:
# Check job status
curl http://localhost:8000/api/v1/backtests/${JOB_ID} | jq
# Check DynamoDB state
aws dynamodb get-item \
--table-name tradai-${ENVIRONMENT}-workflow-state \
--key '{"job_id": {"S": "JOB_ID"}}'
# Find ECS task
aws ecs list-tasks \
--cluster tradai-${ENVIRONMENT} \
--family tradai-strategy-${ENVIRONMENT}
# Check task logs
aws logs tail /ecs/tradai-strategy-${ENVIRONMENT} --follow
# If orphaned, cancel manually
curl -X POST http://localhost:8000/api/v1/backtests/${JOB_ID}/cancel
"Insufficient data for backtest"¶
Symptom: Backtest fails with data availability error.
Cause: Missing OHLCV data for requested period.
Solution:
# Check data freshness
curl "http://localhost:8002/api/v1/freshness?symbols=BTC/USDT:USDT" | jq
# Sync missing data
tradai data sync --symbol BTC/USDT:USDT --start 2024-01-01 --end 2024-12-01
# Verify data is available
tradai data check --symbol BTC/USDT:USDT --timeframe 1h
Data Issues¶
"Missing OHLCV data"¶
Symptom: Data queries return empty results.
Cause: Data not synced or ArcticDB issues.
Solution:
# Check symbol availability
tradai data symbols
# Check data range
tradai data info --symbol BTC/USDT:USDT
# Sync data
tradai data sync --symbol BTC/USDT:USDT --days 90
# Verify in ArcticDB
python << 'EOF'
from arcticdb import Arctic
ac = Arctic("lmdb://./user_data/arcticdb")
lib = ac["ohlcv"]
print(lib.list_symbols())
EOF
"Stale data"¶
Symptom: Data is older than expected.
Cause: Sync job failed or not running.
Solution:
# Check data collection service
curl http://localhost:8002/api/v1/health | jq
# Check last sync time
curl "http://localhost:8002/api/v1/freshness?symbols=BTC/USDT:USDT" | jq
# Force sync
curl -X POST "http://localhost:8002/api/v1/sync" \
-H "Content-Type: application/json" \
-d '{"symbols": ["BTC/USDT:USDT"], "days": 1}'
# Check sync logs
docker compose logs data-collection | tail -100
MLflow Issues¶
"MLflow tracking server unavailable"¶
Symptom: Cannot log experiments or load models.
Cause: MLflow service down or network issues.
Solution:
# Check MLflow health
curl http://localhost:5000/health
# Check MLflow service (Docker)
docker compose logs mlflow
# Check MLflow service (ECS)
aws ecs describe-services \
--cluster tradai-${ENVIRONMENT} \
--services tradai-mlflow-${ENVIRONMENT}
# Verify tracking URI
echo $MLFLOW_TRACKING_URI
# Test connection
python << 'EOF'
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
print(mlflow.search_experiments())
EOF
"Model not found"¶
Symptom: Cannot load registered model.
Cause: Model not registered or wrong name.
Solution:
# List registered models
curl http://localhost:5000/api/2.0/mlflow/registered-models/list | jq
# Search by name
curl "http://localhost:5000/api/2.0/mlflow/registered-models/get?name=MyStrategy" | jq
# Check model versions
curl "http://localhost:5000/api/2.0/mlflow/registered-models/get-latest-versions?name=MyStrategy" | jq
Environment Variable Issues¶
"Environment variable not set"¶
Symptom: KeyError or missing config errors.
Solution:
# Check local .env
cat .env | grep -v '^#' | grep -v '^$'
# Check Docker environment
docker compose exec backend env | sort
# Check ECS task environment
aws ecs describe-task-definition \
--task-definition tradai-backend-${ENVIRONMENT} \
--query 'taskDefinition.containerDefinitions[0].environment'
# Required variables (example)
# DATABASE_URL
# MLFLOW_TRACKING_URI
# AWS_REGION
# S3_CONFIG_BUCKET
Quick Fix Reference¶
| Issue | Quick Fix |
|---|---|
| Service not starting | docker compose down && docker compose up -d |
| Stale Docker state | docker compose down -v && docker compose up -d |
| Port conflict | lsof -i :8000 and kill the process |
| Permission denied | Check file ownership and Docker socket |
| Out of disk | docker system prune -a |
| Network issues | docker network prune |
| ECS task stuck | Force new deployment |
| Lambda timeout | Increase timeout in config |