TradAI Services Architecture¶
Version: 10.0.0 | Date: 2026-03-28 | Status: CURRENT
Source of truth: infra/shared/tradai_infra_shared/config.py (SERVICES dict, ECR_REPOS, LAMBDA_SCHEDULES)
TL;DR: 7 ECS service definitions (4 always-on, 1 on-demand backtest container, 2 trading containers) plus 18 Lambda functions (17 ECR-based containers + 1 inline handler). Dev/staging uses a consolidated EC2 instance (
t3.small, ~$15/month) instead of per-service Fargate tasks. Backtest execution supports 4 modes via the Strategy pattern: Local, SQS, ECS, and Step Functions.
1. Service Overview¶
All service definitions are in config.py SERVICES dict. Ports, CPU, memory, and health check paths are consumed by both ECS Fargate task definitions and the consolidated EC2 mode.
| Service | Port | CPU | Memory | Health Check | Spot | Desired Count |
|---|---|---|---|---|---|---|
backend-api | 8000 | 512 | 1024 | /api/v1/health | Yes (default) | 1 (dev/staging), 2 (prod) |
data-collection | 8002 | 256 | 512 | /api/v1/health | Yes (default) | 1 |
strategy-service | 8003 | 512 | 1024 | /api/v1/health | Yes (default) | 1 |
mlflow | 5000 | 512 | 1024 | /mlflow/ | Yes (default) | 1 |
strategy-container | None | 1024 | 2048 | None | Yes (default) | 0 (on-demand) |
live-trading | 8004 | 1024 | 2048 | /api/v1/health | No (reliability) | 0 (per strategy) |
dry-run-trading | 8005 | 1024 | 2048 | /api/v1/health | Yes | 0 (per strategy) |
Notes: - strategy-container has no port -- it runs as an ECS task (backtest execution), not a long-running service. - live-trading explicitly disables Spot (spot: False) for reliability. It also has start_period: 120 for warmup. - dry-run-trading uses Spot for cost savings. Also has start_period: 120. - Region: eu-central-1 (default, configurable via AWS_REGION env var or Pulumi config).
ECR Repository Mapping¶
Service names map to ECR repositories via ECR_SERVICE_MAPPING:
| Service Name | ECR Repository |
|---|---|
backend-api | tradai/backend |
data-collection | tradai/data-collection |
strategy-service | tradai/strategy-service |
mlflow | tradai/mlflow |
live-trading | tradai/live-trading |
dry-run-trading | tradai/dry-run-trading |
2. ECS Architecture¶
Source: infra/compute/modules/ecs_services.py (class EcsServices)
Cluster and Capacity Providers¶
- Single ECS cluster per environment:
tradai-{env} - Capacity providers:
FARGATE+FARGATE_SPOT - Services with
spot: True(default) use FARGATE_SPOT with FARGATE fallback - Services with
spot: False(live-trading) use FARGATE only
Consolidated EC2 Mode¶
For dev/staging cost savings, services can run on a single EC2 instance instead of separate Fargate tasks. Controlled by CONSOLIDATED_MODE in config.py:
| Environment | Mode | Instance |
|---|---|---|
| dev | Consolidated EC2 | t3.small (2 vCPU, 2GB, ~$15/month) |
| staging | Consolidated EC2 | t3.small |
| prod | ECS Fargate | Per-service task definitions |
Consolidated services: backend-api, data-collection, mlflow, strategy-service. Custom AMI built with Packer (infra/ami/tradai-consolidated.pkr.hcl), configurable via pulumi config set custom_ami_id.
Service Discovery¶
- Private DNS namespace:
tradai-{env}.local(e.g.,tradai-dev.local) - Created in
_create_service_discovery_namespace()asaws.servicediscovery.PrivateDnsNamespace - Each service registers a DNS A record with 60s TTL, MULTIVALUE routing
- Service addresses:
{service-name}.tradai-{env}.local:{port} - MLflow tracking URI:
http://mlflow.tradai-{env}.local:5000/mlflow - SD services are created for ALL services (including consolidated EC2) so DNS works regardless of mode
Task Definition Pattern¶
Each ECS Fargate service gets a task definition with: - Container image from ECR: {account}.dkr.ecr.eu-central-1.amazonaws.com/tradai/{service} - awslogs log driver pointing to /ecs/tradai/{env}/services - Log retention: 30 days (dev/staging), 90 days (prod) - Environment variables built by EnvironmentBuilder + ServiceEnvironmentPresets - Execution role (ECR pull, CloudWatch logs) and task role (S3, DynamoDB, SQS access)
Health Check Pattern¶
ECS health checks use Python httpx (not curl -- Alpine images don't include curl):
urllib.request since it runs the third-party MLflow image. 3. Service Details¶
3.1 Backend API¶
- Purpose: API gateway and backtest orchestration. Accepts backtest requests, delegates execution to the configured executor (Local, SQS, ECS, or Step Functions).
- Source:
services/backend/src/tradai/backend/(3-layer:api/,core/,infrastructure/) - Port: 8000
- Key dependencies: data-collection (port 8002), strategy-service (port 8003), MLflow (port 5000), DynamoDB (workflow-state table), SQS (backtest queue), S3 (configs + results buckets)
- Environment inter-service URLs (ECS):
http://strategy-service.tradai-{env}.local:8003http://data-collection.tradai-{env}.local:8002http://mlflow.tradai-{env}.local:5000/mlflow- Docs:
services/backend/README.md,services/backend/BACKTEST_DESIGN.md
3.2 Data Collection¶
- Purpose: Market data fetching (OHLCV via CCXT/Binance), ArcticDB time-series storage, data freshness checks, metadata queries.
- Source:
services/data-collection/src/tradai/data_collection/(3-layer) - Port: 8002
- Platform:
linux/amd64required (ArcticDB only has x86_64 wheels) - Key dependencies: ArcticDB (S3-backed), LocalStack (dev), S3 (arcticdb bucket)
- Docs:
services/data-collection/README.md
3.3 Strategy Service¶
- Purpose: Strategy execution and backtesting. Manages strategy configs, validates parameters, prepares Freqtrade configurations.
- Source:
services/strategy-service/src/tradai/strategy_service/(3-layer) - Port: 8003
- Key dependencies: MLflow (tracking URI), S3 (config storage), user_data directory
- Docs:
services/strategy-service/README.md
3.4 MLflow¶
- Purpose: ML experiment tracking and model registry. Third-party MLflow server with PostgreSQL backend store and S3 artifact storage.
- Source:
services/mlflow/(wraps MLflow Docker image, namespacetradai_mlflow) - Port: 5000 (uses
--static-prefix=/mlflow) - Health check:
/mlflow/(root returns 200, no/healthendpoint) - Key dependencies: PostgreSQL (backend store), S3 (
tradai-mlflow-{env}bucket for artifacts) - Note: Does NOT follow the 3-layer pattern.
- Docs:
services/mlflow/README.md
3.5 Live Trading¶
- Purpose: Production live trading with real exchange connections. Runs one instance per strategy.
- Port: 8004
- Spot: Disabled (
spot: False) -- reliability required for real money operations - Start period: 120s (warmup for exchange connections and model loading)
- Desired count: 0 (scaled up per strategy via deployment automation)
3.6 Dry-Run Trading¶
- Purpose: Paper trading for strategy validation before live deployment. Same architecture as live-trading but uses simulated order execution.
- Port: 8005
- Spot: Enabled (
spot: True) -- cost savings acceptable for simulation - Start period: 120s
- Desired count: 0 (scaled up per strategy)
3.7 Strategy Container¶
- Purpose: On-demand ECS task for backtest execution. Launched by backtest-consumer Lambda or directly by ECSBacktestJobSubmitter. No port, no health check.
- CPU/Memory: 1024/2048 (highest allocation for compute-heavy backtests)
- Desired count: 0 (purely on-demand)
4. Backtest Execution¶
Source: services/backend/src/tradai/backend/infrastructure/executor_factory.py
Protocol¶
The BacktestJobSubmitter protocol (defined in services/backend/src/tradai/backend/core/repositories.py) provides a single method:
Returns an execution identifier (task ARN, message ID, SFN execution ARN, or container ID).
Implementations¶
| Class | Module | Mode | Description |
|---|---|---|---|
LocalBacktestJobSubmitter | backend.infrastructure.local | LOCAL | Runs via subprocess (freqtrade binary) or Docker SDK |
SQSBacktestJobSubmitter | backend.infrastructure.sqs | SQS | Sends to SQS queue, consumed by backtest-consumer Lambda |
ECSBacktestJobSubmitter | tradai.common.aws.ecs_executor | ECS | Direct ECS Fargate task launch |
StepFunctionsExecutor | tradai.common.aws.step_functions | STEPFUNCTIONS | Starts Step Functions execution for orchestrated workflow |
ExecutorFactory (Strategy Pattern)¶
ExecutorFactory uses the Strategy pattern with four registered strategies:
ExecutorMode | Strategy Class | Key Validation |
|---|---|---|
ECS | ECSExecutorStrategy | Requires BACKEND_ECS_CLUSTER, BACKEND_ECS_SUBNETS, BACKEND_ECS_SECURITY_GROUPS |
LOCAL | LocalExecutorStrategy | SUBPROCESS mode requires freqtrade binary in PATH; DOCKER mode requires Docker daemon |
SQS | SQSExecutorStrategy | Requires BACKEND_BACKTEST_QUEUE_URL |
STEPFUNCTIONS | StepFunctionsExecutorStrategy | Requires BACKEND_STEPFUNCTIONS_STATE_MACHINE_ARN |
The factory provides two creation methods: - create() -- returns ExecutorCreationResult (success with executor or failure with error message) - create_with_fallback() -- in dev mode (allow_none=True), returns None on failure instead of raising
Local execution sub-modes (LocalExecutionMode enum): - SUBPROCESS -- calls freqtrade binary directly - DOCKER -- uses Docker SDK to run strategy containers (default in docker-compose)
Executor Pattern (Strategy + Factory)
The ExecutorFactory uses the Strategy pattern to create the appropriate BacktestJobSubmitter implementation based on ExecutorMode. Each mode (LOCAL, SQS, ECS, STEPFUNCTIONS) has its own validation requirements. In dev, create_with_fallback() returns None on failure instead of raising, allowing graceful degradation.
Execution Flow¶
Mermaid Sequence Diagram¶
sequenceDiagram
participant Client
participant Backend as Backend API
participant Factory as ExecutorFactory
participant SQS as SQS FIFO Queue
participant Lambda as backtest-consumer Lambda
participant SFN as Step Functions
participant ECS as Strategy Container (ECS)
participant DDB as DynamoDB
participant MLflow as MLflow
Client->>Backend: POST /api/v1/backtest
Backend->>Factory: create(settings)
Factory-->>Backend: BacktestJobSubmitter
alt SQS Mode
Backend->>SQS: submit(job) -> message_id
Backend-->>Client: {ticket_id, status: "QUEUED"}
Note over Client,SQS: Response returned immediately after SQS send
SQS->>Lambda: trigger (async)
Lambda->>DDB: idempotency check
Lambda->>SFN: StartExecution
SFN->>ECS: RunTask (strategy container)
ECS->>MLflow: log metrics/results
ECS->>DDB: update status COMPLETED
SFN-->>Lambda: execution complete
else ECS Direct Mode
Backend->>ECS: RunTask API
Backend-->>Client: {ticket_id, status: "QUEUED"}
ECS->>MLflow: log metrics/results
else Local Mode
Backend->>Backend: subprocess / Docker SDK
Backend-->>Client: {ticket_id, status: "QUEUED"}
end
Client->>Backend: GET /api/v1/backtests/{job_id}
Backend->>DDB: query status
Backend-->>Client: {status, progress, result} File: services/backend/src/tradai/backend/infrastructure/executor_factory.py
5. Lambda Functions¶
Source: infra/compute/modules/lambda_funcs.py (LAMBDA_CONFIGS dict)
All 18 Lambda functions use ECR-based container images (not ZIP), except update-nat-routes which uses an inline Python handler. Base image built from lambdas/base/. Lambda handler directories live under lambdas/{name}/handler.py.
| Lambda | Memory | Timeout | VPC | Required | Scheduled | Description |
|---|---|---|---|---|---|---|
backtest-consumer | 256 | 30s | Yes | Yes | No | Consumes SQS backtest queue, launches ECS tasks |
sqs-consumer | 256 | 30s | Yes | No | No | Consumes retraining SQS queue, launches ECS tasks |
orphan-scanner | 128 | 60s | Yes | Yes | rate(5 minutes) | Scans for orphaned ECS tasks |
health-check | 256 | 60s | Yes | Yes | rate(2 minutes) | Periodic health checks via Service Discovery |
trading-heartbeat-check | 256 | 60s | Yes | Yes | rate(5 minutes) | Checks trading container heartbeats in DynamoDB |
drift-monitor | 512 | 120s | Yes | Yes | rate(12 hours) | Model drift detection using PSI |
retraining-scheduler | 256 | 60s | Yes | Yes | rate(6 hours) | Schedules retraining based on drift/intervals |
validate-strategy | 256 | 30s | Yes | No | No | Validates strategy config in Step Functions |
data-collection-proxy | 256 | 180s | Yes | No | No | Proxies to data-collection for ensure-data ops |
notify-completion | 128 | 30s | No | No | No | Sends SNS notifications on backtest completion |
check-retraining-needed | 256 | 30s | Yes | No | No | Checks drift state to decide retraining |
compare-models | 512 | 120s | Yes | No | No | Compares champion vs challenger models |
promote-model | 256 | 60s | Yes | No | No | Promotes challenger to Production in MLflow |
model-rollback | 256 | 60s | Yes | No | No | Rolls back model on performance degradation |
cleanup-resources | 256 | 60s | Yes | No | No | Stops orphaned ECS tasks on SFN failure |
update-status | 256 | 30s | Yes | No | No | Updates job status in DynamoDB workflow-state |
pulumi-drift-detector | 512 | 300s | No | Yes | rate(6 hours) | Detects infrastructure drift via Pulumi preview |
update-nat-routes | 128 | 120s | No | Yes | ASG Lifecycle Hook | Updates private route table when NAT instance is replaced |
ECR Repositories for Lambdas¶
Each Lambda has a dedicated ECR repository following the naming pattern tradai/lambda-{name}. Full list in config.py LAMBDA_ECR_REPOS. A shared base image is at tradai/lambda-base.
Schedule Configuration¶
Schedules are in config.py LAMBDA_SCHEDULES dict. Each can be overridden via Pulumi config (e.g., pulumi config set orphan_scanner_schedule "rate(10 minutes)").
| Lambda | Default Schedule | Config Override Key |
|---|---|---|
orphan-scanner | rate(5 minutes) | orphan_scanner_schedule |
health-check | rate(2 minutes) | health_check_schedule |
trading-heartbeat-check | rate(5 minutes) | trading_heartbeat_schedule |
drift-monitor | rate(12 hours) | drift_monitor_schedule |
retraining-scheduler | rate(6 hours) | retraining_scheduler_schedule |
pulumi-drift-detector | rate(6 hours) | pulumi_drift_schedule |
6. Docker Compose (Development)¶
Source: docker-compose.yaml + docker-compose.override.yaml
Docker Compose Profiles
Profile services (MLflow, LocalStack, Redis) are not started by default. Use docker compose --profile mlflow --profile localstack up or the shortcut just up-full to start all profiles. The override file mounts the Docker socket into the backend container for local Docker-mode backtest execution.
Base Services (docker-compose.yaml)¶
| Service Name | Image | Container Name | Host Port | Container Port |
|---|---|---|---|---|
backend | tradai/backend:${TAG} | tradai-backend | ${HOST_BACKEND_PORT:-8000} | 8000 |
data-collection | tradai/data-collection:${TAG} | tradai-data-collection | ${HOST_DATA_COLLECTION_PORT:-8002} | 8002 |
strategy-service | tradai/strategy-service:${TAG} | tradai-strategy-service | ${HOST_STRATEGY_SERVICE_PORT:-8003} | 8003 |
postgres | postgres:15-alpine | tradai-postgres | (none in base) | 5432 |
Dependencies: - backend depends on data-collection + strategy-service (condition: service_healthy) - strategy-service depends on postgres (implicit via MLflow)
Inter-service URLs (Docker Compose uses container names, not Service Discovery): - BACKEND_DATA_COLLECTION_URL: http://data-collection:8002 - BACKEND_STRATEGY_SERVICE_URL: http://strategy-service:8003 - BACKEND_MLFLOW_TRACKING_URI: http://mlflow:5000
Profile Services (docker-compose.override.yaml)¶
| Service | Profile | Image | Host Port | Container Port |
|---|---|---|---|---|
mlflow | mlflow | tradai/mlflow:${TAG} | ${MLFLOW_PORT:-5001} | 5000 |
localstack | localstack | localstack/localstack:3.8 | ${LOCALSTACK_PORT:-4566} | 4566 |
redis | redis | redis:7-alpine | ${REDIS_PORT:-6379} | 6379 |
Redis is provisioned as an optional caching layer for future use as a distributed pub/sub backend (e.g., cross-worker WebSocket broadcast, hyperopt job state) -- not currently consumed by any service in production.
Start profile services with: docker compose --profile mlflow --profile localstack up Or use: just up-full (all profiles).
Development Overrides¶
The override file adds: - backend: runs as root (user 0:0) for Docker socket access, mounts Docker socket + AWS creds + user_data + strategies, sets BACKEND_LOCAL_EXECUTION_MODE=docker - data-collection: adds LocalStack endpoint config for ArcticDB S3 backend - strategy-service: adds LocalStack endpoint, mounts user_data + strategies + AWS creds - postgres: exposes port ${POSTGRES_PORT:-5433} on host (5433 to avoid conflicts)
Networks¶
| Network | Driver | Purpose |
|---|---|---|
tradai-internal | bridge | Service-to-service communication |
tradai-external | bridge | Services needing external access |
All application services join both networks. Infrastructure services (postgres) join only tradai-internal.
Health Checks¶
Common health check template (YAML anchor &app-healthcheck): - Interval: 30s, timeout: 10s, retries: 3, start period: 15s - Application services: python -c "import httpx; httpx.get(...)" - Postgres: pg_isready - LocalStack: curl -f http://localhost:4566/_localstack/health - Redis: redis-cli ping - MLflow: python -c "import urllib.request; urllib.request.urlopen(...)"
7. Orphan Detection¶
Source: lambdas/orphan-scanner/handler.py
The orphan-scanner Lambda runs on a rate(5 minutes) EventBridge schedule and identifies ECS tasks that: 1. Have been running longer than max_task_runtime_hours (default: 6 hours) 2. Have no corresponding DynamoDB state record 3. Failed to terminate properly
Configuration¶
| Setting | Default | Description |
|---|---|---|
ecs_cluster | (from env) | ECS cluster to scan |
max_task_runtime_hours | 6 | Threshold for long-running tasks |
dry_run | true | Safety default -- log but don't stop tasks |
DynamoDB Tables Scanned¶
tradai-workflow-state-{env}(backtest jobs)tradai-retraining-state-{env}(retraining tasks)tradai-trading-state-{env}(trading containers)
Behavior¶
- Defaults to dry-run mode for safety (log orphans without stopping them)
- Can be overridden per invocation via event payload:
{"dry_run": false} - Publishes CloudWatch metrics under
TradAI/OrphanScanningnamespace - Sends SNS alerts when orphans are detected
Handler: lambdas/orphan-scanner/handler.py
Key Configuration Files¶
| File | Purpose |
|---|---|
infra/shared/tradai_infra_shared/config.py | All service, S3, DynamoDB, ECR, Lambda config |
infra/compute/modules/ecs_services.py | ECS task definitions, Service Discovery, services |
infra/compute/modules/lambda_funcs.py | Lambda function definitions (LAMBDA_CONFIGS) |
docker-compose.yaml | Production-base Docker Compose |
docker-compose.override.yaml | Development overrides + profile services |
services/backend/src/tradai/backend/infrastructure/executor_factory.py | Executor Strategy/Factory pattern |
services/backend/src/tradai/backend/core/repositories.py | Protocol definitions (BacktestJobSubmitter, WorkflowStateRepository, ECSOperations) |
Changelog¶
| Version | Date | Changes |
|---|---|---|
| 10.0.0 | 2026-03-28 | Regenerated from code. All 7 services, 18 Lambdas, Docker Compose verified |
Dependencies¶
| If This Changes | Update This Doc |
|---|---|
infra/shared/tradai_infra_shared/config.py SERVICES dict | Service table (Section 1) |
infra/compute/modules/lambda_funcs.py LAMBDA_CONFIGS | Lambda table (Section 4) |
docker-compose.yaml or docker-compose.override.yaml | Docker Compose section (Section 5) |
services/backend/src/tradai/backend/infrastructure/executor_factory.py | Executor pattern (Section 3) |