Skip to content

TradAI Services Architecture

Version: 10.0.0 | Date: 2026-03-28 | Status: CURRENT

Source of truth: infra/shared/tradai_infra_shared/config.py (SERVICES dict, ECR_REPOS, LAMBDA_SCHEDULES)

TL;DR: 7 ECS service definitions (4 always-on, 1 on-demand backtest container, 2 trading containers) plus 18 Lambda functions (17 ECR-based containers + 1 inline handler). Dev/staging uses a consolidated EC2 instance (t3.small, ~$15/month) instead of per-service Fargate tasks. Backtest execution supports 4 modes via the Strategy pattern: Local, SQS, ECS, and Step Functions.


1. Service Overview

All service definitions are in config.py SERVICES dict. Ports, CPU, memory, and health check paths are consumed by both ECS Fargate task definitions and the consolidated EC2 mode.

Service Port CPU Memory Health Check Spot Desired Count
backend-api 8000 512 1024 /api/v1/health Yes (default) 1 (dev/staging), 2 (prod)
data-collection 8002 256 512 /api/v1/health Yes (default) 1
strategy-service 8003 512 1024 /api/v1/health Yes (default) 1
mlflow 5000 512 1024 /mlflow/ Yes (default) 1
strategy-container None 1024 2048 None Yes (default) 0 (on-demand)
live-trading 8004 1024 2048 /api/v1/health No (reliability) 0 (per strategy)
dry-run-trading 8005 1024 2048 /api/v1/health Yes 0 (per strategy)

Notes: - strategy-container has no port -- it runs as an ECS task (backtest execution), not a long-running service. - live-trading explicitly disables Spot (spot: False) for reliability. It also has start_period: 120 for warmup. - dry-run-trading uses Spot for cost savings. Also has start_period: 120. - Region: eu-central-1 (default, configurable via AWS_REGION env var or Pulumi config).

ECR Repository Mapping

Service names map to ECR repositories via ECR_SERVICE_MAPPING:

Service Name ECR Repository
backend-api tradai/backend
data-collection tradai/data-collection
strategy-service tradai/strategy-service
mlflow tradai/mlflow
live-trading tradai/live-trading
dry-run-trading tradai/dry-run-trading

2. ECS Architecture

Source: infra/compute/modules/ecs_services.py (class EcsServices)

Cluster and Capacity Providers

  • Single ECS cluster per environment: tradai-{env}
  • Capacity providers: FARGATE + FARGATE_SPOT
  • Services with spot: True (default) use FARGATE_SPOT with FARGATE fallback
  • Services with spot: False (live-trading) use FARGATE only

Consolidated EC2 Mode

For dev/staging cost savings, services can run on a single EC2 instance instead of separate Fargate tasks. Controlled by CONSOLIDATED_MODE in config.py:

Environment Mode Instance
dev Consolidated EC2 t3.small (2 vCPU, 2GB, ~$15/month)
staging Consolidated EC2 t3.small
prod ECS Fargate Per-service task definitions

Consolidated services: backend-api, data-collection, mlflow, strategy-service. Custom AMI built with Packer (infra/ami/tradai-consolidated.pkr.hcl), configurable via pulumi config set custom_ami_id.

Service Discovery

  • Private DNS namespace: tradai-{env}.local (e.g., tradai-dev.local)
  • Created in _create_service_discovery_namespace() as aws.servicediscovery.PrivateDnsNamespace
  • Each service registers a DNS A record with 60s TTL, MULTIVALUE routing
  • Service addresses: {service-name}.tradai-{env}.local:{port}
  • MLflow tracking URI: http://mlflow.tradai-{env}.local:5000/mlflow
  • SD services are created for ALL services (including consolidated EC2) so DNS works regardless of mode

Task Definition Pattern

Each ECS Fargate service gets a task definition with: - Container image from ECR: {account}.dkr.ecr.eu-central-1.amazonaws.com/tradai/{service} - awslogs log driver pointing to /ecs/tradai/{env}/services - Log retention: 30 days (dev/staging), 90 days (prod) - Environment variables built by EnvironmentBuilder + ServiceEnvironmentPresets - Execution role (ECR pull, CloudWatch logs) and task role (S3, DynamoDB, SQS access)

Health Check Pattern

ECS health checks use Python httpx (not curl -- Alpine images don't include curl):

python -c "import httpx; httpx.get('http://localhost:{port}/api/v1/health').raise_for_status()"
MLflow uses urllib.request since it runs the third-party MLflow image.


3. Service Details

3.1 Backend API

  • Purpose: API gateway and backtest orchestration. Accepts backtest requests, delegates execution to the configured executor (Local, SQS, ECS, or Step Functions).
  • Source: services/backend/src/tradai/backend/ (3-layer: api/, core/, infrastructure/)
  • Port: 8000
  • Key dependencies: data-collection (port 8002), strategy-service (port 8003), MLflow (port 5000), DynamoDB (workflow-state table), SQS (backtest queue), S3 (configs + results buckets)
  • Environment inter-service URLs (ECS):
  • http://strategy-service.tradai-{env}.local:8003
  • http://data-collection.tradai-{env}.local:8002
  • http://mlflow.tradai-{env}.local:5000/mlflow
  • Docs: services/backend/README.md, services/backend/BACKTEST_DESIGN.md

3.2 Data Collection

  • Purpose: Market data fetching (OHLCV via CCXT/Binance), ArcticDB time-series storage, data freshness checks, metadata queries.
  • Source: services/data-collection/src/tradai/data_collection/ (3-layer)
  • Port: 8002
  • Platform: linux/amd64 required (ArcticDB only has x86_64 wheels)
  • Key dependencies: ArcticDB (S3-backed), LocalStack (dev), S3 (arcticdb bucket)
  • Docs: services/data-collection/README.md

3.3 Strategy Service

  • Purpose: Strategy execution and backtesting. Manages strategy configs, validates parameters, prepares Freqtrade configurations.
  • Source: services/strategy-service/src/tradai/strategy_service/ (3-layer)
  • Port: 8003
  • Key dependencies: MLflow (tracking URI), S3 (config storage), user_data directory
  • Docs: services/strategy-service/README.md

3.4 MLflow

  • Purpose: ML experiment tracking and model registry. Third-party MLflow server with PostgreSQL backend store and S3 artifact storage.
  • Source: services/mlflow/ (wraps MLflow Docker image, namespace tradai_mlflow)
  • Port: 5000 (uses --static-prefix=/mlflow)
  • Health check: /mlflow/ (root returns 200, no /health endpoint)
  • Key dependencies: PostgreSQL (backend store), S3 (tradai-mlflow-{env} bucket for artifacts)
  • Note: Does NOT follow the 3-layer pattern.
  • Docs: services/mlflow/README.md

3.5 Live Trading

  • Purpose: Production live trading with real exchange connections. Runs one instance per strategy.
  • Port: 8004
  • Spot: Disabled (spot: False) -- reliability required for real money operations
  • Start period: 120s (warmup for exchange connections and model loading)
  • Desired count: 0 (scaled up per strategy via deployment automation)

3.6 Dry-Run Trading

  • Purpose: Paper trading for strategy validation before live deployment. Same architecture as live-trading but uses simulated order execution.
  • Port: 8005
  • Spot: Enabled (spot: True) -- cost savings acceptable for simulation
  • Start period: 120s
  • Desired count: 0 (scaled up per strategy)

3.7 Strategy Container

  • Purpose: On-demand ECS task for backtest execution. Launched by backtest-consumer Lambda or directly by ECSBacktestJobSubmitter. No port, no health check.
  • CPU/Memory: 1024/2048 (highest allocation for compute-heavy backtests)
  • Desired count: 0 (purely on-demand)

4. Backtest Execution

Source: services/backend/src/tradai/backend/infrastructure/executor_factory.py

Protocol

The BacktestJobSubmitter protocol (defined in services/backend/src/tradai/backend/core/repositories.py) provides a single method:

class BacktestJobSubmitter(Protocol):
    def submit(self, job: BacktestJobMessage) -> str: ...

Returns an execution identifier (task ARN, message ID, SFN execution ARN, or container ID).

Implementations

Class Module Mode Description
LocalBacktestJobSubmitter backend.infrastructure.local LOCAL Runs via subprocess (freqtrade binary) or Docker SDK
SQSBacktestJobSubmitter backend.infrastructure.sqs SQS Sends to SQS queue, consumed by backtest-consumer Lambda
ECSBacktestJobSubmitter tradai.common.aws.ecs_executor ECS Direct ECS Fargate task launch
StepFunctionsExecutor tradai.common.aws.step_functions STEPFUNCTIONS Starts Step Functions execution for orchestrated workflow

ExecutorFactory (Strategy Pattern)

ExecutorFactory uses the Strategy pattern with four registered strategies:

ExecutorMode Strategy Class Key Validation
ECS ECSExecutorStrategy Requires BACKEND_ECS_CLUSTER, BACKEND_ECS_SUBNETS, BACKEND_ECS_SECURITY_GROUPS
LOCAL LocalExecutorStrategy SUBPROCESS mode requires freqtrade binary in PATH; DOCKER mode requires Docker daemon
SQS SQSExecutorStrategy Requires BACKEND_BACKTEST_QUEUE_URL
STEPFUNCTIONS StepFunctionsExecutorStrategy Requires BACKEND_STEPFUNCTIONS_STATE_MACHINE_ARN

The factory provides two creation methods: - create() -- returns ExecutorCreationResult (success with executor or failure with error message) - create_with_fallback() -- in dev mode (allow_none=True), returns None on failure instead of raising

Local execution sub-modes (LocalExecutionMode enum): - SUBPROCESS -- calls freqtrade binary directly - DOCKER -- uses Docker SDK to run strategy containers (default in docker-compose)

Executor Pattern (Strategy + Factory)

The ExecutorFactory uses the Strategy pattern to create the appropriate BacktestJobSubmitter implementation based on ExecutorMode. Each mode (LOCAL, SQS, ECS, STEPFUNCTIONS) has its own validation requirements. In dev, create_with_fallback() returns None on failure instead of raising, allowing graceful degradation.

Execution Flow

Mermaid Sequence Diagram

sequenceDiagram
    participant Client
    participant Backend as Backend API
    participant Factory as ExecutorFactory
    participant SQS as SQS FIFO Queue
    participant Lambda as backtest-consumer Lambda
    participant SFN as Step Functions
    participant ECS as Strategy Container (ECS)
    participant DDB as DynamoDB
    participant MLflow as MLflow

    Client->>Backend: POST /api/v1/backtest
    Backend->>Factory: create(settings)
    Factory-->>Backend: BacktestJobSubmitter

    alt SQS Mode
        Backend->>SQS: submit(job) -> message_id
        Backend-->>Client: {ticket_id, status: "QUEUED"}
        Note over Client,SQS: Response returned immediately after SQS send
        SQS->>Lambda: trigger (async)
        Lambda->>DDB: idempotency check
        Lambda->>SFN: StartExecution
        SFN->>ECS: RunTask (strategy container)
        ECS->>MLflow: log metrics/results
        ECS->>DDB: update status COMPLETED
        SFN-->>Lambda: execution complete
    else ECS Direct Mode
        Backend->>ECS: RunTask API
        Backend-->>Client: {ticket_id, status: "QUEUED"}
        ECS->>MLflow: log metrics/results
    else Local Mode
        Backend->>Backend: subprocess / Docker SDK
        Backend-->>Client: {ticket_id, status: "QUEUED"}
    end

    Client->>Backend: GET /api/v1/backtests/{job_id}
    Backend->>DDB: query status
    Backend-->>Client: {status, progress, result}

File: services/backend/src/tradai/backend/infrastructure/executor_factory.py


5. Lambda Functions

Source: infra/compute/modules/lambda_funcs.py (LAMBDA_CONFIGS dict)

All 18 Lambda functions use ECR-based container images (not ZIP), except update-nat-routes which uses an inline Python handler. Base image built from lambdas/base/. Lambda handler directories live under lambdas/{name}/handler.py.

Lambda Memory Timeout VPC Required Scheduled Description
backtest-consumer 256 30s Yes Yes No Consumes SQS backtest queue, launches ECS tasks
sqs-consumer 256 30s Yes No No Consumes retraining SQS queue, launches ECS tasks
orphan-scanner 128 60s Yes Yes rate(5 minutes) Scans for orphaned ECS tasks
health-check 256 60s Yes Yes rate(2 minutes) Periodic health checks via Service Discovery
trading-heartbeat-check 256 60s Yes Yes rate(5 minutes) Checks trading container heartbeats in DynamoDB
drift-monitor 512 120s Yes Yes rate(12 hours) Model drift detection using PSI
retraining-scheduler 256 60s Yes Yes rate(6 hours) Schedules retraining based on drift/intervals
validate-strategy 256 30s Yes No No Validates strategy config in Step Functions
data-collection-proxy 256 180s Yes No No Proxies to data-collection for ensure-data ops
notify-completion 128 30s No No No Sends SNS notifications on backtest completion
check-retraining-needed 256 30s Yes No No Checks drift state to decide retraining
compare-models 512 120s Yes No No Compares champion vs challenger models
promote-model 256 60s Yes No No Promotes challenger to Production in MLflow
model-rollback 256 60s Yes No No Rolls back model on performance degradation
cleanup-resources 256 60s Yes No No Stops orphaned ECS tasks on SFN failure
update-status 256 30s Yes No No Updates job status in DynamoDB workflow-state
pulumi-drift-detector 512 300s No Yes rate(6 hours) Detects infrastructure drift via Pulumi preview
update-nat-routes 128 120s No Yes ASG Lifecycle Hook Updates private route table when NAT instance is replaced

ECR Repositories for Lambdas

Each Lambda has a dedicated ECR repository following the naming pattern tradai/lambda-{name}. Full list in config.py LAMBDA_ECR_REPOS. A shared base image is at tradai/lambda-base.

Schedule Configuration

Schedules are in config.py LAMBDA_SCHEDULES dict. Each can be overridden via Pulumi config (e.g., pulumi config set orphan_scanner_schedule "rate(10 minutes)").

Lambda Default Schedule Config Override Key
orphan-scanner rate(5 minutes) orphan_scanner_schedule
health-check rate(2 minutes) health_check_schedule
trading-heartbeat-check rate(5 minutes) trading_heartbeat_schedule
drift-monitor rate(12 hours) drift_monitor_schedule
retraining-scheduler rate(6 hours) retraining_scheduler_schedule
pulumi-drift-detector rate(6 hours) pulumi_drift_schedule

6. Docker Compose (Development)

Source: docker-compose.yaml + docker-compose.override.yaml

Docker Compose Profiles

Profile services (MLflow, LocalStack, Redis) are not started by default. Use docker compose --profile mlflow --profile localstack up or the shortcut just up-full to start all profiles. The override file mounts the Docker socket into the backend container for local Docker-mode backtest execution.

Base Services (docker-compose.yaml)

Service Name Image Container Name Host Port Container Port
backend tradai/backend:${TAG} tradai-backend ${HOST_BACKEND_PORT:-8000} 8000
data-collection tradai/data-collection:${TAG} tradai-data-collection ${HOST_DATA_COLLECTION_PORT:-8002} 8002
strategy-service tradai/strategy-service:${TAG} tradai-strategy-service ${HOST_STRATEGY_SERVICE_PORT:-8003} 8003
postgres postgres:15-alpine tradai-postgres (none in base) 5432

Dependencies: - backend depends on data-collection + strategy-service (condition: service_healthy) - strategy-service depends on postgres (implicit via MLflow)

Inter-service URLs (Docker Compose uses container names, not Service Discovery): - BACKEND_DATA_COLLECTION_URL: http://data-collection:8002 - BACKEND_STRATEGY_SERVICE_URL: http://strategy-service:8003 - BACKEND_MLFLOW_TRACKING_URI: http://mlflow:5000

Profile Services (docker-compose.override.yaml)

Service Profile Image Host Port Container Port
mlflow mlflow tradai/mlflow:${TAG} ${MLFLOW_PORT:-5001} 5000
localstack localstack localstack/localstack:3.8 ${LOCALSTACK_PORT:-4566} 4566
redis redis redis:7-alpine ${REDIS_PORT:-6379} 6379

Redis is provisioned as an optional caching layer for future use as a distributed pub/sub backend (e.g., cross-worker WebSocket broadcast, hyperopt job state) -- not currently consumed by any service in production.

Start profile services with: docker compose --profile mlflow --profile localstack up Or use: just up-full (all profiles).

Development Overrides

The override file adds: - backend: runs as root (user 0:0) for Docker socket access, mounts Docker socket + AWS creds + user_data + strategies, sets BACKEND_LOCAL_EXECUTION_MODE=docker - data-collection: adds LocalStack endpoint config for ArcticDB S3 backend - strategy-service: adds LocalStack endpoint, mounts user_data + strategies + AWS creds - postgres: exposes port ${POSTGRES_PORT:-5433} on host (5433 to avoid conflicts)

Networks

Network Driver Purpose
tradai-internal bridge Service-to-service communication
tradai-external bridge Services needing external access

All application services join both networks. Infrastructure services (postgres) join only tradai-internal.

Health Checks

Common health check template (YAML anchor &app-healthcheck): - Interval: 30s, timeout: 10s, retries: 3, start period: 15s - Application services: python -c "import httpx; httpx.get(...)" - Postgres: pg_isready - LocalStack: curl -f http://localhost:4566/_localstack/health - Redis: redis-cli ping - MLflow: python -c "import urllib.request; urllib.request.urlopen(...)"


7. Orphan Detection

Source: lambdas/orphan-scanner/handler.py

The orphan-scanner Lambda runs on a rate(5 minutes) EventBridge schedule and identifies ECS tasks that: 1. Have been running longer than max_task_runtime_hours (default: 6 hours) 2. Have no corresponding DynamoDB state record 3. Failed to terminate properly

Configuration

Setting Default Description
ecs_cluster (from env) ECS cluster to scan
max_task_runtime_hours 6 Threshold for long-running tasks
dry_run true Safety default -- log but don't stop tasks

DynamoDB Tables Scanned

  • tradai-workflow-state-{env} (backtest jobs)
  • tradai-retraining-state-{env} (retraining tasks)
  • tradai-trading-state-{env} (trading containers)

Behavior

  • Defaults to dry-run mode for safety (log orphans without stopping them)
  • Can be overridden per invocation via event payload: {"dry_run": false}
  • Publishes CloudWatch metrics under TradAI/OrphanScanning namespace
  • Sends SNS alerts when orphans are detected

Handler: lambdas/orphan-scanner/handler.py


Key Configuration Files

File Purpose
infra/shared/tradai_infra_shared/config.py All service, S3, DynamoDB, ECR, Lambda config
infra/compute/modules/ecs_services.py ECS task definitions, Service Discovery, services
infra/compute/modules/lambda_funcs.py Lambda function definitions (LAMBDA_CONFIGS)
docker-compose.yaml Production-base Docker Compose
docker-compose.override.yaml Development overrides + profile services
services/backend/src/tradai/backend/infrastructure/executor_factory.py Executor Strategy/Factory pattern
services/backend/src/tradai/backend/core/repositories.py Protocol definitions (BacktestJobSubmitter, WorkflowStateRepository, ECSOperations)

Changelog

Version Date Changes
10.0.0 2026-03-28 Regenerated from code. All 7 services, 18 Lambdas, Docker Compose verified

Dependencies

If This Changes Update This Doc
infra/shared/tradai_infra_shared/config.py SERVICES dict Service table (Section 1)
infra/compute/modules/lambda_funcs.py LAMBDA_CONFIGS Lambda table (Section 4)
docker-compose.yaml or docker-compose.override.yaml Docker Compose section (Section 5)
services/backend/src/tradai/backend/infrastructure/executor_factory.py Executor pattern (Section 3)