TradAI Services Architecture¶

Version: 10.0.0 | Date: 2026-03-28 | Status: CURRENT

Source of truth: infra/shared/tradai_infra_shared/config.py (SERVICES dict, ECR_REPOS, LAMBDA_SCHEDULES)

TL;DR: 7 ECS service definitions (4 always-on, 1 on-demand backtest container, 2 trading containers) plus 18 Lambda functions (17 ECR-based containers + 1 inline handler). Dev/staging uses a consolidated EC2 instance (t3.small, ~$15/month) instead of per-service Fargate tasks. Backtest execution supports 4 modes via the Strategy pattern: Local, SQS, ECS, and Step Functions.

1. Service Overview¶

All service definitions are in config.py SERVICES dict. Ports, CPU, memory, and health check paths are consumed by both ECS Fargate task definitions and the consolidated EC2 mode.

Service	Port	CPU	Memory	Health Check	Spot	Desired Count
`backend-api`	8000	512	1024	`/api/v1/health`	Yes (default)	1 (dev/staging), 2 (prod)
`data-collection`	8002	256	512	`/api/v1/health`	Yes (default)	1
`strategy-service`	8003	512	1024	`/api/v1/health`	Yes (default)	1
`mlflow`	5000	512	1024	`/mlflow/`	Yes (default)	1
`strategy-container`	None	1024	2048	None	Yes (default)	0 (on-demand)
`live-trading`	8004	1024	2048	`/api/v1/health`	No (reliability)	0 (per strategy)
`dry-run-trading`	8005	1024	2048	`/api/v1/health`	Yes	0 (per strategy)

Notes: - strategy-container has no port -- it runs as an ECS task (backtest execution), not a long-running service. - live-trading explicitly disables Spot (spot: False) for reliability. It also has start_period: 120 for warmup. - dry-run-trading uses Spot for cost savings. Also has start_period: 120. - Region: eu-central-1 (default, configurable via AWS_REGION env var or Pulumi config).

ECR Repository Mapping¶

Service names map to ECR repositories via ECR_SERVICE_MAPPING:

Service Name	ECR Repository
`backend-api`	`tradai/backend`
`data-collection`	`tradai/data-collection`
`strategy-service`	`tradai/strategy-service`
`mlflow`	`tradai/mlflow`
`live-trading`	`tradai/live-trading`
`dry-run-trading`	`tradai/dry-run-trading`

2. ECS Architecture¶

Source: infra/compute/modules/ecs_services.py (class EcsServices)

Cluster and Capacity Providers¶

Single ECS cluster per environment: tradai-{env}
Capacity providers: FARGATE + FARGATE_SPOT
Services with spot: True (default) use FARGATE_SPOT with FARGATE fallback
Services with spot: False (live-trading) use FARGATE only

Consolidated EC2 Mode¶

For dev/staging cost savings, services can run on a single EC2 instance instead of separate Fargate tasks. Controlled by CONSOLIDATED_MODE in config.py:

Environment	Mode	Instance
dev	Consolidated EC2	`t3.small` (2 vCPU, 2GB, ~$15/month)
staging	Consolidated EC2	`t3.small`
prod	ECS Fargate	Per-service task definitions

Consolidated services: backend-api, data-collection, mlflow, strategy-service. Custom AMI built with Packer (infra/ami/tradai-consolidated.pkr.hcl), configurable via pulumi config set custom_ami_id.

Service Discovery¶

Private DNS namespace: tradai-{env}.local (e.g., tradai-dev.local)
Created in _create_service_discovery_namespace() as aws.servicediscovery.PrivateDnsNamespace
Each service registers a DNS A record with 60s TTL, MULTIVALUE routing
Service addresses: {service-name}.tradai-{env}.local:{port}
MLflow tracking URI: http://mlflow.tradai-{env}.local:5000/mlflow
SD services are created for ALL services (including consolidated EC2) so DNS works regardless of mode

Task Definition Pattern¶

Each ECS Fargate service gets a task definition with: - Container image from ECR: {account}.dkr.ecr.eu-central-1.amazonaws.com/tradai/{service} - awslogs log driver pointing to /ecs/tradai/{env}/services - Log retention: 30 days (dev/staging), 90 days (prod) - Environment variables built by EnvironmentBuilder + ServiceEnvironmentPresets - Execution role (ECR pull, CloudWatch logs) and task role (S3, DynamoDB, SQS access)

Health Check Pattern¶

ECS health checks use Python httpx (not curl -- Alpine images don't include curl):

python -c "import httpx; httpx.get('http://localhost:{port}/api/v1/health').raise_for_status()"

MLflow uses urllib.request since it runs the third-party MLflow image.

3. Service Details¶

3.1 Backend API¶

Purpose: API gateway and backtest orchestration. Accepts backtest requests, delegates execution to the configured executor (Local, SQS, ECS, or Step Functions).
Source: services/backend/src/tradai/backend/ (3-layer: api/, core/, infrastructure/)
Port: 8000
Key dependencies: data-collection (port 8002), strategy-service (port 8003), MLflow (port 5000), DynamoDB (workflow-state table), SQS (backtest queue), S3 (configs + results buckets)
Environment inter-service URLs (ECS):
http://strategy-service.tradai-{env}.local:8003
http://data-collection.tradai-{env}.local:8002
http://mlflow.tradai-{env}.local:5000/mlflow
Docs: services/backend/README.md, services/backend/BACKTEST_DESIGN.md

3.2 Data Collection¶

Purpose: Market data fetching (OHLCV via CCXT/Binance), ArcticDB time-series storage, data freshness checks, metadata queries.
Source: services/data-collection/src/tradai/data_collection/ (3-layer)
Port: 8002
Platform: linux/amd64 required (ArcticDB only has x86_64 wheels)
Key dependencies: ArcticDB (S3-backed), LocalStack (dev), S3 (arcticdb bucket)
Docs: services/data-collection/README.md

3.3 Strategy Service¶

Purpose: Strategy execution and backtesting. Manages strategy configs, validates parameters, prepares Freqtrade configurations.
Source: services/strategy-service/src/tradai/strategy_service/ (3-layer)
Port: 8003
Key dependencies: MLflow (tracking URI), S3 (config storage), user_data directory
Docs: services/strategy-service/README.md

3.4 MLflow¶

Purpose: ML experiment tracking and model registry. Third-party MLflow server with PostgreSQL backend store and S3 artifact storage.
Source: services/mlflow/ (wraps MLflow Docker image, namespace tradai_mlflow)
Port: 5000 (uses --static-prefix=/mlflow)
Health check: /mlflow/ (root returns 200, no /health endpoint)
Key dependencies: PostgreSQL (backend store), S3 (tradai-mlflow-{env} bucket for artifacts)
Note: Does NOT follow the 3-layer pattern.
Docs: services/mlflow/README.md

3.5 Live Trading¶

Purpose: Production live trading with real exchange connections. Runs one instance per strategy.
Port: 8004
Spot: Disabled (spot: False) -- reliability required for real money operations
Start period: 120s (warmup for exchange connections and model loading)
Desired count: 0 (scaled up per strategy via deployment automation)

3.6 Dry-Run Trading¶

Purpose: Paper trading for strategy validation before live deployment. Same architecture as live-trading but uses simulated order execution.
Port: 8005
Spot: Enabled (spot: True) -- cost savings acceptable for simulation
Start period: 120s
Desired count: 0 (scaled up per strategy)

3.7 Strategy Container¶

Purpose: On-demand ECS task for backtest execution. Launched by backtest-consumer Lambda or directly by ECSBacktestJobSubmitter. No port, no health check.
CPU/Memory: 1024/2048 (highest allocation for compute-heavy backtests)
Desired count: 0 (purely on-demand)

4. Backtest Execution¶

Source: services/backend/src/tradai/backend/infrastructure/executor_factory.py

Protocol¶

The BacktestJobSubmitter protocol (defined in services/backend/src/tradai/backend/core/repositories.py) provides a single method:

class BacktestJobSubmitter(Protocol):
    def submit(self, job: BacktestJobMessage) -> str: ...

Returns an execution identifier (task ARN, message ID, SFN execution ARN, or container ID).

Implementations¶

Class	Module	Mode	Description
`LocalBacktestJobSubmitter`	`backend.infrastructure.local`	`LOCAL`	Runs via subprocess (freqtrade binary) or Docker SDK
`SQSBacktestJobSubmitter`	`backend.infrastructure.sqs`	`SQS`	Sends to SQS queue, consumed by backtest-consumer Lambda
`ECSBacktestJobSubmitter`	`tradai.common.aws.ecs_executor`	`ECS`	Direct ECS Fargate task launch
`StepFunctionsExecutor`	`tradai.common.aws.step_functions`	`STEPFUNCTIONS`	Starts Step Functions execution for orchestrated workflow

ExecutorFactory (Strategy Pattern)¶

ExecutorFactory uses the Strategy pattern with four registered strategies:

`ExecutorMode`	Strategy Class	Key Validation
`ECS`	`ECSExecutorStrategy`	Requires `BACKEND_ECS_CLUSTER`, `BACKEND_ECS_SUBNETS`, `BACKEND_ECS_SECURITY_GROUPS`
`LOCAL`	`LocalExecutorStrategy`	SUBPROCESS mode requires freqtrade binary in PATH; DOCKER mode requires Docker daemon
`SQS`	`SQSExecutorStrategy`	Requires `BACKEND_BACKTEST_QUEUE_URL`
`STEPFUNCTIONS`	`StepFunctionsExecutorStrategy`	Requires `BACKEND_STEPFUNCTIONS_STATE_MACHINE_ARN`

The factory provides two creation methods: - create() -- returns ExecutorCreationResult (success with executor or failure with error message) - create_with_fallback() -- in dev mode (allow_none=True), returns None on failure instead of raising

Local execution sub-modes (LocalExecutionMode enum): - SUBPROCESS -- calls freqtrade binary directly - DOCKER -- uses Docker SDK to run strategy containers (default in docker-compose)

Executor Pattern (Strategy + Factory)

The ExecutorFactory uses the Strategy pattern to create the appropriate BacktestJobSubmitter implementation based on ExecutorMode. Each mode (LOCAL, SQS, ECS, STEPFUNCTIONS) has its own validation requirements. In dev, create_with_fallback() returns None on failure instead of raising, allowing graceful degradation.

Execution Flow¶

Mermaid Sequence Diagram¶

sequenceDiagram
    participant Client
    participant Backend as Backend API
    participant Factory as ExecutorFactory
    participant SQS as SQS FIFO Queue
    participant Lambda as backtest-consumer Lambda
    participant SFN as Step Functions
    participant ECS as Strategy Container (ECS)
    participant DDB as DynamoDB
    participant MLflow as MLflow

    Client->>Backend: POST /api/v1/backtest
    Backend->>Factory: create(settings)
    Factory-->>Backend: BacktestJobSubmitter

    alt SQS Mode
        Backend->>SQS: submit(job) -> message_id
        Backend-->>Client: {ticket_id, status: "QUEUED"}
        Note over Client,SQS: Response returned immediately after SQS send
        SQS->>Lambda: trigger (async)
        Lambda->>DDB: idempotency check
        Lambda->>SFN: StartExecution
        SFN->>ECS: RunTask (strategy container)
        ECS->>MLflow: log metrics/results
        ECS->>DDB: update status COMPLETED
        SFN-->>Lambda: execution complete
    else ECS Direct Mode
        Backend->>ECS: RunTask API
        Backend-->>Client: {ticket_id, status: "QUEUED"}
        ECS->>MLflow: log metrics/results
    else Local Mode
        Backend->>Backend: subprocess / Docker SDK
        Backend-->>Client: {ticket_id, status: "QUEUED"}
    end

    Client->>Backend: GET /api/v1/backtests/{job_id}
    Backend->>DDB: query status
    Backend-->>Client: {status, progress, result}

File: services/backend/src/tradai/backend/infrastructure/executor_factory.py

5. Lambda Functions¶

Source: infra/compute/modules/lambda_funcs.py (LAMBDA_CONFIGS dict)

All 18 Lambda functions use ECR-based container images (not ZIP), except update-nat-routes which uses an inline Python handler. Base image built from lambdas/base/. Lambda handler directories live under lambdas/{name}/handler.py.

Lambda	Memory	Timeout	VPC	Required	Scheduled	Description
`backtest-consumer`	256	30s	Yes	Yes	No	Consumes SQS backtest queue, launches ECS tasks
`sqs-consumer`	256	30s	Yes	No	No	Consumes retraining SQS queue, launches ECS tasks
`orphan-scanner`	128	60s	Yes	Yes	`rate(5 minutes)`	Scans for orphaned ECS tasks
`health-check`	256	60s	Yes	Yes	`rate(2 minutes)`	Periodic health checks via Service Discovery
`trading-heartbeat-check`	256	60s	Yes	Yes	`rate(5 minutes)`	Checks trading container heartbeats in DynamoDB
`drift-monitor`	512	120s	Yes	Yes	`rate(12 hours)`	Model drift detection using PSI
`retraining-scheduler`	256	60s	Yes	Yes	`rate(6 hours)`	Schedules retraining based on drift/intervals
`validate-strategy`	256	30s	Yes	No	No	Validates strategy config in Step Functions
`data-collection-proxy`	256	180s	Yes	No	No	Proxies to data-collection for ensure-data ops
`notify-completion`	128	30s	No	No	No	Sends SNS notifications on backtest completion
`check-retraining-needed`	256	30s	Yes	No	No	Checks drift state to decide retraining
`compare-models`	512	120s	Yes	No	No	Compares champion vs challenger models
`promote-model`	256	60s	Yes	No	No	Promotes challenger to Production in MLflow
`model-rollback`	256	60s	Yes	No	No	Rolls back model on performance degradation
`cleanup-resources`	256	60s	Yes	No	No	Stops orphaned ECS tasks on SFN failure
`update-status`	256	30s	Yes	No	No	Updates job status in DynamoDB workflow-state
`pulumi-drift-detector`	512	300s	No	Yes	`rate(6 hours)`	Detects infrastructure drift via Pulumi preview
`update-nat-routes`	128	120s	No	Yes	ASG Lifecycle Hook	Updates private route table when NAT instance is replaced

ECR Repositories for Lambdas¶

Each Lambda has a dedicated ECR repository following the naming pattern tradai/lambda-{name}. Full list in config.py LAMBDA_ECR_REPOS. A shared base image is at tradai/lambda-base.

Schedule Configuration¶

Schedules are in config.py LAMBDA_SCHEDULES dict. Each can be overridden via Pulumi config (e.g., pulumi config set orphan_scanner_schedule "rate(10 minutes)").

Lambda	Default Schedule	Config Override Key
`orphan-scanner`	`rate(5 minutes)`	`orphan_scanner_schedule`
`health-check`	`rate(2 minutes)`	`health_check_schedule`
`trading-heartbeat-check`	`rate(5 minutes)`	`trading_heartbeat_schedule`
`drift-monitor`	`rate(12 hours)`	`drift_monitor_schedule`
`retraining-scheduler`	`rate(6 hours)`	`retraining_scheduler_schedule`
`pulumi-drift-detector`	`rate(6 hours)`	`pulumi_drift_schedule`

6. Docker Compose (Development)¶

Source: docker-compose.yaml + docker-compose.override.yaml

Docker Compose Profiles

Profile services (MLflow, LocalStack, Redis) are not started by default. Use docker compose --profile mlflow --profile localstack up or the shortcut just up-full to start all profiles. The override file mounts the Docker socket into the backend container for local Docker-mode backtest execution.

Base Services (docker-compose.yaml)¶

Service Name	Image	Container Name	Host Port	Container Port
`backend`	`tradai/backend:${TAG}`	`tradai-backend`	`${HOST_BACKEND_PORT:-8000}`	8000
`data-collection`	`tradai/data-collection:${TAG}`	`tradai-data-collection`	`${HOST_DATA_COLLECTION_PORT:-8002}`	8002
`strategy-service`	`tradai/strategy-service:${TAG}`	`tradai-strategy-service`	`${HOST_STRATEGY_SERVICE_PORT:-8003}`	8003
`postgres`	`postgres:15-alpine`	`tradai-postgres`	(none in base)	5432

Dependencies: - backend depends on data-collection + strategy-service (condition: service_healthy) - strategy-service depends on postgres (implicit via MLflow)

Inter-service URLs (Docker Compose uses container names, not Service Discovery): - BACKEND_DATA_COLLECTION_URL: http://data-collection:8002 - BACKEND_STRATEGY_SERVICE_URL: http://strategy-service:8003 - BACKEND_MLFLOW_TRACKING_URI: http://mlflow:5000

Profile Services (docker-compose.override.yaml)¶

Service	Profile	Image	Host Port	Container Port
`mlflow`	`mlflow`	`tradai/mlflow:${TAG}`	`${MLFLOW_PORT:-5001}`	5000
`localstack`	`localstack`	`localstack/localstack:3.8`	`${LOCALSTACK_PORT:-4566}`	4566
`redis`	`redis`	`redis:7-alpine`	`${REDIS_PORT:-6379}`	6379

Redis is provisioned as an optional caching layer for future use as a distributed pub/sub backend (e.g., cross-worker WebSocket broadcast, hyperopt job state) -- not currently consumed by any service in production.

Start profile services with: docker compose --profile mlflow --profile localstack up Or use: just up-full (all profiles).

Development Overrides¶

The override file adds: - backend: runs as root (user 0:0) for Docker socket access, mounts Docker socket + AWS creds + user_data + strategies, sets BACKEND_LOCAL_EXECUTION_MODE=docker - data-collection: adds LocalStack endpoint config for ArcticDB S3 backend - strategy-service: adds LocalStack endpoint, mounts user_data + strategies + AWS creds - postgres: exposes port ${POSTGRES_PORT:-5433} on host (5433 to avoid conflicts)

Networks¶

Network	Driver	Purpose
`tradai-internal`	bridge	Service-to-service communication
`tradai-external`	bridge	Services needing external access

All application services join both networks. Infrastructure services (postgres) join only tradai-internal.

Health Checks¶

Common health check template (YAML anchor &app-healthcheck): - Interval: 30s, timeout: 10s, retries: 3, start period: 15s - Application services: python -c "import httpx; httpx.get(...)" - Postgres: pg_isready - LocalStack: curl -f http://localhost:4566/_localstack/health - Redis: redis-cli ping - MLflow: python -c "import urllib.request; urllib.request.urlopen(...)"

7. Orphan Detection¶

Source: lambdas/orphan-scanner/handler.py

The orphan-scanner Lambda runs on a rate(5 minutes) EventBridge schedule and identifies ECS tasks that: 1. Have been running longer than max_task_runtime_hours (default: 6 hours) 2. Have no corresponding DynamoDB state record 3. Failed to terminate properly

Configuration¶

Setting	Default	Description
`ecs_cluster`	(from env)	ECS cluster to scan
`max_task_runtime_hours`	6	Threshold for long-running tasks
`dry_run`	`true`	Safety default -- log but don't stop tasks

DynamoDB Tables Scanned¶

tradai-workflow-state-{env} (backtest jobs)
tradai-retraining-state-{env} (retraining tasks)
tradai-trading-state-{env} (trading containers)

Behavior¶

Defaults to dry-run mode for safety (log orphans without stopping them)
Can be overridden per invocation via event payload: {"dry_run": false}
Publishes CloudWatch metrics under TradAI/OrphanScanning namespace
Sends SNS alerts when orphans are detected

Handler: lambdas/orphan-scanner/handler.py

Key Configuration Files¶

File	Purpose
`infra/shared/tradai_infra_shared/config.py`	All service, S3, DynamoDB, ECR, Lambda config
`infra/compute/modules/ecs_services.py`	ECS task definitions, Service Discovery, services
`infra/compute/modules/lambda_funcs.py`	Lambda function definitions (LAMBDA_CONFIGS)
`docker-compose.yaml`	Production-base Docker Compose
`docker-compose.override.yaml`	Development overrides + profile services
`services/backend/src/tradai/backend/infrastructure/executor_factory.py`	Executor Strategy/Factory pattern
`services/backend/src/tradai/backend/core/repositories.py`	Protocol definitions (BacktestJobSubmitter, WorkflowStateRepository, ECSOperations)

Changelog¶

Version	Date	Changes
10.0.0	2026-03-28	Regenerated from code. All 7 services, 18 Lambdas, Docker Compose verified

Dependencies¶

If This Changes	Update This Doc
`infra/shared/tradai_infra_shared/config.py` SERVICES dict	Service table (Section 1)
`infra/compute/modules/lambda_funcs.py` LAMBDA_CONFIGS	Lambda table (Section 4)
`docker-compose.yaml` or `docker-compose.override.yaml`	Docker Compose section (Section 5)
`services/backend/src/tradai/backend/infrastructure/executor_factory.py`	Executor pattern (Section 3)