Skip to content

TradAI Final Architecture - Architecture Overview

Version: 9.2.1 | Date: 2025-12-09

TL;DR: TradAI runs 4 always-on ECS Fargate services (Backend API, Data Collection, Strategy Service, MLflow) plus on-demand strategy containers and 18 Lambda functions inside a single VPC across 2 AZs. API Gateway handles auth/routing, Step Functions orchestrates multi-step backtests, and SQS FIFO queues provide backpressure. Storage spans S3, ArcticDB, DynamoDB, and RDS PostgreSQL.


Key Architecture Principles

  • Serverless-first -- use managed services (API Gateway, Step Functions, SQS, DynamoDB) to minimize ops overhead.
  • Pay-per-use compute -- on-demand ECS tasks for backtests; Fargate Spot for cost savings.
  • Async by default -- backtest submissions return immediately with a ticket; results polled later.
  • Defense in depth -- WAF + Cognito JWT + security groups + NACLs + encryption at rest.

1. System Architecture Diagram

Mermaid Diagram

graph LR
    subgraph Users
        CLI[CLI / Web UI / API Clients]
    end

    subgraph Edge["Edge Layer"]
        APIGW[AWS API Gateway<br/>JWT + Rate Limit + WAF]
    end

    subgraph SQSLayer["Async Queuing"]
        BQ[Backtest Queue<br/>SQS FIFO]
        DLQ[Dead Letter Queue]
        BQ --> DLQ
    end

    subgraph VPC["VPC 10.0.0.0/16"]
        subgraph PublicSubnets["Public Subnets"]
            ALB[Application Load Balancer]
        end

        subgraph PrivateSubnets["Private Subnets"]
            Backend[Backend API<br/>0.5 vCPU · 1GB · :8000]
            DataCol[Data Collection<br/>0.25 vCPU · 512MB · :8002]
            MLflow[MLflow<br/>0.5 vCPU · 1GB · :5000]
            Lambdas[Lambda Functions<br/>18 functions]
        end

        subgraph DBSubnets["Database Subnets"]
            RDS[(RDS PostgreSQL<br/>db.t4g.micro)]
        end
    end

    subgraph Orchestration
        SFN[Step Functions<br/>Backtest + Retraining]
        ECSTask[Strategy Container<br/>1 vCPU · 2GB]
    end

    subgraph Storage["Storage Layer"]
        S3Configs[S3 Configs]
        S3Results[S3 Results]
        ArcticDB[ArcticDB<br/>S3-backed]
        ECR[ECR Images]
        Dynamo[(DynamoDB<br/>Workflow State)]
        Secrets[Secrets Manager]
    end

    CLI -->|HTTPS 443| APIGW
    APIGW -->|VPC Link| ALB
    APIGW -->|SQS Integration| BQ
    ALB --> Backend
    ALB --> MLflow
    Backend --> DataCol
    Backend --> MLflow
    MLflow --> RDS
    BQ -->|Lambda Trigger| Lambdas
    Lambdas -->|StartExecution| SFN
    SFN -->|RunTask| ECSTask
    SFN --> Lambdas
    Backend --> Dynamo
    ECSTask --> ArcticDB
    ECSTask --> S3Results
    DataCol --> ArcticDB
    Lambdas --> Dynamo

Lambda VPC Placement

The diagram shows all Lambdas in the private subnet for simplicity. In practice, notify-completion and pulumi-drift-detector run outside the VPC (no VPC configuration) since they only need access to SNS/Pulumi and not to internal services.

See source code for detailed ASCII reference.


2. Component Inventory

Long-Running Services (24/7)

Service Purpose Resources Port Cost/Month
Backend API Service API endpoints, orchestration triggers 0.5 vCPU, 1GB 8000 $14.60
Data Collection Service Data freshness checks, metadata queries 0.25 vCPU, 512MB 8002 $7.30
Strategy Service Strategy execution, config preparation, validation 0.5 vCPU, 1GB 8003 $14.60
MLflow Service Experiment tracking, model registry 0.5 vCPU, 1GB 5000 $14.60

On-Demand Tasks (Pay-per-use)

Task Purpose Resources Duration Cost/Run
Strategy Service Task Config preparation, validation 0.5 vCPU, 1GB 5-10 min $0.03
Strategy Container Backtest execution (Freqtrade) 1 vCPU, 2GB 5-60 min $0.02-0.20

Live Trading Services (v9.1)

Service Purpose Resources Mode Cost/Month
Strategy Container (Live) Production trading per strategy 1 vCPU, 2GB ECS Service (non-Spot) $30.66
Strategy Container (Dry-run) Paper trading per strategy 1 vCPU, 2GB ECS Service (Spot) ~$9.20

Note: Strategy containers are unified - same image handles backtest, hyperopt, dry-run, and live modes. Config loaded from MLflow tags + S3 at runtime (see 11-LIVE-TRADING.md).

Lambda Functions

Function Purpose Memory Timeout Trigger Required
backtest-consumer Consume backtest requests from SQS, launch ECS tasks 256MB 30s SQS Yes
sqs-consumer Consume retraining messages, launch ECS tasks 256MB 30s SQS No
orphan-scanner Scan for orphaned ECS tasks 128MB 60s EventBridge (scheduled) Yes
health-check Periodic health checks on ECS services 256MB 60s EventBridge (scheduled) Yes
trading-heartbeat-check Check live trading container heartbeats 256MB 60s EventBridge (scheduled) Yes
drift-monitor Monitor model drift using PSI, alert on drift 512MB 120s EventBridge (scheduled) Yes
retraining-scheduler Schedule model retraining based on drift or intervals 256MB 60s EventBridge (scheduled) Yes
pulumi-drift-detector Detect infrastructure drift via Pulumi preview 512MB 300s EventBridge (scheduled) Yes
validate-strategy Validate strategy config with MLflow version resolution 256MB 30s Step Functions No
data-collection-proxy Proxy to Data Collection service for ensure-data ops 256MB 180s Step Functions No
update-status Update job status in DynamoDB workflow-state table 256MB 30s Step Functions No
cleanup-resources Stop orphaned ECS tasks on workflow failure 256MB 60s Step Functions No
notify-completion Send notifications on pipeline completion 128MB 30s Step Functions No
check-retraining-needed Evaluate drift state to determine if retraining needed 256MB 30s Step Functions No
compare-models Compare champion vs challenger models 512MB 120s Step Functions No
promote-model Promote challenger to Production in MLflow 256MB 60s Step Functions No
model-rollback Roll back model to previous version on degradation 256MB 60s Backend API No
update-nat-routes Update private route table when NAT instance replaced 128MB 120s ASG Lifecycle Hook Yes

Managed Services

Service Purpose Configuration Cost/Month
AWS API Gateway API management, auth, routing HTTP API $3.50
Step Functions Workflow orchestration Standard type $0.50
SQS Message queuing FIFO + DLQ $0.40
DynamoDB Workflow state On-demand, TTL $2.00
Cognito User authentication User pool $0 (free tier)
Secrets Manager Credentials storage 5 secrets $2.00
CloudTrail Audit logging Single trail $2.00

3. Data Flow Patterns

Pattern 1: Simple Query (Synchronous)

User Request                    Response Time: <200ms
AWS API Gateway ──► JWT Validation ──► Rate Limit Check
VPC Link ──► ALB ──► Backend Service
                           ├─► GET /api/v1/strategies ──► MLflow Service ──► RDS
                           └─► GET /api/v1/data/freshness ──► Data Collection ──► ArcticDB (S3)

Use Cases: - List strategies - Get strategy details - Check data freshness - Get workflow status

Pattern 2: Complex Operation (Asynchronous)

User Request                    Response Time: <100ms (queued)
AWS API Gateway ──► JWT Validation ──► Rate Limit Check
SQS Integration ──► Backtest Queue (FIFO)
     │                    │
     │                    └─► Response: { ticket_id, status: "QUEUED" }
     ▼ (Async)
Lambda: SQS Consumer
     ├─► Check Idempotency (DynamoDB)
     ├─► Register Workflow (DynamoDB: status=RUNNING)
     └─► Start Step Functions Execution
         Backtest Workflow v11 (10-70 minutes)
              ├─► ValidateStrategy (Lambda)
              ├─► EnsureData (Lambda → Data Collection)
              ├─► UpdateStatusRunning (Lambda → DynamoDB)
              ├─► RunBacktest (ECS Task: Strategy Container)
              ├─► HandleSuccess (Parallel)
              │     ├─► UpdateStatusCompleted (Lambda)
              │     └─► NotifySuccess (Lambda → SNS)
              └─► On failure: CleanupResources (Lambda) → UpdateStatusFailed

Use Cases: - Run backtest - Deploy strategy - Full data sync - Register new strategy

Pattern 3: Status Polling

User: GET /api/v1/backtests/{job_id}
AWS API Gateway ──► Backend Service
                    DynamoDB Query
                          ├─► If COMPLETED: Include results from MLflow
                          └─► Return: { status, progress, result?, mlflow_url }

Pattern 4: Live Trading (v9.1)

Container Startup (ECS Service - Always On)
     ├─► Read minimal env vars: STRATEGY_NAME, STRATEGY_STAGE, TRADING_MODE
MLflow Config Loading
     ├─► MLflowAdapter.get_model_version(name, stage)
     │       └─► Returns tags: timeframe, pairs, warmup_days, config_s3_path
     ├─► ConfigMergeService.load_config(config_s3_path)
     │       └─► Load full Freqtrade config from S3
     └─► Apply runtime overrides (if any)
Data Warmup
     ├─► ArcticDB direct access (S3) - NOT via Data Collection Service
     │       └─► Load warmup_days of historical OHLCV data
     └─► Initialize Freqtrade with warmed dataframes
Trading Loop (24/7)
     ├─► Freqtrade strategy execution
     │       └─► Exchange API (Binance) for orders
     ├─► Health Reporting (every 30s)
     │       ├─► DynamoDB heartbeat update
     │       └─► CloudWatch metrics (PnL, trades, latency)
     └─► EventBridge + Lambda monitors session health
              └─► SNS alert if heartbeat stale > 3 minutes

Use Cases: - Live production trading - Dry-run (paper trading) - Strategy hot-swap via MLflow stage change


4. Service Communication Matrix

From To Protocol Path Purpose
API Gateway Backend Service HTTP/VPC Link /api/v1/strategies/* Strategy queries
API Gateway Backend Service HTTP/VPC Link /api/v1/data/* Data queries
API Gateway MLflow Service HTTP/VPC Link /mlflow/* MLflow proxy
API Gateway SQS AWS Integration Queue Async operations
Lambda Data Collection HTTP/VPC data-collection.tradai-{env}.local:8002 Data checks
Lambda Step Functions AWS SDK StartExecution Workflow trigger
Lambda DynamoDB AWS SDK GetItem/PutItem State management
Step Functions Lambda AWS Integration Invoke Utility functions
Step Functions ECS AWS Integration RunTask.sync Task execution
Step Functions DynamoDB AWS Integration UpdateItem State updates
Backend Service MLflow Service HTTP :5000 Experiment queries
Backend Service Data Collection HTTP :8002 Freshness checks
Data Collection ArcticDB S3 API s3:// Time series data
Strategy Container ArcticDB S3 API s3:// OHLCV data
Strategy Container MLflow Service HTTP :5000 Log metrics
Live Trading (v9.1)
Strategy Container (Live) MLflow Service HTTP :5000 Load config tags
Strategy Container (Live) S3 S3 API s3:// Load config JSON
Strategy Container (Live) ArcticDB S3 API s3:// Warmup data
Strategy Container (Live) DynamoDB AWS SDK UpdateItem Heartbeat
Strategy Container (Live) Binance HTTPS api.binance.com Orders
Lambda (health-check) DynamoDB AWS SDK Query Session monitoring
EventBridge Lambda (health-check) AWS Integration rate(2 min) Scheduled check

5. Scaling Strategy

Horizontal Scaling

Component Min Max Trigger
Backend Service 1 3 CPU > 70%
Data Collection 1 2 CPU > 70%
MLflow Service 1 2 CPU > 70%
Strategy Tasks 0 10 Step Functions demand
Live Trading (v9.1)
Strategy Container (Live) 1 1 N/A (1 per strategy)
Strategy Container (Dry-run) 1 1 N/A (1 per strategy)

Vertical Scaling (Future)

Component Current Upgrade Path
Backend Service 0.5 vCPU, 1GB 1 vCPU, 2GB
Strategy Container 1 vCPU, 2GB 2 vCPU, 4GB
RDS db.t4g.micro db.t4g.small

6. High Availability Design

Multi-AZ Components

Component AZ-1 (eu-central-1a) AZ-2 (eu-central-1b)
Public Subnet 10.0.1.0/24 10.0.2.0/24
Private Subnet 10.0.11.0/24 10.0.12.0/24
Database Subnet 10.0.21.0/24 10.0.22.0/24
NAT Instance Primary Standby (ASG)
ALB Active Active
ECS Tasks Distributed Distributed

Failure Scenarios

Failure Impact Recovery
Single ECS task Request retry ECS auto-restart (30s)
NAT Instance No egress ASG launches replacement (2 min)
Single AZ 50% capacity Traffic shifts to healthy AZ
RDS Database unavailable Promote read replica (future)
Step Functions execution Workflow fails DLQ + manual retry

7. Technology Decisions Summary

Layer Technology Alternatives Considered Rationale
API Gateway AWS API Gateway Custom FastAPI 98% cost savings, managed
Compute ECS Fargate EKS, EC2 Serverless, simple
Orchestration Step Functions Airflow, Temporal Native AWS, pay-per-use
Queue SQS FIFO EventBridge, SNS Ordering, exactly-once
State DynamoDB Redis, PostgreSQL Serverless, TTL
Auth Cognito Auth0, Keycloak Native AWS, free tier
IaC Pulumi Terraform, CDK Python native, type-safe
Monitoring CloudWatch Datadog, Prometheus Native AWS, integrated

8. Integration Points

External Systems

System Integration Protocol Purpose
Binance CCXT Library REST/WebSocket Market data
MLflow HTTP API REST Experiment tracking
Freqtrade CLI Process Backtesting engine

Internal APIs

API Base URL Auth Rate Limit
TradAI API https://api.tradai.example.com JWT (Cognito) 100 req/5min
MLflow https://mlflow.tradai.example.com Basic Auth None

9. Deployment Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    CI/CD Pipeline (GitHub Actions)               │
│                                                                   │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │ Build Stage │ ─► │ Test Stage  │ ─► │ Deploy Stage        │  │
│  │             │    │             │    │                     │  │
│  │ - just fmt  │    │ - just test │    │ - pulumi preview    │  │
│  │ - just lint │    │ - just check│    │ - pulumi up         │  │
│  │ - docker    │    │             │    │ - ecs deploy        │  │
│  │   build     │    │             │    │                     │  │
│  └─────────────┘    └─────────────┘    └─────────────────────┘  │
│                                                ▼                  │
│                                         ┌─────────────┐          │
│                                         │    ECR      │          │
│                                         │  (Images)   │          │
│                                         └─────────────┘          │
└─────────────────────────────────────────────────────────────────┘
                                    ┌─────────────────────┐
                                    │     AWS Account     │
                                    │                     │
                                    │  ECS ◄─ ECR Images  │
                                    │  Lambda ◄─ S3 Code  │
                                    │  Step Functions     │
                                    └─────────────────────┘

10. Next Steps

  1. Review 03-VPC-NETWORKING.md for detailed network design
  2. Review 04-SECURITY.md for security controls
  3. Review 05-SERVICES.md for service implementations

Changelog

Version Date Changes
9.2.1 2026-03-28 Fixed Lambda table (18 actual incl. update-nat-routes), live trading resources (1vCPU/2GB), added Mermaid diagram

Dependencies

If This Changes Update This Doc
infra/compute/modules/lambda_funcs.py Lambda Functions table
infra/shared/tradai_infra_shared/config.py SERVICES Service specs table
New service or Lambda added Component inventory