Skip to content

TradAI Final Architecture - Architecture Overview

Version: 9.2.1 | Date: 2025-12-09


System Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                    USERS                                          │
│                           (CLI / Web UI / API Clients)                           │
└──────────────────────────────────────┬──────────────────────────────────────────┘
                                       │ HTTPS (443)
┌─────────────────────────────────────────────────────────────────────────────────┐
│                            AWS API Gateway (HTTP API)                            │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐                 │
│  │ JWT Authorizer  │  │ Rate Limiting   │  │ Request         │                 │
│  │ (Cognito)       │  │ (100 req/5min)  │  │ Validation      │                 │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘                 │
│                                                                                   │
│  Routes:                                                                         │
│  ├─ GET  /strategies/*     → VPC Link → Backend Service                        │
│  ├─ GET  /data/*           → VPC Link → Backend Service                        │
│  ├─ POST /backtest         → SQS Integration → Backtest Queue                  │
│  ├─ POST /deploy           → SQS Integration → Deploy Queue                    │
│  ├─ GET  /backtest/{id}    → VPC Link → Backend Service (DynamoDB query)       │
│  └─ ANY  /mlflow/*         → VPC Link → MLflow Service                         │
└────────────────┬─────────────────────────┬──────────────────────────────────────┘
                 │                         │
    ┌────────────┘                         └────────────┐
    │ VPC Link                                          │ SQS Integration
    ▼                                                   ▼
┌─────────────────────────────────────────┐   ┌─────────────────────────────────┐
│          VPC (10.0.0.0/16)              │   │       SQS Queues (FIFO)         │
│                                          │   │                                 │
│  ┌────────────────────────────────────┐ │   │  ┌─────────────────────────┐   │
│  │      Application Load Balancer     │ │   │  │   Backtest Queue        │   │
│  │      (Public Subnets)              │ │   │  │   ├─ Visibility: 15min  │   │
│  └──────────────┬─────────────────────┘ │   │  │   ├─ Retention: 4 days  │   │
│                 │                        │   │  │   └─ Max Receives: 3    │   │
│  ┌──────────────┴─────────────────────┐ │   │  └─────────────┬───────────┘   │
│  │      Private Subnets               │ │   │                │               │
│  │                                    │ │   │  ┌─────────────┴───────────┐   │
│  │  ┌──────────────────────────────┐  │ │   │  │   Dead Letter Queue     │   │
│  │  │    Backend API Service       │  │ │   │  │   (Failed messages)     │   │
│  │  │    (ECS Fargate)             │  │ │   │  └─────────────────────────┘   │
│  │  │    ├─ Port 8000              │  │ │   └─────────────────────────────────┘
│  │  │    ├─ 0.5 vCPU, 1GB          │  │ │                 │
│  │  │    └─ FastAPI                │  │ │                 │ Lambda Trigger
│  │  └──────────────────────────────┘  │ │                 ▼
│  │                                    │ │   ┌─────────────────────────────────┐
│  │  ┌──────────────────────────────┐  │ │   │     Lambda: SQS Consumer        │
│  │  │    Data Collection Service   │  │ │   │     ├─ Processes messages       │
│  │  │    (ECS Fargate)             │  │ │   │     ├─ Idempotency check        │
│  │  │    ├─ Port 8002              │  │ │   │     └─ Starts Step Functions    │
│  │  │    ├─ 0.25 vCPU, 512MB       │  │ │   └─────────────┬───────────────────┘
│  │  │    └─ ArcticDB Client        │  │ │                 │
│  │  └──────────────────────────────┘  │ │                 ▼
│  │                                    │ │   ┌─────────────────────────────────┐
│  │  ┌──────────────────────────────┐  │ │   │   Step Functions (STANDARD)     │
│  │  │    MLflow Service            │  │ │   │                                 │
│  │  │    (ECS Fargate)             │  │ │   │   Backtest Workflow:            │
│  │  │    ├─ Port 5000              │  │ │   │   ┌─────────────────────────┐   │
│  │  │    ├─ 0.5 vCPU, 1GB          │  │ │   │   │ 1. ValidateStrategy     │   │
│  │  │    └─ RDS PostgreSQL         │  │ │   │   │ 2. CheckDataFreshness   │   │
│  │  └──────────────────────────────┘  │ │   │   │    (Parallel)           │   │
│  │                                    │ │   │   │ 3. PrepareConfig        │   │
│  │  ┌──────────────────────────────┐  │ │   │   │ 4. RunBacktest          │──┼──► ECS Task
│  │  │    Lambda Functions (VPC)    │  │ │   │   │ 5. TransformResults     │   │
│  │  │    ├─ validate-strategy      │  │ │   │   │ 6. UpdateState          │   │
│  │  │    ├─ data-collection-proxy  │  │ │   │   │ 7. Cleanup              │   │
│  │  │    ├─ transform-results      │  │ │   │   │ 8. Notify               │   │
│  │  │    ├─ cleanup-resources      │  │ │   │   └─────────────────────────┘   │
│  │  │    ├─ notify-completion      │  │ │   └─────────────────────────────────┘
│  │  │    └─ sqs-consumer           │  │ │
│  │  └──────────────────────────────┘  │ │
│  └────────────────────────────────────┘ │
│                                          │
│  ┌────────────────────────────────────┐ │
│  │      Database Subnets              │ │
│  │  ┌──────────────────────────────┐  │ │
│  │  │    RDS PostgreSQL            │  │ │
│  │  │    (MLflow Backend)          │  │ │
│  │  │    ├─ db.t4g.micro           │  │ │
│  │  │    └─ Single-AZ (dev)        │  │ │
│  │  └──────────────────────────────┘  │ │
│  └────────────────────────────────────┘ │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────────┐
│                              STORAGE LAYER                                        │
│                                                                                   │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐│
│  │   S3 Configs    │  │   S3 Results    │  │   ArcticDB      │  │   ECR       ││
│  │   (SSE-S3)      │  │   (SSE-S3)      │  │   (S3-backed)   │  │   Images    ││
│  │                 │  │                 │  │                 │  │             ││
│  │ strategies/     │  │ backtests/      │  │ futures/        │  │ strategies/ ││
│  │ base-configs/   │  │ experiments/    │  │ spot/           │  │ services/   ││
│  │ temp/           │  │ mlflow-artifacts│  │                 │  │             ││
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  └─────────────┘│
│                                                                                   │
│  ┌─────────────────┐  ┌─────────────────┐                                       │
│  │   DynamoDB      │  │ Secrets Manager │                                       │
│  │   (Workflow     │  │                 │                                       │
│  │    State)       │  │ tradai/mlflow   │                                       │
│  │                 │  │ tradai/binance  │                                       │
│  │ PK: run_id      │  │ tradai/db       │                                       │
│  │ TTL: 7 days     │  │                 │                                       │
│  └─────────────────┘  └─────────────────┘                                       │
└─────────────────────────────────────────────────────────────────────────────────┘

Component Inventory

Long-Running Services (24/7)

Service Purpose Resources Port Cost/Month
Backend API Service API endpoints, orchestration triggers 0.5 vCPU, 1GB 8000 $14.60
Data Collection Service Data freshness checks, metadata queries 0.25 vCPU, 512MB 8002 $7.30
MLflow Service Experiment tracking, model registry 0.5 vCPU, 1GB 5000 $14.60

On-Demand Tasks (Pay-per-use)

Task Purpose Resources Duration Cost/Run
Strategy Service Task Config preparation, validation 0.5 vCPU, 1GB 5-10 min $0.03
Strategy Container Backtest execution (Freqtrade) 1 vCPU, 2GB 5-60 min $0.02-0.20

Live Trading Services (v9.1)

Service Purpose Resources Mode Cost/Month
Strategy Container (Live) Production trading per strategy 0.5 vCPU, 1GB ECS Service (non-Spot) $15.33
Strategy Container (Dry-run) Paper trading per strategy 0.5 vCPU, 1GB ECS Service (non-Spot) $15.33

Note: Strategy containers are unified - same image handles backtest, hyperopt, dry-run, and live modes. Config loaded from MLflow tags + S3 at runtime (see 11-LIVE-TRADING.md).

Lambda Functions

Function Purpose Memory Timeout Trigger
validate-strategy Check ECR image + S3 config 256MB 30s Step Functions
data-collection-proxy Call Data Collection Service 256MB 60s Step Functions
transform-results Format backtest results 512MB 60s Step Functions
cleanup-resources Delete temp S3 files 256MB 30s Step Functions
notify-completion Send notifications 256MB 30s Step Functions
sqs-consumer Process backtest queue 256MB 60s SQS
check-performance Validate metrics 256MB 30s Step Functions
audit-logger Log operations 256MB 30s EventBridge
orphan-scanner Detect stuck RUNNING jobs (v9.2.1) 128MB 60s EventBridge (15 min)

Managed Services

Service Purpose Configuration Cost/Month
AWS API Gateway API management, auth, routing HTTP API $3.50
Step Functions Workflow orchestration Standard type $0.50
SQS Message queuing FIFO + DLQ $0.40
DynamoDB Workflow state On-demand, TTL $2.00
Cognito User authentication User pool $0 (free tier)
Secrets Manager Credentials storage 5 secrets $2.00
CloudTrail Audit logging Single trail $2.00

Data Flow Patterns

Pattern 1: Simple Query (Synchronous)

User Request                    Response Time: <200ms
AWS API Gateway ──► JWT Validation ──► Rate Limit Check
VPC Link ──► ALB ──► Backend Service
                          ├─► GET /strategies ──► MLflow Service ──► RDS
                          └─► GET /data/status ──► Data Collection ──► ArcticDB (S3)

Use Cases: - List strategies - Get strategy details - Check data freshness - Get workflow status

Pattern 2: Complex Operation (Asynchronous)

User Request                    Response Time: <100ms (queued)
AWS API Gateway ──► JWT Validation ──► Rate Limit Check
SQS Integration ──► Backtest Queue (FIFO)
     │                    │
     │                    └─► Response: { ticket_id, status: "QUEUED" }
     ▼ (Async)
Lambda: SQS Consumer
     ├─► Check Idempotency (DynamoDB)
     ├─► Register Workflow (DynamoDB: status=RUNNING)
     └─► Start Step Functions Execution
         Backtest Workflow (10-70 minutes)
              ├─► ValidateStrategy (Lambda)
              ├─► CheckDataFreshness (Lambda → Data Collection)
              ├─► PrepareConfig (ECS Task)
              ├─► RunBacktest (ECS Task: Strategy Container)
              ├─► TransformResults (Lambda)
              ├─► UpdateState (DynamoDB: status=COMPLETED)
              ├─► Cleanup (Lambda)
              └─► Notify (Lambda → SNS/Email)

Use Cases: - Run backtest - Deploy strategy - Full data sync - Register new strategy

Pattern 3: Status Polling

User: GET /backtest/{run_id}/status
AWS API Gateway ──► Backend Service
                    DynamoDB Query
                          ├─► If COMPLETED: Include results from MLflow
                          └─► Return: { status, progress, result?, mlflow_url }

Pattern 4: Live Trading (v9.1)

Container Startup (ECS Service - Always On)
     ├─► Read minimal env vars: STRATEGY_NAME, STRATEGY_STAGE, TRADING_MODE
MLflow Config Loading
     ├─► MLflowAdapter.get_model_version(name, stage)
     │       └─► Returns tags: timeframe, pairs, warmup_days, config_s3_path
     ├─► ConfigMergeService.load_config(config_s3_path)
     │       └─► Load full Freqtrade config from S3
     └─► Apply runtime overrides (if any)
Data Warmup
     ├─► ArcticDB direct access (S3) - NOT via Data Collection Service
     │       └─► Load warmup_days of historical OHLCV data
     └─► Initialize Freqtrade with warmed dataframes
Trading Loop (24/7)
     ├─► Freqtrade strategy execution
     │       └─► Exchange API (Binance) for orders
     ├─► Health Reporting (every 60s)
     │       ├─► DynamoDB heartbeat update
     │       └─► CloudWatch metrics (PnL, trades, latency)
     └─► EventBridge + Lambda monitors session health
              └─► SNS alert if heartbeat stale > 3 minutes

Use Cases: - Live production trading - Dry-run (paper trading) - Strategy hot-swap via MLflow stage change


Service Communication Matrix

From To Protocol Path Purpose
API Gateway Backend Service HTTP/VPC Link /strategies/* Strategy queries
API Gateway Backend Service HTTP/VPC Link /data/* Data queries
API Gateway MLflow Service HTTP/VPC Link /mlflow/* MLflow proxy
API Gateway SQS AWS Integration Queue Async operations
Lambda Data Collection HTTP/VPC localhost:8002 Data checks
Lambda Step Functions AWS SDK StartExecution Workflow trigger
Lambda DynamoDB AWS SDK GetItem/PutItem State management
Step Functions Lambda AWS Integration Invoke Utility functions
Step Functions ECS AWS Integration RunTask.sync Task execution
Step Functions DynamoDB AWS Integration UpdateItem State updates
Backend Service MLflow Service HTTP :5000 Experiment queries
Backend Service Data Collection HTTP :8002 Freshness checks
Data Collection ArcticDB S3 API s3:// Time series data
Strategy Container ArcticDB S3 API s3:// OHLCV data
Strategy Container MLflow Service HTTP :5000 Log metrics
Live Trading (v9.1)
Strategy Container (Live) MLflow Service HTTP :5000 Load config tags
Strategy Container (Live) S3 S3 API s3:// Load config JSON
Strategy Container (Live) ArcticDB S3 API s3:// Warmup data
Strategy Container (Live) DynamoDB AWS SDK UpdateItem Heartbeat
Strategy Container (Live) Binance HTTPS api.binance.com Orders
Lambda (health-check) DynamoDB AWS SDK Query Session monitoring
EventBridge Lambda (health-check) AWS Integration rate(5 min) Scheduled check

Scaling Strategy

Horizontal Scaling

Component Min Max Trigger
Backend Service 1 3 CPU > 70%
Data Collection 1 2 CPU > 70%
MLflow Service 1 2 CPU > 70%
Strategy Tasks 0 10 Step Functions demand
Live Trading (v9.1)
Strategy Container (Live) 1 1 N/A (1 per strategy)
Strategy Container (Dry-run) 1 1 N/A (1 per strategy)

Vertical Scaling (Future)

Component Current Upgrade Path
Backend Service 0.5 vCPU, 1GB 1 vCPU, 2GB
Strategy Container 1 vCPU, 2GB 2 vCPU, 4GB
RDS db.t4g.micro db.t4g.small

High Availability Design

Multi-AZ Components

Component AZ-1 (us-east-1a) AZ-2 (us-east-1b)
Public Subnet 10.0.1.0/24 10.0.2.0/24
Private Subnet 10.0.11.0/24 10.0.12.0/24
Database Subnet 10.0.20.0/24 10.0.21.0/24
NAT Instance Primary Standby (ASG)
ALB Active Active
ECS Tasks Distributed Distributed

Failure Scenarios

Failure Impact Recovery
Single ECS task Request retry ECS auto-restart (30s)
NAT Instance No egress ASG launches replacement (2 min)
Single AZ 50% capacity Traffic shifts to healthy AZ
RDS Database unavailable Promote read replica (future)
Step Functions execution Workflow fails DLQ + manual retry

Technology Decisions Summary

Layer Technology Alternatives Considered Rationale
API Gateway AWS API Gateway Custom FastAPI 98% cost savings, managed
Compute ECS Fargate EKS, EC2 Serverless, simple
Orchestration Step Functions Airflow, Temporal Native AWS, pay-per-use
Queue SQS FIFO EventBridge, SNS Ordering, exactly-once
State DynamoDB Redis, PostgreSQL Serverless, TTL
Auth Cognito Auth0, Keycloak Native AWS, free tier
IaC Pulumi Terraform, CDK Python native, type-safe
Monitoring CloudWatch Datadog, Prometheus Native AWS, integrated

Integration Points

External Systems

System Integration Protocol Purpose
Binance CCXT Library REST/WebSocket Market data
MLflow HTTP API REST Experiment tracking
Freqtrade CLI Process Backtesting engine

Internal APIs

API Base URL Auth Rate Limit
TradAI API https://api.tradai.smartml.me JWT (Cognito) 100 req/5min
MLflow https://mlflow.tradai.smartml.me Basic Auth None

Deployment Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Bitbucket Pipelines                           │
│                                                                   │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │ Build Stage │ ─► │ Test Stage  │ ─► │ Deploy Stage        │  │
│  │             │    │             │    │                     │  │
│  │ - pants fmt │    │ - pants test│    │ - pulumi preview    │  │
│  │ - pants lint│    │ - pants check    │ - pulumi up         │  │
│  │ - docker    │    │             │    │ - ecs deploy        │  │
│  │   build     │    │             │    │                     │  │
│  └─────────────┘    └─────────────┘    └─────────────────────┘  │
│                                                ▼                  │
│                                         ┌─────────────┐          │
│                                         │    ECR      │          │
│                                         │  (Images)   │          │
│                                         └─────────────┘          │
└─────────────────────────────────────────────────────────────────┘
                                    ┌─────────────────────┐
                                    │     AWS Account     │
                                    │                     │
                                    │  ECS ◄─ ECR Images  │
                                    │  Lambda ◄─ S3 Code  │
                                    │  Step Functions     │
                                    └─────────────────────┘

Next Steps

  1. Review 03-VPC-NETWORKING.md for detailed network design
  2. Review 04-SECURITY.md for security controls
  3. Review 05-SERVICES.md for service implementations