TradAI Final Architecture - Architecture Overview Version: 9.2.1 | Date: 2025-12-09
TL;DR: TradAI runs 4 always-on ECS Fargate services (Backend API, Data Collection, Strategy Service, MLflow) plus on-demand strategy containers and 18 Lambda functions inside a single VPC across 2 AZs. API Gateway handles auth/routing, Step Functions orchestrates multi-step backtests, and SQS FIFO queues provide backpressure. Storage spans S3, ArcticDB, DynamoDB, and RDS PostgreSQL.
Key Architecture Principles
Serverless-first -- use managed services (API Gateway, Step Functions, SQS, DynamoDB) to minimize ops overhead. Pay-per-use compute -- on-demand ECS tasks for backtests; Fargate Spot for cost savings. Async by default -- backtest submissions return immediately with a ticket; results polled later. Defense in depth -- WAF + Cognito JWT + security groups + NACLs + encryption at rest. 1. System Architecture Diagram Mermaid Diagram graph LR
subgraph Users
CLI[CLI / Web UI / API Clients]
end
subgraph Edge["Edge Layer"]
APIGW[AWS API Gateway<br/>JWT + Rate Limit + WAF]
end
subgraph SQSLayer["Async Queuing"]
BQ[Backtest Queue<br/>SQS FIFO]
DLQ[Dead Letter Queue]
BQ --> DLQ
end
subgraph VPC["VPC 10.0.0.0/16"]
subgraph PublicSubnets["Public Subnets"]
ALB[Application Load Balancer]
end
subgraph PrivateSubnets["Private Subnets"]
Backend[Backend API<br/>0.5 vCPU · 1GB · :8000]
DataCol[Data Collection<br/>0.25 vCPU · 512MB · :8002]
MLflow[MLflow<br/>0.5 vCPU · 1GB · :5000]
Lambdas[Lambda Functions<br/>18 functions]
end
subgraph DBSubnets["Database Subnets"]
RDS[(RDS PostgreSQL<br/>db.t4g.micro)]
end
end
subgraph Orchestration
SFN[Step Functions<br/>Backtest + Retraining]
ECSTask[Strategy Container<br/>1 vCPU · 2GB]
end
subgraph Storage["Storage Layer"]
S3Configs[S3 Configs]
S3Results[S3 Results]
ArcticDB[ArcticDB<br/>S3-backed]
ECR[ECR Images]
Dynamo[(DynamoDB<br/>Workflow State)]
Secrets[Secrets Manager]
end
CLI -->|HTTPS 443| APIGW
APIGW -->|VPC Link| ALB
APIGW -->|SQS Integration| BQ
ALB --> Backend
ALB --> MLflow
Backend --> DataCol
Backend --> MLflow
MLflow --> RDS
BQ -->|Lambda Trigger| Lambdas
Lambdas -->|StartExecution| SFN
SFN -->|RunTask| ECSTask
SFN --> Lambdas
Backend --> Dynamo
ECSTask --> ArcticDB
ECSTask --> S3Results
DataCol --> ArcticDB
Lambdas --> Dynamo Lambda VPC Placement
The diagram shows all Lambdas in the private subnet for simplicity. In practice, notify-completion and pulumi-drift-detector run outside the VPC (no VPC configuration) since they only need access to SNS/Pulumi and not to internal services.
See source code for detailed ASCII reference.
2. Component Inventory Long-Running Services (24/7) Service Purpose Resources Port Cost/Month Backend API Service API endpoints, orchestration triggers 0.5 vCPU, 1GB 8000 $14.60 Data Collection Service Data freshness checks, metadata queries 0.25 vCPU, 512MB 8002 $7.30 Strategy Service Strategy execution, config preparation, validation 0.5 vCPU, 1GB 8003 $14.60 MLflow Service Experiment tracking, model registry 0.5 vCPU, 1GB 5000 $14.60
On-Demand Tasks (Pay-per-use) Task Purpose Resources Duration Cost/Run Strategy Service Task Config preparation, validation 0.5 vCPU, 1GB 5-10 min $0.03 Strategy Container Backtest execution (Freqtrade) 1 vCPU, 2GB 5-60 min $0.02-0.20
Live Trading Services (v9.1) Service Purpose Resources Mode Cost/Month Strategy Container (Live) Production trading per strategy 1 vCPU, 2GB ECS Service (non-Spot) $30.66 Strategy Container (Dry-run) Paper trading per strategy 1 vCPU, 2GB ECS Service (Spot) ~$9.20
Note: Strategy containers are unified - same image handles backtest, hyperopt, dry-run, and live modes. Config loaded from MLflow tags + S3 at runtime (see 11-LIVE-TRADING.md ).
Lambda Functions Function Purpose Memory Timeout Trigger Required backtest-consumer Consume backtest requests from SQS, launch ECS tasks 256MB 30s SQS Yes sqs-consumer Consume retraining messages, launch ECS tasks 256MB 30s SQS No orphan-scanner Scan for orphaned ECS tasks 128MB 60s EventBridge (scheduled) Yes health-check Periodic health checks on ECS services 256MB 60s EventBridge (scheduled) Yes trading-heartbeat-check Check live trading container heartbeats 256MB 60s EventBridge (scheduled) Yes drift-monitor Monitor model drift using PSI, alert on drift 512MB 120s EventBridge (scheduled) Yes retraining-scheduler Schedule model retraining based on drift or intervals 256MB 60s EventBridge (scheduled) Yes pulumi-drift-detector Detect infrastructure drift via Pulumi preview 512MB 300s EventBridge (scheduled) Yes validate-strategy Validate strategy config with MLflow version resolution 256MB 30s Step Functions No data-collection-proxy Proxy to Data Collection service for ensure-data ops 256MB 180s Step Functions No update-status Update job status in DynamoDB workflow-state table 256MB 30s Step Functions No cleanup-resources Stop orphaned ECS tasks on workflow failure 256MB 60s Step Functions No notify-completion Send notifications on pipeline completion 128MB 30s Step Functions No check-retraining-needed Evaluate drift state to determine if retraining needed 256MB 30s Step Functions No compare-models Compare champion vs challenger models 512MB 120s Step Functions No promote-model Promote challenger to Production in MLflow 256MB 60s Step Functions No model-rollback Roll back model to previous version on degradation 256MB 60s Backend API No update-nat-routes Update private route table when NAT instance replaced 128MB 120s ASG Lifecycle Hook Yes
Managed Services Service Purpose Configuration Cost/Month AWS API Gateway API management, auth, routing HTTP API $3.50 Step Functions Workflow orchestration Standard type $0.50 SQS Message queuing FIFO + DLQ $0.40 DynamoDB Workflow state On-demand, TTL $2.00 Cognito User authentication User pool $0 (free tier) Secrets Manager Credentials storage 5 secrets $2.00 CloudTrail Audit logging Single trail $2.00
3. Data Flow Patterns Pattern 1: Simple Query (Synchronous) User Request Response Time: <200ms
│
▼
AWS API Gateway ──► JWT Validation ──► Rate Limit Check
│
▼
VPC Link ──► ALB ──► Backend Service
│
├─► GET /api/v1/strategies ──► MLflow Service ──► RDS
│
└─► GET /api/v1/data/freshness ──► Data Collection ──► ArcticDB (S3)
Use Cases: - List strategies - Get strategy details - Check data freshness - Get workflow status
Pattern 2: Complex Operation (Asynchronous) User Request Response Time: <100ms (queued)
│
▼
AWS API Gateway ──► JWT Validation ──► Rate Limit Check
│
▼
SQS Integration ──► Backtest Queue (FIFO)
│ │
│ └─► Response: { ticket_id, status: "QUEUED" }
│
▼ (Async)
Lambda: SQS Consumer
│
├─► Check Idempotency (DynamoDB)
│
├─► Register Workflow (DynamoDB: status=RUNNING)
│
└─► Start Step Functions Execution
│
▼
Backtest Workflow v11 (10-70 minutes)
│
├─► ValidateStrategy (Lambda)
├─► EnsureData (Lambda → Data Collection)
├─► UpdateStatusRunning (Lambda → DynamoDB)
├─► RunBacktest (ECS Task: Strategy Container)
├─► HandleSuccess (Parallel)
│ ├─► UpdateStatusCompleted (Lambda)
│ └─► NotifySuccess (Lambda → SNS)
└─► On failure: CleanupResources (Lambda) → UpdateStatusFailed
Use Cases: - Run backtest - Deploy strategy - Full data sync - Register new strategy
Pattern 3: Status Polling User: GET /api/v1/backtests/{job_id}
│
▼
AWS API Gateway ──► Backend Service
│
▼
DynamoDB Query
│
├─► If COMPLETED: Include results from MLflow
│
└─► Return: { status, progress, result?, mlflow_url }
Pattern 4: Live Trading (v9.1) Container Startup (ECS Service - Always On)
│
├─► Read minimal env vars: STRATEGY_NAME, STRATEGY_STAGE, TRADING_MODE
│
▼
MLflow Config Loading
│
├─► MLflowAdapter.get_model_version(name, stage)
│ └─► Returns tags: timeframe, pairs, warmup_days, config_s3_path
│
├─► ConfigMergeService.load_config(config_s3_path)
│ └─► Load full Freqtrade config from S3
│
└─► Apply runtime overrides (if any)
│
▼
Data Warmup
│
├─► ArcticDB direct access (S3) - NOT via Data Collection Service
│ └─► Load warmup_days of historical OHLCV data
│
└─► Initialize Freqtrade with warmed dataframes
│
▼
Trading Loop (24/7)
│
├─► Freqtrade strategy execution
│ └─► Exchange API (Binance) for orders
│
├─► Health Reporting (every 30s)
│ ├─► DynamoDB heartbeat update
│ └─► CloudWatch metrics (PnL, trades, latency)
│
└─► EventBridge + Lambda monitors session health
│
└─► SNS alert if heartbeat stale > 3 minutes
Use Cases: - Live production trading - Dry-run (paper trading) - Strategy hot-swap via MLflow stage change
4. Service Communication Matrix From To Protocol Path Purpose API Gateway Backend Service HTTP/VPC Link /api/v1/strategies/* Strategy queries API Gateway Backend Service HTTP/VPC Link /api/v1/data/* Data queries API Gateway MLflow Service HTTP/VPC Link /mlflow/* MLflow proxy API Gateway SQS AWS Integration Queue Async operations Lambda Data Collection HTTP/VPC data-collection.tradai-{env}.local:8002 Data checks Lambda Step Functions AWS SDK StartExecution Workflow trigger Lambda DynamoDB AWS SDK GetItem/PutItem State management Step Functions Lambda AWS Integration Invoke Utility functions Step Functions ECS AWS Integration RunTask.sync Task execution Step Functions DynamoDB AWS Integration UpdateItem State updates Backend Service MLflow Service HTTP :5000 Experiment queries Backend Service Data Collection HTTP :8002 Freshness checks Data Collection ArcticDB S3 API s3:// Time series data Strategy Container ArcticDB S3 API s3:// OHLCV data Strategy Container MLflow Service HTTP :5000 Log metrics Live Trading (v9.1) Strategy Container (Live) MLflow Service HTTP :5000 Load config tags Strategy Container (Live) S3 S3 API s3:// Load config JSON Strategy Container (Live) ArcticDB S3 API s3:// Warmup data Strategy Container (Live) DynamoDB AWS SDK UpdateItem Heartbeat Strategy Container (Live) Binance HTTPS api.binance.com Orders Lambda (health-check) DynamoDB AWS SDK Query Session monitoring EventBridge Lambda (health-check) AWS Integration rate(2 min) Scheduled check
5. Scaling Strategy Horizontal Scaling Component Min Max Trigger Backend Service 1 3 CPU > 70% Data Collection 1 2 CPU > 70% MLflow Service 1 2 CPU > 70% Strategy Tasks 0 10 Step Functions demand Live Trading (v9.1) Strategy Container (Live) 1 1 N/A (1 per strategy) Strategy Container (Dry-run) 1 1 N/A (1 per strategy)
Vertical Scaling (Future) Component Current Upgrade Path Backend Service 0.5 vCPU, 1GB 1 vCPU, 2GB Strategy Container 1 vCPU, 2GB 2 vCPU, 4GB RDS db.t4g.micro db.t4g.small
6. High Availability Design Multi-AZ Components Component AZ-1 (eu-central-1a) AZ-2 (eu-central-1b) Public Subnet 10.0.1.0/24 10.0.2.0/24 Private Subnet 10.0.11.0/24 10.0.12.0/24 Database Subnet 10.0.21.0/24 10.0.22.0/24 NAT Instance Primary Standby (ASG) ALB Active Active ECS Tasks Distributed Distributed
Failure Scenarios Failure Impact Recovery Single ECS task Request retry ECS auto-restart (30s) NAT Instance No egress ASG launches replacement (2 min) Single AZ 50% capacity Traffic shifts to healthy AZ RDS Database unavailable Promote read replica (future) Step Functions execution Workflow fails DLQ + manual retry
7. Technology Decisions Summary Layer Technology Alternatives Considered Rationale API Gateway AWS API Gateway Custom FastAPI 98% cost savings, managed Compute ECS Fargate EKS, EC2 Serverless, simple Orchestration Step Functions Airflow, Temporal Native AWS, pay-per-use Queue SQS FIFO EventBridge, SNS Ordering, exactly-once State DynamoDB Redis, PostgreSQL Serverless, TTL Auth Cognito Auth0, Keycloak Native AWS, free tier IaC Pulumi Terraform, CDK Python native, type-safe Monitoring CloudWatch Datadog, Prometheus Native AWS, integrated
8. Integration Points External Systems System Integration Protocol Purpose Binance CCXT Library REST/WebSocket Market data MLflow HTTP API REST Experiment tracking Freqtrade CLI Process Backtesting engine
Internal APIs API Base URL Auth Rate Limit TradAI API https://api.tradai.example.com JWT (Cognito) 100 req/5min MLflow https://mlflow.tradai.example.com Basic Auth None
9. Deployment Architecture ┌─────────────────────────────────────────────────────────────────┐
│ CI/CD Pipeline (GitHub Actions) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Build Stage │ ─► │ Test Stage │ ─► │ Deploy Stage │ │
│ │ │ │ │ │ │ │
│ │ - just fmt │ │ - just test │ │ - pulumi preview │ │
│ │ - just lint │ │ - just check│ │ - pulumi up │ │
│ │ - docker │ │ │ │ - ecs deploy │ │
│ │ build │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ ▼ │
│ ┌─────────────┐ │
│ │ ECR │ │
│ │ (Images) │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────┐
│ AWS Account │
│ │
│ ECS ◄─ ECR Images │
│ Lambda ◄─ S3 Code │
│ Step Functions │
└─────────────────────┘
10. Next Steps Review 03-VPC-NETWORKING.md for detailed network design Review 04-SECURITY.md for security controls Review 05-SERVICES.md for service implementations Changelog Version Date Changes 9.2.1 2026-03-28 Fixed Lambda table (18 actual incl. update-nat-routes), live trading resources (1vCPU/2GB), added Mermaid diagram
Dependencies If This Changes Update This Doc infra/compute/modules/lambda_funcs.py Lambda Functions table infra/shared/tradai_infra_shared/config.py SERVICES Service specs table New service or Lambda added Component inventory