TradAI Final Architecture - Architecture Overview Version: 9.2.1 | Date: 2025-12-09
System Architecture Diagram ┌─────────────────────────────────────────────────────────────────────────────────┐
│ USERS │
│ (CLI / Web UI / API Clients) │
└──────────────────────────────────────┬──────────────────────────────────────────┘
│ HTTPS (443)
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ AWS API Gateway (HTTP API) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ JWT Authorizer │ │ Rate Limiting │ │ Request │ │
│ │ (Cognito) │ │ (100 req/5min) │ │ Validation │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Routes: │
│ ├─ GET /strategies/* → VPC Link → Backend Service │
│ ├─ GET /data/* → VPC Link → Backend Service │
│ ├─ POST /backtest → SQS Integration → Backtest Queue │
│ ├─ POST /deploy → SQS Integration → Deploy Queue │
│ ├─ GET /backtest/{id} → VPC Link → Backend Service (DynamoDB query) │
│ └─ ANY /mlflow/* → VPC Link → MLflow Service │
└────────────────┬─────────────────────────┬──────────────────────────────────────┘
│ │
┌────────────┘ └────────────┐
│ VPC Link │ SQS Integration
▼ ▼
┌─────────────────────────────────────────┐ ┌─────────────────────────────────┐
│ VPC (10.0.0.0/16) │ │ SQS Queues (FIFO) │
│ │ │ │
│ ┌────────────────────────────────────┐ │ │ ┌─────────────────────────┐ │
│ │ Application Load Balancer │ │ │ │ Backtest Queue │ │
│ │ (Public Subnets) │ │ │ │ ├─ Visibility: 15min │ │
│ └──────────────┬─────────────────────┘ │ │ │ ├─ Retention: 4 days │ │
│ │ │ │ │ └─ Max Receives: 3 │ │
│ ┌──────────────┴─────────────────────┐ │ │ └─────────────┬───────────┘ │
│ │ Private Subnets │ │ │ │ │
│ │ │ │ │ ┌─────────────┴───────────┐ │
│ │ ┌──────────────────────────────┐ │ │ │ │ Dead Letter Queue │ │
│ │ │ Backend API Service │ │ │ │ │ (Failed messages) │ │
│ │ │ (ECS Fargate) │ │ │ │ └─────────────────────────┘ │
│ │ │ ├─ Port 8000 │ │ │ └─────────────────────────────────┘
│ │ │ ├─ 0.5 vCPU, 1GB │ │ │ │
│ │ │ └─ FastAPI │ │ │ │ Lambda Trigger
│ │ └──────────────────────────────┘ │ │ ▼
│ │ │ │ ┌─────────────────────────────────┐
│ │ ┌──────────────────────────────┐ │ │ │ Lambda: SQS Consumer │
│ │ │ Data Collection Service │ │ │ │ ├─ Processes messages │
│ │ │ (ECS Fargate) │ │ │ │ ├─ Idempotency check │
│ │ │ ├─ Port 8002 │ │ │ │ └─ Starts Step Functions │
│ │ │ ├─ 0.25 vCPU, 512MB │ │ │ └─────────────┬───────────────────┘
│ │ │ └─ ArcticDB Client │ │ │ │
│ │ └──────────────────────────────┘ │ │ ▼
│ │ │ │ ┌─────────────────────────────────┐
│ │ ┌──────────────────────────────┐ │ │ │ Step Functions (STANDARD) │
│ │ │ MLflow Service │ │ │ │ │
│ │ │ (ECS Fargate) │ │ │ │ Backtest Workflow: │
│ │ │ ├─ Port 5000 │ │ │ │ ┌─────────────────────────┐ │
│ │ │ ├─ 0.5 vCPU, 1GB │ │ │ │ │ 1. ValidateStrategy │ │
│ │ │ └─ RDS PostgreSQL │ │ │ │ │ 2. CheckDataFreshness │ │
│ │ └──────────────────────────────┘ │ │ │ │ (Parallel) │ │
│ │ │ │ │ │ 3. PrepareConfig │ │
│ │ ┌──────────────────────────────┐ │ │ │ │ 4. RunBacktest │──┼──► ECS Task
│ │ │ Lambda Functions (VPC) │ │ │ │ │ 5. TransformResults │ │
│ │ │ ├─ validate-strategy │ │ │ │ │ 6. UpdateState │ │
│ │ │ ├─ data-collection-proxy │ │ │ │ │ 7. Cleanup │ │
│ │ │ ├─ transform-results │ │ │ │ │ 8. Notify │ │
│ │ │ ├─ cleanup-resources │ │ │ │ └─────────────────────────┘ │
│ │ │ ├─ notify-completion │ │ │ └─────────────────────────────────┘
│ │ │ └─ sqs-consumer │ │ │
│ │ └──────────────────────────────┘ │ │
│ └────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────┐ │
│ │ Database Subnets │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ RDS PostgreSQL │ │ │
│ │ │ (MLflow Backend) │ │ │
│ │ │ ├─ db.t4g.micro │ │ │
│ │ │ └─ Single-AZ (dev) │ │ │
│ │ └──────────────────────────────┘ │ │
│ └────────────────────────────────────┘ │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ STORAGE LAYER │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐│
│ │ S3 Configs │ │ S3 Results │ │ ArcticDB │ │ ECR ││
│ │ (SSE-S3) │ │ (SSE-S3) │ │ (S3-backed) │ │ Images ││
│ │ │ │ │ │ │ │ ││
│ │ strategies/ │ │ backtests/ │ │ futures/ │ │ strategies/ ││
│ │ base-configs/ │ │ experiments/ │ │ spot/ │ │ services/ ││
│ │ temp/ │ │ mlflow-artifacts│ │ │ │ ││
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────┘│
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ DynamoDB │ │ Secrets Manager │ │
│ │ (Workflow │ │ │ │
│ │ State) │ │ tradai/mlflow │ │
│ │ │ │ tradai/binance │ │
│ │ PK: run_id │ │ tradai/db │ │
│ │ TTL: 7 days │ │ │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
Component Inventory Long-Running Services (24/7) Service Purpose Resources Port Cost/Month Backend API Service API endpoints, orchestration triggers 0.5 vCPU, 1GB 8000 $14.60 Data Collection Service Data freshness checks, metadata queries 0.25 vCPU, 512MB 8002 $7.30 MLflow Service Experiment tracking, model registry 0.5 vCPU, 1GB 5000 $14.60
On-Demand Tasks (Pay-per-use) Task Purpose Resources Duration Cost/Run Strategy Service Task Config preparation, validation 0.5 vCPU, 1GB 5-10 min $0.03 Strategy Container Backtest execution (Freqtrade) 1 vCPU, 2GB 5-60 min $0.02-0.20
Live Trading Services (v9.1) Service Purpose Resources Mode Cost/Month Strategy Container (Live) Production trading per strategy 0.5 vCPU, 1GB ECS Service (non-Spot) $15.33 Strategy Container (Dry-run) Paper trading per strategy 0.5 vCPU, 1GB ECS Service (non-Spot) $15.33
Note: Strategy containers are unified - same image handles backtest, hyperopt, dry-run, and live modes. Config loaded from MLflow tags + S3 at runtime (see 11-LIVE-TRADING.md ).
Lambda Functions Function Purpose Memory Timeout Trigger validate-strategy Check ECR image + S3 config 256MB 30s Step Functions data-collection-proxy Call Data Collection Service 256MB 60s Step Functions transform-results Format backtest results 512MB 60s Step Functions cleanup-resources Delete temp S3 files 256MB 30s Step Functions notify-completion Send notifications 256MB 30s Step Functions sqs-consumer Process backtest queue 256MB 60s SQS check-performance Validate metrics 256MB 30s Step Functions audit-logger Log operations 256MB 30s EventBridge orphan-scanner Detect stuck RUNNING jobs (v9.2.1) 128MB 60s EventBridge (15 min)
Managed Services Service Purpose Configuration Cost/Month AWS API Gateway API management, auth, routing HTTP API $3.50 Step Functions Workflow orchestration Standard type $0.50 SQS Message queuing FIFO + DLQ $0.40 DynamoDB Workflow state On-demand, TTL $2.00 Cognito User authentication User pool $0 (free tier) Secrets Manager Credentials storage 5 secrets $2.00 CloudTrail Audit logging Single trail $2.00
Data Flow Patterns Pattern 1: Simple Query (Synchronous) User Request Response Time: <200ms
│
▼
AWS API Gateway ──► JWT Validation ──► Rate Limit Check
│
▼
VPC Link ──► ALB ──► Backend Service
│
├─► GET /strategies ──► MLflow Service ──► RDS
│
└─► GET /data/status ──► Data Collection ──► ArcticDB (S3)
Use Cases: - List strategies - Get strategy details - Check data freshness - Get workflow status
Pattern 2: Complex Operation (Asynchronous) User Request Response Time: <100ms (queued)
│
▼
AWS API Gateway ──► JWT Validation ──► Rate Limit Check
│
▼
SQS Integration ──► Backtest Queue (FIFO)
│ │
│ └─► Response: { ticket_id, status: "QUEUED" }
│
▼ (Async)
Lambda: SQS Consumer
│
├─► Check Idempotency (DynamoDB)
│
├─► Register Workflow (DynamoDB: status=RUNNING)
│
└─► Start Step Functions Execution
│
▼
Backtest Workflow (10-70 minutes)
│
├─► ValidateStrategy (Lambda)
├─► CheckDataFreshness (Lambda → Data Collection)
├─► PrepareConfig (ECS Task)
├─► RunBacktest (ECS Task: Strategy Container)
├─► TransformResults (Lambda)
├─► UpdateState (DynamoDB: status=COMPLETED)
├─► Cleanup (Lambda)
└─► Notify (Lambda → SNS/Email)
Use Cases: - Run backtest - Deploy strategy - Full data sync - Register new strategy
Pattern 3: Status Polling User: GET /backtest/{run_id}/status
│
▼
AWS API Gateway ──► Backend Service
│
▼
DynamoDB Query
│
├─► If COMPLETED: Include results from MLflow
│
└─► Return: { status, progress, result?, mlflow_url }
Pattern 4: Live Trading (v9.1) Container Startup (ECS Service - Always On)
│
├─► Read minimal env vars: STRATEGY_NAME, STRATEGY_STAGE, TRADING_MODE
│
▼
MLflow Config Loading
│
├─► MLflowAdapter.get_model_version(name, stage)
│ └─► Returns tags: timeframe, pairs, warmup_days, config_s3_path
│
├─► ConfigMergeService.load_config(config_s3_path)
│ └─► Load full Freqtrade config from S3
│
└─► Apply runtime overrides (if any)
│
▼
Data Warmup
│
├─► ArcticDB direct access (S3) - NOT via Data Collection Service
│ └─► Load warmup_days of historical OHLCV data
│
└─► Initialize Freqtrade with warmed dataframes
│
▼
Trading Loop (24/7)
│
├─► Freqtrade strategy execution
│ └─► Exchange API (Binance) for orders
│
├─► Health Reporting (every 60s)
│ ├─► DynamoDB heartbeat update
│ └─► CloudWatch metrics (PnL, trades, latency)
│
└─► EventBridge + Lambda monitors session health
│
└─► SNS alert if heartbeat stale > 3 minutes
Use Cases: - Live production trading - Dry-run (paper trading) - Strategy hot-swap via MLflow stage change
Service Communication Matrix From To Protocol Path Purpose API Gateway Backend Service HTTP/VPC Link /strategies/* Strategy queries API Gateway Backend Service HTTP/VPC Link /data/* Data queries API Gateway MLflow Service HTTP/VPC Link /mlflow/* MLflow proxy API Gateway SQS AWS Integration Queue Async operations Lambda Data Collection HTTP/VPC localhost:8002 Data checks Lambda Step Functions AWS SDK StartExecution Workflow trigger Lambda DynamoDB AWS SDK GetItem/PutItem State management Step Functions Lambda AWS Integration Invoke Utility functions Step Functions ECS AWS Integration RunTask.sync Task execution Step Functions DynamoDB AWS Integration UpdateItem State updates Backend Service MLflow Service HTTP :5000 Experiment queries Backend Service Data Collection HTTP :8002 Freshness checks Data Collection ArcticDB S3 API s3:// Time series data Strategy Container ArcticDB S3 API s3:// OHLCV data Strategy Container MLflow Service HTTP :5000 Log metrics Live Trading (v9.1) Strategy Container (Live) MLflow Service HTTP :5000 Load config tags Strategy Container (Live) S3 S3 API s3:// Load config JSON Strategy Container (Live) ArcticDB S3 API s3:// Warmup data Strategy Container (Live) DynamoDB AWS SDK UpdateItem Heartbeat Strategy Container (Live) Binance HTTPS api.binance.com Orders Lambda (health-check) DynamoDB AWS SDK Query Session monitoring EventBridge Lambda (health-check) AWS Integration rate(5 min) Scheduled check
Scaling Strategy Horizontal Scaling Component Min Max Trigger Backend Service 1 3 CPU > 70% Data Collection 1 2 CPU > 70% MLflow Service 1 2 CPU > 70% Strategy Tasks 0 10 Step Functions demand Live Trading (v9.1) Strategy Container (Live) 1 1 N/A (1 per strategy) Strategy Container (Dry-run) 1 1 N/A (1 per strategy)
Vertical Scaling (Future) Component Current Upgrade Path Backend Service 0.5 vCPU, 1GB 1 vCPU, 2GB Strategy Container 1 vCPU, 2GB 2 vCPU, 4GB RDS db.t4g.micro db.t4g.small
High Availability Design Multi-AZ Components Component AZ-1 (us-east-1a) AZ-2 (us-east-1b) Public Subnet 10.0.1.0/24 10.0.2.0/24 Private Subnet 10.0.11.0/24 10.0.12.0/24 Database Subnet 10.0.20.0/24 10.0.21.0/24 NAT Instance Primary Standby (ASG) ALB Active Active ECS Tasks Distributed Distributed
Failure Scenarios Failure Impact Recovery Single ECS task Request retry ECS auto-restart (30s) NAT Instance No egress ASG launches replacement (2 min) Single AZ 50% capacity Traffic shifts to healthy AZ RDS Database unavailable Promote read replica (future) Step Functions execution Workflow fails DLQ + manual retry
Technology Decisions Summary Layer Technology Alternatives Considered Rationale API Gateway AWS API Gateway Custom FastAPI 98% cost savings, managed Compute ECS Fargate EKS, EC2 Serverless, simple Orchestration Step Functions Airflow, Temporal Native AWS, pay-per-use Queue SQS FIFO EventBridge, SNS Ordering, exactly-once State DynamoDB Redis, PostgreSQL Serverless, TTL Auth Cognito Auth0, Keycloak Native AWS, free tier IaC Pulumi Terraform, CDK Python native, type-safe Monitoring CloudWatch Datadog, Prometheus Native AWS, integrated
Integration Points External Systems System Integration Protocol Purpose Binance CCXT Library REST/WebSocket Market data MLflow HTTP API REST Experiment tracking Freqtrade CLI Process Backtesting engine
Internal APIs API Base URL Auth Rate Limit TradAI API https://api.tradai.smartml.me JWT (Cognito) 100 req/5min MLflow https://mlflow.tradai.smartml.me Basic Auth None
Deployment Architecture ┌─────────────────────────────────────────────────────────────────┐
│ Bitbucket Pipelines │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Build Stage │ ─► │ Test Stage │ ─► │ Deploy Stage │ │
│ │ │ │ │ │ │ │
│ │ - pants fmt │ │ - pants test│ │ - pulumi preview │ │
│ │ - pants lint│ │ - pants check │ - pulumi up │ │
│ │ - docker │ │ │ │ - ecs deploy │ │
│ │ build │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ ▼ │
│ ┌─────────────┐ │
│ │ ECR │ │
│ │ (Images) │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────┐
│ AWS Account │
│ │
│ ECS ◄─ ECR Images │
│ Lambda ◄─ S3 Code │
│ Step Functions │
└─────────────────────┘
Next Steps Review 03-VPC-NETWORKING.md for detailed network design Review 04-SECURITY.md for security controls Review 05-SERVICES.md for service implementations