Skip to content

Architecture at a Glance

One page to understand the entire TradAI platform.

System Diagram

graph LR
    CLI["CLI / Users"] --> APIGW["API Gateway<br/>+ WAF + Cognito"]
    APIGW --> ALB["ALB"]

    ALB --> Backend["Backend API<br/>:8000"]
    ALB --> Strategy["Strategy Service<br/>:8003"]
    ALB --> DataCol["Data Collection<br/>:8002"]
    ALB --> MLflow["MLflow<br/>:5000"]

    Backend -->|orchestrate| Strategy
    Backend -->|data sync| DataCol
    Strategy -->|experiments| MLflow
    DataCol -->|CCXT| Exchange["Exchanges"]

    Backend -->|submit| SQS["SQS"] --> Lambda18["18 Lambda<br/>Functions"]
    Lambda18 --> SF["Step Functions<br/>Backtest + Retraining"]
    SF -->|run| ECS["ECS Tasks<br/>Freqtrade"]

    Backend --> DDB[("DynamoDB<br/>12 tables")]
    DataCol --> Arctic[("ArcticDB<br/>S3-backed")]
    MLflow --> S3[("S3<br/>Artifacts")]
    ECS --> DDB
    ECS --> MLflow

    Lambda18 --> CW["CloudWatch"] --> SNS["SNS Alerts"]

Component Quick Reference

Component What It Does Learn More
Backend API Gateway, backtest orchestration, job management Services, Data Flows
Strategy Service Backtest execution, hyperopt, model promotion, A/B testing Services, Step Functions
Data Collection Exchange data fetching and ArcticDB storage Services, Data Flows
MLflow Experiment tracking, model registry, artifact storage ML Lifecycle
Live Trading Real-time strategy execution on ECS (partially implemented — see doc for status) Live Trading
18 Lambda Functions Orchestration, monitoring, ML ops, deployment Services
Step Functions Backtest and retraining workflows (STANDARD type) Step Functions
VPC + Security Network isolation, WAF, Cognito JWT auth VPC, Security
Pulumi IaC 4-stack infrastructure (persistent, foundation, compute, edge) Pulumi Code

Key Architecture Decisions

Decision Choice Rationale
NAT Instance over NAT Gateway t4g.nano NAT instance ~$4/mo vs ~$32/mo; acceptable for dev/staging workloads
ECS Fargate over EC2 Fargate launch type Zero server management; pay-per-task; simpler scaling. Note: dev/staging use EC2 consolidated mode (CONSOLIDATED_MODE=True) for cost savings; production uses Fargate.
Step Functions STANDARD STANDARD over EXPRESS Long-running backtests (up to 1 hour); built-in retry and error handling
Protocol-based DI Python Protocols, not ABCs Structural subtyping; no inheritance coupling; better testability
4-stack Pulumi persistent / foundation / compute / edge Independent lifecycle per layer; safe blast radius; parallel deployments

Reading Paths

If You Need Read These (in order)
System understanding This page then System Design then Data Flows
Deploy infrastructure Pulumi Code then Canonical Config then Deployment
Debug an issue State Machines then Error Handling then Observability
Understand costs Cost Analysis
Security review Security then VPC
ML / model lifecycle ML Lifecycle then Step Functions
Live trading setup Live Trading then Configuration
CI/CD and releases CI/CD Pipeline then Deployment Pipeline
Testing approach Testing Strategy