TradAI Final Architecture - Canonical Configuration Version: 9.2.1 | Date: 2025-12-09 | Status: AUTHORITATIVE
Purpose This document is the single source of truth for all configuration values in the TradAI architecture. When any other document or code conflicts with this file, this file takes precedence.
1. Network Configuration 1.1 VPC Parameter Value Notes VPC CIDR 10.0.0.0/16 65,536 IPs Region us-east-1 Primary region Availability Zones us-east-1a, us-east-1b 2 AZs for HA DNS Hostnames true Required for Service Discovery DNS Support true Required
1.2 Subnets Subnet Type AZ CIDR Route Table Purpose Public-1 us-east-1a 10.0.1.0/24 Public RT ALB, NAT Instance Public-2 us-east-1b 10.0.2.0/24 Public RT ALB (multi-AZ) Private-1 us-east-1a 10.0.11.0/24 Private RT ECS Tasks, Lambda Private-2 us-east-1b 10.0.12.0/24 Private RT ECS Tasks, Lambda Database-1 us-east-1a 10.0.21.0/24 Database RT RDS Primary Database-2 us-east-1b 10.0.22.0/24 Database RT RDS Standby
1.3 Route Tables Route Table Routes Associated Subnets Public RT 0.0.0.0/0 -> IGW Public-1, Public-2 Private RT 0.0.0.0/0 -> NAT Instance Private-1, Private-2 Database RT Local only (no internet) Database-1, Database-2
1.4 NAT Instance High Availability Cost-effective NAT Instance with automatic failover (~$32/month savings vs NAT Gateway).
Component Value Notes Instance Type t4g.nano ARM-based, ~$3/month AMI Amazon Linux 2023 ARM64 Latest AMI, auto-selected ASG min=1, max=1, desired=1 Single instance for cost optimization EIP Static allocation Attached by user data script Lambda tradai-update-nat-routes Updates route on instance replacement Subnet Public-1 (us-east-1a) Single AZ deployment
User Data Configuration: - IP forwarding enabled (net.ipv4.ip_forward = 1) - NAT masquerading via iptables - Automatic EIP association - Source/destination check disabled
Failover Flow: 1. ASG detects unhealthy instance (EC2 health check) 2. ASG terminates failed instance and launches replacement 3. ASG Lifecycle Hook triggers EventBridge rule 4. Lambda function tradai-update-nat-routes invoked 5. Lambda waits for instance running state 6. Lambda updates private route table (0.0.0.0/0 -> new instance) 7. Lambda completes lifecycle action 8. Traffic resumes (~2-3 minutes failover time)
IAM Permissions: - NAT Instance Role: ec2:AssociateAddress, ec2:ModifyInstanceAttribute - Lambda Role: ec2:CreateRoute, ec2:ReplaceRoute, autoscaling:CompleteLifecycleAction
2. Service Configuration 2.1 Service Ports Service Port Health Check Path Protocol Backend API 8000 /health HTTP Data Collection 8002 /health HTTP MLflow 5000 /health HTTP Strategy Service 8003 /health HTTP
2.2 ECS Task Definitions Service CPU Memory Desired Count Launch Type backend-api 512 1024 MB 1 FARGATE data-collection 256 512 MB 1 FARGATE mlflow 512 1024 MB 1 FARGATE strategy-service 512 1024 MB 0 (on-demand) FARGATE strategy-container 1024 2048 MB 0 (on-demand) FARGATE_SPOT
2.3 Service Discovery Service DNS Name Namespace Backend API backend-api.tradai.local tradai.local Data Collection data-collection.tradai.local tradai.local MLflow mlflow.tradai.local tradai.local
2.4 Backtest Execution Modes (v9.2) Mode Environment Variable Description Use Case local BACKTEST_MODE=local Docker SDK local execution Development, testing ecs BACKTEST_MODE=ecs Direct ECS Fargate task launch Simple production backtests sqs BACKTEST_MODE=sqs SQS → Lambda → ECS High-volume with backpressure stepfunctions BACKTEST_MODE=stepfunctions SQS → Lambda → Step Functions Complex multi-step workflows
SQS Consumer Lambda Environment:
Variable Values Default Description LAUNCH_MODE ecs, stepfunctions ecs How SQS Lambda launches backtests ECS_CLUSTER cluster ARN - Target ECS cluster ECS_TASK_PREFIX string strategy- Task definition prefix BACKTEST_WORKFLOW_ARN Step Functions ARN - For stepfunctions mode
2.5 Strategy Container Environment Variables Required (Minimal Bootstrap):
Variable Description Example JOB_ID Unique job identifier uuid STRATEGY Strategy class name (or FREQTRADE_STRATEGY as fallback) PascalStrategy STRATEGY_ID Unique strategy identifier (required for live/dry-run) pascal-btc TRADING_MODE Execution mode backtest, live, dry-run, train MLFLOW_TRACKING_URI MLflow endpoint http://mlflow.tradai.local:5000
Note (Issue 10 Fix): Documentation previously used STRATEGY_NAME and STRATEGY_STAGE. The actual env vars are STRATEGY and STRATEGY_ID. Stage is determined by deployment config.
Optional (Infrastructure):
Variable Description Default DYNAMODB_TABLE Workflow state table tradai-workflow-state ARCTICDB_S3_URI ArcticDB S3 path s3://tradai-arcticdb-{env} RESULTS_DIR Local results path /freqtrade/user_data/backtest_results
3. Storage Configuration 3.1 S3 Buckets Bucket Name Purpose Versioning Encryption Lifecycle tradai-configs-{env} Strategy configs, base configs Enabled SSE-S3 None tradai-results-{env} Backtest results, reports Enabled SSE-S3 Glacier after 30d tradai-arcticdb-{env} ArcticDB time-series data Enabled SSE-S3 None tradai-logs-{env} ALB logs, CloudTrail, audit Disabled SSE-S3 Delete after 90d tradai-mlflow-{env} MLflow artifacts Enabled SSE-S3 None
3.2 DynamoDB Tables Table Name Partition Key Sort Key GSI TTL tradai-workflow-state run_id (S) - status-created_at-index ttl (7 days)
3.3 RDS Configuration Parameter Dev Value Prod Value Instance Class db.t4g.micro db.t4g.small Engine PostgreSQL 15.4 PostgreSQL 15.4 Storage 20 GB gp3 50 GB gp3 Multi-AZ false true Backup Retention 7 days 14 days SSL Enforcement rds.force_ssl=1 rds.force_ssl=1 Publicly Accessible false false
4. Security Configuration 4.1 Security Groups ALB Security Group (tradai-alb-sg) Direction Port Protocol Source/Dest Purpose Ingress 80 TCP 0.0.0.0/0 HTTP redirect Ingress 443 TCP 0.0.0.0/0 HTTPS Egress 8000-8003 TCP ECS SG To services Egress 5000 TCP ECS SG To MLflow
ECS Security Group (tradai-ecs-sg) Direction Port Protocol Source/Dest Purpose Ingress 8000 TCP ALB SG Backend API Ingress 8002 TCP ALB SG Data Collection Ingress 8003 TCP ALB SG Strategy Service Ingress 5000 TCP ALB SG MLflow Ingress 8000-8003 TCP Lambda SG Internal calls Egress 443 TCP 0.0.0.0/0 AWS APIs Egress 5432 TCP RDS SG Database
Lambda Security Group (tradai-lambda-sg) Direction Port Protocol Source/Dest Purpose Egress 443 TCP 0.0.0.0/0 AWS APIs Egress 8000-8003 TCP ECS SG To services Egress 5000 TCP ECS SG To MLflow
RDS Security Group (tradai-rds-sg) Direction Port Protocol Source/Dest Purpose Ingress 5432 TCP ECS SG From ECS Ingress 5432 TCP Lambda SG From Lambda
NAT Security Group (tradai-nat-sg) Direction Port Protocol Source/Dest Purpose Ingress 0-65535 TCP 10.0.11.0/24 From Private-1 Ingress 0-65535 TCP 10.0.12.0/24 From Private-2 Egress 0-65535 TCP 0.0.0.0/0 To internet
4.2 Cognito Configuration Parameter Value User Pool Name tradai-users MFA Required (TOTP) Password Min Length 12 Password Requirements Upper, Lower, Number, Symbol Account Recovery Email only Advanced Security ENFORCED
4.3 WAF Configuration Rule Priority Action Rate Limit RateLimitRule 1 Block 100/5min per IP AWSManagedRulesCommonRuleSet 2 Block - AWSManagedRulesKnownBadInputsRuleSet 3 Block - AWSManagedRulesSQLiRuleSet 4 Block -
5. API Gateway Configuration 5.1 Routes Method Path Integration Auth Required GET /health VPC Link -> ALB No GET /strategies VPC Link -> Backend API Yes GET /strategies/{id} VPC Link -> Backend API Yes POST /strategies VPC Link -> Backend API Yes POST /backtest SQS Direct Integration Yes GET /backtest/{run_id} VPC Link -> Backend API Yes GET /data/symbols VPC Link -> Data Collection Yes GET /data/freshness VPC Link -> Data Collection Yes POST /data/sync VPC Link -> Data Collection Yes ANY /mlflow/* VPC Link -> MLflow Yes
5.2 Throttling Parameter Value Default Rate Limit 100 req/sec Default Burst Limit 200 req Per-Route Override /backtest: 10 req/sec
6. SQS Configuration 6.1 Backtest Queue Parameter Value Queue Name tradai-backtest-queue.fifo FIFO true Content-Based Deduplication true Visibility Timeout 900 (15 minutes) Message Retention 345600 (4 days) Receive Wait Time 20 (long polling)
6.2 Dead Letter Queue Parameter Value Queue Name tradai-backtest-dlq.fifo FIFO true Message Retention 1209600 (14 days) Max Receive Count 3
7. Lambda Functions Function Name Runtime Memory Timeout VPC Trigger tradai-validate-strategy Python 3.11 256 MB 30s Yes Step Functions tradai-validate-data Python 3.11 256 MB 60s Yes Step Functions tradai-sqs-consumer Python 3.11 256 MB 30s Yes SQS tradai-transform-results Python 3.11 512 MB 120s Yes Step Functions tradai-cleanup-resources Python 3.11 256 MB 60s Yes Step Functions tradai-notify-completion Python 3.11 128 MB 30s No Step Functions tradai-update-nat-routes Python 3.11 128 MB 30s No ASG Lifecycle tradai-orphan-scanner Python 3.11 128 MB 60s Yes EventBridge (15 min)
8. Secrets Manager Secret Name Contents Rotation tradai/database-url PostgreSQL connection string Manual (Phase 2: Auto) tradai/mlflow-credentials MLflow tracking auth Manual tradai/binance-api Binance API key/secret Manual tradai/jwt-secret JWT signing key Manual
9. Naming Conventions 9.1 Resource Naming Pattern {project}-{component}-{environment}[-{suffix}]
Examples:
- tradai-backend-api-prod
- tradai-vpc-dev
- tradai-backtest-queue-staging.fifo
9.2 Tag Requirements All resources MUST have these tags:
Tag Key Required Example Values Application Yes tradai Environment Yes dev, staging, prod Service If applicable backend-api, mlflow ManagedBy Yes pulumi CostCenter Yes trading-platform Owner Yes platform-team
10. Environment Differences Component Dev Staging Prod RDS Multi-AZ No No Yes RDS Instance db.t4g.micro db.t4g.micro db.t4g.small ECS Desired Count 1 1 2 Log Retention 7 days 14 days 30 days Deletion Protection No No Yes WAF Optional Yes Yes CloudTrail Optional Yes Yes
11. Validation Checklist Before deployment, verify:
[ ] All subnet CIDRs match this document [ ] All service ports match this document [ ] All S3 bucket names follow the pattern [ ] Security groups reference correct CIDRs [ ] DynamoDB uses run_id as partition key [ ] All resources have required tags [ ] Secrets exist in Secrets Manager [ ] SSL enforcement enabled on RDS Document Control: - This document supersedes any conflicting values in other architecture documents - Changes require review and version increment - Pulumi config.py should import values from this spec