TradAI Final Architecture - Implementation Roadmap Version: 9.2 | Date: 2025-12-09
Timeline Overview Week 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
├──────────┤ Phase 1: Infrastructure
├──────────────────┤ Phase 2: Core Services
├──────────────┤ Phase 3: Orchestration
├────────┤ Phase 4: Testing
├────────┤ Phase 5: Production
├──────────────────────┤ Phase 6: Live Trading (v9.1)
Phase Duration Focus Deliverables Phase 1 Weeks 1-3 Infrastructure Foundation VPC, Security, Storage Phase 2 Weeks 4-7 Core Services ECS, Lambda, API Gateway Phase 3 Weeks 8-10 Orchestration Step Functions, SQS, DynamoDB Phase 4 Weeks 11-12 Testing & Hardening Integration, Load, Security Phase 5 Weeks 13-16 Production & Buffer Deploy, Monitor, Iterate Phase 6 Weeks 17-22 Live Trading ECS Services, MLflow Config, Health Monitoring
Phase 1: Infrastructure Foundation (Weeks 1-3) Week 1: Networking Day Task Deliverable Effort 1 VPC creation VPC with 10.0.0.0/16 CIDR 2h 1 Subnets 6 subnets across 2 AZs 2h 2 Internet Gateway IGW attached to VPC 1h 2 Route Tables Public, Private, Database routes 2h 3 NAT Instance t4g.nano with userdata 3h 3 NAT ASG Auto Scaling Group for HA 2h 4 Security Groups ALB, ECS, Lambda, RDS, NAT 3h 5 NACLs Public, Private, Database ACLs 2h 5 VPC Endpoints S3, DynamoDB gateway endpoints 1h
Week 1 Checklist: - [ ] VPC created with DNS enabled - [ ] 6 subnets created (2 public, 2 private, 2 database) - [ ] Internet Gateway attached - [ ] Route tables configured - [ ] NAT Instance running with source/dest check disabled - [ ] All security groups created - [ ] NACLs applied to subnets - [ ] Gateway endpoints for S3 and DynamoDB
Week 2: Security & Database Day Task Deliverable Effort 1 Cognito User Pool User pool with MFA required 2h 1 Cognito App Client API client configuration 1h 2 IAM Roles ECS execution, task, Lambda roles 3h 2 IAM Policies Least-privilege policies 2h 3 Secrets Manager 5 secrets (MLflow, DB, Binance) 2h 3 RDS PostgreSQL db.t4g.micro in database subnet 3h 4 S3 Buckets configs, results, arcticdb, logs 2h 4 S3 Encryption SSE-S3 on all buckets 1h 4 S3 Lifecycle Temp cleanup, Glacier archival 1h 5 DynamoDB Table workflow-state with GSI 2h 5 ECR Repositories 4 repos (backend, data, mlflow, strategy) 1h
Week 2 Checklist: - [ ] Cognito user pool with MFA required - [ ] All IAM roles and policies created - [ ] Secrets stored in Secrets Manager - [ ] RDS PostgreSQL running (verify connection) - [ ] S3 buckets with encryption and lifecycle - [ ] DynamoDB table with TTL enabled - [ ] ECR repositories created
Week 3: Load Balancing & Audit Day Task Deliverable Effort 1 ACM Certificate Request and validate SSL cert 2h 1 ALB Creation Internet-facing ALB in public subnets 2h 2 Target Groups Backend, Data Collection, MLflow 2h 2 Listener Rules HTTPS routing rules 2h 3 HTTP Redirect Port 80 → 443 redirect 1h 3 WAF Web ACL Rate limiting, managed rules 3h 4 CloudTrail Audit trail to S3 2h 4 VPC Flow Logs Flow logs to CloudWatch 1h 5 CloudWatch Log Groups Create all required log groups 1h 5 Phase 1 Testing Verify all infrastructure 3h
Week 3 Checklist: - [ ] SSL certificate validated - [ ] ALB with HTTPS listener - [ ] Target groups for all services - [ ] HTTP redirect working - [ ] WAF rules active - [ ] CloudTrail logging to S3 - [ ] VPC Flow Logs enabled - [ ] All log groups created - [ ] Infrastructure smoke test passed
Phase 2: Core Services (Weeks 4-7) Week 4: Backend API Service Day Task Deliverable Effort 1 Docker Build Backend API Dockerfile 2h 1 ECR Push Push image to ECR 1h 2 Task Definition ECS task definition 2h 2 Service Definition ECS service with ALB 2h 3 Health Endpoints /health and /ready 2h 3 Circuit Breaker Implement resilience pattern 3h 4 API Routes strategies, backtest, data routes 4h 5 Integration Test Test ALB → ECS → responds 2h 5 Fix Issues Address any deployment issues 2h
Week 5: Data Collection & MLflow Day Task Deliverable Effort 1 Data Collection Docker Dockerfile + ArcticDB client 2h 1 Data Collection Deploy ECS service deployment 2h 2 MLflow Docker MLflow server Dockerfile 2h 2 MLflow Deploy ECS service with RDS backend 3h 3 Service Discovery Cloud Map for internal DNS 2h 3 Internal Communication Verify service-to-service calls 2h 4 Data Freshness API Implement /data/freshness 3h 5 MLflow Verification Create test experiment 2h 5 Integration Test Full service communication test 2h
Week 6: Lambda Functions Day Task Deliverable Effort 1 Lambda: validate-strategy Implementation + deploy 2h 1 Lambda: data-collection-proxy Implementation + deploy 2h 2 Lambda: sqs-consumer Implementation + deploy 3h 2 Lambda: transform-results Implementation + deploy 2h 3 Lambda: cleanup-resources Implementation + deploy 2h 3 Lambda: notify-completion Implementation + deploy 2h 4 VPC Configuration Lambda VPC access 2h 4 Lambda Testing Test each function 3h 5 Lambda Permissions Verify IAM permissions 2h 5 Integration Test Lambda → ECS communication 2h
Week 7: AWS API Gateway Day Task Deliverable Effort 1 HTTP API Creation Create API Gateway 1h 1 JWT Authorizer Cognito integration 2h 2 VPC Link Connect to ALB 2h 2 Backend Routes /strategies, /data, /health 2h 3 SQS Integration Direct integration for /backtest 3h 3 MLflow Routes /mlflow/* proxy 1h 4 Rate Limiting Configure throttling 1h 4 CORS Configure CORS headers 1h 4 Custom Domain api.tradai.smartml.me 2h 5 API Testing Test all routes 3h 5 Documentation OpenAPI spec 2h
Phase 2 Checklist: - [ ] Backend API service running and healthy - [ ] Data Collection service running - [ ] MLflow service running with RDS backend - [ ] All 6 Lambda functions deployed - [ ] AWS API Gateway with all routes - [ ] JWT authorization working - [ ] Service-to-service communication verified
Phase 3: Orchestration (Weeks 8-10) Week 8: SQS & Step Functions Day Task Deliverable Effort 1 Backtest Queue FIFO queue creation 1h 1 Dead Letter Queue DLQ configuration 1h 2 DLQ Alarm CloudWatch alarm for DLQ 1h 2 SQS Lambda Trigger Event source mapping 2h 3 Backtest Workflow Step Functions state machine 4h 4 Workflow Testing Test happy path 3h 5 Error Handling Implement catch/retry 3h 5 DynamoDB Integration State updates from workflow 2h
Week 9: Workflow Refinement Day Task Deliverable Effort 1 Parallel Validation Implement parallel states 2h 1 Choice States Validation decision logic 2h 2 Data Sync Workflow Create data sync state machine 3h 2 Deploy Workflow Create deploy state machine (stub) 2h 3 Nested Workflows Data sync from backtest 2h 3 Timeout/Heartbeat Configure for long tasks 2h 4 End-to-End Test Full backtest flow 4h 5 Fix Issues Address workflow issues 4h
Week 10: Strategy Tasks & Backtest Executors (v9.2) Day Task Deliverable Effort 1 Container Entrypoint Script Python script with DynamoDB/MLflow/S3 4h 1 Strategy Container Docker Freqtrade container with entrypoint 3h 2 ECSBacktestExecutor Direct ECS task launch 3h 2 LocalBacktestExecutor Docker SDK implementation 3h 3 SQS Lambda Handler Lambda with LAUNCH_MODE support 3h 3 StepFunctionsExecutor Step Functions integration 2h 4 Executor Factory Dependency injection setup 2h 4 ECS Task Definitions Terraform/Pulumi for strategies 3h 5 Full Backtest Test End-to-end with all modes 4h 5 Performance Tuning Optimize task resources 2h
Phase 3 Checklist: - [ ] SQS queues (main + DLQ) created - [ ] DLQ alarm configured - [ ] Backtest workflow deployed (simplified v9.2) - [ ] Data sync workflow deployed - [ ] Parallel validation working - [ ] Container entrypoint script handles status/MLflow/S3 - [ ] ECSBacktestExecutor implemented - [ ] LocalBacktestExecutor implemented - [ ] SQS Lambda handler supports LAUNCH_MODE - [ ] Executor factory wired in dependencies.py - [ ] Fargate Spot enabled - [ ] End-to-end backtest successful (all 4 modes)
Phase 4: Testing & Hardening (Weeks 11-12) Week 11: Testing Day Task Deliverable Effort 1 Unit Tests Service unit tests 4h 1 Integration Tests Service integration tests 4h 2 API Tests API Gateway endpoint tests 3h 2 Workflow Tests Step Functions test cases 3h 3 Load Testing 10 concurrent backtests 4h 4 Failure Testing Test error scenarios 4h 5 Security Testing OWASP checks, IAM review 4h
Week 12: Hardening Day Task Deliverable Effort 1 CloudWatch Alarms Error rate, latency alarms 2h 1 CloudWatch Dashboard Operational dashboard 2h 2 Runbooks Operational runbooks 3h 2 Alert Routing SNS topics, Slack integration 2h 3 Backup Verification RDS snapshots, S3 versioning 2h 3 DR Testing Simulate failures 3h 4 Performance Baseline Document baseline metrics 2h 4 Documentation Update all docs 3h 5 Stakeholder Demo Demo to team 2h 5 Go/No-Go Review Production readiness review 2h
Phase 4 Checklist: - [ ] All unit tests passing - [ ] Integration tests passing - [ ] Load test: 10 concurrent backtests successful - [ ] Error scenarios handled gracefully - [ ] Security review completed - [ ] CloudWatch alarms configured - [ ] Dashboard created - [ ] Runbooks documented - [ ] DR test successful - [ ] Stakeholder approval
Phase 5: Production & Buffer (Weeks 13-16) Week 13: Production Deployment Day Task Deliverable Effort 1 Production VPC Deploy to production account 4h 2 Production Services Deploy all services 4h 3 DNS Cutover Point api.tradai.smartml.me 1h 3 Smoke Testing Verify production 3h 4 Monitoring Verify all metrics flowing 2h 5 User Acceptance Real user testing 4h
Weeks 14-16: Stabilization & Buffer Focus Tasks Bug Fixes Address issues found in production Performance Tune based on real usage patterns Cost Verify actual costs match estimates Documentation Update based on learnings Training Train team on operations Handoff Complete knowledge transfer
Phase 6: Live Trading (Weeks 17-22) - v9.1 Prerequisite: Phases 1-5 complete with stable backtesting platform.
See 11-LIVE-TRADING.md for detailed architecture.
Week 17-18: Live Trading Foundation Day Task Deliverable Effort 1 Strategy Container Dockerfile Multi-mode Dockerfile 4h 1 TRADING_MODE env handling Mode switching logic 2h 2 StrategyConfigLoader MLflow + S3 config loading 4h 2 Integration with existing patterns Use MLflowAdapter, ConfigMergeService 2h 3 ArcticDBWarmupLoader Historical data warmup 4h 4 DynamoDB state manager Live session state handling 3h 4 Health reporter component Heartbeat + metrics 3h 5 ECS Service definitions Non-Spot service configs 4h 5 Secrets Manager setup Exchange API credentials 2h
Week 17-18 Checklist: - [ ] Strategy container supports all modes via TRADING_MODE - [ ] Config loaded from MLflow tags + S3 (minimal env vars) - [ ] ArcticDB warmup working (direct S3 access) - [ ] DynamoDB live session schema deployed - [ ] Health reporter publishing to DynamoDB - [ ] ECS Service definitions ready (non-Spot) - [ ] Exchange credentials in Secrets Manager
Week 19-20: MLflow Model Lifecycle Day Task Deliverable Effort 1 Update StrategyMetadata Add live trading tags 2h 1 Update to_mlflow_tags() Include pairs, warmup_days, etc. 2h 2 MLflow model serving Load models for inference 4h 2 FreqAI inference mode Disable retraining for live 3h 3 Model promotion workflow Staging → Production 3h 4 Model versioning tests Verify version pinning works 3h 5 Integration testing End-to-end config loading 4h
Week 19-20 Checklist: - [ ] StrategyMetadata includes all live trading fields - [ ] to_mlflow_tags() outputs complete tag set - [ ] FreqAI models load from MLflow (no retraining) - [ ] Model promotion workflow documented - [ ] Config hot-reload tested (stage change)
Week 21-22: Monitoring & Go-Live Day Task Deliverable Effort 1 EventBridge rules 5-min health check schedule 2h 1 Lambda health-check Session health monitor 3h 2 CloudWatch alarms Heartbeat, exchange, PnL 3h 2 SNS alerting Alert channels configured 2h 3 Operational runbooks Start/stop/pause procedures 4h 4 Dry-run deployment Paper trading first strategy 4h 4 Dry-run validation Verify no real orders 2h 5 Live deployment First live strategy 3h 5 Go-live monitoring Active monitoring first 24h 4h
Week 21-22 Checklist: - [ ] EventBridge + Lambda health monitoring deployed - [ ] CloudWatch alarms firing correctly - [ ] SNS alerts received - [ ] Runbooks documented and tested - [ ] Dry-run validation complete (no real orders) - [ ] First live strategy deployed - [ ] 24-hour monitoring successful
Phase 6 Cost Addition Per Strategy Monthly ECS Fargate (24/7) $15.33 Monitoring overhead $1.68 Reserve (exchange fees) $29.00 Total per strategy ~$46
Priority Matrix P0 - Critical (Must Complete Week 1) # Task Impact Effort 1 VPC with private subnets Security foundation 4h 2 NAT Instance Enables internet access 3h 3 Security Groups Network security 3h 4 RDS PostgreSQL MLflow dependency 3h 5 S3 with encryption Storage foundation 2h
P1 - High Priority (Week 2-4) # Task Impact Effort 6 Cognito + JWT Authentication 3h 7 Backend API Service Core functionality 8h 8 MLflow Service Experiment tracking 5h 9 CloudTrail Audit compliance 2h 10 Lambda functions Workflow utilities 12h
P2 - Medium Priority (Week 5-8) # Task Impact Effort 11 Data Collection Service Data operations 4h 12 AWS API Gateway Managed API 8h 13 Step Functions workflows Orchestration 12h 14 SQS + DLQ Async processing 4h 15 Strategy Tasks Backtest execution 8h
P3 - Lower Priority (Week 9+) # Task Impact Effort 16 WAF Additional security 3h 17 CloudWatch Dashboard Observability 2h 18 Fargate Spot Cost optimization 2h 19 Reserved capacity Cost optimization 1h 20 Performance tuning Optimization 4h
Dependencies Graph Week 1
├── VPC ───────────────────────┐
├── Subnets ──────────────────┤
├── NAT Instance ─────────────┤
└── Security Groups ──────────┤
│
Week 2 │
├── RDS ◄─────────────────────┤
├── S3 ◄──────────────────────┤
├── Secrets Manager ◄─────────┤
├── Cognito ─────────────────┐│
└── IAM Roles ───────────────┤│
││
Week 3 ││
├── ALB ◄────────────────────┼┤
├── CloudTrail ◄─────────────┘│
└── WAF ◄─────────────────────┘
│
Week 4-5 │
├── Backend API ◄─────────────┤
├── Data Collection ◄─────────┤
└── MLflow ◄──────────────────┤
│
Week 6 │
└── Lambda Functions ◄────────┤
│
Week 7 │
└── API Gateway ◄─────────────┤ (depends on Cognito, ALB)
│
Week 8-10 │
├── SQS ◄─────────────────────┤
├── Step Functions ◄──────────┤ (depends on Lambda, ECS)
└── Strategy Tasks ◄──────────┘
Risk Mitigation Risk Probability Impact Mitigation VPC configuration errors Medium High Test with simple EC2 first NAT Instance failure Low High ASG + health checks Step Functions complexity Medium Medium Start with simple workflow Integration issues Medium Medium Test each connection Timeline slip Medium Medium 4-week buffer built in
Success Criteria Metric Target Verification Infrastructure deployed 100% All resources in AWS Services healthy All passing health checks CloudWatch API responds < 500ms p99 Load test Backtest completes > 95% success Step Functions metrics Cost within budget < $100/month Cost Explorer Security score > 90% Security Hub
Next Steps Review 09-PULUMI-CODE.md for infrastructure code Set up Pulumi project Begin Phase 1 implementation After Phases 1-5: Review 11-LIVE-TRADING.md for live trading Begin Phase 6 implementation (live trading)