TradAI Final Architecture - Executive Summary¶
Version: 8.0 | Date: 2025-11-27 | Status: APPROVED FOR IMPLEMENTATION
Overview¶
TradAI is a quantitative trading platform consisting of three microservices for backtesting trading strategies using the Freqtrade framework. This document summarizes the final architecture after comprehensive review by 5 specialist agents.
Assessment Summary¶
Overall Score: 8.4/10 - PRODUCTION READY¶
| Dimension | Before (v5.1) | After (v8.0) | Improvement |
|---|---|---|---|
| VPC/Networking | 4/10 | 8.5/10 | +113% |
| Security | 3/10 | 8.0/10 | +167% |
| Step Functions | 5/10 | 8.5/10 | +70% |
| Cost Optimization | 7/10 | 9.0/10 | +29% |
| Service Architecture | 6/10 | 8.0/10 | +33% |
| OVERALL | 5/10 | 8.4/10 | +68% |
Budget Impact¶
Original Estimate (v5.1): $109/month (incomplete, missing components)
Corrected Baseline: $220/month (full infrastructure)
Final Optimized (v8.0): $78/month (with all optimizations)
─────────────────────────────────────
Savings vs Baseline: $142/month (64% reduction)
Critical Issues Resolved¶
Blockers Fixed (Were Breaking Production)¶
| # | Issue | Impact | Resolution |
|---|---|---|---|
| 1 | Step Functions EXPRESS type | Backtests > 5 min fail | Changed to STANDARD |
| 2 | HTTP invoke from Step Functions | Cannot call internal services | Lambda proxy pattern |
| 3 | No VPC defined | Services exposed publicly | Full VPC with private subnets |
| 4 | No authentication | APIs unprotected | Cognito + JWT + API Gateway |
| 5 | Combined backend service | Cannot scale independently | Separated into 2 services |
Security Gaps Closed¶
| # | Gap | Risk Level | Resolution |
|---|---|---|---|
| 1 | No S3 encryption | HIGH | SSE-S3 enabled on all buckets |
| 2 | No CloudTrail | HIGH | Enabled for audit compliance |
| 3 | MFA optional | MEDIUM | MFA required for all users |
| 4 | HTTP:80 exposed | MEDIUM | Redirect-only, no direct traffic |
| 5 | Missing NACLs | MEDIUM | Stateless firewall rules added |
| 6 | No WAF | MEDIUM | Rate limiting + managed rules |
| 7 | Egress too permissive | LOW | Specific CIDR blocks only |
Cost Optimizations Applied¶
| # | Optimization | Before | After | Savings |
|---|---|---|---|---|
| 1 | NAT Instance vs Gateway | $64.80 | $6.13 | $58.67/mo |
| 2 | Remove VPC Endpoints | $28.00 | $0 | $28.00/mo |
| 3 | RDS Single-AZ (dev) | $36.00 | $18.00 | $18.00/mo |
| 4 | Fargate Spot for tasks | $7.67 | $1.92 | $5.75/mo |
| 5 | CloudWatch Logs 7d | $10.00 | $5.00 | $5.00/mo |
| 6 | S3 Lifecycle policies | $2.50 | $1.00 | $1.50/mo |
| 7 | Reserved RDS (1yr) | - | - | $6.00/mo |
| TOTAL | $122.92/mo |
Architecture Pattern: Hybrid¶
Long-Running Services (24/7): - Backend API Service (FastAPI) - $14.60/mo - MLflow Service (Experiment Tracking) - $14.60/mo
On-Demand Tasks (Pay-per-use): - Strategy Tasks (Fargate Spot) - $1.92/mo - Backtest Containers - Included above
Serverless (Fully Managed): - AWS API Gateway - $3.50/mo - Lambda Functions (8) - $0 (free tier) - Step Functions (Standard) - $0.50/mo - SQS + DLQ - $0.40/mo
Why Hybrid? - 64% cost savings vs all long-running - Fast response for quick operations (no cold start) - Pay-per-use for expensive backtest operations - Clear workflow orchestration with Step Functions
Key Architectural Decisions¶
1. AWS API Gateway over Custom FastAPI¶
Decision: Use managed AWS API Gateway instead of custom FastAPI service
| Aspect | Custom FastAPI | AWS API Gateway |
|---|---|---|
| Cost | $14.60/mo | $0.27/mo |
| Maintenance | High (patching, updates) | Zero (fully managed) |
| Scaling | Manual | Automatic |
| Security | DIY (auth, rate limiting) | Built-in |
| Availability | Single-AZ | Multi-AZ |
Savings: $14.33/month (98% reduction)
2. NAT Instance over NAT Gateway¶
Decision: Use t4g.nano NAT Instance instead of NAT Gateway
| Aspect | NAT Gateway | NAT Instance |
|---|---|---|
| Cost | $64.80/mo | $6.13/mo |
| Bandwidth | 45 Gbps | 5 Gbps |
| Availability | Managed | Self-managed |
| Complexity | Low | Medium |
Savings: $58.67/month (90% reduction) Acceptable for: Dev/staging environments with < 5 Gbps traffic
3. Step Functions STANDARD over EXPRESS¶
Decision: Use Standard workflow type for all state machines
| Aspect | EXPRESS | STANDARD |
|---|---|---|
| Max Duration | 5 minutes | 1 year |
| Backtest Support | NO | YES |
| State Persistence | No | 90 days |
| Cost per 1000 | $0.025 | $0.025 |
Critical: Backtests routinely run 30-60 minutes. EXPRESS would fail.
4. Separated Backend Services¶
Decision: Split API Gateway from Data Collection into separate ECS services
| Aspect | Combined | Separated |
|---|---|---|
| Cost | $14.60/mo | $18.25/mo |
| Scaling | Coupled | Independent |
| Deployment | Atomic | Independent |
| Failure Isolation | None | Full |
Cost: +$3.65/month for significantly better architecture
5. SQS for Async Processing¶
Decision: Add SQS queue between API and Step Functions
Benefits: - Instant API response (<100ms) - No timeout issues (ALB 30s limit) - Automatic retry with DLQ - Decoupled architecture - Cost: $0.42/month
Implementation Timeline¶
| Phase | Duration | Focus | Cost |
|---|---|---|---|
| Phase 1 | Weeks 1-3 | Infrastructure (VPC, Security, Storage) | $0 |
| Phase 2 | Weeks 4-7 | Core Services (API, MLflow, Tasks) | $0 |
| Phase 3 | Weeks 8-10 | Orchestration (Step Functions, SQS) | $0 |
| Phase 4 | Weeks 11-12 | Testing & Hardening | $0 |
| Phase 5 | Weeks 13-16 | Buffer & Production Deploy | $78/mo |
Total Timeline: 12-16 weeks Total Implementation Effort: ~150 hours
Risk Assessment¶
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Step Functions timeout | Low | High | Heartbeat + proper timeout config |
| NAT Instance failure | Low | High | Auto Scaling Group + health checks |
| Cost overrun | Low | Medium | Budget alerts at $80/$100 thresholds |
| Security breach | Very Low | Critical | MFA + WAF + audit logging |
| Data Collection delays | Medium | Medium | Circuit breaker + retry logic |
Success Criteria¶
| Metric | Target | Current Baseline |
|---|---|---|
| Backtest success rate | > 95% | N/A (new system) |
| API latency (p99) | < 500ms | N/A |
| Monthly cost | < $100 | $0 |
| Uptime | > 99.5% | N/A |
| Security score | > 90% | N/A |
| Mean time to deploy | < 30 min | N/A |
Approval Checklist¶
Technical Approval¶
- [x] VPC design reviewed and approved
- [x] Security checklist completed
- [x] Step Functions workflow validated
- [x] Cost analysis verified
- [x] Implementation plan accepted
Business Approval¶
- [x] $78-100/month budget approved
- [x] 12-16 week timeline accepted
- [x] Risk assessment acknowledged
Sign-off¶
- [x] Architecture Owner
- [x] Security Lead
- [x] Infrastructure Lead
- [x] Development Lead
Next Steps¶
- Immediate (Day 1): Create Pulumi project structure
- Week 1: Deploy VPC, subnets, security groups
- Week 2: Deploy RDS, S3, DynamoDB, Secrets Manager
- Week 3: Deploy ALB, NAT Instance, CloudTrail
- Week 4+: Begin service deployment per roadmap
Document Status: FINAL APPROVED Review Cycle: Quarterly or on major changes