Skip to content

TradAI Final Architecture - Canonical Configuration

Version: 9.2.1 | Date: 2025-12-09 | Status: AUTHORITATIVE


Purpose

This document is the single source of truth for all configuration values in the TradAI architecture. When any other document or code conflicts with this file, this file takes precedence.


1. Network Configuration

1.1 VPC

Parameter Value Notes
VPC CIDR 10.0.0.0/16 65,536 IPs
Region us-east-1 Primary region
Availability Zones us-east-1a, us-east-1b 2 AZs for HA
DNS Hostnames true Required for Service Discovery
DNS Support true Required

1.2 Subnets

Subnet Type AZ CIDR Route Table Purpose
Public-1 us-east-1a 10.0.1.0/24 Public RT ALB, NAT Instance
Public-2 us-east-1b 10.0.2.0/24 Public RT ALB (multi-AZ)
Private-1 us-east-1a 10.0.11.0/24 Private RT ECS Tasks, Lambda
Private-2 us-east-1b 10.0.12.0/24 Private RT ECS Tasks, Lambda
Database-1 us-east-1a 10.0.21.0/24 Database RT RDS Primary
Database-2 us-east-1b 10.0.22.0/24 Database RT RDS Standby

1.3 Route Tables

Route Table Routes Associated Subnets
Public RT 0.0.0.0/0 -> IGW Public-1, Public-2
Private RT 0.0.0.0/0 -> NAT Instance Private-1, Private-2
Database RT Local only (no internet) Database-1, Database-2

1.4 NAT Instance High Availability

Cost-effective NAT Instance with automatic failover (~$32/month savings vs NAT Gateway).

Component Value Notes
Instance Type t4g.nano ARM-based, ~$3/month
AMI Amazon Linux 2023 ARM64 Latest AMI, auto-selected
ASG min=1, max=1, desired=1 Single instance for cost optimization
EIP Static allocation Attached by user data script
Lambda tradai-update-nat-routes Updates route on instance replacement
Subnet Public-1 (us-east-1a) Single AZ deployment

User Data Configuration: - IP forwarding enabled (net.ipv4.ip_forward = 1) - NAT masquerading via iptables - Automatic EIP association - Source/destination check disabled

Failover Flow: 1. ASG detects unhealthy instance (EC2 health check) 2. ASG terminates failed instance and launches replacement 3. ASG Lifecycle Hook triggers EventBridge rule 4. Lambda function tradai-update-nat-routes invoked 5. Lambda waits for instance running state 6. Lambda updates private route table (0.0.0.0/0 -> new instance) 7. Lambda completes lifecycle action 8. Traffic resumes (~2-3 minutes failover time)

IAM Permissions: - NAT Instance Role: ec2:AssociateAddress, ec2:ModifyInstanceAttribute - Lambda Role: ec2:CreateRoute, ec2:ReplaceRoute, autoscaling:CompleteLifecycleAction


2. Service Configuration

2.1 Service Ports

Service Port Health Check Path Protocol
Backend API 8000 /health HTTP
Data Collection 8002 /health HTTP
MLflow 5000 /health HTTP
Strategy Service 8003 /health HTTP

2.2 ECS Task Definitions

Service CPU Memory Desired Count Launch Type
backend-api 512 1024 MB 1 FARGATE
data-collection 256 512 MB 1 FARGATE
mlflow 512 1024 MB 1 FARGATE
strategy-service 512 1024 MB 0 (on-demand) FARGATE
strategy-container 1024 2048 MB 0 (on-demand) FARGATE_SPOT

2.3 Service Discovery

Service DNS Name Namespace
Backend API backend-api.tradai.local tradai.local
Data Collection data-collection.tradai.local tradai.local
MLflow mlflow.tradai.local tradai.local

2.4 Backtest Execution Modes (v9.2)

Mode Environment Variable Description Use Case
local BACKTEST_MODE=local Docker SDK local execution Development, testing
ecs BACKTEST_MODE=ecs Direct ECS Fargate task launch Simple production backtests
sqs BACKTEST_MODE=sqs SQS → Lambda → ECS High-volume with backpressure
stepfunctions BACKTEST_MODE=stepfunctions SQS → Lambda → Step Functions Complex multi-step workflows

SQS Consumer Lambda Environment:

Variable Values Default Description
LAUNCH_MODE ecs, stepfunctions ecs How SQS Lambda launches backtests
ECS_CLUSTER cluster ARN - Target ECS cluster
ECS_TASK_PREFIX string strategy- Task definition prefix
BACKTEST_WORKFLOW_ARN Step Functions ARN - For stepfunctions mode

2.5 Strategy Container Environment Variables

Required (Minimal Bootstrap):

Variable Description Example
JOB_ID Unique job identifier uuid
STRATEGY Strategy class name (or FREQTRADE_STRATEGY as fallback) PascalStrategy
STRATEGY_ID Unique strategy identifier (required for live/dry-run) pascal-btc
TRADING_MODE Execution mode backtest, live, dry-run, train
MLFLOW_TRACKING_URI MLflow endpoint http://mlflow.tradai.local:5000

Note (Issue 10 Fix): Documentation previously used STRATEGY_NAME and STRATEGY_STAGE. The actual env vars are STRATEGY and STRATEGY_ID. Stage is determined by deployment config.

Optional (Infrastructure):

Variable Description Default
DYNAMODB_TABLE Workflow state table tradai-workflow-state
ARCTICDB_S3_URI ArcticDB S3 path s3://tradai-arcticdb-{env}
RESULTS_DIR Local results path /freqtrade/user_data/backtest_results

3. Storage Configuration

3.1 S3 Buckets

Bucket Name Purpose Versioning Encryption Lifecycle
tradai-configs-{env} Strategy configs, base configs Enabled SSE-S3 None
tradai-results-{env} Backtest results, reports Enabled SSE-S3 Glacier after 30d
tradai-arcticdb-{env} ArcticDB time-series data Enabled SSE-S3 None
tradai-logs-{env} ALB logs, CloudTrail, audit Disabled SSE-S3 Delete after 90d
tradai-mlflow-{env} MLflow artifacts Enabled SSE-S3 None

3.2 DynamoDB Tables

Table Name Partition Key Sort Key GSI TTL
tradai-workflow-state run_id (S) - status-created_at-index ttl (7 days)

3.3 RDS Configuration

Parameter Dev Value Prod Value
Instance Class db.t4g.micro db.t4g.small
Engine PostgreSQL 15.4 PostgreSQL 15.4
Storage 20 GB gp3 50 GB gp3
Multi-AZ false true
Backup Retention 7 days 14 days
SSL Enforcement rds.force_ssl=1 rds.force_ssl=1
Publicly Accessible false false

4. Security Configuration

4.1 Security Groups

ALB Security Group (tradai-alb-sg)

Direction Port Protocol Source/Dest Purpose
Ingress 80 TCP 0.0.0.0/0 HTTP redirect
Ingress 443 TCP 0.0.0.0/0 HTTPS
Egress 8000-8003 TCP ECS SG To services
Egress 5000 TCP ECS SG To MLflow

ECS Security Group (tradai-ecs-sg)

Direction Port Protocol Source/Dest Purpose
Ingress 8000 TCP ALB SG Backend API
Ingress 8002 TCP ALB SG Data Collection
Ingress 8003 TCP ALB SG Strategy Service
Ingress 5000 TCP ALB SG MLflow
Ingress 8000-8003 TCP Lambda SG Internal calls
Egress 443 TCP 0.0.0.0/0 AWS APIs
Egress 5432 TCP RDS SG Database

Lambda Security Group (tradai-lambda-sg)

Direction Port Protocol Source/Dest Purpose
Egress 443 TCP 0.0.0.0/0 AWS APIs
Egress 8000-8003 TCP ECS SG To services
Egress 5000 TCP ECS SG To MLflow

RDS Security Group (tradai-rds-sg)

Direction Port Protocol Source/Dest Purpose
Ingress 5432 TCP ECS SG From ECS
Ingress 5432 TCP Lambda SG From Lambda

NAT Security Group (tradai-nat-sg)

Direction Port Protocol Source/Dest Purpose
Ingress 0-65535 TCP 10.0.11.0/24 From Private-1
Ingress 0-65535 TCP 10.0.12.0/24 From Private-2
Egress 0-65535 TCP 0.0.0.0/0 To internet

4.2 Cognito Configuration

Parameter Value
User Pool Name tradai-users
MFA Required (TOTP)
Password Min Length 12
Password Requirements Upper, Lower, Number, Symbol
Account Recovery Email only
Advanced Security ENFORCED

4.3 WAF Configuration

Rule Priority Action Rate Limit
RateLimitRule 1 Block 100/5min per IP
AWSManagedRulesCommonRuleSet 2 Block -
AWSManagedRulesKnownBadInputsRuleSet 3 Block -
AWSManagedRulesSQLiRuleSet 4 Block -

5. API Gateway Configuration

5.1 Routes

Method Path Integration Auth Required
GET /health VPC Link -> ALB No
GET /strategies VPC Link -> Backend API Yes
GET /strategies/{id} VPC Link -> Backend API Yes
POST /strategies VPC Link -> Backend API Yes
POST /backtest SQS Direct Integration Yes
GET /backtest/{run_id} VPC Link -> Backend API Yes
GET /data/symbols VPC Link -> Data Collection Yes
GET /data/freshness VPC Link -> Data Collection Yes
POST /data/sync VPC Link -> Data Collection Yes
ANY /mlflow/* VPC Link -> MLflow Yes

5.2 Throttling

Parameter Value
Default Rate Limit 100 req/sec
Default Burst Limit 200 req
Per-Route Override /backtest: 10 req/sec

6. SQS Configuration

6.1 Backtest Queue

Parameter Value
Queue Name tradai-backtest-queue.fifo
FIFO true
Content-Based Deduplication true
Visibility Timeout 900 (15 minutes)
Message Retention 345600 (4 days)
Receive Wait Time 20 (long polling)

6.2 Dead Letter Queue

Parameter Value
Queue Name tradai-backtest-dlq.fifo
FIFO true
Message Retention 1209600 (14 days)
Max Receive Count 3

7. Lambda Functions

Function Name Runtime Memory Timeout VPC Trigger
tradai-validate-strategy Python 3.11 256 MB 30s Yes Step Functions
tradai-validate-data Python 3.11 256 MB 60s Yes Step Functions
tradai-sqs-consumer Python 3.11 256 MB 30s Yes SQS
tradai-transform-results Python 3.11 512 MB 120s Yes Step Functions
tradai-cleanup-resources Python 3.11 256 MB 60s Yes Step Functions
tradai-notify-completion Python 3.11 128 MB 30s No Step Functions
tradai-update-nat-routes Python 3.11 128 MB 30s No ASG Lifecycle
tradai-orphan-scanner Python 3.11 128 MB 60s Yes EventBridge (15 min)

8. Secrets Manager

Secret Name Contents Rotation
tradai/database-url PostgreSQL connection string Manual (Phase 2: Auto)
tradai/mlflow-credentials MLflow tracking auth Manual
tradai/binance-api Binance API key/secret Manual
tradai/jwt-secret JWT signing key Manual

9. Naming Conventions

9.1 Resource Naming Pattern

{project}-{component}-{environment}[-{suffix}]

Examples:
- tradai-backend-api-prod
- tradai-vpc-dev
- tradai-backtest-queue-staging.fifo

9.2 Tag Requirements

All resources MUST have these tags:

Tag Key Required Example Values
Application Yes tradai
Environment Yes dev, staging, prod
Service If applicable backend-api, mlflow
ManagedBy Yes pulumi
CostCenter Yes trading-platform
Owner Yes platform-team

10. Environment Differences

Component Dev Staging Prod
RDS Multi-AZ No No Yes
RDS Instance db.t4g.micro db.t4g.micro db.t4g.small
ECS Desired Count 1 1 2
Log Retention 7 days 14 days 30 days
Deletion Protection No No Yes
WAF Optional Yes Yes
CloudTrail Optional Yes Yes

11. Validation Checklist

Before deployment, verify:

  • [ ] All subnet CIDRs match this document
  • [ ] All service ports match this document
  • [ ] All S3 bucket names follow the pattern
  • [ ] Security groups reference correct CIDRs
  • [ ] DynamoDB uses run_id as partition key
  • [ ] All resources have required tags
  • [ ] Secrets exist in Secrets Manager
  • [ ] SSL enforcement enabled on RDS

Document Control: - This document supersedes any conflicting values in other architecture documents - Changes require review and version increment - Pulumi config.py should import values from this spec