Skip to content

TradAI Final Architecture - Canonical Configuration

Version: 9.3.0 | Date: 2026-03-28 | Status: CURRENT

TL;DR: Single source of truth for all infrastructure configuration. Defines VPC CIDRs, service specs (CPU/memory/ports), DynamoDB tables, S3 buckets, Lambda configs, and environment-specific overrides. All values in infra/shared/tradai_infra_shared/config.py.


Purpose

This document mirrors infra/shared/tradai_infra_shared/config.py, which is the actual source of truth deployed by Pulumi. Update config.py first, then update this document to match.

graph TD
    B["infra/shared/config.py<br/><b>Source of Truth (code)</b>"] --> A["10-CANONICAL-CONFIG.md<br/><b>Human-readable mirror</b>"]
    B --> F["persistent stack"]
    B --> G["foundation stack"]
    B --> H["compute stack"]
    B --> I["edge stack"]
    A -.->|"read by"| C["Service Environment Variables<br/>ECS task definitions"]
    A -.->|"read by"| D["Lambda Configuration<br/>Memory, timeout, VPC"]
    A -.->|"read by"| E["Security Configuration<br/>SGs, NACLs, WAF, Cognito"]

    style B fill:#d32f2f,color:#fff
    style A fill:#1565c0,color:#fff

Immutable Values

The following values must NEVER change after initial deployment without a full migration plan: VPC CIDR (10.0.0.0/16), subnet CIDRs, DynamoDB partition keys (per table, see Section 3.2), S3 bucket naming pattern (tradai-{component}-{env}), and Cognito User Pool ID. Changing these will cause data loss or service disruption.

Environment Overrides

Environment-specific values (RDS instance class, desired count, log retention, etc.) are controlled by the Pulumi stack name. The stack name IS the environment (dev, staging, prod). See Section 10 for the full environment differences matrix.


1. Network Configuration

1.1 VPC

Parameter Value Notes
VPC CIDR 10.0.0.0/16 65,536 IPs
Region eu-central-1 Primary region
Availability Zones eu-central-1a, eu-central-1b 2 AZs for HA
DNS Hostnames true Required for Service Discovery
DNS Support true Required

1.2 Subnets

Subnet Type AZ CIDR Route Table Purpose
Public-1 eu-central-1a 10.0.1.0/24 Public RT ALB, NAT Instance
Public-2 eu-central-1b 10.0.2.0/24 Public RT ALB (multi-AZ)
Private-1 eu-central-1a 10.0.11.0/24 Private RT ECS Tasks, Lambda
Private-2 eu-central-1b 10.0.12.0/24 Private RT ECS Tasks, Lambda
Database-1 eu-central-1a 10.0.21.0/24 Database RT RDS Primary
Database-2 eu-central-1b 10.0.22.0/24 Database RT RDS Standby

1.3 Route Tables

Route Table Routes Associated Subnets
Public RT 0.0.0.0/0 -> IGW Public-1, Public-2
Private RT 0.0.0.0/0 -> NAT Instance Private-1, Private-2
Database RT Local only (no internet) Database-1, Database-2

1.4 NAT Instance High Availability

Cost-effective NAT Instance with automatic failover (~$32/month savings vs NAT Gateway).

Component Value Notes
Instance Type t4g.nano ARM-based, ~$3/month
AMI Amazon Linux 2023 ARM64 Latest AMI, auto-selected
ASG min=1, max=1, desired=1 Single instance for cost optimization
EIP Static allocation Attached by user data script
Lambda tradai-update-nat-routes Updates route on instance replacement
Subnet Public-1 (eu-central-1a) Single AZ deployment

User Data Configuration: - IP forwarding enabled (net.ipv4.ip_forward = 1) - NAT masquerading via iptables - Automatic EIP association - Source/destination check disabled

Failover Flow: 1. ASG detects unhealthy instance (EC2 health check) 2. ASG terminates failed instance and launches replacement 3. ASG Lifecycle Hook triggers EventBridge rule 4. Lambda function tradai-update-nat-routes invoked 5. Lambda waits for instance running state 6. Lambda updates private route table (0.0.0.0/0 -> new instance) 7. Lambda completes lifecycle action 8. Traffic resumes (~2-3 minutes failover time)

IAM Permissions: - NAT Instance Role: ec2:AssociateAddress, ec2:ModifyInstanceAttribute - Lambda Role: ec2:CreateRoute, ec2:ReplaceRoute, autoscaling:CompleteLifecycleAction


2. Service Configuration

2.1 Service Ports

Service Port Health Check Path Protocol
Backend API 8000 /api/v1/health HTTP
Data Collection 8002 /api/v1/health HTTP
MLflow 5000 /mlflow/ HTTP
Strategy Service 8003 /api/v1/health HTTP

2.2 ECS Task Definitions

Service CPU Memory Desired Count Launch Type
backend-api 512 1024 MB 1 FARGATE
data-collection 256 512 MB 1 FARGATE
mlflow 512 1024 MB 1 FARGATE
strategy-service 512 1024 MB 1 FARGATE
strategy-container 1024 2048 MB 0 (on-demand) FARGATE_SPOT
live-trading 1024 2048 MB 0 (per strategy) FARGATE
dry-run-trading 1024 2048 MB 0 (per strategy) FARGATE_SPOT

2.3 Service Discovery

Service DNS Name Port Namespace
Backend API backend-api.tradai-{env}.local 8000 tradai-{env}.local
Data Collection data-collection.tradai-{env}.local 8002 tradai-{env}.local
MLflow mlflow.tradai-{env}.local 5000 tradai-{env}.local
Strategy Service strategy-service.tradai-{env}.local 8003 tradai-{env}.local
Live Trading live-trading.tradai-{env}.local 8004 tradai-{env}.local
Dry-Run Trading dry-run-trading.tradai-{env}.local 8005 tradai-{env}.local

2.4 Backtest Execution Modes (v9.2)

Mode Environment Variable Description Use Case
local BACKEND_EXECUTOR_MODE=local Docker SDK local execution Development, testing
ecs BACKEND_EXECUTOR_MODE=ecs Direct ECS Fargate task launch Simple production backtests
sqs BACKEND_EXECUTOR_MODE=sqs SQS → Lambda → ECS High-volume with backpressure
stepfunctions BACKEND_EXECUTOR_MODE=stepfunctions SQS → Lambda → Step Functions Complex multi-step workflows

SQS Consumer Lambda Environment:

Variable Values Default Description
LAUNCH_MODE ecs, stepfunctions ecs SQS Lambda only - How Lambda launches backtests
ECS_CLUSTER cluster ARN - Target ECS cluster
ECS_TASK_PREFIX string strategy- Task definition prefix
STEP_FUNCTIONS_ARN Step Functions ARN - For stepfunctions mode

2.5 Strategy Container Environment Variables

Required (Minimal Bootstrap):

Variable Description Example
JOB_ID Unique job identifier uuid
STRATEGY Strategy class name (or FREQTRADE_STRATEGY as fallback) PascalStrategy
STRATEGY_ID Unique strategy identifier (required for live/dry-run) pascal-btc
TRADING_MODE Execution mode backtest, live, dry-run, train
MLFLOW_TRACKING_URI MLflow endpoint http://mlflow.tradai-{env}.local:5000/mlflow

Note (Issue 10 Fix): Documentation previously used STRATEGY_NAME and STRATEGY_STAGE. The actual env vars are STRATEGY and STRATEGY_ID. Stage is determined by deployment config.

Optional (Infrastructure):

Variable Description Default
CONFIG_PATH Freqtrade config file path /freqtrade/user_data/config.json
DYNAMODB_TABLE Workflow state table (also accepts WORKFLOW_STATE_TABLE) tradai-workflow-state
S3_RESULTS_BUCKET S3 bucket for result uploads -
ARCTICDB_S3_URI ArcticDB S3 path s3://tradai-arcticdb-{env}
RESULTS_DIR Local results path /freqtrade/user_data/backtest_results
TIMEFRAME Backtest timeframe -
TIMERANGE Date range (YYYYMMDD-YYYYMMDD) -
PAIRS Comma-separated trading pairs -
EXPERIMENT_NAME MLflow experiment name default

3. Storage Configuration

3.1 S3 Buckets

Bucket Name Purpose Versioning Encryption Lifecycle
tradai-configs-{env} Strategy configs, base configs Enabled SSE-S3 None
tradai-results-{env} Backtest results, reports Enabled SSE-S3 Glacier after 30d
tradai-arcticdb-{env} ArcticDB time-series data Enabled SSE-S3 None
tradai-logs-{env} ALB logs, CloudTrail, audit Disabled SSE-S3 Delete after 90d
tradai-mlflow-{env} MLflow artifacts Enabled SSE-S3 None

3.2 DynamoDB Tables

All 12 tables use PAY_PER_REQUEST billing, server-side encryption, and point-in-time recovery (except idempotency). Prod tables have deletion protection enabled.

# Table Name Partition Key Sort Key GSIs TTL Attr Purpose
1 tradai-workflow-state-{env} run_id (S) -- status-created_at-index (hash: status, range: created_at, ALL), trace_id-index (hash: trace_id, INCLUDE: job_id, status, created_at, execution_arn) ttl Backtest job tracking and E2E correlation
2 tradai-idempotency-{env} idempotency_key (S) -- -- expiration Lambda idempotency deduplication (no PITR)
3 tradai-health-state-{env} service_name (S) -- -- expires_at Service health check state
4 tradai-trading-state-{env} strategy_id (S) -- -- ttl Live trading container state and heartbeats
5 tradai-deployments-{env} strategy_name (S) deployment_id (S) environment-created_at-index (hash: environment, range: created_at, ALL) ttl Strategy deployment history and rollback info
6 tradai-drift-state-{env} model_name (S) -- -- expires_at Model/data drift detection state
7 tradai-retraining-state-{env} model_name (S) -- -- expires_at Model retraining scheduling state
8 tradai-rollback-state-{env} model_name (S) -- -- -- (none) Model rollback tracking (no TTL for audit)
9 tradai-shadow-test-state-{env} test_id (S) -- model_name-status_created_at-index (hash: model_name, range: status_created_at, ALL) expires_at Shadow trading test state and metrics
10 tradai-notifications-{env} notification_id (S) -- -- expire_at Notification delivery tracking
11 tradai-infra-drift-state-{env} stack_name (S) -- has_drift-last_check-index (hash: has_drift, range: last_check, ALL) -- (none) Pulumi infrastructure drift detection (no TTL for audit)
12 tradai-config-versions-{env} strategy_name (S) config_id (S) config_hash-index (hash: config_hash, KEYS_ONLY), status-index (hash: strategy_name, range: status, ALL) ttl Immutable config version registry (content-addressable)

Source code: infra/shared/tradai_infra_shared/config.py (DYNAMODB_TABLES dict) and infra/persistent/modules/dynamodb.py (DynamoDbTables class).

3.3 RDS Configuration

Parameter Dev Value Prod Value
Instance Class db.t4g.micro db.t4g.small
Engine PostgreSQL 15.13 PostgreSQL 15.13
Storage 20 GB gp3 50 GB gp3
Multi-AZ false true
Backup Retention 7 days 14 days
SSL Enforcement rds.force_ssl=1 rds.force_ssl=1
Publicly Accessible false false

4. Security Configuration

4.1 Security Groups

ALB Security Group (tradai-alb-sg)

Direction Port Protocol Source/Dest Purpose
Ingress 80 TCP 0.0.0.0/0 HTTP redirect
Ingress 443 TCP 0.0.0.0/0 HTTPS
Egress 8000-8003 TCP ECS SG To services
Egress 5000 TCP ECS SG To MLflow

ECS Security Group (tradai-ecs-sg)

Direction Port Protocol Source/Dest Purpose
Ingress 8000 TCP ALB SG Backend API
Ingress 8002 TCP ALB SG Data Collection
Ingress 8003 TCP ALB SG Strategy Service
Ingress 5000 TCP ALB SG MLflow
Ingress 8000-8003 TCP Lambda SG Internal calls
Egress 443 TCP 0.0.0.0/0 AWS APIs
Egress 5432 TCP RDS SG Database

Lambda Security Group (tradai-lambda-sg)

Direction Port Protocol Source/Dest Purpose
Egress 443 TCP 0.0.0.0/0 AWS APIs
Egress 8000-8003 TCP ECS SG To services
Egress 5000 TCP ECS SG To MLflow

RDS Security Group (tradai-rds-sg)

Direction Port Protocol Source/Dest Purpose
Ingress 5432 TCP ECS SG From ECS
Ingress 5432 TCP Lambda SG From Lambda

NAT Security Group (tradai-nat-sg)

Direction Port Protocol Source/Dest Purpose
Ingress 0-65535 TCP 10.0.11.0/24 From Private-1
Ingress 0-65535 TCP 10.0.12.0/24 From Private-2
Egress 0-65535 TCP 0.0.0.0/0 To internet

Note: NAT SG currently allows TCP only. DNS (port 53) and NTP (port 123) use UDP. If private subnet workloads need direct DNS resolution or NTP through NAT (rather than via VPC-provided DNS), UDP rules for ports 53 and 123 should be added to both ingress and egress. Currently VPC DNS handles resolution so this is not a blocker, but should be added if custom DNS or NTP is required.

4.2 Cognito Configuration

Parameter Value
User Pool Name tradai-users
MFA Required (TOTP)
Password Min Length 12
Password Requirements Upper, Lower, Number, Symbol
Account Recovery Email only
Advanced Security Not enabled (no user_pool_add_ons configured)

4.3 WAF Configuration

Rule Priority Action Rate Limit
RateLimitRule 1 Block 100/5min per IP
AWSManagedRulesCommonRuleSet 2 Block -
AWSManagedRulesKnownBadInputsRuleSet 3 Block -
AWSManagedRulesSQLiRuleSet 4 Block -

5. API Gateway Configuration

5.1 Routes

Method Path Integration Auth Required
GET /api/v1/health VPC Link -> ALB No
GET /api/v1/strategies VPC Link -> Backend API Yes
GET /api/v1/strategies/{id} VPC Link -> Backend API Yes
POST /api/v1/strategies VPC Link -> Backend API Yes
POST /api/v1/strategies/{name}/stage VPC Link -> Backend API Yes
POST /api/v1/strategies/{name}/promote VPC Link -> Backend API Yes
POST /api/v1/backtests SQS Direct Integration Yes
GET /api/v1/backtests VPC Link -> Backend API Yes
GET /api/v1/backtests/{job_id} VPC Link -> Backend API Yes
POST /api/v1/backtests/{job_id}/cancel VPC Link -> Backend API Yes
GET /api/v1/backtests/{job_id}/equity VPC Link -> Backend API Yes
GET /api/v1/backtests/{job_id}/report-data VPC Link -> Backend API Yes
GET /api/v1/data/symbols VPC Link -> Backend API Yes
GET /api/v1/data/freshness VPC Link -> Backend API Yes
POST /api/v1/data/sync VPC Link -> Backend API Yes
GET /api/v1/models/{name}/versions VPC Link -> Backend API Yes
POST /api/v1/models/{name}/rollback VPC Link -> Backend API Yes
GET /api/v1/catalog/strategies VPC Link -> Backend API Yes
GET /api/v1/catalog/strategies/{name} VPC Link -> Backend API Yes
GET /api/v1/catalog/strategies/{name}/compare VPC Link -> Backend API Yes
ANY /mlflow/* VPC Link -> MLflow Yes

5.2 Throttling

Parameter Value
Default Rate Limit 100 req/sec
Default Burst Limit 200 req
Per-Route Override /api/v1/backtests: 10 req/sec

6. SQS Configuration

6.1 Backtest Queue

Parameter Value
Queue Name tradai-backtest-queue.fifo
FIFO true
Content-Based Deduplication true
Visibility Timeout 900 (15 minutes)
Message Retention 345600 (4 days)
Receive Wait Time 20 (long polling)

6.2 Dead Letter Queue

Parameter Value
Queue Name tradai-backtest-dlq.fifo
FIFO true
Message Retention 1209600 (14 days)
Max Receive Count 3

7. Lambda Functions

Function Name Runtime Memory Timeout VPC Trigger
tradai-backtest-consumer Python 3.11 256 MB 30s Yes SQS
tradai-sqs-consumer Python 3.11 256 MB 30s Yes SQS
tradai-orphan-scanner Python 3.11 128 MB 60s Yes EventBridge (5 min)
tradai-health-check Python 3.11 256 MB 60s Yes EventBridge (2 min)
tradai-trading-heartbeat-check Python 3.11 256 MB 60s Yes EventBridge (5 min)
tradai-drift-monitor Python 3.11 512 MB 120s Yes EventBridge (12 hours)
tradai-retraining-scheduler Python 3.11 256 MB 60s Yes EventBridge (6 hours)
tradai-validate-strategy Python 3.11 256 MB 30s Yes Step Functions
tradai-data-collection-proxy Python 3.11 256 MB 180s Yes Step Functions
tradai-notify-completion Python 3.11 128 MB 30s No Step Functions
tradai-check-retraining-needed Python 3.11 256 MB 30s Yes Step Functions
tradai-compare-models Python 3.11 512 MB 120s Yes Step Functions
tradai-promote-model Python 3.11 256 MB 60s Yes Step Functions
tradai-model-rollback Python 3.11 256 MB 60s Yes Backend API
tradai-cleanup-resources Python 3.11 256 MB 60s Yes Step Functions
tradai-update-status Python 3.11 256 MB 30s Yes Step Functions
tradai-pulumi-drift-detector Python 3.11 512 MB 300s No EventBridge (6 hours)
tradai-update-nat-routes Python 3.11 128 MB 120s No ASG Lifecycle

8. Secrets Manager

Secret Name Contents Rotation
tradai/database-url PostgreSQL connection string Manual (Phase 2: Auto)
tradai/mlflow-credentials MLflow tracking auth Manual
tradai/binance-api Binance API key/secret Manual
tradai/jwt-secret JWT signing key Manual

9. Naming Conventions

9.1 Resource Naming Pattern

{project}-{component}-{environment}[-{suffix}]

Examples:
- tradai-backend-api-prod
- tradai-vpc-dev
- tradai-backtest-queue-staging.fifo

9.2 Tag Requirements

All resources MUST have these tags:

Tag Key Required Example Values
Application Yes tradai
Environment Yes dev, staging, prod
Service If applicable backend-api, mlflow
ManagedBy Yes pulumi
CostCenter Yes trading-platform
Owner Yes platform-team

10. Environment Differences

Component Dev Staging Prod
Compute Mode Consolidated EC2 (t3.small) Consolidated EC2 (t3.small) ECS Fargate
RDS Multi-AZ No No Yes
RDS Instance db.t4g.micro db.t4g.micro db.t4g.small
ECS Desired Count 1 1 2
Log Retention 30 days 30 days 90 days
Deletion Protection No No Yes
WAF Optional Yes Yes
CloudTrail Optional Yes Yes

11. Validation Checklist

Before deployment, verify:

  • All subnet CIDRs match this document
  • All service ports match this document
  • All S3 bucket names follow the pattern
  • Security groups reference correct CIDRs
  • All 12 DynamoDB tables match Section 3.2 (names, keys, GSIs, TTL)
  • All resources have required tags
  • Secrets exist in Secrets Manager
  • SSL enforcement enabled on RDS

Document Control: - This document mirrors infra/shared/tradai_infra_shared/config.py which is the actual source of truth. Update config.py first, then update this document. - Changes require review and version increment - When this document conflicts with config.py, config.py takes precedence


Changelog

Version Date Changes
9.3.0 2026-03-28 Added all 12 DynamoDB tables with keys/GSIs/TTL (was 1 of 12). Fixed ownership: config.py is source of truth, this doc mirrors it. Added Compute Mode row to env differences. Added UDP note to NAT SG. Fixed validation checklist.
9.2.1 2026-03-28 Fixed region, health check paths, RDS version, Lambda table, log retention

Dependencies

If This Changes Update This Doc
infra/shared/tradai_infra_shared/config.py ALL sections — this doc mirrors config.py
New DynamoDB table added Section 3.2 (DynamoDB Tables)
New Lambda function added Section 7 (Lambda Functions)
New ECS service added Section 2 (Service Configuration)