TradAI Infrastructure as Code (Pulumi)¶
Version: 9.2.1 | Date: 2026-03-28 | Status: CURRENT
TL;DR: 4-stack Pulumi architecture in
eu-central-1. Stacks are layered by lifecycle:persistent(data-bearing, never destroyed) ->foundation(VPC, RDS) ->compute(ECS, Lambda, Step Functions) ->edge(API Gateway, WAF, CloudWatch). All config centralized ininfra/shared/tradai_infra_shared/config.py.
1. Architecture Overview¶
TradAI uses a 4-stack Pulumi architecture deployed in eu-central-1. Stacks are layered by lifecycle and dependency: persistent data is separated from ephemeral compute, which is separated from edge routing and monitoring.
Stack Dependency Flow¶
graph LR
A["persistent<br/><small>S3, DynamoDB, ECR<br/>Cognito, CodeArtifact</small>"] --> B["foundation<br/><small>VPC, Subnets, RDS<br/>SQS, SNS, NAT</small>"]
B --> C["compute<br/><small>IAM, ALB, ECS<br/>Lambda, Step Functions</small>"]
C --> D["edge<br/><small>API Gateway, WAF<br/>CloudWatch, Alarms</small>"]
style A fill:#2d6a4f,color:#fff
style B fill:#40916c,color:#fff
style C fill:#52b788,color:#000
style D fill:#95d5b2,color:#000 Why 4 Stacks?¶
| Concern | Stack | Rationale |
|---|---|---|
| Data preservation | persistent | S3, DynamoDB, ECR, Cognito are never destroyed during dev cycles |
| Network stability | foundation | VPC, subnets, RDS rarely change; destroying them is disruptive |
| Rapid iteration | compute | ECS services, Lambda, Step Functions change frequently |
| Edge independence | edge | API Gateway, WAF, CloudWatch can be updated without touching compute |
Cross-Stack Communication¶
StackReference Pattern
Stacks share outputs via Pulumi StackReference. Each downstream stack declares references to its upstream stacks and reads exported values. The org value defaults to "organization" and stack names match environments (dev, staging, prod), so a reference looks like organization/tradai-persistent/dev. Never hardcode resource IDs across stacks -- always use StackReference.get_output().
Stacks share outputs via Pulumi StackReference. Each downstream stack declares references to its upstream stacks and reads exported values:
# Example from compute/__main__.py
persistent = pulumi.StackReference(f"{org}/tradai-persistent/{pulumi.get_stack()}")
foundation = pulumi.StackReference(f"{org}/tradai-foundation/{pulumi.get_stack()}")
vpc_id = foundation.get_output("vpc_id")
ecr_repository_urls = persistent.get_output("ecr_repository_urls")
Shared Library¶
All four stacks depend on infra/shared/tradai_infra_shared/, a local Python package providing: - config.py -- Single source of truth for all resource names, sizes, and feature flags - core/ -- Reusable builder patterns (PolicyBuilder, SecurityRuleBuilder, etc.) - testing.py -- PulumiMocks and test fixtures shared across stack test suites
2. Project Structure¶
infra/
├── persistent/ # Stack 1: data-bearing resources
│ ├── __main__.py
│ └── modules/
│ ├── s3.py # 5 S3 buckets
│ ├── dynamodb.py # 12 DynamoDB tables
│ ├── ecr.py # 24 ECR repositories (6 service + 18 Lambda)
│ ├── cognito.py # User pool, app clients, M2M client
│ ├── codeartifact.py # Python package registry
│ ├── cloudtrail.py # Audit logging
│ └── pulumi_backend.py # Self-hosted Pulumi state in S3
│
├── foundation/ # Stack 2: networking + databases
│ ├── __main__.py
│ └── modules/
│ ├── vpc.py # VPC, subnets (public/private/database), IGW, routes
│ ├── security_groups.py # ALB, ECS, Lambda, RDS, NAT, endpoint SGs
│ ├── nacl.py # Network ACLs per subnet tier
│ ├── nat_instance.py # t4g.nano NAT instance (~$3/month)
│ ├── vpc_endpoints.py # S3, DynamoDB, ECR, CloudWatch, STS endpoints
│ ├── vpc_flow_logs.py # VPC traffic logging
│ ├── rds.py # PostgreSQL 15 (MLflow metadata store)
│ ├── sqs.py # Backtest queue + DLQ
│ ├── sns.py # Alerts + registration topics
│ └── secret_rotation.py # RDS credential rotation
│
├── compute/ # Stack 3: compute resources
│ ├── __main__.py
│ ├── asl_templates/ # Step Functions workflow templates
│ │ ├── backtest_workflow.json.j2
│ │ └── retraining_workflow.json.j2
│ └── modules/
│ ├── iam.py # Execution/task/CLI-CI/consolidated roles
│ ├── alb.py # Application Load Balancer + target groups
│ ├── ecs.py # ECS cluster + strategy task definition
│ ├── ecs_services.py # 7 ECS services + Service Discovery
│ ├── lambda_funcs.py # 18 Lambda functions + EventBridge schedules
│ ├── step_functions.py # Backtest + retraining workflows
│ ├── ec2_consolidated.py # Single EC2 instance for dev/staging
│ └── ec2_userdata.py # Userdata script generation
│
├── edge/ # Stack 4: routing + monitoring
│ ├── __main__.py
│ └── modules/
│ ├── api_gateway.py # HTTP API, VPC Link, JWT auth, SQS integration
│ ├── waf.py # Rate limiting, geo-blocking, managed rule groups
│ ├── cloudwatch_alarms.py # Composite alarm, RDS/Lambda/StepFunctions alarms
│ └── cloudwatch_dashboard.py # Operational dashboard
│
├── shared/ # Shared library (all stacks depend on this)
│ └── tradai_infra_shared/
│ ├── config.py # All configuration values
│ ├── testing.py # PulumiMocks for unit tests
│ └── core/
│ ├── policy_builder.py
│ ├── security_rule_builder.py
│ ├── asl_loader.py
│ ├── environment_builder.py
│ ├── ami_lookup.py
│ └── shell_template_env.py
│
├── ami/ # Packer AMI for consolidated EC2
│ ├── tradai-consolidated.pkr.hcl
│ └── files/
│
├── .env # AWS_PROFILE, PULUMI_CONFIG_PASSPHRASE, S3_PULUMI_BACKEND_URL
├── README.md
├── TEARDOWN.md
└── pulumi-ci.sh
3. Shared Configuration¶
All resource names, sizes, feature flags, and routing rules are centralized in a single file: infra/shared/tradai_infra_shared/config.py
This is the canonical source of truth. No stack module hardcodes resource names -- everything is imported from config.
Configuration Groups¶
| Group | Type | Contents |
|---|---|---|
SERVICES | dict[str, dict] | 7 ECS services: cpu, memory, port, desired_count, health_check_path, spot |
S3_BUCKETS | dict[str, str] | 5 buckets: configs, results, arcticdb, logs, mlflow |
S3_BUCKET_CONFIG | dict[str, dict] | Per-bucket versioning and lifecycle rules |
DYNAMODB_TABLES | dict[str, str] | 12 tables: workflow_state, idempotency, health_state, trading_state, deployments, drift_state, retraining_state, rollback_state, shadow_test_state, notifications, infra_drift_state, config_versions |
RDS_CONFIG | dict | PostgreSQL 15.13, db.t4g.micro (dev) / db.t4g.small (prod), 20-50 GB |
ECR_REPOS | list[str] | 6 service repos: backend, data-collection, strategy-service, mlflow, live-trading, dry-run-trading |
LAMBDA_ECR_REPOS | list[str] | 18 Lambda image repos |
STRATEGY_ECR_REPOS | list[str] | Dynamic list from Pulumi config (CI/CD-created strategy repos) |
COGNITO_CONFIG | CognitoConfigDict | Password min 12, MFA required, all character classes |
API_ROUTES | list[dict] | 21 HTTP routes: method, path, target service, auth required |
API_THROTTLING | dict | Default 100 req/s, backtest POST 10 req/s |
ALB_PATH_PATTERNS | dict[str, list[str]] | Path-based routing per service |
LAMBDA_SCHEDULES | dict[str, str] | 6 EventBridge schedules (overridable via Pulumi config) |
CONSOLIDATED_MODE | dict[str, bool] | dev/staging: True (EC2), prod: False (Fargate) |
EC2_CONSOLIDATED_CONFIG | EC2ConsolidatedConfigDict | t3.small, 30 GB, 4 services |
CORS_ALLOWED_ORIGINS | list[str] | ["*"] in dev, [] otherwise |
Naming Conventions¶
All resource names follow the pattern tradai-{component}-{ENVIRONMENT}. The config module provides helper functions to generate these consistently:
| Function | Purpose | Example Return |
|---|---|---|
get_resource_name(component) | Standard resource name | tradai-workflow-state-dev |
get_ecr_repo_name(repo_key) | ECR repo with slash | tradai/backend |
get_tags(service=None) | Standard resource tags | {"Application": "tradai", "Environment": "dev", ...} |
get_sd_namespace() | Service Discovery namespace | tradai-dev.local |
get_mlflow_tracking_uri() | MLflow URI via SD | http://mlflow.tradai-dev.local:5000/mlflow |
get_env_short() | Short env code | d, s, or p |
is_consolidated_mode() | EC2 vs Fargate check | True for dev/staging |
is_skip_alb() | Skip ALB creation | False by default |
Environment-Specific Behavior¶
The ENVIRONMENT variable comes from pulumi.get_stack() (the stack name is the environment). Several config values change by environment:
| Config | dev / staging | prod |
|---|---|---|
| Service desired_count | 1 | 2 (backend-api) |
| RDS instance class | db.t4g.micro | db.t4g.small |
| RDS storage | 20 GB | 50 GB |
| RDS multi-AZ | No | Yes |
| RDS backup retention | 7 days | 14 days |
| Log retention | 30 days | 90 days |
| Compute mode | Consolidated EC2 | ECS Fargate |
| Live-trading spot | No (never) | No (never) |
4. Stack: Persistent¶
Purpose: Data-bearing resources that survive dev-cycle teardowns. Destroying this stack means losing all market data, backtest results, container images, and user accounts.
Deploy: just infra-up-persistent Entry point: infra/persistent/__main__.py
Resources¶
| Module | Resources | Key Exports |
|---|---|---|
s3.py | 5 S3 buckets (configs, results, arcticdb, logs, mlflow) | s3_bucket_ids, s3_bucket_arns, per-bucket IDs |
dynamodb.py | 12 DynamoDB tables with PAY_PER_REQUEST billing | Per-table name and ARN (24 exports) |
ecr.py | 24 ECR repos (6 service + 18 Lambda) + strategy repos | ecr_repository_urls, ecr_repository_arns |
cognito.py | User pool, web client, M2M client, Cognito domain | cognito_user_pool_id, cognito_user_pool_client_id, cognito_m2m_client_id, cognito_user_pool_endpoint |
codeartifact.py | Domain + repository for Python packages | codeartifact_domain_name, codeartifact_repository_endpoint |
cloudtrail.py | Management event trail writing to logs bucket | -- |
pulumi_backend.py | S3 bucket for Pulumi state files | pulumi_state_bucket_name |
Design Decisions¶
- Separate from foundation: VPC teardown should never risk deleting S3 or DynamoDB data.
- ECR here, not compute: Images must exist before compute stack can create task definitions.
just lambda-bootstrappushes images to ECR between persistent and compute deployments. - Cognito here, not edge: User pool is data-bearing (user accounts persist).
5. Stack: Foundation¶
Purpose: Networking infrastructure and VPC-bound databases. Rarely destroyed because VPC changes cascade to every resource with a subnet or security group dependency.
Deploy: just infra-up-foundation (after persistent) Entry point: infra/foundation/__main__.py
Resources¶
| Module | Resources | Key Exports |
|---|---|---|
vpc.py | VPC (10.0.0.0/16), 2 AZs, 6 subnets (2 public, 2 private, 2 database), IGW, route tables | vpc_id, public_subnet_ids, private_subnet_ids, database_subnet_ids |
security_groups.py | 6 security groups: ALB, ECS, Lambda, RDS, NAT, VPC endpoints; optional consolidated SG | Per-SG IDs |
nacl.py | Network ACLs for public, private, and database subnets | -- |
nat_instance.py | t4g.nano NAT instance (replaces NAT Gateway for cost, ~$3/month) | -- |
vpc_endpoints.py | Gateway endpoints (S3, DynamoDB) + interface endpoints (ECR, CloudWatch, STS) | -- |
vpc_flow_logs.py | VPC traffic logging to CloudWatch | -- |
rds.py | PostgreSQL 15.13 instance with subnet group, parameter group | rds_endpoint, rds_secret_arn, rds_database_name, rds_instance_identifier |
sqs.py | Backtest request queue + dead letter queue | backtest_queue_url, backtest_queue_arn, backtest_dlq_arn |
sns.py | Alerts topic (with email subscriptions) + registration topic | sns_alerts_topic_arn, sns_registration_topic_arn |
secret_rotation.py | RDS credential rotation via Lambda + SNS notification | -- |
Design Decisions¶
- RDS in foundation, not persistent: RDS requires VPC subnet groups. Placing it with networking avoids circular cross-stack dependencies.
- SQS/SNS in foundation: Messaging resources are VPC-adjacent and consumed by both compute and edge stacks.
- NAT instance over NAT Gateway: t4g.nano costs ~$3/month vs ~$32/month for NAT Gateway. Acceptable for dev/staging traffic volumes.
6. Stack: Compute¶
Purpose: All compute workloads -- the most frequently updated stack. Contains IAM roles, ECS cluster and services, Lambda functions, Step Functions workflows, ALB, and the optional consolidated EC2 instance.
Deploy: just infra-up-compute (after foundation + Lambda images pushed to ECR) Entry point: infra/compute/__main__.py
Stack References¶
Compute reads from both upstream stacks: - persistent: ECR repo URLs, S3 bucket IDs, DynamoDB table names - foundation: VPC ID, subnet IDs, security group IDs, RDS endpoint, SQS/SNS ARNs
Resources¶
| Module | Resources | Key Exports |
|---|---|---|
iam.py | ECS execution role, ECS task role, CLI/CI role, consolidated instance role + profile | ecs_execution_role_arn, ecs_task_role_arn, cli_ci_role_arn |
alb.py | ALB, HTTP/HTTPS listeners, target groups per service (skippable via config) | alb_arn, alb_dns_name, listener ARNs |
ecs.py | ECS cluster (Fargate + Fargate Spot capacity), strategy task definition, log group | ecs_cluster_arn, strategy_task_definition_arn |
ecs_services.py | 7 ECS services with Service Discovery namespace, per-service task definitions | ecs_service_arns, service_discovery_namespace_id |
lambda_funcs.py | 18 Lambda functions, EventBridge schedules, SQS event source mapping | lambda_function_arns |
step_functions.py | Backtest + retraining state machines (ASL from Jinja2 templates) | backtest_workflow_arn, retraining_workflow_name |
ec2_consolidated.py | Single EC2 (t3.small) running Docker Compose for dev/staging | consolidated_asg_name |
ec2_userdata.py | Generates cloud-init userdata scripts for consolidated instance | -- |
Consolidated EC2 Mode¶
In dev/staging, is_consolidated_mode() returns True. Instead of running 4 separate Fargate tasks (~$37/month), a single t3.small EC2 instance (~$15/month) runs backend-api, data-collection, mlflow, and strategy-service via Docker Compose. The instance: - Registers with ALB target groups for each service - Registers with Service Discovery for inter-service communication - Pulls images from ECR using an instance profile role - Receives per-service environment variables (S3 buckets, RDS endpoint, SQS URL, etc.)
In prod, is_consolidated_mode() returns False and services run as individual Fargate tasks for reliability and independent scaling.
Lambda Deployment Model¶
Lambda functions use container images (not ZIP). Each Lambda has a dedicated ECR repo in the persistent stack. The deployment flow is:
just lambda-bootstrapbuilds and pushes all 18 Lambda images to ECRjust infra-up-computecreates Lambda functions referencing those image URIs- Some Lambdas are "deferred" -- created after Step Functions so they can receive the workflow ARN as an environment variable
Step Functions Workflows¶
Two state machines are defined using Jinja2 ASL templates in infra/compute/asl_templates/: - backtest_workflow.json.j2 -- Orchestrates: validate strategy, run ECS task, consume results, update status - retraining_workflow.json.j2 -- Orchestrates: check if retraining needed, run training, compare models, promote
Templates are rendered at deploy time with Lambda ARNs, ECS cluster ARN, subnet IDs, and security group IDs injected as template variables.
7. Stack: Edge¶
Purpose: External-facing routing and operational monitoring. Can be redeployed independently of compute resources.
Deploy: just infra-up-edge (after compute) Entry point: infra/edge/__main__.py
Stack References¶
Edge reads from all three upstream stacks: - persistent: Cognito user pool ID, client IDs, endpoint - foundation: VPC ID, subnet IDs, ECS SG, SQS queue URL/ARN, SNS topic ARN, RDS identifier - compute: ALB DNS name, listener ARNs, Step Functions workflow names
Resources¶
| Module | Resources | Key Exports |
|---|---|---|
api_gateway.py | HTTP API, VPC Link to ALB, JWT authorizer (Cognito), SQS integration for backtest POST, route table (21 routes), custom domain (optional) | api_gateway_endpoint, api_gateway_id |
waf.py | WebACL with rate limiting, geo-blocking, AWS managed rule groups; CloudWatch log group | waf_web_acl_id, waf_web_acl_arn |
cloudwatch_alarms.py | Composite alarm, RDS CPU/storage alarms, Lambda error alarms, Step Functions failure alarms, stale heartbeat alarm | composite_alarm_arn, lambda_error_alarm_arns |
cloudwatch_dashboard.py | Operational dashboard with ECS, Lambda, RDS, and API Gateway widgets | dashboard_url |
Design Decisions¶
- API Gateway uses VPC Link: Routes traffic to ALB in private subnets; no public ALB needed.
- Backtest POST routes to SQS: API Gateway has native SQS integration -- the request goes directly to the backtest queue without hitting any compute resource.
- WAF not associated with HTTP API: WAFv2 cannot parse the
$defaultstage ARN of HTTP APIs. The WebACL is created but requires REST API or ALB association to be effective. - JWT auth via Cognito: API Gateway validates JWTs against the Cognito user pool. Health check endpoint is unauthenticated.
8. Core Patterns¶
The shared library at infra/shared/tradai_infra_shared/core/ provides reusable builder patterns that eliminate duplication across stack modules.
PolicyBuilder¶
File: infra/shared/tradai_infra_shared/core/policy_builder.py
Fluent builder for IAM policies. Provides preset methods for common permission sets (DynamoDB CRUD, S3 read/write, Secrets Manager, CloudWatch, SNS, CodeArtifact) and composes them into a single policy document. Also provides AssumeRolePolicyBuilder with factory methods like for_ecs_tasks() and for_lambda().
# Example: building a task role policy
policy = (PolicyBuilder("tradai")
.with_dynamodb_crud()
.with_s3_readwrite()
.with_secrets_read()
.with_cloudwatch_metrics()
.build())
SecurityRuleBuilder¶
File: infra/shared/tradai_infra_shared/core/security_rule_builder.py
Fluent builder for security group rules. Provides CommonPorts with predefined port ranges for all services and PortRange for custom ranges. Methods like ingress_from_sg() and egress_https_to_internet() create rules with consistent naming and tagging.
AslTemplateLoader¶
File: infra/shared/tradai_infra_shared/core/asl_loader.py
Jinja2-based loader for Amazon States Language (ASL) workflow definitions. Renders .json.j2 templates from infra/compute/asl_templates/ with Lambda ARNs, ECS config, and subnet IDs injected as variables. Validates rendered output is valid JSON.
EnvironmentBuilder¶
File: infra/shared/tradai_infra_shared/core/environment_builder.py
Fluent builder for container environment variables that handles both static strings and Pulumi Output values. Methods include add() for static values, add_output() for Pulumi Outputs, and add_if() for conditional variables. Resolves all Outputs at build time using Output.all().apply().
Additional Utilities¶
| Utility | File | Purpose |
|---|---|---|
get_latest_al2023_ami() | core/ami_lookup.py | Looks up latest Amazon Linux 2023 AMI by architecture (arm64 or x86_64) |
make_shell_template_env() | core/shell_template_env.py | Creates Jinja2 Environment for .sh.j2 userdata templates with StrictUndefined |
PulumiMocks | testing.py | Mock Pulumi runtime for unit testing stack modules |
9. Deployment¶
Prerequisites¶
- AWS profile is configured via Pulumi config (
pulumi config set aws:profile tradai), not as an environment variable. Configureinfra/.envwith the required variables (seeinfra/.env.example): PULUMI_CONFIG_PASSPHRASE=<your passphrase>-
S3_PULUMI_BACKEND_URL=s3://<pulumi-state-bucket> -
The justfile
_infra-runhelper sourcesinfra/.envautomatically before running Pulumi commands, then logs into the S3 backend and selects the target stack.
Full Bootstrap (First Deployment)¶
This runs the full sequence: 1. just infra-up-persistent dev -- S3, DynamoDB, ECR, Cognito, CodeArtifact 2. just infra-up-foundation dev -- VPC, RDS, SQS, SNS 3. just lambda-bootstrap -- Build and push all 18 Lambda images to ECR 4. just service-push-all -- Build and push all service images to ECR 5. just infra-up-compute dev -- IAM, ALB, ECS, Lambda, Step Functions, EC2 6. just infra-up-edge dev -- API Gateway, WAF, CloudWatch
Individual Stack Commands¶
| Command | Purpose |
|---|---|
just infra-up-persistent [stack] | Deploy persistent stack |
just infra-up-foundation [stack] | Deploy foundation stack |
just infra-up-compute [stack] | Deploy compute stack (pre-flight checks Lambda images in ECR) |
just infra-up-edge [stack] | Deploy edge stack |
just infra-preview [stack] | Preview all 4 stacks |
just infra-preview-{layer} [stack] | Preview individual stack |
just infra-down-soft [stack] | Destroy edge + compute only (data and networking preserved) |
just infra-down-all [stack] | Destroy edge + compute + foundation (persistent preserved) |
just infra-down-persistent [stack] | DANGER: destroy all data (requires confirmation prompt) |
Default stack is dev for all commands.
Compute Pre-Flight Check¶
just infra-up-compute verifies that all 18 Lambda images exist in ECR before deploying. If any image is missing, it fails with instructions to run just lambda-bootstrap first. This prevents partial deployments where Lambda functions reference non-existent images.
10. Cross-Stack References¶
How It Works¶
Each stack exports named outputs via pulumi.export(). Downstream stacks read these using pulumi.StackReference(f"{org}/tradai-{stack_name}/{environment}").
Key Exports Per Stack¶
| Stack | Key Exports | Consumed By |
|---|---|---|
| persistent | ecr_repository_urls, s3_*_bucket_id (x5), DynamoDB table names/ARNs (x24), cognito_user_pool_id, cognito_user_pool_client_id, cognito_m2m_client_id, cognito_user_pool_endpoint | compute, edge |
| foundation | vpc_id, public_subnet_ids, private_subnet_ids, security group IDs (x6), rds_endpoint, rds_secret_arn, rds_database_name, rds_instance_identifier, backtest_queue_url, backtest_queue_arn, backtest_dlq_arn, sns_alerts_topic_arn | compute, edge |
| compute | alb_dns_name, alb_https_listener_arn, alb_http_listener_arn, ecs_cluster_arn, backtest_workflow_arn, backtest_workflow_name, retraining_workflow_name, lambda_function_arns | edge |
| edge | api_gateway_endpoint, api_gateway_id, waf_web_acl_id, composite_alarm_arn, dashboard_url | (terminal -- consumed by users/CI) |
Reference Pattern¶
The org value defaults to "organization" and can be overridden via pulumi config set org <value>. Stack names match environments (dev, staging, prod), so a reference looks like: organization/tradai-persistent/dev.
Changelog¶
| Version | Date | Changes |
|---|---|---|
| 9.2.1 | 2026-03-28 | Full regeneration. Corrected to 4-stack architecture, eu-central-1, tables over code |
Dependencies¶
| If This Changes | Update This Doc |
|---|---|
| New Pulumi stack or module added | Stack table, module listings |
infra/shared/tradai_infra_shared/config.py | Config reference section |
| Stack deployment order changes | Dependency flow diagram |