TradAI Infrastructure as Code (Pulumi)¶

Version: 9.2.1 | Date: 2026-03-28 | Status: CURRENT

TL;DR: 4-stack Pulumi architecture in eu-central-1. Stacks are layered by lifecycle: persistent (data-bearing, never destroyed) -> foundation (VPC, RDS) -> compute (ECS, Lambda, Step Functions) -> edge (API Gateway, WAF, CloudWatch). All config centralized in infra/shared/tradai_infra_shared/config.py.

1. Architecture Overview¶

TradAI uses a 4-stack Pulumi architecture deployed in eu-central-1. Stacks are layered by lifecycle and dependency: persistent data is separated from ephemeral compute, which is separated from edge routing and monitoring.

Stack Dependency Flow¶

graph LR
    A["persistent<br/><small>S3, DynamoDB, ECR<br/>Cognito, CodeArtifact</small>"] --> B["foundation<br/><small>VPC, Subnets, RDS<br/>SQS, SNS, NAT</small>"]
    B --> C["compute<br/><small>IAM, ALB, ECS<br/>Lambda, Step Functions</small>"]
    C --> D["edge<br/><small>API Gateway, WAF<br/>CloudWatch, Alarms</small>"]

    style A fill:#2d6a4f,color:#fff
    style B fill:#40916c,color:#fff
    style C fill:#52b788,color:#000
    style D fill:#95d5b2,color:#000

Why 4 Stacks?¶

Concern	Stack	Rationale
Data preservation	persistent	S3, DynamoDB, ECR, Cognito are never destroyed during dev cycles
Network stability	foundation	VPC, subnets, RDS rarely change; destroying them is disruptive
Rapid iteration	compute	ECS services, Lambda, Step Functions change frequently
Edge independence	edge	API Gateway, WAF, CloudWatch can be updated without touching compute

Cross-Stack Communication¶

StackReference Pattern

Stacks share outputs via Pulumi StackReference. Each downstream stack declares references to its upstream stacks and reads exported values. The org value defaults to "organization" and stack names match environments (dev, staging, prod), so a reference looks like organization/tradai-persistent/dev. Never hardcode resource IDs across stacks -- always use StackReference.get_output().

Stacks share outputs via Pulumi StackReference. Each downstream stack declares references to its upstream stacks and reads exported values:

# Example from compute/__main__.py
persistent = pulumi.StackReference(f"{org}/tradai-persistent/{pulumi.get_stack()}")
foundation = pulumi.StackReference(f"{org}/tradai-foundation/{pulumi.get_stack()}")

vpc_id = foundation.get_output("vpc_id")
ecr_repository_urls = persistent.get_output("ecr_repository_urls")

Shared Library¶

All four stacks depend on infra/shared/tradai_infra_shared/, a local Python package providing: - config.py -- Single source of truth for all resource names, sizes, and feature flags - core/ -- Reusable builder patterns (PolicyBuilder, SecurityRuleBuilder, etc.) - testing.py -- PulumiMocks and test fixtures shared across stack test suites

2. Project Structure¶

infra/
├── persistent/                    # Stack 1: data-bearing resources
│   ├── __main__.py
│   └── modules/
│       ├── s3.py                  # 5 S3 buckets
│       ├── dynamodb.py            # 12 DynamoDB tables
│       ├── ecr.py                 # 22 ECR repositories (4 service + 18 Lambda)
│       ├── cognito.py             # User pool, app clients, M2M client
│       ├── codeartifact.py        # Python package registry
│       ├── cloudtrail.py          # Audit logging
│       └── pulumi_backend.py      # Self-hosted Pulumi state in S3
│
├── foundation/                    # Stack 2: networking + databases
│   ├── __main__.py
│   └── modules/
│       ├── vpc.py                 # VPC, subnets (public/private/database), IGW, routes
│       ├── security_groups.py     # ALB, ECS, Lambda, RDS, NAT, endpoint SGs
│       ├── nacl.py                # Network ACLs per subnet tier
│       ├── nat_instance.py        # t4g.nano NAT instance (~$3/month)
│       ├── vpc_endpoints.py       # S3, DynamoDB, ECR, CloudWatch, STS endpoints
│       ├── vpc_flow_logs.py       # VPC traffic logging
│       ├── rds.py                 # PostgreSQL 15 (MLflow metadata store)
│       ├── sqs.py                 # Backtest queue + DLQ
│       ├── sns.py                 # Alerts + registration topics
│       └── secret_rotation.py     # RDS credential rotation
│
├── compute/                       # Stack 3: compute resources
│   ├── __main__.py
│   ├── asl_templates/             # Step Functions workflow templates
│   │   ├── backtest_workflow.json.j2
│   │   └── retraining_workflow.json.j2
│   └── modules/
│       ├── iam.py                 # Execution/task/CLI-CI/consolidated roles
│       ├── alb.py                 # Application Load Balancer + target groups
│       ├── ecs.py                 # ECS cluster + strategy task definition
│       ├── ecs_services.py        # 5 ECS services + per-strategy services + Service Discovery
│       ├── lambda_funcs.py        # 18 Lambda functions + EventBridge schedules
│       ├── step_functions.py      # Backtest + retraining workflows
│       ├── ec2_consolidated.py    # Single EC2 instance for dev/staging
│       └── ec2_userdata.py        # Userdata script generation
│
├── edge/                          # Stack 4: routing + monitoring
│   ├── __main__.py
│   └── modules/
│       ├── api_gateway.py         # HTTP API, VPC Link, JWT auth, SQS integration
│       ├── waf.py                 # Rate limiting, geo-blocking, managed rule groups
│       ├── cloudwatch_alarms.py   # Composite alarm, RDS/Lambda/StepFunctions alarms
│       └── cloudwatch_dashboard.py # Operational dashboard
│
├── shared/                        # Shared library (all stacks depend on this)
│   └── tradai_infra_shared/
│       ├── config.py              # All configuration values
│       ├── testing.py             # PulumiMocks for unit tests
│       └── core/
│           ├── policy_builder.py
│           ├── security_rule_builder.py
│           ├── asl_loader.py
│           ├── environment_builder.py
│           ├── ami_lookup.py
│           └── shell_template_env.py
│
├── ami/                           # Packer AMI for consolidated EC2
│   ├── tradai-consolidated.pkr.hcl
│   └── files/
│
├── .env                           # AWS_PROFILE, PULUMI_CONFIG_PASSPHRASE, S3_PULUMI_BACKEND_URL
├── README.md
├── TEARDOWN.md
└── pulumi-ci.sh

3. Shared Configuration¶

All resource names, sizes, feature flags, and routing rules are centralized in a single file: infra/shared/tradai_infra_shared/config.py

This is the canonical source of truth. No stack module hardcodes resource names -- everything is imported from config.

Configuration Groups¶

Group	Type	Contents
`SERVICES`	`dict[str, dict]`	5 ECS services: cpu, memory, port, desired_count, health_check_path, spot (trading runs as per-strategy `tradai-strategy-{slug}` services, not in this dict)
`S3_BUCKETS`	`dict[str, str]`	5 buckets: configs, results, arcticdb, logs, mlflow
`S3_BUCKET_CONFIG`	`dict[str, dict]`	Per-bucket versioning and lifecycle rules
`DYNAMODB_TABLES`	`dict[str, str]`	12 tables: workflow_state, idempotency, health_state, trading_state, deployments, drift_state, retraining_state, rollback_state, shadow_test_state, notifications, infra_drift_state, config_versions
`RDS_CONFIG`	`dict`	PostgreSQL 15.13, db.t4g.micro (dev) / db.t4g.small (prod), 20-50 GB
`ECR_REPOS`	`list[str]`	4 service repos: backend, data-collection, strategy-service, mlflow
`LAMBDA_ECR_REPOS`	`list[str]`	18 Lambda image repos
`STRATEGY_ECR_REPOS`	`list[str]`	Dynamic list from Pulumi config (CI/CD-created strategy repos)
`COGNITO_CONFIG`	`CognitoConfigDict`	Password min 12, MFA required, all character classes
`API_ROUTES`	`list[dict]`	21 HTTP routes: method, path, target service, auth required
`API_THROTTLING`	`dict`	Default 100 req/s, backtest POST 10 req/s
`ALB_PATH_PATTERNS`	`dict[str, list[str]]`	Path-based routing per service
`LAMBDA_SCHEDULES`	`dict[str, str]`	6 EventBridge schedules (overridable via Pulumi config)
`CONSOLIDATED_MODE`	`dict[str, bool]`	dev/staging: True (EC2), prod: False (Fargate)
`EC2_CONSOLIDATED_CONFIG`	`EC2ConsolidatedConfigDict`	t3.small, 30 GB, 4 services
`CORS_ALLOWED_ORIGINS`	`list[str]`	`["*"]` in dev, `[]` otherwise

Naming Conventions¶

All resource names follow the pattern tradai-{component}-{ENVIRONMENT}. The config module provides helper functions to generate these consistently:

Function	Purpose	Example Return
`get_resource_name(component)`	Standard resource name	`tradai-workflow-state-dev`
`get_ecr_repo_name(repo_key)`	ECR repo with slash	`tradai/backend`
`get_tags(service=None)`	Standard resource tags	`{"Application": "tradai", "Environment": "dev", ...}`
`get_sd_namespace()`	Service Discovery namespace	`tradai-dev.local`
`get_mlflow_tracking_uri()`	MLflow URI via SD	`http://mlflow.tradai-dev.local:5000/mlflow`
`get_env_short()`	Short env code	`d`, `s`, or `p`
`is_consolidated_mode()`	EC2 vs Fargate check	`True` for dev/staging
`is_skip_alb()`	Skip ALB creation	`False` by default

Environment-Specific Behavior¶

The ENVIRONMENT variable comes from pulumi.get_stack() (the stack name is the environment). Several config values change by environment:

Config	dev / staging	prod
Service desired_count	1	2 (backend-api)
RDS instance class	db.t4g.micro	db.t4g.small
RDS storage	20 GB	50 GB
RDS multi-AZ	No	Yes
RDS backup retention	7 days	14 days
Log retention	30 days	90 days
Compute mode	Consolidated EC2	ECS Fargate
Live-trading spot	No (never)	No (never)

4. Stack: Persistent¶

Purpose: Data-bearing resources that survive dev-cycle teardowns. Destroying this stack means losing all market data, backtest results, container images, and user accounts.

Deploy: just infra-up-persistent Entry point: infra/persistent/__main__.py

Resources¶

Module	Resources	Key Exports
`s3.py`	5 S3 buckets (configs, results, arcticdb, logs, mlflow)	`s3_bucket_ids`, `s3_bucket_arns`, per-bucket IDs
`dynamodb.py`	12 DynamoDB tables with PAY_PER_REQUEST billing	Per-table name and ARN (24 exports)
`ecr.py`	22 ECR repos (4 service + 18 Lambda) + strategy repos	`ecr_repository_urls`, `ecr_repository_arns`
`cognito.py`	User pool, web client, M2M client, Cognito domain	`cognito_user_pool_id`, `cognito_user_pool_client_id`, `cognito_m2m_client_id`, `cognito_user_pool_endpoint`
`codeartifact.py`	Domain + repository for Python packages	`codeartifact_domain_name`, `codeartifact_repository_endpoint`
`cloudtrail.py`	Management event trail writing to logs bucket	--
`pulumi_backend.py`	S3 bucket for Pulumi state files	`pulumi_state_bucket_name`

Design Decisions¶

Separate from foundation: VPC teardown should never risk deleting S3 or DynamoDB data.
ECR here, not compute: Images must exist before compute stack can create task definitions. just lambda-bootstrap pushes images to ECR between persistent and compute deployments.
Cognito here, not edge: User pool is data-bearing (user accounts persist).

5. Stack: Foundation¶

Purpose: Networking infrastructure and VPC-bound databases. Rarely destroyed because VPC changes cascade to every resource with a subnet or security group dependency.

Deploy: just infra-up-foundation (after persistent) Entry point: infra/foundation/__main__.py

Resources¶

Module	Resources	Key Exports
`vpc.py`	VPC (10.0.0.0/16), 2 AZs, 6 subnets (2 public, 2 private, 2 database), IGW, route tables	`vpc_id`, `public_subnet_ids`, `private_subnet_ids`, `database_subnet_ids`
`security_groups.py`	6 security groups: ALB, ECS, Lambda, RDS, NAT, VPC endpoints; optional consolidated SG	Per-SG IDs
`nacl.py`	Network ACLs for public, private, and database subnets	--
`nat_instance.py`	t4g.nano NAT instance (replaces NAT Gateway for cost, ~$3/month)	--
`vpc_endpoints.py`	Gateway endpoints (S3, DynamoDB) + interface endpoints (ECR, CloudWatch, STS)	--
`vpc_flow_logs.py`	VPC traffic logging to CloudWatch	--
`rds.py`	PostgreSQL 15.13 instance with subnet group, parameter group	`rds_endpoint`, `rds_secret_arn`, `rds_database_name`, `rds_instance_identifier`
`sqs.py`	Backtest request queue + dead letter queue	`backtest_queue_url`, `backtest_queue_arn`, `backtest_dlq_arn`
`sns.py`	Alerts topic (with email subscriptions) + registration topic	`sns_alerts_topic_arn`, `sns_registration_topic_arn`
`secret_rotation.py`	RDS credential rotation via Lambda + SNS notification	--

Design Decisions¶

RDS in foundation, not persistent: RDS requires VPC subnet groups. Placing it with networking avoids circular cross-stack dependencies.
SQS/SNS in foundation: Messaging resources are VPC-adjacent and consumed by both compute and edge stacks.
NAT instance over NAT Gateway: t4g.nano costs ~$3/month vs ~$32/month for NAT Gateway. Acceptable for dev/staging traffic volumes.

6. Stack: Compute¶

Purpose: All compute workloads -- the most frequently updated stack. Contains IAM roles, ECS cluster and services, Lambda functions, Step Functions workflows, ALB, and the optional consolidated EC2 instance.

Deploy: just infra-up-compute (after foundation + Lambda images pushed to ECR) Entry point: infra/compute/__main__.py

Stack References¶

Compute reads from both upstream stacks: - persistent: ECR repo URLs, S3 bucket IDs, DynamoDB table names - foundation: VPC ID, subnet IDs, security group IDs, RDS endpoint, SQS/SNS ARNs

Resources¶

Module	Resources	Key Exports
`iam.py`	ECS execution role, ECS task role, CLI/CI role, consolidated instance role + profile	`ecs_execution_role_arn`, `ecs_task_role_arn`, `cli_ci_role_arn`
`alb.py`	ALB, HTTP/HTTPS listeners, target groups per service (skippable via config)	`alb_arn`, `alb_dns_name`, listener ARNs
`ecs.py`	ECS cluster (Fargate + Fargate Spot capacity), strategy task definition, log group	`ecs_cluster_arn`, `strategy_task_definition_arn`
`ecs_services.py`	5 ECS services with Service Discovery namespace, per-service task definitions (plus per-strategy `tradai-strategy-{slug}` trading services)	`ecs_service_arns`, `service_discovery_namespace_id`
`lambda_funcs.py`	18 Lambda functions, EventBridge schedules, SQS event source mapping	`lambda_function_arns`
`step_functions.py`	Backtest + retraining state machines (ASL from Jinja2 templates)	`backtest_workflow_arn`, `retraining_workflow_name`
`ec2_consolidated.py`	Single EC2 (t3.small) running Docker Compose for dev/staging	`consolidated_asg_name`
`ec2_userdata.py`	Generates cloud-init userdata scripts for consolidated instance	--

Consolidated EC2 Mode¶

In dev/staging, is_consolidated_mode() returns True. Instead of running 4 separate Fargate tasks (~$37/month), a single t3.small EC2 instance (~$15/month) runs backend-api, data-collection, mlflow, and strategy-service via Docker Compose. The instance: - Registers with ALB target groups for each service - Registers with Service Discovery for inter-service communication - Pulls images from ECR using an instance profile role - Receives per-service environment variables (S3 buckets, RDS endpoint, SQS URL, etc.)

In prod, is_consolidated_mode() returns False and services run as individual Fargate tasks for reliability and independent scaling.

Lambda Deployment Model¶

Lambda functions use container images (not ZIP). Each Lambda has a dedicated ECR repo in the persistent stack. The deployment flow is:

just lambda-bootstrap builds and pushes all 18 Lambda images to ECR
just infra-up-compute creates Lambda functions referencing those image URIs
Some Lambdas are "deferred" -- created after Step Functions so they can receive the workflow ARN as an environment variable

Step Functions Workflows¶

Two state machines are defined using Jinja2 ASL templates in infra/compute/asl_templates/: - backtest_workflow.json.j2 -- Orchestrates: validate strategy, run ECS task, consume results, update status - retraining_workflow.json.j2 -- Orchestrates: check if retraining needed, run training, compare models, promote

Templates are rendered at deploy time with Lambda ARNs, ECS cluster ARN, subnet IDs, and security group IDs injected as template variables.

7. Stack: Edge¶

Purpose: External-facing routing and operational monitoring. Can be redeployed independently of compute resources.

Deploy: just infra-up-edge (after compute) Entry point: infra/edge/__main__.py

Stack References¶

Edge reads from all three upstream stacks: - persistent: Cognito user pool ID, client IDs, endpoint - foundation: VPC ID, subnet IDs, ECS SG, SQS queue URL/ARN, SNS topic ARN, RDS identifier - compute: ALB DNS name, listener ARNs, Step Functions workflow names

Resources¶

Module	Resources	Key Exports
`api_gateway.py`	HTTP API, VPC Link to ALB, JWT authorizer (Cognito), SQS integration for backtest POST, route table (21 routes), custom domain (optional)	`api_gateway_endpoint`, `api_gateway_id`
`waf.py`	WebACL with rate limiting, geo-blocking, AWS managed rule groups; CloudWatch log group	`waf_web_acl_id`, `waf_web_acl_arn`
`cloudwatch_alarms.py`	Composite alarm, RDS CPU/storage alarms, Lambda error alarms, Step Functions failure alarms, stale heartbeat alarm	`composite_alarm_arn`, `lambda_error_alarm_arns`
`cloudwatch_dashboard.py`	Operational dashboard with ECS, Lambda, RDS, and API Gateway widgets	`dashboard_url`

Design Decisions¶

API Gateway uses VPC Link: Routes traffic to ALB in private subnets; no public ALB needed.
Backtest POST routes to SQS: API Gateway has native SQS integration -- the request goes directly to the backtest queue without hitting any compute resource.
WAF not associated with HTTP API: WAFv2 cannot parse the $default stage ARN of HTTP APIs. The WebACL is created but requires REST API or ALB association to be effective.
JWT auth via Cognito: API Gateway validates JWTs against the Cognito user pool. Health check endpoint is unauthenticated.

8. Core Patterns¶

The shared library at infra/shared/tradai_infra_shared/core/ provides reusable builder patterns that eliminate duplication across stack modules.

PolicyBuilder¶

File: infra/shared/tradai_infra_shared/core/policy_builder.py

Fluent builder for IAM policies. Provides preset methods for common permission sets (DynamoDB CRUD, S3 read/write, Secrets Manager, CloudWatch, SNS, CodeArtifact) and composes them into a single policy document. Also provides AssumeRolePolicyBuilder with factory methods like for_ecs_tasks() and for_lambda().

# Example: building a task role policy
policy = (PolicyBuilder("tradai")
    .with_dynamodb_crud()
    .with_s3_readwrite()
    .with_secrets_read()
    .with_cloudwatch_metrics()
    .build())

SecurityRuleBuilder¶

File: infra/shared/tradai_infra_shared/core/security_rule_builder.py

Fluent builder for security group rules. Provides CommonPorts with predefined port ranges for all services and PortRange for custom ranges. Methods like ingress_from_sg() and egress_https_to_internet() create rules with consistent naming and tagging.

AslTemplateLoader¶

File: infra/shared/tradai_infra_shared/core/asl_loader.py

Jinja2-based loader for Amazon States Language (ASL) workflow definitions. Renders .json.j2 templates from infra/compute/asl_templates/ with Lambda ARNs, ECS config, and subnet IDs injected as variables. Validates rendered output is valid JSON.

EnvironmentBuilder¶

File: infra/shared/tradai_infra_shared/core/environment_builder.py

Fluent builder for container environment variables that handles both static strings and Pulumi Output values. Methods include add() for static values, add_output() for Pulumi Outputs, and add_if() for conditional variables. Resolves all Outputs at build time using Output.all().apply().

Additional Utilities¶

Utility	File	Purpose
`get_latest_al2023_ami()`	`core/ami_lookup.py`	Looks up latest Amazon Linux 2023 AMI by architecture (arm64 or x86_64)
`make_shell_template_env()`	`core/shell_template_env.py`	Creates Jinja2 Environment for `.sh.j2` userdata templates with StrictUndefined
`PulumiMocks`	`testing.py`	Mock Pulumi runtime for unit testing stack modules

9. Deployment¶

Prerequisites¶

AWS profile is configured via Pulumi config (pulumi config set aws:profile tradai), not as an environment variable. Configure infra/.env with the required variables (see infra/.env.example):
PULUMI_CONFIG_PASSPHRASE=<your passphrase>
S3_PULUMI_BACKEND_URL=s3://<pulumi-state-bucket>
The justfile _infra-run helper sources infra/.env automatically before running Pulumi commands, then logs into the S3 backend and selects the target stack.

Full Bootstrap (First Deployment)¶

just infra-bootstrap dev

This runs the full sequence: 1. just infra-up-persistent dev -- S3, DynamoDB, ECR, Cognito, CodeArtifact 2. just infra-up-foundation dev -- VPC, RDS, SQS, SNS 3. just lambda-bootstrap -- Build and push all 18 Lambda images to ECR 4. just service-push-all -- Build and push all service images to ECR 5. just infra-up-compute dev -- IAM, ALB, ECS, Lambda, Step Functions, EC2 6. just infra-up-edge dev -- API Gateway, WAF, CloudWatch

Individual Stack Commands¶

Command	Purpose
`just infra-up-persistent [stack]`	Deploy persistent stack
`just infra-up-foundation [stack]`	Deploy foundation stack
`just infra-up-compute [stack]`	Deploy compute stack (pre-flight checks Lambda images in ECR)
`just infra-up-edge [stack]`	Deploy edge stack
`just infra-preview [stack]`	Preview all 4 stacks
`just infra-preview-{layer} [stack]`	Preview individual stack
`just infra-down-soft [stack]`	Destroy edge + compute only (data and networking preserved)
`just infra-down-all [stack]`	Destroy edge + compute + foundation (persistent preserved)
`just infra-down-persistent [stack]`	DANGER: destroy all data (requires confirmation prompt)

Default stack is dev for all commands.

Compute Pre-Flight Check¶

just infra-up-compute verifies that all 18 Lambda images exist in ECR before deploying. If any image is missing, it fails with instructions to run just lambda-bootstrap first. This prevents partial deployments where Lambda functions reference non-existent images.

10. Cross-Stack References¶

How It Works¶

Each stack exports named outputs via pulumi.export(). Downstream stacks read these using pulumi.StackReference(f"{org}/tradai-{stack_name}/{environment}").

Key Exports Per Stack¶

Stack	Key Exports	Consumed By
persistent	`ecr_repository_urls`, `s3_*_bucket_id` (x5), DynamoDB table names/ARNs (x24), `cognito_user_pool_id`, `cognito_user_pool_client_id`, `cognito_m2m_client_id`, `cognito_user_pool_endpoint`	compute, edge
foundation	`vpc_id`, `public_subnet_ids`, `private_subnet_ids`, security group IDs (x6), `rds_endpoint`, `rds_secret_arn`, `rds_database_name`, `rds_instance_identifier`, `backtest_queue_url`, `backtest_queue_arn`, `backtest_dlq_arn`, `sns_alerts_topic_arn`	compute, edge
compute	`alb_dns_name`, `alb_https_listener_arn`, `alb_http_listener_arn`, `ecs_cluster_arn`, `backtest_workflow_arn`, `backtest_workflow_name`, `retraining_workflow_name`, `lambda_function_arns`	edge
edge	`api_gateway_endpoint`, `api_gateway_id`, `waf_web_acl_id`, `composite_alarm_arn`, `dashboard_url`	(terminal -- consumed by users/CI)

Reference Pattern¶

The org value defaults to "organization" and can be overridden via pulumi config set org <value>. Stack names match environments (dev, staging, prod), so a reference looks like: organization/tradai-persistent/dev.

Changelog¶

Version	Date	Changes
9.2.1	2026-03-28	Full regeneration. Corrected to 4-stack architecture, eu-central-1, tables over code

Dependencies¶

If This Changes	Update This Doc
New Pulumi stack or module added	Stack table, module listings
`infra/shared/tradai_infra_shared/config.py`	Config reference section
Stack deployment order changes	Dependency flow diagram