Skip to content

TradAI Infrastructure as Code (Pulumi)

Version: 9.2.1 | Date: 2026-03-28 | Status: CURRENT

TL;DR: 4-stack Pulumi architecture in eu-central-1. Stacks are layered by lifecycle: persistent (data-bearing, never destroyed) -> foundation (VPC, RDS) -> compute (ECS, Lambda, Step Functions) -> edge (API Gateway, WAF, CloudWatch). All config centralized in infra/shared/tradai_infra_shared/config.py.


1. Architecture Overview

TradAI uses a 4-stack Pulumi architecture deployed in eu-central-1. Stacks are layered by lifecycle and dependency: persistent data is separated from ephemeral compute, which is separated from edge routing and monitoring.

Stack Dependency Flow

graph LR
    A["persistent<br/><small>S3, DynamoDB, ECR<br/>Cognito, CodeArtifact</small>"] --> B["foundation<br/><small>VPC, Subnets, RDS<br/>SQS, SNS, NAT</small>"]
    B --> C["compute<br/><small>IAM, ALB, ECS<br/>Lambda, Step Functions</small>"]
    C --> D["edge<br/><small>API Gateway, WAF<br/>CloudWatch, Alarms</small>"]

    style A fill:#2d6a4f,color:#fff
    style B fill:#40916c,color:#fff
    style C fill:#52b788,color:#000
    style D fill:#95d5b2,color:#000

Why 4 Stacks?

Concern Stack Rationale
Data preservation persistent S3, DynamoDB, ECR, Cognito are never destroyed during dev cycles
Network stability foundation VPC, subnets, RDS rarely change; destroying them is disruptive
Rapid iteration compute ECS services, Lambda, Step Functions change frequently
Edge independence edge API Gateway, WAF, CloudWatch can be updated without touching compute

Cross-Stack Communication

StackReference Pattern

Stacks share outputs via Pulumi StackReference. Each downstream stack declares references to its upstream stacks and reads exported values. The org value defaults to "organization" and stack names match environments (dev, staging, prod), so a reference looks like organization/tradai-persistent/dev. Never hardcode resource IDs across stacks -- always use StackReference.get_output().

Stacks share outputs via Pulumi StackReference. Each downstream stack declares references to its upstream stacks and reads exported values:

# Example from compute/__main__.py
persistent = pulumi.StackReference(f"{org}/tradai-persistent/{pulumi.get_stack()}")
foundation = pulumi.StackReference(f"{org}/tradai-foundation/{pulumi.get_stack()}")

vpc_id = foundation.get_output("vpc_id")
ecr_repository_urls = persistent.get_output("ecr_repository_urls")

Shared Library

All four stacks depend on infra/shared/tradai_infra_shared/, a local Python package providing: - config.py -- Single source of truth for all resource names, sizes, and feature flags - core/ -- Reusable builder patterns (PolicyBuilder, SecurityRuleBuilder, etc.) - testing.py -- PulumiMocks and test fixtures shared across stack test suites


2. Project Structure

infra/
├── persistent/                    # Stack 1: data-bearing resources
│   ├── __main__.py
│   └── modules/
│       ├── s3.py                  # 5 S3 buckets
│       ├── dynamodb.py            # 12 DynamoDB tables
│       ├── ecr.py                 # 24 ECR repositories (6 service + 18 Lambda)
│       ├── cognito.py             # User pool, app clients, M2M client
│       ├── codeartifact.py        # Python package registry
│       ├── cloudtrail.py          # Audit logging
│       └── pulumi_backend.py      # Self-hosted Pulumi state in S3
├── foundation/                    # Stack 2: networking + databases
│   ├── __main__.py
│   └── modules/
│       ├── vpc.py                 # VPC, subnets (public/private/database), IGW, routes
│       ├── security_groups.py     # ALB, ECS, Lambda, RDS, NAT, endpoint SGs
│       ├── nacl.py                # Network ACLs per subnet tier
│       ├── nat_instance.py        # t4g.nano NAT instance (~$3/month)
│       ├── vpc_endpoints.py       # S3, DynamoDB, ECR, CloudWatch, STS endpoints
│       ├── vpc_flow_logs.py       # VPC traffic logging
│       ├── rds.py                 # PostgreSQL 15 (MLflow metadata store)
│       ├── sqs.py                 # Backtest queue + DLQ
│       ├── sns.py                 # Alerts + registration topics
│       └── secret_rotation.py     # RDS credential rotation
├── compute/                       # Stack 3: compute resources
│   ├── __main__.py
│   ├── asl_templates/             # Step Functions workflow templates
│   │   ├── backtest_workflow.json.j2
│   │   └── retraining_workflow.json.j2
│   └── modules/
│       ├── iam.py                 # Execution/task/CLI-CI/consolidated roles
│       ├── alb.py                 # Application Load Balancer + target groups
│       ├── ecs.py                 # ECS cluster + strategy task definition
│       ├── ecs_services.py        # 7 ECS services + Service Discovery
│       ├── lambda_funcs.py        # 18 Lambda functions + EventBridge schedules
│       ├── step_functions.py      # Backtest + retraining workflows
│       ├── ec2_consolidated.py    # Single EC2 instance for dev/staging
│       └── ec2_userdata.py        # Userdata script generation
├── edge/                          # Stack 4: routing + monitoring
│   ├── __main__.py
│   └── modules/
│       ├── api_gateway.py         # HTTP API, VPC Link, JWT auth, SQS integration
│       ├── waf.py                 # Rate limiting, geo-blocking, managed rule groups
│       ├── cloudwatch_alarms.py   # Composite alarm, RDS/Lambda/StepFunctions alarms
│       └── cloudwatch_dashboard.py # Operational dashboard
├── shared/                        # Shared library (all stacks depend on this)
│   └── tradai_infra_shared/
│       ├── config.py              # All configuration values
│       ├── testing.py             # PulumiMocks for unit tests
│       └── core/
│           ├── policy_builder.py
│           ├── security_rule_builder.py
│           ├── asl_loader.py
│           ├── environment_builder.py
│           ├── ami_lookup.py
│           └── shell_template_env.py
├── ami/                           # Packer AMI for consolidated EC2
│   ├── tradai-consolidated.pkr.hcl
│   └── files/
├── .env                           # AWS_PROFILE, PULUMI_CONFIG_PASSPHRASE, S3_PULUMI_BACKEND_URL
├── README.md
├── TEARDOWN.md
└── pulumi-ci.sh

3. Shared Configuration

All resource names, sizes, feature flags, and routing rules are centralized in a single file: infra/shared/tradai_infra_shared/config.py

This is the canonical source of truth. No stack module hardcodes resource names -- everything is imported from config.

Configuration Groups

Group Type Contents
SERVICES dict[str, dict] 7 ECS services: cpu, memory, port, desired_count, health_check_path, spot
S3_BUCKETS dict[str, str] 5 buckets: configs, results, arcticdb, logs, mlflow
S3_BUCKET_CONFIG dict[str, dict] Per-bucket versioning and lifecycle rules
DYNAMODB_TABLES dict[str, str] 12 tables: workflow_state, idempotency, health_state, trading_state, deployments, drift_state, retraining_state, rollback_state, shadow_test_state, notifications, infra_drift_state, config_versions
RDS_CONFIG dict PostgreSQL 15.13, db.t4g.micro (dev) / db.t4g.small (prod), 20-50 GB
ECR_REPOS list[str] 6 service repos: backend, data-collection, strategy-service, mlflow, live-trading, dry-run-trading
LAMBDA_ECR_REPOS list[str] 18 Lambda image repos
STRATEGY_ECR_REPOS list[str] Dynamic list from Pulumi config (CI/CD-created strategy repos)
COGNITO_CONFIG CognitoConfigDict Password min 12, MFA required, all character classes
API_ROUTES list[dict] 21 HTTP routes: method, path, target service, auth required
API_THROTTLING dict Default 100 req/s, backtest POST 10 req/s
ALB_PATH_PATTERNS dict[str, list[str]] Path-based routing per service
LAMBDA_SCHEDULES dict[str, str] 6 EventBridge schedules (overridable via Pulumi config)
CONSOLIDATED_MODE dict[str, bool] dev/staging: True (EC2), prod: False (Fargate)
EC2_CONSOLIDATED_CONFIG EC2ConsolidatedConfigDict t3.small, 30 GB, 4 services
CORS_ALLOWED_ORIGINS list[str] ["*"] in dev, [] otherwise

Naming Conventions

All resource names follow the pattern tradai-{component}-{ENVIRONMENT}. The config module provides helper functions to generate these consistently:

Function Purpose Example Return
get_resource_name(component) Standard resource name tradai-workflow-state-dev
get_ecr_repo_name(repo_key) ECR repo with slash tradai/backend
get_tags(service=None) Standard resource tags {"Application": "tradai", "Environment": "dev", ...}
get_sd_namespace() Service Discovery namespace tradai-dev.local
get_mlflow_tracking_uri() MLflow URI via SD http://mlflow.tradai-dev.local:5000/mlflow
get_env_short() Short env code d, s, or p
is_consolidated_mode() EC2 vs Fargate check True for dev/staging
is_skip_alb() Skip ALB creation False by default

Environment-Specific Behavior

The ENVIRONMENT variable comes from pulumi.get_stack() (the stack name is the environment). Several config values change by environment:

Config dev / staging prod
Service desired_count 1 2 (backend-api)
RDS instance class db.t4g.micro db.t4g.small
RDS storage 20 GB 50 GB
RDS multi-AZ No Yes
RDS backup retention 7 days 14 days
Log retention 30 days 90 days
Compute mode Consolidated EC2 ECS Fargate
Live-trading spot No (never) No (never)

4. Stack: Persistent

Purpose: Data-bearing resources that survive dev-cycle teardowns. Destroying this stack means losing all market data, backtest results, container images, and user accounts.

Deploy: just infra-up-persistent Entry point: infra/persistent/__main__.py

Resources

Module Resources Key Exports
s3.py 5 S3 buckets (configs, results, arcticdb, logs, mlflow) s3_bucket_ids, s3_bucket_arns, per-bucket IDs
dynamodb.py 12 DynamoDB tables with PAY_PER_REQUEST billing Per-table name and ARN (24 exports)
ecr.py 24 ECR repos (6 service + 18 Lambda) + strategy repos ecr_repository_urls, ecr_repository_arns
cognito.py User pool, web client, M2M client, Cognito domain cognito_user_pool_id, cognito_user_pool_client_id, cognito_m2m_client_id, cognito_user_pool_endpoint
codeartifact.py Domain + repository for Python packages codeartifact_domain_name, codeartifact_repository_endpoint
cloudtrail.py Management event trail writing to logs bucket --
pulumi_backend.py S3 bucket for Pulumi state files pulumi_state_bucket_name

Design Decisions

  • Separate from foundation: VPC teardown should never risk deleting S3 or DynamoDB data.
  • ECR here, not compute: Images must exist before compute stack can create task definitions. just lambda-bootstrap pushes images to ECR between persistent and compute deployments.
  • Cognito here, not edge: User pool is data-bearing (user accounts persist).

5. Stack: Foundation

Purpose: Networking infrastructure and VPC-bound databases. Rarely destroyed because VPC changes cascade to every resource with a subnet or security group dependency.

Deploy: just infra-up-foundation (after persistent) Entry point: infra/foundation/__main__.py

Resources

Module Resources Key Exports
vpc.py VPC (10.0.0.0/16), 2 AZs, 6 subnets (2 public, 2 private, 2 database), IGW, route tables vpc_id, public_subnet_ids, private_subnet_ids, database_subnet_ids
security_groups.py 6 security groups: ALB, ECS, Lambda, RDS, NAT, VPC endpoints; optional consolidated SG Per-SG IDs
nacl.py Network ACLs for public, private, and database subnets --
nat_instance.py t4g.nano NAT instance (replaces NAT Gateway for cost, ~$3/month) --
vpc_endpoints.py Gateway endpoints (S3, DynamoDB) + interface endpoints (ECR, CloudWatch, STS) --
vpc_flow_logs.py VPC traffic logging to CloudWatch --
rds.py PostgreSQL 15.13 instance with subnet group, parameter group rds_endpoint, rds_secret_arn, rds_database_name, rds_instance_identifier
sqs.py Backtest request queue + dead letter queue backtest_queue_url, backtest_queue_arn, backtest_dlq_arn
sns.py Alerts topic (with email subscriptions) + registration topic sns_alerts_topic_arn, sns_registration_topic_arn
secret_rotation.py RDS credential rotation via Lambda + SNS notification --

Design Decisions

  • RDS in foundation, not persistent: RDS requires VPC subnet groups. Placing it with networking avoids circular cross-stack dependencies.
  • SQS/SNS in foundation: Messaging resources are VPC-adjacent and consumed by both compute and edge stacks.
  • NAT instance over NAT Gateway: t4g.nano costs ~$3/month vs ~$32/month for NAT Gateway. Acceptable for dev/staging traffic volumes.

6. Stack: Compute

Purpose: All compute workloads -- the most frequently updated stack. Contains IAM roles, ECS cluster and services, Lambda functions, Step Functions workflows, ALB, and the optional consolidated EC2 instance.

Deploy: just infra-up-compute (after foundation + Lambda images pushed to ECR) Entry point: infra/compute/__main__.py

Stack References

Compute reads from both upstream stacks: - persistent: ECR repo URLs, S3 bucket IDs, DynamoDB table names - foundation: VPC ID, subnet IDs, security group IDs, RDS endpoint, SQS/SNS ARNs

Resources

Module Resources Key Exports
iam.py ECS execution role, ECS task role, CLI/CI role, consolidated instance role + profile ecs_execution_role_arn, ecs_task_role_arn, cli_ci_role_arn
alb.py ALB, HTTP/HTTPS listeners, target groups per service (skippable via config) alb_arn, alb_dns_name, listener ARNs
ecs.py ECS cluster (Fargate + Fargate Spot capacity), strategy task definition, log group ecs_cluster_arn, strategy_task_definition_arn
ecs_services.py 7 ECS services with Service Discovery namespace, per-service task definitions ecs_service_arns, service_discovery_namespace_id
lambda_funcs.py 18 Lambda functions, EventBridge schedules, SQS event source mapping lambda_function_arns
step_functions.py Backtest + retraining state machines (ASL from Jinja2 templates) backtest_workflow_arn, retraining_workflow_name
ec2_consolidated.py Single EC2 (t3.small) running Docker Compose for dev/staging consolidated_asg_name
ec2_userdata.py Generates cloud-init userdata scripts for consolidated instance --

Consolidated EC2 Mode

In dev/staging, is_consolidated_mode() returns True. Instead of running 4 separate Fargate tasks (~$37/month), a single t3.small EC2 instance (~$15/month) runs backend-api, data-collection, mlflow, and strategy-service via Docker Compose. The instance: - Registers with ALB target groups for each service - Registers with Service Discovery for inter-service communication - Pulls images from ECR using an instance profile role - Receives per-service environment variables (S3 buckets, RDS endpoint, SQS URL, etc.)

In prod, is_consolidated_mode() returns False and services run as individual Fargate tasks for reliability and independent scaling.

Lambda Deployment Model

Lambda functions use container images (not ZIP). Each Lambda has a dedicated ECR repo in the persistent stack. The deployment flow is:

  1. just lambda-bootstrap builds and pushes all 18 Lambda images to ECR
  2. just infra-up-compute creates Lambda functions referencing those image URIs
  3. Some Lambdas are "deferred" -- created after Step Functions so they can receive the workflow ARN as an environment variable

Step Functions Workflows

Two state machines are defined using Jinja2 ASL templates in infra/compute/asl_templates/: - backtest_workflow.json.j2 -- Orchestrates: validate strategy, run ECS task, consume results, update status - retraining_workflow.json.j2 -- Orchestrates: check if retraining needed, run training, compare models, promote

Templates are rendered at deploy time with Lambda ARNs, ECS cluster ARN, subnet IDs, and security group IDs injected as template variables.


7. Stack: Edge

Purpose: External-facing routing and operational monitoring. Can be redeployed independently of compute resources.

Deploy: just infra-up-edge (after compute) Entry point: infra/edge/__main__.py

Stack References

Edge reads from all three upstream stacks: - persistent: Cognito user pool ID, client IDs, endpoint - foundation: VPC ID, subnet IDs, ECS SG, SQS queue URL/ARN, SNS topic ARN, RDS identifier - compute: ALB DNS name, listener ARNs, Step Functions workflow names

Resources

Module Resources Key Exports
api_gateway.py HTTP API, VPC Link to ALB, JWT authorizer (Cognito), SQS integration for backtest POST, route table (21 routes), custom domain (optional) api_gateway_endpoint, api_gateway_id
waf.py WebACL with rate limiting, geo-blocking, AWS managed rule groups; CloudWatch log group waf_web_acl_id, waf_web_acl_arn
cloudwatch_alarms.py Composite alarm, RDS CPU/storage alarms, Lambda error alarms, Step Functions failure alarms, stale heartbeat alarm composite_alarm_arn, lambda_error_alarm_arns
cloudwatch_dashboard.py Operational dashboard with ECS, Lambda, RDS, and API Gateway widgets dashboard_url

Design Decisions

  • API Gateway uses VPC Link: Routes traffic to ALB in private subnets; no public ALB needed.
  • Backtest POST routes to SQS: API Gateway has native SQS integration -- the request goes directly to the backtest queue without hitting any compute resource.
  • WAF not associated with HTTP API: WAFv2 cannot parse the $default stage ARN of HTTP APIs. The WebACL is created but requires REST API or ALB association to be effective.
  • JWT auth via Cognito: API Gateway validates JWTs against the Cognito user pool. Health check endpoint is unauthenticated.

8. Core Patterns

The shared library at infra/shared/tradai_infra_shared/core/ provides reusable builder patterns that eliminate duplication across stack modules.

PolicyBuilder

File: infra/shared/tradai_infra_shared/core/policy_builder.py

Fluent builder for IAM policies. Provides preset methods for common permission sets (DynamoDB CRUD, S3 read/write, Secrets Manager, CloudWatch, SNS, CodeArtifact) and composes them into a single policy document. Also provides AssumeRolePolicyBuilder with factory methods like for_ecs_tasks() and for_lambda().

# Example: building a task role policy
policy = (PolicyBuilder("tradai")
    .with_dynamodb_crud()
    .with_s3_readwrite()
    .with_secrets_read()
    .with_cloudwatch_metrics()
    .build())

SecurityRuleBuilder

File: infra/shared/tradai_infra_shared/core/security_rule_builder.py

Fluent builder for security group rules. Provides CommonPorts with predefined port ranges for all services and PortRange for custom ranges. Methods like ingress_from_sg() and egress_https_to_internet() create rules with consistent naming and tagging.

AslTemplateLoader

File: infra/shared/tradai_infra_shared/core/asl_loader.py

Jinja2-based loader for Amazon States Language (ASL) workflow definitions. Renders .json.j2 templates from infra/compute/asl_templates/ with Lambda ARNs, ECS config, and subnet IDs injected as variables. Validates rendered output is valid JSON.

EnvironmentBuilder

File: infra/shared/tradai_infra_shared/core/environment_builder.py

Fluent builder for container environment variables that handles both static strings and Pulumi Output values. Methods include add() for static values, add_output() for Pulumi Outputs, and add_if() for conditional variables. Resolves all Outputs at build time using Output.all().apply().

Additional Utilities

Utility File Purpose
get_latest_al2023_ami() core/ami_lookup.py Looks up latest Amazon Linux 2023 AMI by architecture (arm64 or x86_64)
make_shell_template_env() core/shell_template_env.py Creates Jinja2 Environment for .sh.j2 userdata templates with StrictUndefined
PulumiMocks testing.py Mock Pulumi runtime for unit testing stack modules

9. Deployment

Prerequisites

  1. AWS profile is configured via Pulumi config (pulumi config set aws:profile tradai), not as an environment variable. Configure infra/.env with the required variables (see infra/.env.example):
  2. PULUMI_CONFIG_PASSPHRASE=<your passphrase>
  3. S3_PULUMI_BACKEND_URL=s3://<pulumi-state-bucket>

  4. The justfile _infra-run helper sources infra/.env automatically before running Pulumi commands, then logs into the S3 backend and selects the target stack.

Full Bootstrap (First Deployment)

just infra-bootstrap dev

This runs the full sequence: 1. just infra-up-persistent dev -- S3, DynamoDB, ECR, Cognito, CodeArtifact 2. just infra-up-foundation dev -- VPC, RDS, SQS, SNS 3. just lambda-bootstrap -- Build and push all 18 Lambda images to ECR 4. just service-push-all -- Build and push all service images to ECR 5. just infra-up-compute dev -- IAM, ALB, ECS, Lambda, Step Functions, EC2 6. just infra-up-edge dev -- API Gateway, WAF, CloudWatch

Individual Stack Commands

Command Purpose
just infra-up-persistent [stack] Deploy persistent stack
just infra-up-foundation [stack] Deploy foundation stack
just infra-up-compute [stack] Deploy compute stack (pre-flight checks Lambda images in ECR)
just infra-up-edge [stack] Deploy edge stack
just infra-preview [stack] Preview all 4 stacks
just infra-preview-{layer} [stack] Preview individual stack
just infra-down-soft [stack] Destroy edge + compute only (data and networking preserved)
just infra-down-all [stack] Destroy edge + compute + foundation (persistent preserved)
just infra-down-persistent [stack] DANGER: destroy all data (requires confirmation prompt)

Default stack is dev for all commands.

Compute Pre-Flight Check

just infra-up-compute verifies that all 18 Lambda images exist in ECR before deploying. If any image is missing, it fails with instructions to run just lambda-bootstrap first. This prevents partial deployments where Lambda functions reference non-existent images.


10. Cross-Stack References

How It Works

Each stack exports named outputs via pulumi.export(). Downstream stacks read these using pulumi.StackReference(f"{org}/tradai-{stack_name}/{environment}").

Key Exports Per Stack

Stack Key Exports Consumed By
persistent ecr_repository_urls, s3_*_bucket_id (x5), DynamoDB table names/ARNs (x24), cognito_user_pool_id, cognito_user_pool_client_id, cognito_m2m_client_id, cognito_user_pool_endpoint compute, edge
foundation vpc_id, public_subnet_ids, private_subnet_ids, security group IDs (x6), rds_endpoint, rds_secret_arn, rds_database_name, rds_instance_identifier, backtest_queue_url, backtest_queue_arn, backtest_dlq_arn, sns_alerts_topic_arn compute, edge
compute alb_dns_name, alb_https_listener_arn, alb_http_listener_arn, ecs_cluster_arn, backtest_workflow_arn, backtest_workflow_name, retraining_workflow_name, lambda_function_arns edge
edge api_gateway_endpoint, api_gateway_id, waf_web_acl_id, composite_alarm_arn, dashboard_url (terminal -- consumed by users/CI)

Reference Pattern

The org value defaults to "organization" and can be overridden via pulumi config set org <value>. Stack names match environments (dev, staging, prod), so a reference looks like: organization/tradai-persistent/dev.


Changelog

Version Date Changes
9.2.1 2026-03-28 Full regeneration. Corrected to 4-stack architecture, eu-central-1, tables over code

Dependencies

If This Changes Update This Doc
New Pulumi stack or module added Stack table, module listings
infra/shared/tradai_infra_shared/config.py Config reference section
Stack deployment order changes Dependency flow diagram