Skip to content

Pulumi Infrastructure Deployment Guide

Comprehensive guide for deploying and managing TradAI infrastructure using Pulumi with AWS.

Architecture Overview

TradAI infrastructure is split into 4 independent Pulumi stacks. The key design principle is separating data-bearing resources (never destroyed in dev cycles) from ephemeral compute (freely destroyed and redeployed):

persistent → foundation → (lambda-bootstrap) → compute → edge
Stack Resources Destroy in dev?
persistent S3×5, DynamoDB×12, ECR×24, Cognito, CodeArtifact, CloudTrail, Pulumi backend Never
foundation VPC, subnets, NAT, SGs, NACLs, VPC endpoints, RDS, SQS, SNS Rarely
compute IAM roles, ALB, ECS, Lambda, Step Functions, EC2 Consolidated Freely
edge API Gateway, WAF, CloudWatch Alarms/Dashboard Freely

Stacks share data via pulumi.StackReference: - compute reads from persistent (ECR URLs, S3 IDs, DynamoDB names) and foundation (VPC, SGs, RDS, SQS) - edge reads from persistent (Cognito), foundation (VPC, SGs, SQS), and compute (ALB endpoints)

Key Files

File Purpose
infra/shared/tradai_infra_shared/config.py Canonical source of truth for all configuration
infra/shared/tradai_infra_shared/core/ Shared builders (PolicyBuilder, SecurityRuleBuilder, etc.)
infra/persistent/__main__.py Persistent stack entry point (data resources)
infra/foundation/__main__.py Foundation stack entry point (networking)
infra/compute/__main__.py Compute stack entry point (ECS, Lambda, IAM)
infra/edge/__main__.py Edge stack entry point (API GW, WAF)
infra/{stack}/Pulumi.yaml Per-stack Pulumi project definition
infra/{stack}/Pulumi.{env}.yaml Per-stack environment configuration
infra/{stack}/modules/ Per-stack infrastructure modules

Prerequisites

# Required tools
python --version          # 3.11+
uv --version              # UV package manager
pulumi version            # Pulumi CLI
aws --version             # AWS CLI

# AWS credentials
aws configure list        # Verify credentials

Install Pulumi

# macOS
brew install pulumi

# Linux
curl -fsSL https://get.pulumi.com | sh

# Verify
pulumi version

Quick Start

For Existing Team Members

# 1. Setup environment
just infra-setup
# Edit infra/.env with S3_PULUMI_BACKEND_URL, PULUMI_CONFIG_PASSPHRASE, AWS_PROFILE, AWS_REGION

# 2. Preview all stacks
just infra-preview dev

# 3. Deploy all stacks in order
just infra-bootstrap dev

# Or deploy individually (persistent first):
just infra-up-persistent dev    # S3, DynamoDB, ECR, Cognito (no deps)
just infra-up-foundation dev    # VPC, RDS, SQS, SNS
just lambda-bootstrap            # Push Lambda images to ECR
just infra-up-compute dev        # IAM, ECS, ALB, Lambda (images exist now)
just infra-up-edge dev           # API Gateway, WAF, CloudWatch

First-Time Bootstrap

# 1. Setup environment
just infra-setup
# Edit infra/.env with AWS credentials and Pulumi passphrase

# 2. Full bootstrap (deploys foundation → pushes images → compute → edge)
just infra-bootstrap dev

# 3. Verify deployment
just infra-outputs persistent dev
just infra-outputs foundation dev
just infra-outputs compute dev
just infra-outputs edge dev

Stack Environments

Stack Description Key Differences
dev Development Single-AZ, t4g.micro RDS, FARGATE_SPOT, 7-day logs
staging Pre-production Production-like, single-AZ, t4g.small RDS
prod Production Multi-AZ, full redundancy, 90-day logs, no Spot for live trading

Selecting a Stack

# List available stacks
pulumi stack ls

# Select a stack
pulumi stack select dev

# Create new stack
pulumi stack init staging

Stack Modules

Foundation Stack (infra/foundation/)

No container image dependencies. Can be deployed first.

Module Description Key Resources
vpc.py VPC (10.0.0.0/16), 6 subnets (2 AZs, 3 tiers) VPC, subnets, route tables
vpc_endpoints.py Gateway + Interface VPC Endpoints S3, DynamoDB, ECR, STS, SSM, SQS endpoints
vpc_flow_logs.py VPC Flow Logs to CloudWatch Log group, IAM role
nacl.py Network ACLs for all subnets NACLs, rules
s3.py 5 S3 buckets (configs, results, arcticdb, logs, mlflow) Buckets, lifecycle rules
cloudtrail.py CloudTrail audit logging Trail, S3 policy
dynamodb.py 8 DynamoDB tables Tables, GSIs
sns.py SNS topics (alerts, registration) Topics, subscriptions
security_groups.py 5 security groups SGs, ingress/egress rules
nat_instance.py NAT instance with ASG (cost-optimized) ASG, launch template
rds.py PostgreSQL for MLflow RDS instance, subnet group
secret_rotation.py RDS secret rotation (30-day) Lambda, rotation schedule
ecr.py 12 ECR repositories Repos, lifecycle policies
iam.py ECS execution, task, and consolidated roles IAM roles, policies, instance profile
sqs.py Backtest queue + DLQ Queues, redrive policy
cognito.py Cognito user pool, M2M client User pool, app client
codeartifact.py CodeArtifact for Python packages Domain, repository

Compute Stack (infra/compute/)

Needs ECR images. Deploy after just lambda-bootstrap.

Module Description Key Resources
alb.py Application Load Balancer ALB, listeners, target groups
ecs.py ECS Fargate cluster Cluster, capacity providers
ecs_services.py 4 ECS services (or consolidated EC2) Services, task definitions
ec2_consolidated.py Consolidated EC2 for dev/staging ASG, launch template, Cloud Map
lambda_funcs.py Lambda functions Functions, event sources
step_functions.py Backtest workflow state machine State machine, IAM role

Edge Stack (infra/edge/)

Needs compute resources (ALB, Cognito).

Module Description Key Resources
api_gateway.py HTTP API Gateway with routes API, routes, integrations
waf.py WAF WebACL for API Gateway WebACL, rules
cloudwatch_alarms.py Composite alarm, heartbeat detection Alarms, SNS actions
cloudwatch_dashboard.py Trading platform dashboard Dashboard widgets

Configuration Management

config.py - Single Source of Truth

All infrastructure values are defined in infra/shared/tradai_infra_shared/config.py:

# Core configuration
PROJECT_NAME = "tradai"
ENVIRONMENT = pulumi.get_stack()  # dev, staging, prod
AWS_REGION = "eu-central-1"

# VPC CIDR
VPC_CIDR = "10.0.0.0/16"

# Services definition
SERVICES = {
    "backend": {"port": 8000, "cpu": 256, "memory": 512},
    "strategy-service": {"port": 8003, "cpu": 256, "memory": 512},
    "data-collection": {"port": 8002, "cpu": 256, "memory": 512},
    "mlflow": {"port": 5000, "cpu": 256, "memory": 512},
}

# DynamoDB tables
DYNAMODB_TABLES = {
    "workflow_state": "tradai-workflow-state",
    "health_state": "tradai-health-state",
    "trading_state": "tradai-trading-state",
    # ...
}

Stack Configuration Overrides

Override values per stack in Pulumi.{stack}.yaml:

config:
  aws:region: eu-central-1
  tradai-infrastructure:environment: dev
  tradai-infrastructure:certificate_arn: "arn:aws:acm:..."  # Optional for HTTPS
  tradai-infrastructure:api_domain: "api.tradai.io"        # Optional custom domain

Common Configuration Commands

# Set certificate for HTTPS
pulumi config set certificate_arn "arn:aws:acm:eu-central-1:123:certificate/abc"

# Set alert email
pulumi config set --path 'alert_emails[0]' 'alerts@example.com'

# Set CORS origins
pulumi config set --path 'cors_allowed_origins[0]' 'https://app.tradai.io'

# Configure CloudWatch alarm thresholds
pulumi config set alarm_latency_threshold 5000
pulumi config set alarm_min_strategies 1  # For prod

# Configure Lambda schedules
pulumi config set drift_monitor_schedule "rate(1 day)"
pulumi config set retraining_scheduler_schedule "rate(7 days)"

Helper Functions

config.py provides helper functions for consistent naming:

# Standard resource name
get_resource_name("backend")  # tradai-backend-dev

# ECR repository name
get_ecr_repo_name("backend")  # tradai/backend

# Standard tags
get_tags(service="backend")
# {
#   "Project": "tradai",
#   "Environment": "dev",
#   "Service": "backend",
#   "ManagedBy": "pulumi",
# }

# Environment short code (for name length limits)
get_env_short()  # "d" for dev, "s" for staging, "p" for prod

Daily Operations

Preview and Deploy

# Preview all stacks
just infra-preview dev

# Preview a single stack
just infra-preview-foundation dev
just infra-preview-compute dev
just infra-preview-edge dev

# Deploy all stacks in order
just infra-bootstrap dev

# Deploy a single stack
just infra-up-foundation dev
just infra-up-compute dev
just infra-up-edge dev

View Outputs

# Stack outputs (resource IDs, ARNs, endpoints)
just infra-outputs foundation dev
just infra-outputs compute dev
just infra-outputs edge dev

# Filter a specific output
just infra-outputs compute dev | jq -r '.alb_dns_name'

Service Operations

# ECS service status
just ecs-status

# Force ECS redeployment (after image push)
just ecs-force-deploy backend
just ecs-force-deploy-all

# ASG refresh (consolidated EC2 — picks up new launch template)
just asg-refresh
just asg-status

# Push service images to ECR
just service-ecr-login
just service-push backend
just service-push-all

# Check Lambda images
just lambda-check-images

State Management

# Export state backup (run from specific stack dir)
cd infra/foundation && pulumi stack export > state-backup-$(date +%Y%m%d).json

# Refresh state from AWS
cd infra/foundation && set -a && source ../.env && set +a && pulumi refresh

# View resource URNs
cd infra/foundation && set -a && source ../.env && set +a && pulumi stack --show-urns

Rollback

# Pulumi doesn't have native rollback. Options:

# 1. Redeploy previous code from git
git checkout HEAD~1 -- infra/
just infra-bootstrap dev

# 2. Import previous state backup
cd infra/foundation && pulumi stack import --file state-backup.json
just infra-up-foundation dev

# 3. Manual ECS service rollback
aws ecs update-service --cluster tradai-dev --service tradai-backend-dev \
  --task-definition tradai-backend-dev:PREVIOUS_REVISION \
  --profile tradai --region eu-central-1

Adding New Modules

Module Template

"""New Module Description.

Task ID: XX001
Dependencies: vpc, security_groups
"""

import pulumi
import pulumi_aws as aws
from tradai_infra_shared.config import (
    ENVIRONMENT,
    PROJECT_NAME,
    get_tags,
)


class NewModule:
    """Creates new infrastructure component."""

    def __init__(
        self,
        vpc_id: pulumi.Input[str],
        security_group_id: pulumi.Input[str],
    ) -> None:
        """Initialize NewModule.

        Args:
            vpc_id: VPC ID for resource placement
            security_group_id: Security group to attach
        """
        self.resource = aws.service.Resource(
            f"{PROJECT_NAME}-component",
            vpc_id=vpc_id,
            security_groups=[security_group_id],
            tags=get_tags("component")
            | {"Name": f"{PROJECT_NAME}-component-{ENVIRONMENT}"},
        )

    @property
    def resource_id(self) -> pulumi.Output[str]:
        """Get resource ID."""
        return self.resource.id

Integration Checklist

  1. Create module file in the appropriate stack's modules/ directory
  2. Add import to the stack's __main__.py
  3. Choose the right stack: foundation (no image deps), compute (needs ECR images), edge (needs compute)
  4. Export outputs via pulumi.export() if other stacks need the values
  5. Update shared config if new canonical values needed (infra/shared/tradai_infra_shared/config.py)
  6. Add tests in the stack's tests/ directory
  7. Update documentation (this guide)

Example: Adding to a Stack

# In compute/__main__.py
from modules.new_module import NewModule

# Read from foundation stack via StackReference
foundation_ref = pulumi.StackReference("tradai-foundation/dev")
vpc_id = foundation_ref.get_output("vpc_id")

new_module = NewModule(
    vpc_id=vpc_id,
    security_group_id=foundation_ref.get_output("ecs_sg_id"),
)
pulumi.export("new_resource_id", new_module.resource_id)

Troubleshooting

Common Issues

Error Cause Solution
error: no Pulumi.yaml Wrong directory cd infra/foundation/ (or compute/edge)
error: stack 'dev' not found Stack not initialized pulumi stack init dev
error: passphrase must be set Missing passphrase source infra/.env or set PULUMI_CONFIG_PASSPHRASE
error: unable to deserialize State corruption Restore from backup
error: resource already exists Orphaned resource Import or delete manually
SSM agent not registering Custom AMI missing agent Redeploy compute stack (installs from S3 RPM)
S3/dnf 403 errors on EC2 VPC endpoint policy too restrictive Check foundation vpc_endpoints.py AllowSystemPackageBuckets

State Lock Issues

# Check for lock
aws s3api head-object --bucket $BUCKET --key .pulumi/locks/...

# Force unlock (use with caution)
pulumi cancel

Import Existing Resources

# Import existing resource
pulumi import aws:s3:Bucket my-bucket existing-bucket-name

# Import with specific URN
pulumi import aws:ecs:Service backend arn:aws:ecs:eu-central-1:123:service/cluster/service

Debug Mode

# Verbose logging
PULUMI_DEBUG_COMMANDS=1 pulumi up

# Trace logging
pulumi up --logtostderr -v=9

# Dry run with detailed output
pulumi preview --diff --debug

Accessing Services

Dev/Staging: Consolidated EC2

In dev/staging, all services run on a single EC2 instance via Docker Compose (cost optimization). Access via SSM Session Manager (no SSH keys needed):

# Find the consolidated instance
aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=tradai-consolidated-dev" "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].InstanceId' --output text \
  --profile tradai --region eu-central-1

# Interactive shell
aws ssm start-session --target <instance-id> --profile tradai --region eu-central-1

# On the instance:
sudo docker ps                                   # Running containers
sudo docker logs backend-api --tail 50 -f        # Service logs
cd /opt/tradai && sudo docker-compose restart     # Restart services

Production: ECS Fargate

In production, services run on ECS Fargate. Use ECS Exec for debugging:

# List running tasks
aws ecs list-tasks --cluster tradai-prod --service-name tradai-backend-prod \
  --profile tradai --region eu-central-1

# Interactive shell into a container
aws ecs execute-command --cluster tradai-prod --task <task-arn> \
  --container backend --interactive --command "/bin/sh" \
  --profile tradai --region eu-central-1

Service Endpoints

All services are fronted by an Application Load Balancer:

Service Port Path Purpose
backend-api 8000 /api/* API gateway, orchestration
data-collection 8002 /data/* Market data fetching
strategy-service 8003 /strategy/* Strategy execution, backtesting
mlflow 5000 /mlflow/* ML experiment tracking
# Get ALB DNS
just infra-outputs compute dev | jq -r '.alb_dns_name'

# Test health
curl -s http://<ALB_DNS>/api/health | jq .

Logs

# CloudWatch (consolidated EC2)
aws logs tail /tradai/consolidated/containers --since 30m \
  --profile tradai --region eu-central-1

# CloudWatch (follow in real time)
aws logs tail /tradai/consolidated/containers --follow \
  --profile tradai --region eu-central-1

CI/CD Integration

Bitbucket Pipelines

Pipeline Description
infra-preview Preview changes (PR validation)
infra-deploy-dev Deploy to dev
infra-deploy-staging Deploy to staging
infra-deploy-prod Deploy to production

Required Variables

Variable Description Secured
PULUMI_CONFIG_PASSPHRASE Stack encryption Yes
S3_PULUMI_BACKEND_URL Backend URL No
AWS_ACCESS_KEY_ID AWS credentials Yes
AWS_SECRET_ACCESS_KEY AWS credentials Yes
AWS_REGION AWS region No

Pipeline Example

# bitbucket-pipelines.yml (simplified)
pipelines:
  custom:
    infra-deploy-dev:
      - step:
          name: Deploy Infrastructure (Dev)
          script:
            - just infra-bootstrap dev

Cost Optimization

Optimization Savings Implementation
NAT Instance vs Gateway ~$32/month nat_instance.py uses t4g.nano
FARGATE_SPOT ~70% on backtests Enabled in ecs.py
Single-AZ (dev/staging) ~50% on RDS rds.py multi_az=False
db.t4g.micro ~$12/month Dev/staging RDS instance
On-demand DynamoDB Pay per use Default in dynamodb.py
S3 lifecycle rules Auto-cleanup s3.py lifecycle policies

Security Best Practices

  1. Secrets: Never in config, always AWS Secrets Manager
  2. RDS: Private subnets only, security group restricted
  3. IAM: Least-privilege roles in iam.py
  4. Security Groups: Minimal ingress rules
  5. WAF: Rate limiting on API Gateway
  6. CloudTrail: Audit logging enabled
  7. VPC Flow Logs: Network traffic logging
  8. Encryption: S3, RDS, DynamoDB at rest