Skip to content

Pulumi Infrastructure Deployment Guide

Comprehensive guide for deploying and managing TradAI infrastructure using Pulumi with AWS.

Architecture Overview

TradAI infrastructure is organized into 4 deployment phases:

Phase 0: State Management (S3 backend for Pulumi state)
Phase 1: Foundation (VPC, storage, security, networking)
Phase 2: Compute (ECS, Lambda, ALB, API Gateway)
Phase 3: Orchestration (Step Functions, SQS)
Phase 4: Monitoring (CloudWatch, WAF)

Key Files

File Purpose
infra/__main__.py Main entry point - orchestrates all modules
infra/config.py Canonical source of truth for all configuration
infra/Pulumi.yaml Pulumi project definition
infra/Pulumi.{stack}.yaml Stack-specific configuration
infra/modules/ 28 infrastructure modules

Prerequisites

# Required tools
python --version          # 3.11+
uv --version              # UV package manager
pulumi version            # Pulumi CLI
aws --version             # AWS CLI

# AWS credentials
aws configure list        # Verify credentials

Install Pulumi

# macOS
brew install pulumi

# Linux
curl -fsSL https://get.pulumi.com | sh

# Verify
pulumi version

Quick Start

For Existing Team Members

# 1. Setup environment
just infra-setup
# Edit infra/.env with S3_PULUMI_BACKEND_URL and PULUMI_CONFIG_PASSPHRASE

# 2. Login to S3 backend
just infra-login

# 3. Select stack
just infra-init dev

# 4. Preview and deploy
just infra-preview
just infra-up

First-Time Bootstrap

# 1. Setup environment
just infra-setup
# Edit infra/.env (leave S3_PULUMI_BACKEND_URL empty)

# 2. Bootstrap with local backend
just infra-bootstrap dev

# 3. Deploy infrastructure (creates S3 bucket)
just infra-up

# 4. Get bucket name for migration
pulumi stack output pulumi_state_bucket_name

# 5. Update .env with bucket URL
# S3_PULUMI_BACKEND_URL=s3://tradai-pulumi-state-abc123

# 6. Migrate state to S3
just infra-migrate

Stack Environments

Stack Description Key Differences
dev Development Single-AZ, t4g.micro RDS, FARGATE_SPOT, 7-day logs
staging Pre-production Production-like, single-AZ, t4g.small RDS
prod Production Multi-AZ, full redundancy, 90-day logs, no Spot for live trading

Selecting a Stack

# List available stacks
pulumi stack ls

# Select a stack
pulumi stack select dev

# Create new stack
pulumi stack init staging

Deployment Phases

Phase 0: Pulumi Backend

Self-hosted S3 state storage enabling team collaboration and CI/CD.

Module: pulumi_backend.py

Resources: - S3 bucket for Pulumi state - IAM roles for CI/CD and Pulumi Dashboard

Outputs:

pulumi stack output pulumi_state_bucket_name
pulumi stack output pulumi_backend_role_arn

Phase 1: Foundation Infrastructure

Module Task ID Description Dependencies
vpc.py IF002 VPC (10.0.0.0/16), 6 subnets (2 AZs, 3 tiers) None
vpc_endpoints.py SEC002 Gateway endpoints (S3, DynamoDB) VPC
vpc_flow_logs.py SEC004 VPC Flow Logs to CloudWatch VPC
nacl.py SEC005 Network ACLs for all subnets VPC
s3.py IS001 5 S3 buckets (configs, results, arcticdb, logs, mlflow) None
cloudtrail.py SEC003 CloudTrail audit logging S3
dynamodb.py IS003 8 DynamoDB tables None
sns.py MN001 SNS topics (alerts, registration) None
security_groups.py IF004 5 security groups VPC
nat_instance.py IF003 NAT instance with ASG VPC, SG
rds.py IS002 PostgreSQL for MLflow VPC, SG
secret_rotation.py SEC006 RDS secret rotation (30-day) RDS
ecr.py IS004 12 ECR repositories None
codeartifact.py SR003 CodeArtifact for Python packages None

Phase 2: Compute Infrastructure

Module Task ID Description Dependencies
iam.py IC001 ECS execution and task roles None
ecs.py IC001 ECS Fargate cluster, strategy task def IAM
alb.py IC002 Application Load Balancer VPC, SG
sqs.py IO001 Backtest queue + DLQ None
cognito.py DK005 Cognito user pool, M2M client None
ecs_services.py IC003 4 ECS services (backend, strategy, data, mlflow) ECS, ALB, ECR
lambda_funcs.py IC004 8 Lambda functions VPC, SG
api_gateway.py IC005 HTTP API Gateway with 11 routes ALB, Cognito
waf.py SEC001 WAF WebACL for API Gateway API Gateway

Phase 3: Orchestration

Module Task ID Description Dependencies
step_functions.py IO002 Backtest workflow state machine Lambda, ECS

Phase 4: Monitoring

Module Task ID Description Dependencies
cloudwatch_alarms.py MN003 Composite alarm, heartbeat detection SNS
cloudwatch_dashboard.py OB001 Trading platform dashboard None

Configuration Management

config.py - Single Source of Truth

All infrastructure values are defined in infra/config.py:

# Core configuration
PROJECT_NAME = "tradai"
ENVIRONMENT = pulumi.get_stack()  # dev, staging, prod
AWS_REGION = "us-east-1"

# VPC CIDR
VPC_CIDR = "10.0.0.0/16"

# Services definition
SERVICES = {
    "backend": {"port": 8000, "cpu": 256, "memory": 512},
    "strategy-service": {"port": 8003, "cpu": 256, "memory": 512},
    "data-collection": {"port": 8002, "cpu": 256, "memory": 512},
    "mlflow": {"port": 5000, "cpu": 256, "memory": 512},
}

# DynamoDB tables
DYNAMODB_TABLES = {
    "workflow_state": "tradai-workflow-state",
    "health_state": "tradai-health-state",
    "trading_state": "tradai-trading-state",
    # ...
}

Stack Configuration Overrides

Override values per stack in Pulumi.{stack}.yaml:

config:
  aws:region: us-east-1
  tradai-infrastructure:environment: dev
  tradai-infrastructure:certificate_arn: "arn:aws:acm:..."  # Optional for HTTPS
  tradai-infrastructure:api_domain: "api.tradai.io"        # Optional custom domain

Common Configuration Commands

# Set certificate for HTTPS
pulumi config set certificate_arn "arn:aws:acm:us-east-1:123:certificate/abc"

# Set alert email
pulumi config set --path 'alert_emails[0]' 'alerts@example.com'

# Set CORS origins
pulumi config set --path 'cors_allowed_origins[0]' 'https://app.tradai.io'

# Configure CloudWatch alarm thresholds
pulumi config set alarm_latency_threshold 5000
pulumi config set alarm_min_strategies 1  # For prod

# Configure Lambda schedules
pulumi config set drift_monitor_schedule "rate(1 day)"
pulumi config set retraining_scheduler_schedule "rate(7 days)"

Helper Functions

config.py provides helper functions for consistent naming:

# Standard resource name
get_resource_name("backend")  # tradai-backend-dev

# ECR repository name
get_ecr_repo_name("backend")  # tradai/backend

# Standard tags
get_tags(service="backend")
# {
#   "Project": "tradai",
#   "Environment": "dev",
#   "Service": "backend",
#   "ManagedBy": "pulumi",
# }

# Environment short code (for name length limits)
get_env_short()  # "d" for dev, "s" for staging, "p" for prod

Daily Operations

Preview Changes

# Preview all changes
pulumi preview

# Preview with diff
pulumi preview --diff

# Preview specific resource
pulumi preview --target 'urn:pulumi:dev::tradai-infrastructure::aws:ecs:Service::tradai-backend-dev'

Deploy

# Deploy with confirmation prompt
pulumi up

# Deploy with auto-approve (CI/CD)
pulumi up --yes

# Deploy specific resource
pulumi up --target 'urn:pulumi:dev::tradai-infrastructure::aws:ecs:Service::tradai-backend-dev'

# Skip preview
pulumi up --skip-preview --yes

View Outputs

# All outputs
pulumi stack output

# Specific output
pulumi stack output vpc_id
pulumi stack output ecs_cluster_name
pulumi stack output api_gateway_endpoint

# JSON format
pulumi stack output --json

State Management

# Export state backup
pulumi stack export > state-backup-$(date +%Y%m%d).json

# Import state
pulumi stack import --file state-backup.json

# Refresh state from AWS
pulumi refresh

# View resource URNs
pulumi stack --show-urns

Rollback

# Pulumi doesn't have native rollback
# Options:
# 1. Redeploy previous code from git
git checkout HEAD~1 -- infra/
pulumi up --yes

# 2. Import previous state backup
pulumi stack import --file state-backup-20240115.json
pulumi up --yes

# 3. Manual resource updates
aws ecs update-service --cluster tradai-dev --service tradai-backend-dev \
  --task-definition tradai-backend-dev:PREVIOUS_REVISION

Adding New Modules

Module Template

"""New Module Description.

Task ID: XX001
Dependencies: vpc, security_groups
"""

import pulumi
import pulumi_aws as aws

from config import (
    ENVIRONMENT,
    PROJECT_NAME,
    get_resource_name,
    get_tags,
)


class NewModule:
    """Creates new infrastructure component."""

    def __init__(
        self,
        vpc_id: pulumi.Input[str],
        security_group_id: pulumi.Input[str],
    ):
        """Initialize NewModule.

        Args:
            vpc_id: VPC ID for resource placement
            security_group_id: Security group to attach
        """
        self._create_resources(vpc_id, security_group_id)

    def _create_resources(
        self,
        vpc_id: pulumi.Input[str],
        security_group_id: pulumi.Input[str],
    ) -> None:
        """Create module resources."""
        self.resource = aws.service.Resource(
            get_resource_name("component"),
            vpc_id=vpc_id,
            security_groups=[security_group_id],
            tags=get_tags(service="component"),
        )

        # Export resource attributes
        self.resource_id = self.resource.id
        self.resource_arn = self.resource.arn

Integration Checklist

  1. Create module file in infra/modules/
  2. Add import to infra/__main__.py
  3. Add to correct phase based on dependencies
  4. Export outputs via pulumi.export()
  5. Update config.py if new canonical values needed
  6. Add tests in infra/tests/
  7. Update documentation (this guide, pulumi-modules.md)

Example: Adding to main.py

# In Phase 2 section of __main__.py
from modules.new_module import NewModule

# After dependencies are created
new_module = NewModule(
    vpc_id=vpc.vpc.id,
    security_group_id=security_groups.ecs_sg_id,
)
pulumi.export("new_resource_id", new_module.resource_id)
pulumi.export("new_resource_arn", new_module.resource_arn)

Troubleshooting

Common Issues

Error Cause Solution
error: no Pulumi.yaml Wrong directory cd infra/
error: stack 'dev' not found Stack not initialized pulumi stack init dev
error: passphrase must be set Missing passphrase Set PULUMI_CONFIG_PASSPHRASE
error: unable to deserialize State corruption Restore from backup
error: resource already exists Orphaned resource Import or delete manually

State Lock Issues

# Check for lock
aws s3api head-object --bucket $BUCKET --key .pulumi/locks/...

# Force unlock (use with caution)
pulumi cancel

Import Existing Resources

# Import existing resource
pulumi import aws:s3:Bucket my-bucket existing-bucket-name

# Import with specific URN
pulumi import aws:ecs:Service backend arn:aws:ecs:us-east-1:123:service/cluster/service

Debug Mode

# Verbose logging
PULUMI_DEBUG_COMMANDS=1 pulumi up

# Trace logging
pulumi up --logtostderr -v=9

# Dry run with detailed output
pulumi preview --diff --debug

CI/CD Integration

Bitbucket Pipelines

Pipeline Description
infra-preview Preview changes (PR validation)
infra-deploy-dev Deploy to dev
infra-deploy-staging Deploy to staging
infra-deploy-prod Deploy to production

Required Variables

Variable Description Secured
PULUMI_CONFIG_PASSPHRASE Stack encryption Yes
S3_PULUMI_BACKEND_URL Backend URL No
AWS_ACCESS_KEY_ID AWS credentials Yes
AWS_SECRET_ACCESS_KEY AWS credentials Yes
AWS_REGION AWS region No

Pipeline Example

# bitbucket-pipelines.yml
pipelines:
  custom:
    infra-deploy-dev:
      - step:
          name: Deploy Infrastructure (Dev)
          script:
            - cd infra
            - pulumi login $S3_PULUMI_BACKEND_URL
            - pulumi stack select dev
            - pulumi up --yes

Cost Optimization

Optimization Savings Implementation
NAT Instance vs Gateway ~$32/month nat_instance.py uses t4g.nano
FARGATE_SPOT ~70% on backtests Enabled in ecs.py
Single-AZ (dev/staging) ~50% on RDS rds.py multi_az=False
db.t4g.micro ~$12/month Dev/staging RDS instance
On-demand DynamoDB Pay per use Default in dynamodb.py
S3 lifecycle rules Auto-cleanup s3.py lifecycle policies

Security Best Practices

  1. Secrets: Never in config, always AWS Secrets Manager
  2. RDS: Private subnets only, security group restricted
  3. IAM: Least-privilege roles in iam.py
  4. Security Groups: Minimal ingress rules
  5. WAF: Rate limiting on API Gateway
  6. CloudTrail: Audit logging enabled
  7. VPC Flow Logs: Network traffic logging
  8. Encryption: S3, RDS, DynamoDB at rest