Pulumi Infrastructure Deployment Guide Comprehensive guide for deploying and managing TradAI infrastructure using Pulumi with AWS.
Architecture Overview TradAI infrastructure is organized into 4 deployment phases :
Phase 0: State Management (S3 backend for Pulumi state)
↓
Phase 1: Foundation (VPC, storage, security, networking)
↓
Phase 2: Compute (ECS, Lambda, ALB, API Gateway)
↓
Phase 3: Orchestration (Step Functions, SQS)
↓
Phase 4: Monitoring (CloudWatch, WAF)
Key Files File Purpose infra/__main__.py Main entry point - orchestrates all modules infra/config.py Canonical source of truth for all configuration infra/Pulumi.yaml Pulumi project definition infra/Pulumi.{stack}.yaml Stack-specific configuration infra/modules/ 28 infrastructure modules
Prerequisites # Required tools
python --version # 3.11+
uv --version # UV package manager
pulumi version # Pulumi CLI
aws --version # AWS CLI
# AWS credentials
aws configure list # Verify credentials
Install Pulumi # macOS
brew install pulumi
# Linux
curl -fsSL https://get.pulumi.com | sh
# Verify
pulumi version
Quick Start For Existing Team Members # 1. Setup environment
just infra-setup
# Edit infra/.env with S3_PULUMI_BACKEND_URL and PULUMI_CONFIG_PASSPHRASE
# 2. Login to S3 backend
just infra-login
# 3. Select stack
just infra-init dev
# 4. Preview and deploy
just infra-preview
just infra-up
First-Time Bootstrap # 1. Setup environment
just infra-setup
# Edit infra/.env (leave S3_PULUMI_BACKEND_URL empty)
# 2. Bootstrap with local backend
just infra-bootstrap dev
# 3. Deploy infrastructure (creates S3 bucket)
just infra-up
# 4. Get bucket name for migration
pulumi stack output pulumi_state_bucket_name
# 5. Update .env with bucket URL
# S3_PULUMI_BACKEND_URL=s3://tradai-pulumi-state-abc123
# 6. Migrate state to S3
just infra-migrate
Stack Environments Stack Description Key Differences dev Development Single-AZ, t4g.micro RDS, FARGATE_SPOT, 7-day logs staging Pre-production Production-like, single-AZ, t4g.small RDS prod Production Multi-AZ, full redundancy, 90-day logs, no Spot for live trading
Selecting a Stack # List available stacks
pulumi stack ls
# Select a stack
pulumi stack select dev
# Create new stack
pulumi stack init staging
Deployment Phases Phase 0: Pulumi Backend Self-hosted S3 state storage enabling team collaboration and CI/CD.
Module: pulumi_backend.py
Resources: - S3 bucket for Pulumi state - IAM roles for CI/CD and Pulumi Dashboard
Outputs:
pulumi stack output pulumi_state_bucket_name
pulumi stack output pulumi_backend_role_arn
Phase 1: Foundation Infrastructure Module Task ID Description Dependencies vpc.py IF002 VPC (10.0.0.0/16), 6 subnets (2 AZs, 3 tiers) None vpc_endpoints.py SEC002 Gateway endpoints (S3, DynamoDB) VPC vpc_flow_logs.py SEC004 VPC Flow Logs to CloudWatch VPC nacl.py SEC005 Network ACLs for all subnets VPC s3.py IS001 5 S3 buckets (configs, results, arcticdb, logs, mlflow) None cloudtrail.py SEC003 CloudTrail audit logging S3 dynamodb.py IS003 8 DynamoDB tables None sns.py MN001 SNS topics (alerts, registration) None security_groups.py IF004 5 security groups VPC nat_instance.py IF003 NAT instance with ASG VPC, SG rds.py IS002 PostgreSQL for MLflow VPC, SG secret_rotation.py SEC006 RDS secret rotation (30-day) RDS ecr.py IS004 12 ECR repositories None codeartifact.py SR003 CodeArtifact for Python packages None
Phase 2: Compute Infrastructure Module Task ID Description Dependencies iam.py IC001 ECS execution and task roles None ecs.py IC001 ECS Fargate cluster, strategy task def IAM alb.py IC002 Application Load Balancer VPC, SG sqs.py IO001 Backtest queue + DLQ None cognito.py DK005 Cognito user pool, M2M client None ecs_services.py IC003 4 ECS services (backend, strategy, data, mlflow) ECS, ALB, ECR lambda_funcs.py IC004 8 Lambda functions VPC, SG api_gateway.py IC005 HTTP API Gateway with 11 routes ALB, Cognito waf.py SEC001 WAF WebACL for API Gateway API Gateway
Phase 3: Orchestration Module Task ID Description Dependencies step_functions.py IO002 Backtest workflow state machine Lambda, ECS
Phase 4: Monitoring Module Task ID Description Dependencies cloudwatch_alarms.py MN003 Composite alarm, heartbeat detection SNS cloudwatch_dashboard.py OB001 Trading platform dashboard None
Configuration Management config.py - Single Source of Truth All infrastructure values are defined in infra/config.py:
# Core configuration
PROJECT_NAME = "tradai"
ENVIRONMENT = pulumi . get_stack () # dev, staging, prod
AWS_REGION = "us-east-1"
# VPC CIDR
VPC_CIDR = "10.0.0.0/16"
# Services definition
SERVICES = {
"backend" : { "port" : 8000 , "cpu" : 256 , "memory" : 512 },
"strategy-service" : { "port" : 8003 , "cpu" : 256 , "memory" : 512 },
"data-collection" : { "port" : 8002 , "cpu" : 256 , "memory" : 512 },
"mlflow" : { "port" : 5000 , "cpu" : 256 , "memory" : 512 },
}
# DynamoDB tables
DYNAMODB_TABLES = {
"workflow_state" : "tradai-workflow-state" ,
"health_state" : "tradai-health-state" ,
"trading_state" : "tradai-trading-state" ,
# ...
}
Stack Configuration Overrides Override values per stack in Pulumi.{stack}.yaml:
config :
aws:region : us-east-1
tradai-infrastructure:environment : dev
tradai-infrastructure:certificate_arn : "arn:aws:acm:..." # Optional for HTTPS
tradai-infrastructure:api_domain : "api.tradai.io" # Optional custom domain
Common Configuration Commands # Set certificate for HTTPS
pulumi config set certificate_arn "arn:aws:acm:us-east-1:123:certificate/abc"
# Set alert email
pulumi config set --path 'alert_emails[0]' 'alerts@example.com'
# Set CORS origins
pulumi config set --path 'cors_allowed_origins[0]' 'https://app.tradai.io'
# Configure CloudWatch alarm thresholds
pulumi config set alarm_latency_threshold 5000
pulumi config set alarm_min_strategies 1 # For prod
# Configure Lambda schedules
pulumi config set drift_monitor_schedule "rate(1 day)"
pulumi config set retraining_scheduler_schedule "rate(7 days)"
Helper Functions config.py provides helper functions for consistent naming:
# Standard resource name
get_resource_name ( "backend" ) # tradai-backend-dev
# ECR repository name
get_ecr_repo_name ( "backend" ) # tradai/backend
# Standard tags
get_tags ( service = "backend" )
# {
# "Project": "tradai",
# "Environment": "dev",
# "Service": "backend",
# "ManagedBy": "pulumi",
# }
# Environment short code (for name length limits)
get_env_short () # "d" for dev, "s" for staging, "p" for prod
Daily Operations Preview Changes # Preview all changes
pulumi preview
# Preview with diff
pulumi preview --diff
# Preview specific resource
pulumi preview --target 'urn:pulumi:dev::tradai-infrastructure::aws:ecs:Service::tradai-backend-dev'
Deploy # Deploy with confirmation prompt
pulumi up
# Deploy with auto-approve (CI/CD)
pulumi up --yes
# Deploy specific resource
pulumi up --target 'urn:pulumi:dev::tradai-infrastructure::aws:ecs:Service::tradai-backend-dev'
# Skip preview
pulumi up --skip-preview --yes
View Outputs # All outputs
pulumi stack output
# Specific output
pulumi stack output vpc_id
pulumi stack output ecs_cluster_name
pulumi stack output api_gateway_endpoint
# JSON format
pulumi stack output --json
State Management # Export state backup
pulumi stack export > state-backup-$( date +%Y%m%d) .json
# Import state
pulumi stack import --file state-backup.json
# Refresh state from AWS
pulumi refresh
# View resource URNs
pulumi stack --show-urns
Rollback # Pulumi doesn't have native rollback
# Options:
# 1. Redeploy previous code from git
git checkout HEAD~1 -- infra/
pulumi up --yes
# 2. Import previous state backup
pulumi stack import --file state-backup-20240115.json
pulumi up --yes
# 3. Manual resource updates
aws ecs update-service --cluster tradai-dev --service tradai-backend-dev \
--task-definition tradai-backend-dev:PREVIOUS_REVISION
Adding New Modules Module Template """New Module Description.
Task ID: XX001
Dependencies: vpc, security_groups
"""
import pulumi
import pulumi_aws as aws
from config import (
ENVIRONMENT ,
PROJECT_NAME ,
get_resource_name ,
get_tags ,
)
class NewModule :
"""Creates new infrastructure component."""
def __init__ (
self ,
vpc_id : pulumi . Input [ str ],
security_group_id : pulumi . Input [ str ],
):
"""Initialize NewModule.
Args:
vpc_id: VPC ID for resource placement
security_group_id: Security group to attach
"""
self . _create_resources ( vpc_id , security_group_id )
def _create_resources (
self ,
vpc_id : pulumi . Input [ str ],
security_group_id : pulumi . Input [ str ],
) -> None :
"""Create module resources."""
self . resource = aws . service . Resource (
get_resource_name ( "component" ),
vpc_id = vpc_id ,
security_groups = [ security_group_id ],
tags = get_tags ( service = "component" ),
)
# Export resource attributes
self . resource_id = self . resource . id
self . resource_arn = self . resource . arn
Integration Checklist Create module file in infra/modules/ Add import to infra/__main__.py Add to correct phase based on dependencies Export outputs via pulumi.export() Update config.py if new canonical values needed Add tests in infra/tests/ Update documentation (this guide, pulumi-modules.md) Example: Adding to main .py # In Phase 2 section of __main__.py
from modules.new_module import NewModule
# After dependencies are created
new_module = NewModule (
vpc_id = vpc . vpc . id ,
security_group_id = security_groups . ecs_sg_id ,
)
pulumi . export ( "new_resource_id" , new_module . resource_id )
pulumi . export ( "new_resource_arn" , new_module . resource_arn )
Troubleshooting Common Issues Error Cause Solution error: no Pulumi.yaml Wrong directory cd infra/ error: stack 'dev' not found Stack not initialized pulumi stack init dev error: passphrase must be set Missing passphrase Set PULUMI_CONFIG_PASSPHRASE error: unable to deserialize State corruption Restore from backup error: resource already exists Orphaned resource Import or delete manually
State Lock Issues # Check for lock
aws s3api head-object --bucket $BUCKET --key .pulumi/locks/...
# Force unlock (use with caution)
pulumi cancel
Import Existing Resources # Import existing resource
pulumi import aws:s3:Bucket my-bucket existing-bucket-name
# Import with specific URN
pulumi import aws:ecs:Service backend arn:aws:ecs:us-east-1:123:service/cluster/service
Debug Mode # Verbose logging
PULUMI_DEBUG_COMMANDS = 1 pulumi up
# Trace logging
pulumi up --logtostderr -v= 9
# Dry run with detailed output
pulumi preview --diff --debug
CI/CD Integration Bitbucket Pipelines Pipeline Description infra-preview Preview changes (PR validation) infra-deploy-dev Deploy to dev infra-deploy-staging Deploy to staging infra-deploy-prod Deploy to production
Required Variables Variable Description Secured PULUMI_CONFIG_PASSPHRASE Stack encryption Yes S3_PULUMI_BACKEND_URL Backend URL No AWS_ACCESS_KEY_ID AWS credentials Yes AWS_SECRET_ACCESS_KEY AWS credentials Yes AWS_REGION AWS region No
Pipeline Example # bitbucket-pipelines.yml
pipelines :
custom :
infra-deploy-dev :
- step :
name : Deploy Infrastructure (Dev)
script :
- cd infra
- pulumi login $S3_PULUMI_BACKEND_URL
- pulumi stack select dev
- pulumi up --yes
Cost Optimization Optimization Savings Implementation NAT Instance vs Gateway ~$32/month nat_instance.py uses t4g.nano FARGATE_SPOT ~70% on backtests Enabled in ecs.py Single-AZ (dev/staging) ~50% on RDS rds.py multi_az=False db.t4g.micro ~$12/month Dev/staging RDS instance On-demand DynamoDB Pay per use Default in dynamodb.py S3 lifecycle rules Auto-cleanup s3.py lifecycle policies
Security Best Practices Secrets : Never in config, always AWS Secrets Manager RDS : Private subnets only, security group restricted IAM : Least-privilege roles in iam.py Security Groups : Minimal ingress rules WAF : Rate limiting on API Gateway CloudTrail : Audit logging enabled VPC Flow Logs : Network traffic logging Encryption : S3, RDS, DynamoDB at rest