Pulumi Infrastructure Deployment Guide¶
Comprehensive guide for deploying and managing TradAI infrastructure using Pulumi with AWS.
Architecture Overview¶
TradAI infrastructure is split into 4 independent Pulumi stacks. The key design principle is separating data-bearing resources (never destroyed in dev cycles) from ephemeral compute (freely destroyed and redeployed):
| Stack | Resources | Destroy in dev? |
|---|---|---|
| persistent | S3×5, DynamoDB×12, ECR×24, Cognito, CodeArtifact, CloudTrail, Pulumi backend | Never |
| foundation | VPC, subnets, NAT, SGs, NACLs, VPC endpoints, RDS, SQS, SNS | Rarely |
| compute | IAM roles, ALB, ECS, Lambda, Step Functions, EC2 Consolidated | Freely |
| edge | API Gateway, WAF, CloudWatch Alarms/Dashboard | Freely |
Stacks share data via pulumi.StackReference: - compute reads from persistent (ECR URLs, S3 IDs, DynamoDB names) and foundation (VPC, SGs, RDS, SQS) - edge reads from persistent (Cognito), foundation (VPC, SGs, SQS), and compute (ALB endpoints)
Key Files¶
| File | Purpose |
|---|---|
infra/shared/tradai_infra_shared/config.py | Canonical source of truth for all configuration |
infra/shared/tradai_infra_shared/core/ | Shared builders (PolicyBuilder, SecurityRuleBuilder, etc.) |
infra/persistent/__main__.py | Persistent stack entry point (data resources) |
infra/foundation/__main__.py | Foundation stack entry point (networking) |
infra/compute/__main__.py | Compute stack entry point (ECS, Lambda, IAM) |
infra/edge/__main__.py | Edge stack entry point (API GW, WAF) |
infra/{stack}/Pulumi.yaml | Per-stack Pulumi project definition |
infra/{stack}/Pulumi.{env}.yaml | Per-stack environment configuration |
infra/{stack}/modules/ | Per-stack infrastructure modules |
Prerequisites¶
# Required tools
python --version # 3.11+
uv --version # UV package manager
pulumi version # Pulumi CLI
aws --version # AWS CLI
# AWS credentials
aws configure list # Verify credentials
Install Pulumi¶
Quick Start¶
For Existing Team Members¶
# 1. Setup environment
just infra-setup
# Edit infra/.env with S3_PULUMI_BACKEND_URL, PULUMI_CONFIG_PASSPHRASE, AWS_PROFILE, AWS_REGION
# 2. Preview all stacks
just infra-preview dev
# 3. Deploy all stacks in order
just infra-bootstrap dev
# Or deploy individually (persistent first):
just infra-up-persistent dev # S3, DynamoDB, ECR, Cognito (no deps)
just infra-up-foundation dev # VPC, RDS, SQS, SNS
just lambda-bootstrap # Push Lambda images to ECR
just infra-up-compute dev # IAM, ECS, ALB, Lambda (images exist now)
just infra-up-edge dev # API Gateway, WAF, CloudWatch
First-Time Bootstrap¶
# 1. Setup environment
just infra-setup
# Edit infra/.env with AWS credentials and Pulumi passphrase
# 2. Full bootstrap (deploys foundation → pushes images → compute → edge)
just infra-bootstrap dev
# 3. Verify deployment
just infra-outputs persistent dev
just infra-outputs foundation dev
just infra-outputs compute dev
just infra-outputs edge dev
Stack Environments¶
| Stack | Description | Key Differences |
|---|---|---|
| dev | Development | Single-AZ, t4g.micro RDS, FARGATE_SPOT, 7-day logs |
| staging | Pre-production | Production-like, single-AZ, t4g.small RDS |
| prod | Production | Multi-AZ, full redundancy, 90-day logs, no Spot for live trading |
Selecting a Stack¶
# List available stacks
pulumi stack ls
# Select a stack
pulumi stack select dev
# Create new stack
pulumi stack init staging
Stack Modules¶
Foundation Stack (infra/foundation/)¶
No container image dependencies. Can be deployed first.
| Module | Description | Key Resources |
|---|---|---|
vpc.py | VPC (10.0.0.0/16), 6 subnets (2 AZs, 3 tiers) | VPC, subnets, route tables |
vpc_endpoints.py | Gateway + Interface VPC Endpoints | S3, DynamoDB, ECR, STS, SSM, SQS endpoints |
vpc_flow_logs.py | VPC Flow Logs to CloudWatch | Log group, IAM role |
nacl.py | Network ACLs for all subnets | NACLs, rules |
s3.py | 5 S3 buckets (configs, results, arcticdb, logs, mlflow) | Buckets, lifecycle rules |
cloudtrail.py | CloudTrail audit logging | Trail, S3 policy |
dynamodb.py | 8 DynamoDB tables | Tables, GSIs |
sns.py | SNS topics (alerts, registration) | Topics, subscriptions |
security_groups.py | 5 security groups | SGs, ingress/egress rules |
nat_instance.py | NAT instance with ASG (cost-optimized) | ASG, launch template |
rds.py | PostgreSQL for MLflow | RDS instance, subnet group |
secret_rotation.py | RDS secret rotation (30-day) | Lambda, rotation schedule |
ecr.py | 12 ECR repositories | Repos, lifecycle policies |
iam.py | ECS execution, task, and consolidated roles | IAM roles, policies, instance profile |
sqs.py | Backtest queue + DLQ | Queues, redrive policy |
cognito.py | Cognito user pool, M2M client | User pool, app client |
codeartifact.py | CodeArtifact for Python packages | Domain, repository |
Compute Stack (infra/compute/)¶
Needs ECR images. Deploy after just lambda-bootstrap.
| Module | Description | Key Resources |
|---|---|---|
alb.py | Application Load Balancer | ALB, listeners, target groups |
ecs.py | ECS Fargate cluster | Cluster, capacity providers |
ecs_services.py | 4 ECS services (or consolidated EC2) | Services, task definitions |
ec2_consolidated.py | Consolidated EC2 for dev/staging | ASG, launch template, Cloud Map |
lambda_funcs.py | Lambda functions | Functions, event sources |
step_functions.py | Backtest workflow state machine | State machine, IAM role |
Edge Stack (infra/edge/)¶
Needs compute resources (ALB, Cognito).
| Module | Description | Key Resources |
|---|---|---|
api_gateway.py | HTTP API Gateway with routes | API, routes, integrations |
waf.py | WAF WebACL for API Gateway | WebACL, rules |
cloudwatch_alarms.py | Composite alarm, heartbeat detection | Alarms, SNS actions |
cloudwatch_dashboard.py | Trading platform dashboard | Dashboard widgets |
Configuration Management¶
config.py - Single Source of Truth¶
All infrastructure values are defined in infra/shared/tradai_infra_shared/config.py:
# Core configuration
PROJECT_NAME = "tradai"
ENVIRONMENT = pulumi.get_stack() # dev, staging, prod
AWS_REGION = "eu-central-1"
# VPC CIDR
VPC_CIDR = "10.0.0.0/16"
# Services definition
SERVICES = {
"backend": {"port": 8000, "cpu": 256, "memory": 512},
"strategy-service": {"port": 8003, "cpu": 256, "memory": 512},
"data-collection": {"port": 8002, "cpu": 256, "memory": 512},
"mlflow": {"port": 5000, "cpu": 256, "memory": 512},
}
# DynamoDB tables
DYNAMODB_TABLES = {
"workflow_state": "tradai-workflow-state",
"health_state": "tradai-health-state",
"trading_state": "tradai-trading-state",
# ...
}
Stack Configuration Overrides¶
Override values per stack in Pulumi.{stack}.yaml:
config:
aws:region: eu-central-1
tradai-infrastructure:environment: dev
tradai-infrastructure:certificate_arn: "arn:aws:acm:..." # Optional for HTTPS
tradai-infrastructure:api_domain: "api.tradai.io" # Optional custom domain
Common Configuration Commands¶
# Set certificate for HTTPS
pulumi config set certificate_arn "arn:aws:acm:eu-central-1:123:certificate/abc"
# Set alert email
pulumi config set --path 'alert_emails[0]' 'alerts@example.com'
# Set CORS origins
pulumi config set --path 'cors_allowed_origins[0]' 'https://app.tradai.io'
# Configure CloudWatch alarm thresholds
pulumi config set alarm_latency_threshold 5000
pulumi config set alarm_min_strategies 1 # For prod
# Configure Lambda schedules
pulumi config set drift_monitor_schedule "rate(1 day)"
pulumi config set retraining_scheduler_schedule "rate(7 days)"
Helper Functions¶
config.py provides helper functions for consistent naming:
# Standard resource name
get_resource_name("backend") # tradai-backend-dev
# ECR repository name
get_ecr_repo_name("backend") # tradai/backend
# Standard tags
get_tags(service="backend")
# {
# "Project": "tradai",
# "Environment": "dev",
# "Service": "backend",
# "ManagedBy": "pulumi",
# }
# Environment short code (for name length limits)
get_env_short() # "d" for dev, "s" for staging, "p" for prod
Daily Operations¶
Preview and Deploy¶
# Preview all stacks
just infra-preview dev
# Preview a single stack
just infra-preview-foundation dev
just infra-preview-compute dev
just infra-preview-edge dev
# Deploy all stacks in order
just infra-bootstrap dev
# Deploy a single stack
just infra-up-foundation dev
just infra-up-compute dev
just infra-up-edge dev
View Outputs¶
# Stack outputs (resource IDs, ARNs, endpoints)
just infra-outputs foundation dev
just infra-outputs compute dev
just infra-outputs edge dev
# Filter a specific output
just infra-outputs compute dev | jq -r '.alb_dns_name'
Service Operations¶
# ECS service status
just ecs-status
# Force ECS redeployment (after image push)
just ecs-force-deploy backend
just ecs-force-deploy-all
# ASG refresh (consolidated EC2 — picks up new launch template)
just asg-refresh
just asg-status
# Push service images to ECR
just service-ecr-login
just service-push backend
just service-push-all
# Check Lambda images
just lambda-check-images
State Management¶
# Export state backup (run from specific stack dir)
cd infra/foundation && pulumi stack export > state-backup-$(date +%Y%m%d).json
# Refresh state from AWS
cd infra/foundation && set -a && source ../.env && set +a && pulumi refresh
# View resource URNs
cd infra/foundation && set -a && source ../.env && set +a && pulumi stack --show-urns
Rollback¶
# Pulumi doesn't have native rollback. Options:
# 1. Redeploy previous code from git
git checkout HEAD~1 -- infra/
just infra-bootstrap dev
# 2. Import previous state backup
cd infra/foundation && pulumi stack import --file state-backup.json
just infra-up-foundation dev
# 3. Manual ECS service rollback
aws ecs update-service --cluster tradai-dev --service tradai-backend-dev \
--task-definition tradai-backend-dev:PREVIOUS_REVISION \
--profile tradai --region eu-central-1
Adding New Modules¶
Module Template¶
"""New Module Description.
Task ID: XX001
Dependencies: vpc, security_groups
"""
import pulumi
import pulumi_aws as aws
from tradai_infra_shared.config import (
ENVIRONMENT,
PROJECT_NAME,
get_tags,
)
class NewModule:
"""Creates new infrastructure component."""
def __init__(
self,
vpc_id: pulumi.Input[str],
security_group_id: pulumi.Input[str],
) -> None:
"""Initialize NewModule.
Args:
vpc_id: VPC ID for resource placement
security_group_id: Security group to attach
"""
self.resource = aws.service.Resource(
f"{PROJECT_NAME}-component",
vpc_id=vpc_id,
security_groups=[security_group_id],
tags=get_tags("component")
| {"Name": f"{PROJECT_NAME}-component-{ENVIRONMENT}"},
)
@property
def resource_id(self) -> pulumi.Output[str]:
"""Get resource ID."""
return self.resource.id
Integration Checklist¶
- Create module file in the appropriate stack's
modules/directory - Add import to the stack's
__main__.py - Choose the right stack: foundation (no image deps), compute (needs ECR images), edge (needs compute)
- Export outputs via
pulumi.export()if other stacks need the values - Update shared config if new canonical values needed (
infra/shared/tradai_infra_shared/config.py) - Add tests in the stack's
tests/directory - Update documentation (this guide)
Example: Adding to a Stack¶
# In compute/__main__.py
from modules.new_module import NewModule
# Read from foundation stack via StackReference
foundation_ref = pulumi.StackReference("tradai-foundation/dev")
vpc_id = foundation_ref.get_output("vpc_id")
new_module = NewModule(
vpc_id=vpc_id,
security_group_id=foundation_ref.get_output("ecs_sg_id"),
)
pulumi.export("new_resource_id", new_module.resource_id)
Troubleshooting¶
Common Issues¶
| Error | Cause | Solution |
|---|---|---|
error: no Pulumi.yaml | Wrong directory | cd infra/foundation/ (or compute/edge) |
error: stack 'dev' not found | Stack not initialized | pulumi stack init dev |
error: passphrase must be set | Missing passphrase | source infra/.env or set PULUMI_CONFIG_PASSPHRASE |
error: unable to deserialize | State corruption | Restore from backup |
error: resource already exists | Orphaned resource | Import or delete manually |
| SSM agent not registering | Custom AMI missing agent | Redeploy compute stack (installs from S3 RPM) |
| S3/dnf 403 errors on EC2 | VPC endpoint policy too restrictive | Check foundation vpc_endpoints.py AllowSystemPackageBuckets |
State Lock Issues¶
# Check for lock
aws s3api head-object --bucket $BUCKET --key .pulumi/locks/...
# Force unlock (use with caution)
pulumi cancel
Import Existing Resources¶
# Import existing resource
pulumi import aws:s3:Bucket my-bucket existing-bucket-name
# Import with specific URN
pulumi import aws:ecs:Service backend arn:aws:ecs:eu-central-1:123:service/cluster/service
Debug Mode¶
# Verbose logging
PULUMI_DEBUG_COMMANDS=1 pulumi up
# Trace logging
pulumi up --logtostderr -v=9
# Dry run with detailed output
pulumi preview --diff --debug
Accessing Services¶
Dev/Staging: Consolidated EC2¶
In dev/staging, all services run on a single EC2 instance via Docker Compose (cost optimization). Access via SSM Session Manager (no SSH keys needed):
# Find the consolidated instance
aws ec2 describe-instances \
--filters "Name=tag:Name,Values=tradai-consolidated-dev" "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].InstanceId' --output text \
--profile tradai --region eu-central-1
# Interactive shell
aws ssm start-session --target <instance-id> --profile tradai --region eu-central-1
# On the instance:
sudo docker ps # Running containers
sudo docker logs backend-api --tail 50 -f # Service logs
cd /opt/tradai && sudo docker-compose restart # Restart services
Production: ECS Fargate¶
In production, services run on ECS Fargate. Use ECS Exec for debugging:
# List running tasks
aws ecs list-tasks --cluster tradai-prod --service-name tradai-backend-prod \
--profile tradai --region eu-central-1
# Interactive shell into a container
aws ecs execute-command --cluster tradai-prod --task <task-arn> \
--container backend --interactive --command "/bin/sh" \
--profile tradai --region eu-central-1
Service Endpoints¶
All services are fronted by an Application Load Balancer:
| Service | Port | Path | Purpose |
|---|---|---|---|
| backend-api | 8000 | /api/* | API gateway, orchestration |
| data-collection | 8002 | /data/* | Market data fetching |
| strategy-service | 8003 | /strategy/* | Strategy execution, backtesting |
| mlflow | 5000 | /mlflow/* | ML experiment tracking |
# Get ALB DNS
just infra-outputs compute dev | jq -r '.alb_dns_name'
# Test health
curl -s http://<ALB_DNS>/api/health | jq .
Logs¶
# CloudWatch (consolidated EC2)
aws logs tail /tradai/consolidated/containers --since 30m \
--profile tradai --region eu-central-1
# CloudWatch (follow in real time)
aws logs tail /tradai/consolidated/containers --follow \
--profile tradai --region eu-central-1
CI/CD Integration¶
Bitbucket Pipelines¶
| Pipeline | Description |
|---|---|
infra-preview | Preview changes (PR validation) |
infra-deploy-dev | Deploy to dev |
infra-deploy-staging | Deploy to staging |
infra-deploy-prod | Deploy to production |
Required Variables¶
| Variable | Description | Secured |
|---|---|---|
PULUMI_CONFIG_PASSPHRASE | Stack encryption | Yes |
S3_PULUMI_BACKEND_URL | Backend URL | No |
AWS_ACCESS_KEY_ID | AWS credentials | Yes |
AWS_SECRET_ACCESS_KEY | AWS credentials | Yes |
AWS_REGION | AWS region | No |
Pipeline Example¶
# bitbucket-pipelines.yml (simplified)
pipelines:
custom:
infra-deploy-dev:
- step:
name: Deploy Infrastructure (Dev)
script:
- just infra-bootstrap dev
Cost Optimization¶
| Optimization | Savings | Implementation |
|---|---|---|
| NAT Instance vs Gateway | ~$32/month | nat_instance.py uses t4g.nano |
| FARGATE_SPOT | ~70% on backtests | Enabled in ecs.py |
| Single-AZ (dev/staging) | ~50% on RDS | rds.py multi_az=False |
| db.t4g.micro | ~$12/month | Dev/staging RDS instance |
| On-demand DynamoDB | Pay per use | Default in dynamodb.py |
| S3 lifecycle rules | Auto-cleanup | s3.py lifecycle policies |
Security Best Practices¶
- Secrets: Never in config, always AWS Secrets Manager
- RDS: Private subnets only, security group restricted
- IAM: Least-privilege roles in
iam.py - Security Groups: Minimal ingress rules
- WAF: Rate limiting on API Gateway
- CloudTrail: Audit logging enabled
- VPC Flow Logs: Network traffic logging
- Encryption: S3, RDS, DynamoDB at rest
Related Documentation¶
- Pulumi Modules Reference - Full module inventory
- Infrastructure Issues Runbook - Operational procedures
- Database Operations Runbook - RDS failover, backup/restore
- Pulumi Code Architecture - Detailed code reference