Pulumi Infrastructure Deployment Guide¶

Comprehensive guide for deploying and managing TradAI infrastructure using Pulumi with AWS.

Architecture Overview¶

TradAI infrastructure is split into 4 independent Pulumi stacks. The key design principle is separating data-bearing resources (never destroyed in dev cycles) from ephemeral compute (freely destroyed and redeployed):

persistent → foundation → (lambda-bootstrap) → compute → edge

Stack	Resources	Destroy in dev?
persistent	S3×5, DynamoDB×12, ECR×24, Cognito, CodeArtifact, CloudTrail, Pulumi backend	Never
foundation	VPC, subnets, NAT, SGs, NACLs, VPC endpoints, RDS, SQS, SNS	Rarely
compute	IAM roles, ALB, ECS, Lambda, Step Functions, EC2 Consolidated	Freely
edge	API Gateway, WAF, CloudWatch Alarms/Dashboard	Freely

Stacks share data via pulumi.StackReference: - compute reads from persistent (ECR URLs, S3 IDs, DynamoDB names) and foundation (VPC, SGs, RDS, SQS) - edge reads from persistent (Cognito), foundation (VPC, SGs, SQS), and compute (ALB endpoints)

Key Files¶

File	Purpose
`infra/shared/tradai_infra_shared/config.py`	Canonical source of truth for all configuration
`infra/shared/tradai_infra_shared/core/`	Shared builders (PolicyBuilder, SecurityRuleBuilder, etc.)
`infra/persistent/__main__.py`	Persistent stack entry point (data resources)
`infra/foundation/__main__.py`	Foundation stack entry point (networking)
`infra/compute/__main__.py`	Compute stack entry point (ECS, Lambda, IAM)
`infra/edge/__main__.py`	Edge stack entry point (API GW, WAF)
`infra/{stack}/Pulumi.yaml`	Per-stack Pulumi project definition
`infra/{stack}/Pulumi.{env}.yaml`	Per-stack environment configuration
`infra/{stack}/modules/`	Per-stack infrastructure modules

Prerequisites¶

# Required tools
python --version          # 3.11+
uv --version              # UV package manager
pulumi version            # Pulumi CLI
aws --version             # AWS CLI

# AWS credentials
aws configure list        # Verify credentials

Install Pulumi¶

# macOS
brew install pulumi

# Linux
curl -fsSL https://get.pulumi.com | sh

# Verify
pulumi version

Quick Start¶

For Existing Team Members¶

# 1. Setup environment
just infra-setup
# Edit infra/.env with S3_PULUMI_BACKEND_URL, PULUMI_CONFIG_PASSPHRASE, AWS_PROFILE, AWS_REGION

# 2. Preview all stacks
just infra-preview dev

# 3. Deploy all stacks in order
just infra-bootstrap dev

# Or deploy individually (persistent first):
just infra-up-persistent dev    # S3, DynamoDB, ECR, Cognito (no deps)
just infra-up-foundation dev    # VPC, RDS, SQS, SNS
just lambda-bootstrap            # Push Lambda images to ECR
just infra-up-compute dev        # IAM, ECS, ALB, Lambda (images exist now)
just infra-up-edge dev           # API Gateway, WAF, CloudWatch

First-Time Bootstrap¶

# 1. Setup environment
just infra-setup
# Edit infra/.env with AWS credentials and Pulumi passphrase

# 2. Full bootstrap (deploys foundation → pushes images → compute → edge)
just infra-bootstrap dev

# 3. Verify deployment
just infra-outputs persistent dev
just infra-outputs foundation dev
just infra-outputs compute dev
just infra-outputs edge dev

Stack Environments¶

Stack	Description	Key Differences
dev	Development	Single-AZ, t4g.micro RDS, FARGATE_SPOT, 7-day logs
staging	Pre-production	Production-like, single-AZ, t4g.small RDS
prod	Production	Multi-AZ, full redundancy, 90-day logs, no Spot for live trading

Selecting a Stack¶

# List available stacks
pulumi stack ls

# Select a stack
pulumi stack select dev

# Create new stack
pulumi stack init staging

Stack Modules¶

Foundation Stack (`infra/foundation/`)¶

No container image dependencies. Can be deployed first.

Module	Description	Key Resources
`vpc.py`	VPC (10.0.0.0/16), 6 subnets (2 AZs, 3 tiers)	VPC, subnets, route tables
`vpc_endpoints.py`	Gateway + Interface VPC Endpoints	S3, DynamoDB, ECR, STS, SSM, SQS endpoints
`vpc_flow_logs.py`	VPC Flow Logs to CloudWatch	Log group, IAM role
`nacl.py`	Network ACLs for all subnets	NACLs, rules
`s3.py`	5 S3 buckets (configs, results, arcticdb, logs, mlflow)	Buckets, lifecycle rules
`cloudtrail.py`	CloudTrail audit logging	Trail, S3 policy
`dynamodb.py`	8 DynamoDB tables	Tables, GSIs
`sns.py`	SNS topics (alerts, registration)	Topics, subscriptions
`security_groups.py`	5 security groups	SGs, ingress/egress rules
`nat_instance.py`	NAT instance with ASG (cost-optimized)	ASG, launch template
`rds.py`	PostgreSQL for MLflow	RDS instance, subnet group
`secret_rotation.py`	RDS secret rotation (30-day)	Lambda, rotation schedule
`ecr.py`	12 ECR repositories	Repos, lifecycle policies
`iam.py`	ECS execution, task, and consolidated roles	IAM roles, policies, instance profile
`sqs.py`	Backtest queue + DLQ	Queues, redrive policy
`cognito.py`	Cognito user pool, M2M client	User pool, app client
`codeartifact.py`	CodeArtifact for Python packages	Domain, repository

Compute Stack (`infra/compute/`)¶

Needs ECR images. Deploy after just lambda-bootstrap.

Module	Description	Key Resources
`alb.py`	Application Load Balancer	ALB, listeners, target groups
`ecs.py`	ECS Fargate cluster	Cluster, capacity providers
`ecs_services.py`	4 ECS services (or consolidated EC2)	Services, task definitions
`ec2_consolidated.py`	Consolidated EC2 for dev/staging	ASG, launch template, Cloud Map
`lambda_funcs.py`	Lambda functions	Functions, event sources
`step_functions.py`	Backtest workflow state machine	State machine, IAM role

Edge Stack (`infra/edge/`)¶

Needs compute resources (ALB, Cognito).

Module	Description	Key Resources
`api_gateway.py`	HTTP API Gateway with routes	API, routes, integrations
`waf.py`	WAF WebACL for API Gateway	WebACL, rules
`cloudwatch_alarms.py`	Composite alarm, heartbeat detection	Alarms, SNS actions
`cloudwatch_dashboard.py`	Trading platform dashboard	Dashboard widgets

Configuration Management¶

config.py - Single Source of Truth¶

All infrastructure values are defined in infra/shared/tradai_infra_shared/config.py:

# Core configuration
PROJECT_NAME = "tradai"
ENVIRONMENT = pulumi.get_stack()  # dev, staging, prod
AWS_REGION = "eu-central-1"

# VPC CIDR
VPC_CIDR = "10.0.0.0/16"

# Services definition
SERVICES = {
    "backend": {"port": 8000, "cpu": 256, "memory": 512},
    "strategy-service": {"port": 8003, "cpu": 256, "memory": 512},
    "data-collection": {"port": 8002, "cpu": 256, "memory": 512},
    "mlflow": {"port": 5000, "cpu": 256, "memory": 512},
}

# DynamoDB tables
DYNAMODB_TABLES = {
    "workflow_state": "tradai-workflow-state",
    "health_state": "tradai-health-state",
    "trading_state": "tradai-trading-state",
    # ...
}

Stack Configuration Overrides¶

Override values per stack in Pulumi.{stack}.yaml:

config:
  aws:region: eu-central-1
  tradai-infrastructure:environment: dev
  tradai-infrastructure:certificate_arn: "arn:aws:acm:..."  # Optional for HTTPS
  tradai-infrastructure:api_domain: "api.tradai.io"        # Optional custom domain

Common Configuration Commands¶

# Set certificate for HTTPS
pulumi config set certificate_arn "arn:aws:acm:eu-central-1:123:certificate/abc"

# Set alert email
pulumi config set --path 'alert_emails[0]' 'alerts@example.com'

# Set CORS origins
pulumi config set --path 'cors_allowed_origins[0]' 'https://app.tradai.io'

# Configure CloudWatch alarm thresholds
pulumi config set alarm_latency_threshold 5000
pulumi config set alarm_min_strategies 1  # For prod

# Configure Lambda schedules
pulumi config set drift_monitor_schedule "rate(1 day)"
pulumi config set retraining_scheduler_schedule "rate(7 days)"

Helper Functions¶

config.py provides helper functions for consistent naming:

# Standard resource name
get_resource_name("backend")  # tradai-backend-dev

# ECR repository name
get_ecr_repo_name("backend")  # tradai/backend

# Standard tags
get_tags(service="backend")
# {
#   "Project": "tradai",
#   "Environment": "dev",
#   "Service": "backend",
#   "ManagedBy": "pulumi",
# }

# Environment short code (for name length limits)
get_env_short()  # "d" for dev, "s" for staging, "p" for prod

Daily Operations¶

Preview and Deploy¶

# Preview all stacks
just infra-preview dev

# Preview a single stack
just infra-preview-foundation dev
just infra-preview-compute dev
just infra-preview-edge dev

# Deploy all stacks in order
just infra-bootstrap dev

# Deploy a single stack
just infra-up-foundation dev
just infra-up-compute dev
just infra-up-edge dev

View Outputs¶

# Stack outputs (resource IDs, ARNs, endpoints)
just infra-outputs foundation dev
just infra-outputs compute dev
just infra-outputs edge dev

# Filter a specific output
just infra-outputs compute dev | jq -r '.alb_dns_name'

Service Operations¶

# ECS service status
just ecs-status

# Force ECS redeployment (after image push)
just ecs-force-deploy backend
just ecs-force-deploy-all

# ASG refresh (consolidated EC2 — picks up new launch template)
just asg-refresh
just asg-status

# Push service images to ECR
just service-ecr-login
just service-push backend
just service-push-all

# Check Lambda images
just lambda-check-images

State Management¶

# Export state backup (run from specific stack dir)
cd infra/foundation && pulumi stack export > state-backup-$(date +%Y%m%d).json

# Refresh state from AWS
cd infra/foundation && set -a && source ../.env && set +a && pulumi refresh

# View resource URNs
cd infra/foundation && set -a && source ../.env && set +a && pulumi stack --show-urns

Rollback¶

# Pulumi doesn't have native rollback. Options:

# 1. Redeploy previous code from git
git checkout HEAD~1 -- infra/
just infra-bootstrap dev

# 2. Import previous state backup
cd infra/foundation && pulumi stack import --file state-backup.json
just infra-up-foundation dev

# 3. Manual ECS service rollback
aws ecs update-service --cluster tradai-dev --service tradai-backend-dev \
  --task-definition tradai-backend-dev:PREVIOUS_REVISION \
  --profile tradai --region eu-central-1

Adding New Modules¶

Module Template¶

"""New Module Description.

Task ID: XX001
Dependencies: vpc, security_groups
"""

import pulumi
import pulumi_aws as aws
from tradai_infra_shared.config import (
    ENVIRONMENT,
    PROJECT_NAME,
    get_tags,
)


class NewModule:
    """Creates new infrastructure component."""

    def __init__(
        self,
        vpc_id: pulumi.Input[str],
        security_group_id: pulumi.Input[str],
    ) -> None:
        """Initialize NewModule.

        Args:
            vpc_id: VPC ID for resource placement
            security_group_id: Security group to attach
        """
        self.resource = aws.service.Resource(
            f"{PROJECT_NAME}-component",
            vpc_id=vpc_id,
            security_groups=[security_group_id],
            tags=get_tags("component")
            | {"Name": f"{PROJECT_NAME}-component-{ENVIRONMENT}"},
        )

    @property
    def resource_id(self) -> pulumi.Output[str]:
        """Get resource ID."""
        return self.resource.id

Integration Checklist¶

Create module file in the appropriate stack's modules/ directory
Add import to the stack's __main__.py
Choose the right stack: foundation (no image deps), compute (needs ECR images), edge (needs compute)
Export outputs via pulumi.export() if other stacks need the values
Update shared config if new canonical values needed (infra/shared/tradai_infra_shared/config.py)
Add tests in the stack's tests/ directory
Update documentation (this guide)

Example: Adding to a Stack¶

# In compute/__main__.py
from modules.new_module import NewModule

# Read from foundation stack via StackReference
foundation_ref = pulumi.StackReference("tradai-foundation/dev")
vpc_id = foundation_ref.get_output("vpc_id")

new_module = NewModule(
    vpc_id=vpc_id,
    security_group_id=foundation_ref.get_output("ecs_sg_id"),
)
pulumi.export("new_resource_id", new_module.resource_id)

Troubleshooting¶

Common Issues¶

Error	Cause	Solution
`error: no Pulumi.yaml`	Wrong directory	`cd infra/foundation/` (or compute/edge)
`error: stack 'dev' not found`	Stack not initialized	`pulumi stack init dev`
`error: passphrase must be set`	Missing passphrase	`source infra/.env` or set `PULUMI_CONFIG_PASSPHRASE`
`error: unable to deserialize`	State corruption	Restore from backup
`error: resource already exists`	Orphaned resource	Import or delete manually
SSM agent not registering	Custom AMI missing agent	Redeploy compute stack (installs from S3 RPM)
S3/dnf 403 errors on EC2	VPC endpoint policy too restrictive	Check foundation `vpc_endpoints.py` AllowSystemPackageBuckets

State Lock Issues¶

# Check for lock
aws s3api head-object --bucket $BUCKET --key .pulumi/locks/...

# Force unlock (use with caution)
pulumi cancel

Import Existing Resources¶

# Import existing resource
pulumi import aws:s3:Bucket my-bucket existing-bucket-name

# Import with specific URN
pulumi import aws:ecs:Service backend arn:aws:ecs:eu-central-1:123:service/cluster/service

Debug Mode¶

# Verbose logging
PULUMI_DEBUG_COMMANDS=1 pulumi up

# Trace logging
pulumi up --logtostderr -v=9

# Dry run with detailed output
pulumi preview --diff --debug

Accessing Services¶

Dev/Staging: Consolidated EC2¶

In dev/staging, all services run on a single EC2 instance via Docker Compose (cost optimization). Access via SSM Session Manager (no SSH keys needed):

# Find the consolidated instance
aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=tradai-consolidated-dev" "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].InstanceId' --output text \
  --profile tradai --region eu-central-1

# Interactive shell
aws ssm start-session --target <instance-id> --profile tradai --region eu-central-1

# On the instance:
sudo docker ps                                   # Running containers
sudo docker logs backend-api --tail 50 -f        # Service logs
cd /opt/tradai && sudo docker-compose restart     # Restart services

Production: ECS Fargate¶

In production, services run on ECS Fargate. Use ECS Exec for debugging:

# List running tasks
aws ecs list-tasks --cluster tradai-prod --service-name tradai-backend-prod \
  --profile tradai --region eu-central-1

# Interactive shell into a container
aws ecs execute-command --cluster tradai-prod --task <task-arn> \
  --container backend --interactive --command "/bin/sh" \
  --profile tradai --region eu-central-1

Service Endpoints¶

All services are fronted by an Application Load Balancer:

Service	Port	Path	Purpose
backend-api	8000	`/api/*`	API gateway, orchestration
data-collection	8002	`/data/*`	Market data fetching
strategy-service	8003	`/strategy/*`	Strategy execution, backtesting
mlflow	5000	`/mlflow/*`	ML experiment tracking

# Get ALB DNS
just infra-outputs compute dev | jq -r '.alb_dns_name'

# Test health
curl -s http://<ALB_DNS>/api/health | jq .

Logs¶

# CloudWatch (consolidated EC2)
aws logs tail /tradai/consolidated/containers --since 30m \
  --profile tradai --region eu-central-1

# CloudWatch (follow in real time)
aws logs tail /tradai/consolidated/containers --follow \
  --profile tradai --region eu-central-1

CI/CD Integration¶

Bitbucket Pipelines¶

Pipeline	Description
`infra-preview`	Preview changes (PR validation)
`infra-deploy-dev`	Deploy to dev
`infra-deploy-staging`	Deploy to staging
`infra-deploy-prod`	Deploy to production

Required Variables¶

Variable	Description	Secured
`PULUMI_CONFIG_PASSPHRASE`	Stack encryption	Yes
`S3_PULUMI_BACKEND_URL`	Backend URL	No
`AWS_ACCESS_KEY_ID`	AWS credentials	Yes
`AWS_SECRET_ACCESS_KEY`	AWS credentials	Yes
`AWS_REGION`	AWS region	No

Pipeline Example¶

# bitbucket-pipelines.yml (simplified)
pipelines:
  custom:
    infra-deploy-dev:
      - step:
          name: Deploy Infrastructure (Dev)
          script:
            - just infra-bootstrap dev

Cost Optimization¶

Optimization	Savings	Implementation
NAT Instance vs Gateway	~$32/month	`nat_instance.py` uses t4g.nano
FARGATE_SPOT	~70% on backtests	Enabled in `ecs.py`
Single-AZ (dev/staging)	~50% on RDS	`rds.py` `multi_az=False`
db.t4g.micro	~$12/month	Dev/staging RDS instance
On-demand DynamoDB	Pay per use	Default in `dynamodb.py`
S3 lifecycle rules	Auto-cleanup	`s3.py` lifecycle policies

Security Best Practices¶

Secrets: Never in config, always AWS Secrets Manager
RDS: Private subnets only, security group restricted
IAM: Least-privilege roles in iam.py
Security Groups: Minimal ingress rules
WAF: Rate limiting on API Gateway
CloudTrail: Audit logging enabled
VPC Flow Logs: Network traffic logging
Encryption: S3, RDS, DynamoDB at rest

Pulumi Modules Reference - Full module inventory
Infrastructure Issues Runbook - Operational procedures
Database Operations Runbook - RDS failover, backup/restore
Pulumi Code Architecture - Detailed code reference