Skip to content

TradAI Deployment Pipeline

Version: 1.0.0 | Date: 2026-03-28 | Status: CURRENT Source: justfile, infra/, scripts/, .github/workflows/


1. TL;DR

4-stage deployment: Lambda images → Pulumi infrastructure (4 stacks in order) → Service containers → ECS force-deploy. Full bootstrap via just infra-bootstrap. Pulumi CI via infra/pulumi-ci.sh.


2. Deployment Architecture

flowchart TD
    CA[CodeArtifact Login] --> BW[Build Library Wheels]
    BW --> BLB[Build Lambda Base Image]
    BLB --> BLA[Build All Lambdas<br/><i>auto-discovers lambdas/*/Dockerfile</i>]
    BLA --> ECR_L[Push Lambda Images to ECR]

    ECR_L --> P1[Pulumi: persistent<br/><i>S3, DynamoDB, ECR, Cognito, CodeArtifact</i>]
    P1 --> P2[Pulumi: foundation<br/><i>VPC, RDS, SQS, SNS, Security Groups</i>]
    P2 --> P3[Pulumi: compute<br/><i>ALB, ECS, Lambda, Step Functions</i>]
    P3 --> P4[Pulumi: edge<br/><i>API Gateway, WAF, CloudWatch</i>]

    P4 --> DBS[Docker Build Services<br/><i>backend, data-collection,<br/>strategy-service, mlflow</i>]
    DBS --> ECR_S[Push Service Images to ECR]
    ECR_S --> ECS[ECS Force Deploy<br/><i>Rolling update</i>]

    style P1 fill:#f9f,stroke:#333
    style P2 fill:#f9f,stroke:#333
    style P3 fill:#f9f,stroke:#333
    style P4 fill:#f9f,stroke:#333
    style ECR_L fill:#bbf,stroke:#333
    style ECR_S fill:#bbf,stroke:#333
    style ECS fill:#bfb,stroke:#333

3. First-Time Bootstrap

Prerequisites

  • AWS CLI configured with tradai profile
  • Pulumi CLI installed
  • infra/.env populated with AWS_PROFILE, PULUMI_CONFIG_PASSPHRASE, S3_PULUMI_BACKEND_URL
  • Docker daemon running

Step-by-step for a new environment:

Step Command What It Does
1 Configure infra/.env Set AWS_PROFILE=tradai, PULUMI_CONFIG_PASSPHRASE, S3_PULUMI_BACKEND_URL
2 just codeartifact-login dev Authenticate pip with CodeArtifact (12h token)
3 just infra-bootstrap dev Full pipeline (see breakdown below)
4 just ecs-status Verify ECS services are running

What just infra-bootstrap Runs

The infra-bootstrap recipe executes the following sequence:

  1. just infra-up-persistent dev -- S3 buckets, DynamoDB, ECR repos, Cognito, CodeArtifact
  2. just infra-up-foundation dev -- VPC, subnets, NAT, RDS, SQS/SNS, security groups
  3. just lambda-bootstrap -- Build wheel + base image + all lambdas + push to ECR
  4. just service-push-all -- Build all 4 service Docker images + push to ECR
  5. just infra-up-compute dev -- ALB, ECS, Lambda functions, Step Functions
  6. just infra-up-edge dev -- API Gateway, WAF, CloudWatch dashboards and alarms

Compute Pre-Flight Check

infra-up-compute verifies all Lambda images exist in ECR before deploying. If any are missing, it aborts to prevent partial deployments.


4. Lambda Deployment Pipeline

4.1 Build Chain

sequenceDiagram
    participant W as Wheel Build
    participant B as Base Image
    participant L as Lambda Images
    participant E as ECR
    participant A as AWS Lambda

    W->>W: uv build libs/tradai-common --out-dir dist
    W->>B: tradai_common*.whl
    B->>B: docker build -f lambdas/base/Dockerfile
    B->>L: tradai/lambda-base:latest
    L->>L: docker build -f lambdas/<name>/Dockerfile<br/>(17 lambdas in parallel)
    L->>E: docker push tradai/lambda-<name>:tag
    E->>A: aws lambda update-function-code --image-uri

The base image (lambdas/base/Dockerfile) uses public.ecr.aws/lambda/python:3.11, installs gcc, g++, make, tar, gzip, and then installs the tradai-common wheel plus boto3, pydantic, pydantic-settings, and httpx. Each individual Lambda extends this base image with its own handler.py.

4.2 Commands

Command Description
just lambda-build-wheel Build tradai-common wheel to dist/
just lambda-build-base Build base image (depends on wheel)
just lambda-build <name> Build a single Lambda image
just lambda-build-all Build base + all Lambda images
just lambda-ecr-login Authenticate Docker with ECR
just lambda-push-base Tag and push base image to ECR
just lambda-push <name> Tag and push a single Lambda to ECR
just lambda-push-all Push all Lambda images to ECR
just lambda-bootstrap Full pipeline: wheel → base → all images → ECR push
just lambda-check-images Verify all Lambda images exist in ECR
just lambda-list List all Lambda functions with Dockerfiles

4.3 Auto-Discovery

Lambda build discovers all directories in lambdas/ automatically. Any directory containing a Dockerfile (excluding base/) is treated as a Lambda function.

Current Lambda functions (17):

Lambda Purpose
backtest-consumer SQS backtest result processing
check-retraining-needed Evaluate retraining triggers
cleanup-resources Resource cleanup automation
compare-models Model comparison logic
data-collection-proxy Data collection proxy
drift-monitor Model/data drift detection
health-check System health monitoring
model-rollback Model rollback handler
notify-completion Completion notifications
orphan-scanner Orphaned resource detection
promote-model Model promotion handler
pulumi-drift-detector Infrastructure drift detection
retraining-scheduler Retraining schedule management
sqs-consumer SQS message consumer
trading-heartbeat-check Trading system heartbeat
update-status Job status updates in DynamoDB
validate-strategy Strategy validation handler

18th Lambda: update-nat-routes

The 18th Lambda function (update-nat-routes) is an inline Python handler deployed directly via Pulumi (no Dockerfile). It updates the private route table when the NAT instance is replaced by the ASG. Because it has no Dockerfile, it is not auto-discovered by lambda-bootstrap and is not included in the count above.


5. Infrastructure Deployment (Pulumi)

5.1 Stack Order

The 4 Pulumi stacks must be deployed in strict dependency order:

flowchart LR
    P[persistent] --> F[foundation] --> C[compute] --> E[edge]

    P:::persistent
    F:::foundation
    C:::compute
    E:::edge

    classDef persistent fill:#e1bee7,stroke:#333
    classDef foundation fill:#bbdefb,stroke:#333
    classDef compute fill:#c8e6c9,stroke:#333
    classDef edge fill:#ffe0b2,stroke:#333
Stack Resources Depends On
persistent S3 buckets, DynamoDB, ECR repos, Cognito, CloudTrail, CodeArtifact None (never destroyed in prod)
foundation VPC, subnets, NAT Instance (t4g.nano), RDS, SQS/SNS, security groups persistent (ECR, S3)
compute ALB, ECS services, Lambda functions, Step Functions, Cloud Map foundation (VPC, SGs, RDS)
edge API Gateway, WAF, CloudWatch dashboards and alarms compute (ALB, ECS, Lambda)

5.2 Commands

Command Description
just infra-setup Create infra/.env, sync all stack dependencies
just infra-bootstrap <stack> Full deploy: all stacks + images (see Section 3)
just infra-up-persistent <stack> Deploy persistent stack
just infra-up-foundation <stack> Deploy foundation stack
just infra-up-compute <stack> Deploy compute stack (pre-flight ECR check)
just infra-up-edge <stack> Deploy edge stack
just infra-preview <stack> Preview all 4 stacks
just infra-preview-<layer> <stack> Preview a single stack
just infra-down-soft <stack> Destroy edge + compute only (preserves data)
just infra-down-all <stack> Destroy edge + compute + foundation (preserves persistent)
just infra-down-persistent <stack> Destroy persistent (requires confirmation)
just infra-outputs <layer> <stack> Show stack outputs as JSON
just infra-refresh <layer> <stack> Refresh stack state from cloud
just infra-recover <layer> <stack> Cancel pending ops, refresh, show drift
just infra-stack-init <layer> <stack> Initialize a new Pulumi stack
just infra-verify-account Verify AWS identity for the profile
just infra-test Run Pulumi infrastructure unit tests
just infra-lint Lint infrastructure Python code
just infra-check Lint + test combined
just infra-typecheck MyPy type check on all stacks

5.3 Pulumi CI Script

infra/pulumi-ci.sh is used by GitHub Actions for automated deployments.

Usage:

./infra/pulumi-ci.sh <layer> <stack> <command> [s3-backend-url]

Arguments:

Argument Values Description
layer persistent, foundation, compute, edge, all Which stack(s) to operate on
stack dev, staging, prod Target environment
command preview, up Pulumi operation
s3-backend-url (optional) Defaults to $S3_PULUMI_BACKEND_URL

Required environment variables: - PULUMI_CONFIG_PASSPHRASE -- Stack encryption passphrase - S3_PULUMI_BACKEND_URL -- S3 backend for Pulumi state - AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY -- AWS credentials - AWS_REGION -- (optional, defaults to eu-central-1)

When layer=all, the script runs all stacks in order: persistent → foundation → compute → edge.

5.4 Preview Before Deploy

# Preview all stacks for dev environment
just infra-preview dev

# Preview individual stacks
just infra-preview-persistent dev
just infra-preview-foundation dev
just infra-preview-compute dev
just infra-preview-edge dev

On pull requests that touch infra/**, the deploy-infra.yml workflow automatically runs pulumi preview across all stacks for dev, staging, and prod, posting a summary table as a PR comment.


6. Service Container Deployment

6.1 Docker Build

Four services are containerized for ECS deployment:

Service Dockerfile ECR Image
backend services/backend/Dockerfile tradai/backend
data-collection services/data-collection/Dockerfile tradai/data-collection
strategy-service services/strategy-service/Dockerfile tradai/strategy-service
mlflow services/mlflow/Dockerfile tradai/mlflow
# Build all 4 service images (linux/amd64)
just docker-build

# Build a single service
just docker-build-service backend

6.2 ECR Push

# Login to ECR + build + push all services
just service-push-all

# Push a single service
just service-push backend

# Verify images exist in ECR
just service-check-images

6.3 ECS Redeployment

# Force rolling update on all ECS services
just ecs-force-deploy-all

# Force rolling update on a specific service
just ecs-force-deploy strategy-service

# Check ECS service status
just ecs-status

# View recent ECS events
just ecs-events strategy-service

ECS services targeted by ecs-force-deploy-all: strategy-service, dry-run-trading, live-trading (trading services only).

CI vs Local Redeployment Targets

The docker-build.yml CI workflow redeploys the 4 services it builds: backend-api, data-collection, strategy-service, and mlflow. The just ecs-force-deploy-all command targets only the 3 trading services listed above. To redeploy a specific service locally, use just ecs-force-deploy <service>.

The docker-build.yml GitHub Actions workflow triggers on version tags (after CI passes), builds all 4 service images in parallel, pushes to ECR with both the version tag and latest, then forces a new ECS deployment for backend-api, data-collection, strategy-service, and mlflow.


7. Library Publishing

Libraries are published to AWS CodeArtifact for consumption by the separate tradai-strategies repository.

flowchart LR
    L[CodeArtifact Login<br/><code>just codeartifact-login dev</code>] --> B[Build Wheels<br/><code>just build-libs</code>]
    B --> P[Publish to CodeArtifact<br/><code>just publish-all-libs dev</code>]

    style L fill:#fff3e0
    style P fill:#c8e6c9
Command Description
just codeartifact-login <env> Authenticate pip with CodeArtifact
just build-libs Build wheels for tradai-common, tradai-data, tradai-strategy
just publish-strategy <env> Publish tradai-strategy wheel to CodeArtifact
just publish-all-libs <env> Publish all library wheels

Token Expiry

CodeArtifact authorization tokens expire after 12 hours. Re-run just codeartifact-login before any publish or install operation if the token has expired.

The publish-libs.yml GitHub Actions workflow triggers on version tags (after CI passes) and publishes tradai-strategy to CodeArtifact automatically, with --skip-existing to handle re-runs.


8. Rollback Procedures

Service Rollback

Redeploy the previous working image by forcing a new ECS deployment. ECS uses the :latest tag, so push a known-good image and redeploy:

# Rebuild from a known-good commit
git checkout <good-commit>
just docker-build-service backend
just service-push backend
just ecs-force-deploy backend

Lambda Rollback

Rebuild Lambda images from the previous commit and push:

git checkout <good-commit>
just lambda-bootstrap

The deploy-lambdas.yml workflow updates each Lambda function code to the new image URI. AWS Lambda also supports version aliases for instant rollback without rebuilding.

Infrastructure Rollback

# Recover from a failed deployment (cancel pending ops, refresh state)
just infra-recover <layer> <stack>

# Re-deploy with current code (Pulumi converges to desired state)
just infra-up-<layer> <stack>

Pulumi stores state in S3. Each pulumi up converges the actual cloud state to match the declared code. Rolling back infrastructure means running pulumi up with the previous code version.

Destructive Operations

  • just infra-down-persistent destroys all data (S3 buckets, DynamoDB tables, ECR repos, Cognito user pool). Requires typing destroy-all-data to confirm.
  • just infra-down-all destroys edge + compute + foundation but preserves persistent data.
  • just infra-down-soft destroys only edge + compute (safest for dev iteration).
  • In production, deletion_protection is enabled on DynamoDB, RDS, ALB, and Cognito. Teardown requires explicit removal of protection first.

9. Environment Configuration

Local Development

The infra/.env file is sourced automatically by all just infra-* recipes via:

set -a && source ../.env && set +a

Required variables in infra/.env:

Variable Description
AWS_PROFILE AWS CLI profile name (default: tradai)
PULUMI_CONFIG_PASSPHRASE Passphrase for Pulumi stack encryption
S3_PULUMI_BACKEND_URL S3 URL for Pulumi state backend

CI/CD (GitHub Actions)

GitHub Actions workflows use repository secrets:

Secret Used By
AWS_ACCESS_KEY_ID All deployment workflows
AWS_SECRET_ACCESS_KEY All deployment workflows
AWS_REGION All deployment workflows (default: eu-central-1)
PULUMI_CONFIG_PASSPHRASE deploy-infra.yml
S3_PULUMI_BACKEND_URL deploy-infra.yml

CI/CD Workflow Triggers

Workflow Trigger Purpose
ci.yml Push to main, version tags, PRs, weekly schedule Lint, typecheck, test, security scan
deploy-infra.yml PR touching infra/**, manual dispatch Preview on PR, deploy on manual trigger
deploy-lambdas.yml After CI on version tags, manual dispatch Build + push Lambda images, update functions
docker-build.yml After CI on version tags Build + push service images, ECS redeploy
publish-libs.yml After CI on version tags Publish libraries to CodeArtifact

10. Changelog

Date Version Change
2026-03-28 1.0.0 Initial document

Dependencies

This document references:

Document Relationship
02-ARCHITECTURE-OVERVIEW Overall system architecture
03-VPC-NETWORKING Network topology deployed by foundation stack
05-SERVICES ECS service definitions
09-PULUMI-CODE Pulumi stack implementation details
10-CANONICAL-CONFIG Configuration constants shared across stacks
infra/TEARDOWN.md Full teardown guide with manual steps