TradAI Deployment Pipeline¶

Version: 1.0.0 | Date: 2026-03-28 | Status: CURRENT Source: justfile, infra/, scripts/, .github/workflows/

1. TL;DR¶

4-stage deployment: Lambda images → Pulumi infrastructure (4 stacks in order) → Service containers → ECS force-deploy. Full bootstrap via just infra-bootstrap. Pulumi CI via infra/pulumi-ci.sh.

2. Deployment Architecture¶

flowchart TD
    CA[CodeArtifact Login] --> BW[Build Library Wheels]
    BW --> BLB[Build Lambda Base Image]
    BLB --> BLA[Build All Lambdas<br/><i>auto-discovers lambdas/*/Dockerfile</i>]
    BLA --> ECR_L[Push Lambda Images to ECR]

    ECR_L --> P1[Pulumi: persistent<br/><i>S3, DynamoDB, ECR, Cognito, CodeArtifact</i>]
    P1 --> P2[Pulumi: foundation<br/><i>VPC, RDS, SQS, SNS, Security Groups</i>]
    P2 --> P3[Pulumi: compute<br/><i>ALB, ECS, Lambda, Step Functions</i>]
    P3 --> P4[Pulumi: edge<br/><i>API Gateway, WAF, CloudWatch</i>]

    P4 --> DBS[Docker Build Services<br/><i>backend, data-collection,<br/>strategy-service, mlflow</i>]
    DBS --> ECR_S[Push Service Images to ECR]
    ECR_S --> ECS[ECS Force Deploy<br/><i>Rolling update</i>]

    style P1 fill:#f9f,stroke:#333
    style P2 fill:#f9f,stroke:#333
    style P3 fill:#f9f,stroke:#333
    style P4 fill:#f9f,stroke:#333
    style ECR_L fill:#bbf,stroke:#333
    style ECR_S fill:#bbf,stroke:#333
    style ECS fill:#bfb,stroke:#333

3. First-Time Bootstrap¶

Prerequisites

AWS CLI configured with tradai profile
Pulumi CLI installed
infra/.env populated with AWS_PROFILE, PULUMI_CONFIG_PASSPHRASE, S3_PULUMI_BACKEND_URL
Docker daemon running

Step-by-step for a new environment:

Step	Command	What It Does
1	Configure `infra/.env`	Set `AWS_PROFILE=tradai`, `PULUMI_CONFIG_PASSPHRASE`, `S3_PULUMI_BACKEND_URL`
2	`just codeartifact-login dev`	Authenticate pip with CodeArtifact (12h token)
3	`just infra-bootstrap dev`	Full pipeline (see breakdown below)
4	`just ecs-status`	Verify ECS services are running

What `just infra-bootstrap` Runs¶

The infra-bootstrap recipe executes the following sequence:

just infra-up-persistent dev -- S3 buckets, DynamoDB, ECR repos, Cognito, CodeArtifact
just infra-up-foundation dev -- VPC, subnets, NAT, RDS, SQS/SNS, security groups
just lambda-bootstrap -- Build wheel + base image + all lambdas + push to ECR
just service-push-all -- Build all 4 service Docker images + push to ECR
just infra-up-compute dev -- ALB, ECS, Lambda functions, Step Functions
just infra-up-edge dev -- API Gateway, WAF, CloudWatch dashboards and alarms

Compute Pre-Flight Check

infra-up-compute verifies all Lambda images exist in ECR before deploying. If any are missing, it aborts to prevent partial deployments.

4. Lambda Deployment Pipeline¶

4.1 Build Chain¶

sequenceDiagram
    participant W as Wheel Build
    participant B as Base Image
    participant L as Lambda Images
    participant E as ECR
    participant A as AWS Lambda

    W->>W: uv build libs/tradai-common --out-dir dist
    W->>B: tradai_common*.whl
    B->>B: docker build -f lambdas/base/Dockerfile
    B->>L: tradai/lambda-base:latest
    L->>L: docker build -f lambdas/<name>/Dockerfile<br/>(17 lambdas in parallel)
    L->>E: docker push tradai/lambda-<name>:tag
    E->>A: aws lambda update-function-code --image-uri

The base image (lambdas/base/Dockerfile) uses public.ecr.aws/lambda/python:3.11, installs gcc, g++, make, tar, gzip, and then installs the tradai-common wheel plus boto3, pydantic, pydantic-settings, and httpx. Each individual Lambda extends this base image with its own handler.py.

4.2 Commands¶

Command	Description
`just lambda-build-wheel`	Build `tradai-common` wheel to `dist/`
`just lambda-build-base`	Build base image (depends on wheel)
`just lambda-build <name>`	Build a single Lambda image
`just lambda-build-all`	Build base + all Lambda images
`just lambda-ecr-login`	Authenticate Docker with ECR
`just lambda-push-base`	Tag and push base image to ECR
`just lambda-push <name>`	Tag and push a single Lambda to ECR
`just lambda-push-all`	Push all Lambda images to ECR
`just lambda-bootstrap`	Full pipeline: wheel → base → all images → ECR push
`just lambda-check-images`	Verify all Lambda images exist in ECR
`just lambda-list`	List all Lambda functions with Dockerfiles

4.3 Auto-Discovery¶

Lambda build discovers all directories in lambdas/ automatically. Any directory containing a Dockerfile (excluding base/) is treated as a Lambda function.

Current Lambda functions (17):

Lambda	Purpose
`backtest-consumer`	SQS backtest result processing
`check-retraining-needed`	Evaluate retraining triggers
`cleanup-resources`	Resource cleanup automation
`compare-models`	Model comparison logic
`data-collection-proxy`	Data collection proxy
`drift-monitor`	Model/data drift detection
`health-check`	System health monitoring
`model-rollback`	Model rollback handler
`notify-completion`	Completion notifications
`orphan-scanner`	Orphaned resource detection
`promote-model`	Model promotion handler
`pulumi-drift-detector`	Infrastructure drift detection
`retraining-scheduler`	Retraining schedule management
`sqs-consumer`	SQS message consumer
`trading-heartbeat-check`	Trading system heartbeat
`update-status`	Job status updates in DynamoDB
`validate-strategy`	Strategy validation handler

18th Lambda: update-nat-routes

The 18th Lambda function (update-nat-routes) is an inline Python handler deployed directly via Pulumi (no Dockerfile). It updates the private route table when the NAT instance is replaced by the ASG. Because it has no Dockerfile, it is not auto-discovered by lambda-bootstrap and is not included in the count above.

5. Infrastructure Deployment (Pulumi)¶

5.1 Stack Order¶

The 4 Pulumi stacks must be deployed in strict dependency order:

flowchart LR
    P[persistent] --> F[foundation] --> C[compute] --> E[edge]

    P:::persistent
    F:::foundation
    C:::compute
    E:::edge

    classDef persistent fill:#e1bee7,stroke:#333
    classDef foundation fill:#bbdefb,stroke:#333
    classDef compute fill:#c8e6c9,stroke:#333
    classDef edge fill:#ffe0b2,stroke:#333

Stack	Resources	Depends On
persistent	S3 buckets, DynamoDB, ECR repos, Cognito, CloudTrail, CodeArtifact	None (never destroyed in prod)
foundation	VPC, subnets, NAT Instance (t4g.nano), RDS, SQS/SNS, security groups	persistent (ECR, S3)
compute	ALB, ECS services, Lambda functions, Step Functions, Cloud Map	foundation (VPC, SGs, RDS)
edge	API Gateway, WAF, CloudWatch dashboards and alarms	compute (ALB, ECS, Lambda)

5.2 Commands¶

Command	Description
`just infra-setup`	Create `infra/.env`, sync all stack dependencies
`just infra-bootstrap <stack>`	Full deploy: all stacks + images (see Section 3)
`just infra-up-persistent <stack>`	Deploy persistent stack
`just infra-up-foundation <stack>`	Deploy foundation stack
`just infra-up-compute <stack>`	Deploy compute stack (pre-flight ECR check)
`just infra-up-edge <stack>`	Deploy edge stack
`just infra-preview <stack>`	Preview all 4 stacks
`just infra-preview-<layer> <stack>`	Preview a single stack
`just infra-down-soft <stack>`	Destroy edge + compute only (preserves data)
`just infra-down-all <stack>`	Destroy edge + compute + foundation (preserves persistent)
`just infra-down-persistent <stack>`	Destroy persistent (requires confirmation)
`just infra-outputs <layer> <stack>`	Show stack outputs as JSON
`just infra-refresh <layer> <stack>`	Refresh stack state from cloud
`just infra-recover <layer> <stack>`	Cancel pending ops, refresh, show drift
`just infra-stack-init <layer> <stack>`	Initialize a new Pulumi stack
`just infra-verify-account`	Verify AWS identity for the profile
`just infra-test`	Run Pulumi infrastructure unit tests
`just infra-lint`	Lint infrastructure Python code
`just infra-check`	Lint + test combined
`just infra-typecheck`	MyPy type check on all stacks

5.3 Pulumi CI Script¶

infra/pulumi-ci.sh is used by GitHub Actions for automated deployments.

Usage:

./infra/pulumi-ci.sh <layer> <stack> <command> [s3-backend-url]

Arguments:

Argument	Values	Description
`layer`	`persistent`, `foundation`, `compute`, `edge`, `all`	Which stack(s) to operate on
`stack`	`dev`, `staging`, `prod`	Target environment
`command`	`preview`, `up`	Pulumi operation
`s3-backend-url`	(optional)	Defaults to `$S3_PULUMI_BACKEND_URL`

Required environment variables: - PULUMI_CONFIG_PASSPHRASE -- Stack encryption passphrase - S3_PULUMI_BACKEND_URL -- S3 backend for Pulumi state - AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY -- AWS credentials - AWS_REGION -- (optional, defaults to eu-central-1)

When layer=all, the script runs all stacks in order: persistent → foundation → compute → edge.

5.4 Preview Before Deploy¶

# Preview all stacks for dev environment
just infra-preview dev

# Preview individual stacks
just infra-preview-persistent dev
just infra-preview-foundation dev
just infra-preview-compute dev
just infra-preview-edge dev

On pull requests that touch infra/**, the deploy-infra.yml workflow automatically runs pulumi preview across all stacks for dev, staging, and prod, posting a summary table as a PR comment.

6. Service Container Deployment¶

6.1 Docker Build¶

Four services are containerized for ECS deployment:

Service	Dockerfile	ECR Image
`backend`	`services/backend/Dockerfile`	`tradai/backend`
`data-collection`	`services/data-collection/Dockerfile`	`tradai/data-collection`
`strategy-service`	`services/strategy-service/Dockerfile`	`tradai/strategy-service`
`mlflow`	`services/mlflow/Dockerfile`	`tradai/mlflow`

# Build all 4 service images (linux/amd64)
just docker-build

# Build a single service
just docker-build-service backend

6.2 ECR Push¶

# Login to ECR + build + push all services
just service-push-all

# Push a single service
just service-push backend

# Verify images exist in ECR
just service-check-images

6.3 ECS Redeployment¶

# Force rolling update on all ECS services
just ecs-force-deploy-all

# Force rolling update on a specific service
just ecs-force-deploy strategy-service

# Check ECS service status
just ecs-status

# View recent ECS events
just ecs-events strategy-service

ECS services targeted by ecs-force-deploy-all: strategy-service, dry-run-trading, live-trading (trading services only).

CI vs Local Redeployment Targets

The docker-build.yml CI workflow redeploys the 4 services it builds: backend-api, data-collection, strategy-service, and mlflow. The just ecs-force-deploy-all command targets only the 3 trading services listed above. To redeploy a specific service locally, use just ecs-force-deploy <service>.

The docker-build.yml GitHub Actions workflow triggers on version tags (after CI passes), builds all 4 service images in parallel, pushes to ECR with both the version tag and latest, then forces a new ECS deployment for backend-api, data-collection, strategy-service, and mlflow.

7. Library Publishing¶

Libraries are published to AWS CodeArtifact for consumption by the separate tradai-strategies repository.

flowchart LR
    L[CodeArtifact Login<br/><code>just codeartifact-login dev</code>] --> B[Build Wheels<br/><code>just build-libs</code>]
    B --> P[Publish to CodeArtifact<br/><code>just publish-all-libs dev</code>]

    style L fill:#fff3e0
    style P fill:#c8e6c9

Command	Description
`just codeartifact-login <env>`	Authenticate pip with CodeArtifact
`just build-libs`	Build wheels for tradai-common, tradai-data, tradai-strategy
`just publish-strategy <env>`	Publish tradai-strategy wheel to CodeArtifact
`just publish-all-libs <env>`	Publish all library wheels

Token Expiry

CodeArtifact authorization tokens expire after 12 hours. Re-run just codeartifact-login before any publish or install operation if the token has expired.

The publish-libs.yml GitHub Actions workflow triggers on version tags (after CI passes) and publishes tradai-strategy to CodeArtifact automatically, with --skip-existing to handle re-runs.

8. Rollback Procedures¶

Service Rollback¶

Redeploy the previous working image by forcing a new ECS deployment. ECS uses the :latest tag, so push a known-good image and redeploy:

# Rebuild from a known-good commit
git checkout <good-commit>
just docker-build-service backend
just service-push backend
just ecs-force-deploy backend

Lambda Rollback¶

Rebuild Lambda images from the previous commit and push:

git checkout <good-commit>
just lambda-bootstrap

The deploy-lambdas.yml workflow updates each Lambda function code to the new image URI. AWS Lambda also supports version aliases for instant rollback without rebuilding.

Infrastructure Rollback¶

# Recover from a failed deployment (cancel pending ops, refresh state)
just infra-recover <layer> <stack>

# Re-deploy with current code (Pulumi converges to desired state)
just infra-up-<layer> <stack>

Pulumi stores state in S3. Each pulumi up converges the actual cloud state to match the declared code. Rolling back infrastructure means running pulumi up with the previous code version.

Destructive Operations

just infra-down-persistent destroys all data (S3 buckets, DynamoDB tables, ECR repos, Cognito user pool). Requires typing destroy-all-data to confirm.
just infra-down-all destroys edge + compute + foundation but preserves persistent data.
just infra-down-soft destroys only edge + compute (safest for dev iteration).
In production, deletion_protection is enabled on DynamoDB, RDS, ALB, and Cognito. Teardown requires explicit removal of protection first.

9. Environment Configuration¶

Local Development¶

The infra/.env file is sourced automatically by all just infra-* recipes via:

set -a && source ../.env && set +a

Required variables in infra/.env:

Variable	Description
`AWS_PROFILE`	AWS CLI profile name (default: `tradai`)
`PULUMI_CONFIG_PASSPHRASE`	Passphrase for Pulumi stack encryption
`S3_PULUMI_BACKEND_URL`	S3 URL for Pulumi state backend

CI/CD (GitHub Actions)¶

GitHub Actions workflows use repository secrets:

Secret	Used By
`AWS_ACCESS_KEY_ID`	All deployment workflows
`AWS_SECRET_ACCESS_KEY`	All deployment workflows
`AWS_REGION`	All deployment workflows (default: `eu-central-1`)
`PULUMI_CONFIG_PASSPHRASE`	`deploy-infra.yml`
`S3_PULUMI_BACKEND_URL`	`deploy-infra.yml`

CI/CD Workflow Triggers¶

Workflow	Trigger	Purpose
`ci.yml`	Push to main, version tags, PRs, weekly schedule	Lint, typecheck, test, security scan
`deploy-infra.yml`	PR touching `infra/**`, manual dispatch	Preview on PR, deploy on manual trigger
`deploy-lambdas.yml`	After CI on version tags, manual dispatch	Build + push Lambda images, update functions
`docker-build.yml`	After CI on version tags	Build + push service images, ECS redeploy
`publish-libs.yml`	After CI on version tags	Publish libraries to CodeArtifact

10. Changelog¶

Date	Version	Change
2026-03-28	1.0.0	Initial document

Dependencies¶

This document references:

Document	Relationship
02-ARCHITECTURE-OVERVIEW	Overall system architecture
03-VPC-NETWORKING	Network topology deployed by foundation stack
05-SERVICES	ECS service definitions
09-PULUMI-CODE	Pulumi stack implementation details
10-CANONICAL-CONFIG	Configuration constants shared across stacks
`infra/TEARDOWN.md`	Full teardown guide with manual steps