TradAI Deployment Pipeline¶
Version: 1.0.0 | Date: 2026-03-28 | Status: CURRENT Source: justfile, infra/, scripts/, .github/workflows/
1. TL;DR¶
4-stage deployment: Lambda images → Pulumi infrastructure (4 stacks in order) → Service containers → ECS force-deploy. Full bootstrap via
just infra-bootstrap. Pulumi CI viainfra/pulumi-ci.sh.
2. Deployment Architecture¶
flowchart TD
CA[CodeArtifact Login] --> BW[Build Library Wheels]
BW --> BLB[Build Lambda Base Image]
BLB --> BLA[Build All Lambdas<br/><i>auto-discovers lambdas/*/Dockerfile</i>]
BLA --> ECR_L[Push Lambda Images to ECR]
ECR_L --> P1[Pulumi: persistent<br/><i>S3, DynamoDB, ECR, Cognito, CodeArtifact</i>]
P1 --> P2[Pulumi: foundation<br/><i>VPC, RDS, SQS, SNS, Security Groups</i>]
P2 --> P3[Pulumi: compute<br/><i>ALB, ECS, Lambda, Step Functions</i>]
P3 --> P4[Pulumi: edge<br/><i>API Gateway, WAF, CloudWatch</i>]
P4 --> DBS[Docker Build Services<br/><i>backend, data-collection,<br/>strategy-service, mlflow</i>]
DBS --> ECR_S[Push Service Images to ECR]
ECR_S --> ECS[ECS Force Deploy<br/><i>Rolling update</i>]
style P1 fill:#f9f,stroke:#333
style P2 fill:#f9f,stroke:#333
style P3 fill:#f9f,stroke:#333
style P4 fill:#f9f,stroke:#333
style ECR_L fill:#bbf,stroke:#333
style ECR_S fill:#bbf,stroke:#333
style ECS fill:#bfb,stroke:#333 3. First-Time Bootstrap¶
Prerequisites
- AWS CLI configured with
tradaiprofile - Pulumi CLI installed
infra/.envpopulated withAWS_PROFILE,PULUMI_CONFIG_PASSPHRASE,S3_PULUMI_BACKEND_URL- Docker daemon running
Step-by-step for a new environment:
| Step | Command | What It Does |
|---|---|---|
| 1 | Configure infra/.env | Set AWS_PROFILE=tradai, PULUMI_CONFIG_PASSPHRASE, S3_PULUMI_BACKEND_URL |
| 2 | just codeartifact-login dev | Authenticate pip with CodeArtifact (12h token) |
| 3 | just infra-bootstrap dev | Full pipeline (see breakdown below) |
| 4 | just ecs-status | Verify ECS services are running |
What just infra-bootstrap Runs¶
The infra-bootstrap recipe executes the following sequence:
just infra-up-persistent dev-- S3 buckets, DynamoDB, ECR repos, Cognito, CodeArtifactjust infra-up-foundation dev-- VPC, subnets, NAT, RDS, SQS/SNS, security groupsjust lambda-bootstrap-- Build wheel + base image + all lambdas + push to ECRjust service-push-all-- Build all 4 service Docker images + push to ECRjust infra-up-compute dev-- ALB, ECS, Lambda functions, Step Functionsjust infra-up-edge dev-- API Gateway, WAF, CloudWatch dashboards and alarms
Compute Pre-Flight Check
infra-up-compute verifies all Lambda images exist in ECR before deploying. If any are missing, it aborts to prevent partial deployments.
4. Lambda Deployment Pipeline¶
4.1 Build Chain¶
sequenceDiagram
participant W as Wheel Build
participant B as Base Image
participant L as Lambda Images
participant E as ECR
participant A as AWS Lambda
W->>W: uv build libs/tradai-common --out-dir dist
W->>B: tradai_common*.whl
B->>B: docker build -f lambdas/base/Dockerfile
B->>L: tradai/lambda-base:latest
L->>L: docker build -f lambdas/<name>/Dockerfile<br/>(17 lambdas in parallel)
L->>E: docker push tradai/lambda-<name>:tag
E->>A: aws lambda update-function-code --image-uri The base image (lambdas/base/Dockerfile) uses public.ecr.aws/lambda/python:3.11, installs gcc, g++, make, tar, gzip, and then installs the tradai-common wheel plus boto3, pydantic, pydantic-settings, and httpx. Each individual Lambda extends this base image with its own handler.py.
4.2 Commands¶
| Command | Description |
|---|---|
just lambda-build-wheel | Build tradai-common wheel to dist/ |
just lambda-build-base | Build base image (depends on wheel) |
just lambda-build <name> | Build a single Lambda image |
just lambda-build-all | Build base + all Lambda images |
just lambda-ecr-login | Authenticate Docker with ECR |
just lambda-push-base | Tag and push base image to ECR |
just lambda-push <name> | Tag and push a single Lambda to ECR |
just lambda-push-all | Push all Lambda images to ECR |
just lambda-bootstrap | Full pipeline: wheel → base → all images → ECR push |
just lambda-check-images | Verify all Lambda images exist in ECR |
just lambda-list | List all Lambda functions with Dockerfiles |
4.3 Auto-Discovery¶
Lambda build discovers all directories in lambdas/ automatically. Any directory containing a Dockerfile (excluding base/) is treated as a Lambda function.
Current Lambda functions (17):
| Lambda | Purpose |
|---|---|
backtest-consumer | SQS backtest result processing |
check-retraining-needed | Evaluate retraining triggers |
cleanup-resources | Resource cleanup automation |
compare-models | Model comparison logic |
data-collection-proxy | Data collection proxy |
drift-monitor | Model/data drift detection |
health-check | System health monitoring |
model-rollback | Model rollback handler |
notify-completion | Completion notifications |
orphan-scanner | Orphaned resource detection |
promote-model | Model promotion handler |
pulumi-drift-detector | Infrastructure drift detection |
retraining-scheduler | Retraining schedule management |
sqs-consumer | SQS message consumer |
trading-heartbeat-check | Trading system heartbeat |
update-status | Job status updates in DynamoDB |
validate-strategy | Strategy validation handler |
18th Lambda: update-nat-routes
The 18th Lambda function (update-nat-routes) is an inline Python handler deployed directly via Pulumi (no Dockerfile). It updates the private route table when the NAT instance is replaced by the ASG. Because it has no Dockerfile, it is not auto-discovered by lambda-bootstrap and is not included in the count above.
5. Infrastructure Deployment (Pulumi)¶
5.1 Stack Order¶
The 4 Pulumi stacks must be deployed in strict dependency order:
flowchart LR
P[persistent] --> F[foundation] --> C[compute] --> E[edge]
P:::persistent
F:::foundation
C:::compute
E:::edge
classDef persistent fill:#e1bee7,stroke:#333
classDef foundation fill:#bbdefb,stroke:#333
classDef compute fill:#c8e6c9,stroke:#333
classDef edge fill:#ffe0b2,stroke:#333 | Stack | Resources | Depends On |
|---|---|---|
| persistent | S3 buckets, DynamoDB, ECR repos, Cognito, CloudTrail, CodeArtifact | None (never destroyed in prod) |
| foundation | VPC, subnets, NAT Instance (t4g.nano), RDS, SQS/SNS, security groups | persistent (ECR, S3) |
| compute | ALB, ECS services, Lambda functions, Step Functions, Cloud Map | foundation (VPC, SGs, RDS) |
| edge | API Gateway, WAF, CloudWatch dashboards and alarms | compute (ALB, ECS, Lambda) |
5.2 Commands¶
| Command | Description |
|---|---|
just infra-setup | Create infra/.env, sync all stack dependencies |
just infra-bootstrap <stack> | Full deploy: all stacks + images (see Section 3) |
just infra-up-persistent <stack> | Deploy persistent stack |
just infra-up-foundation <stack> | Deploy foundation stack |
just infra-up-compute <stack> | Deploy compute stack (pre-flight ECR check) |
just infra-up-edge <stack> | Deploy edge stack |
just infra-preview <stack> | Preview all 4 stacks |
just infra-preview-<layer> <stack> | Preview a single stack |
just infra-down-soft <stack> | Destroy edge + compute only (preserves data) |
just infra-down-all <stack> | Destroy edge + compute + foundation (preserves persistent) |
just infra-down-persistent <stack> | Destroy persistent (requires confirmation) |
just infra-outputs <layer> <stack> | Show stack outputs as JSON |
just infra-refresh <layer> <stack> | Refresh stack state from cloud |
just infra-recover <layer> <stack> | Cancel pending ops, refresh, show drift |
just infra-stack-init <layer> <stack> | Initialize a new Pulumi stack |
just infra-verify-account | Verify AWS identity for the profile |
just infra-test | Run Pulumi infrastructure unit tests |
just infra-lint | Lint infrastructure Python code |
just infra-check | Lint + test combined |
just infra-typecheck | MyPy type check on all stacks |
5.3 Pulumi CI Script¶
infra/pulumi-ci.sh is used by GitHub Actions for automated deployments.
Usage:
Arguments:
| Argument | Values | Description |
|---|---|---|
layer | persistent, foundation, compute, edge, all | Which stack(s) to operate on |
stack | dev, staging, prod | Target environment |
command | preview, up | Pulumi operation |
s3-backend-url | (optional) | Defaults to $S3_PULUMI_BACKEND_URL |
Required environment variables: - PULUMI_CONFIG_PASSPHRASE -- Stack encryption passphrase - S3_PULUMI_BACKEND_URL -- S3 backend for Pulumi state - AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY -- AWS credentials - AWS_REGION -- (optional, defaults to eu-central-1)
When layer=all, the script runs all stacks in order: persistent → foundation → compute → edge.
5.4 Preview Before Deploy¶
# Preview all stacks for dev environment
just infra-preview dev
# Preview individual stacks
just infra-preview-persistent dev
just infra-preview-foundation dev
just infra-preview-compute dev
just infra-preview-edge dev
On pull requests that touch infra/**, the deploy-infra.yml workflow automatically runs pulumi preview across all stacks for dev, staging, and prod, posting a summary table as a PR comment.
6. Service Container Deployment¶
6.1 Docker Build¶
Four services are containerized for ECS deployment:
| Service | Dockerfile | ECR Image |
|---|---|---|
backend | services/backend/Dockerfile | tradai/backend |
data-collection | services/data-collection/Dockerfile | tradai/data-collection |
strategy-service | services/strategy-service/Dockerfile | tradai/strategy-service |
mlflow | services/mlflow/Dockerfile | tradai/mlflow |
# Build all 4 service images (linux/amd64)
just docker-build
# Build a single service
just docker-build-service backend
6.2 ECR Push¶
# Login to ECR + build + push all services
just service-push-all
# Push a single service
just service-push backend
# Verify images exist in ECR
just service-check-images
6.3 ECS Redeployment¶
# Force rolling update on all ECS services
just ecs-force-deploy-all
# Force rolling update on a specific service
just ecs-force-deploy strategy-service
# Check ECS service status
just ecs-status
# View recent ECS events
just ecs-events strategy-service
ECS services targeted by ecs-force-deploy-all: strategy-service, dry-run-trading, live-trading (trading services only).
CI vs Local Redeployment Targets
The docker-build.yml CI workflow redeploys the 4 services it builds: backend-api, data-collection, strategy-service, and mlflow. The just ecs-force-deploy-all command targets only the 3 trading services listed above. To redeploy a specific service locally, use just ecs-force-deploy <service>.
The docker-build.yml GitHub Actions workflow triggers on version tags (after CI passes), builds all 4 service images in parallel, pushes to ECR with both the version tag and latest, then forces a new ECS deployment for backend-api, data-collection, strategy-service, and mlflow.
7. Library Publishing¶
Libraries are published to AWS CodeArtifact for consumption by the separate tradai-strategies repository.
flowchart LR
L[CodeArtifact Login<br/><code>just codeartifact-login dev</code>] --> B[Build Wheels<br/><code>just build-libs</code>]
B --> P[Publish to CodeArtifact<br/><code>just publish-all-libs dev</code>]
style L fill:#fff3e0
style P fill:#c8e6c9 | Command | Description |
|---|---|
just codeartifact-login <env> | Authenticate pip with CodeArtifact |
just build-libs | Build wheels for tradai-common, tradai-data, tradai-strategy |
just publish-strategy <env> | Publish tradai-strategy wheel to CodeArtifact |
just publish-all-libs <env> | Publish all library wheels |
Token Expiry
CodeArtifact authorization tokens expire after 12 hours. Re-run just codeartifact-login before any publish or install operation if the token has expired.
The publish-libs.yml GitHub Actions workflow triggers on version tags (after CI passes) and publishes tradai-strategy to CodeArtifact automatically, with --skip-existing to handle re-runs.
8. Rollback Procedures¶
Service Rollback¶
Redeploy the previous working image by forcing a new ECS deployment. ECS uses the :latest tag, so push a known-good image and redeploy:
# Rebuild from a known-good commit
git checkout <good-commit>
just docker-build-service backend
just service-push backend
just ecs-force-deploy backend
Lambda Rollback¶
Rebuild Lambda images from the previous commit and push:
The deploy-lambdas.yml workflow updates each Lambda function code to the new image URI. AWS Lambda also supports version aliases for instant rollback without rebuilding.
Infrastructure Rollback¶
# Recover from a failed deployment (cancel pending ops, refresh state)
just infra-recover <layer> <stack>
# Re-deploy with current code (Pulumi converges to desired state)
just infra-up-<layer> <stack>
Pulumi stores state in S3. Each pulumi up converges the actual cloud state to match the declared code. Rolling back infrastructure means running pulumi up with the previous code version.
Destructive Operations
just infra-down-persistentdestroys all data (S3 buckets, DynamoDB tables, ECR repos, Cognito user pool). Requires typingdestroy-all-datato confirm.just infra-down-alldestroys edge + compute + foundation but preserves persistent data.just infra-down-softdestroys only edge + compute (safest for dev iteration).- In production,
deletion_protectionis enabled on DynamoDB, RDS, ALB, and Cognito. Teardown requires explicit removal of protection first.
9. Environment Configuration¶
Local Development¶
The infra/.env file is sourced automatically by all just infra-* recipes via:
Required variables in infra/.env:
| Variable | Description |
|---|---|
AWS_PROFILE | AWS CLI profile name (default: tradai) |
PULUMI_CONFIG_PASSPHRASE | Passphrase for Pulumi stack encryption |
S3_PULUMI_BACKEND_URL | S3 URL for Pulumi state backend |
CI/CD (GitHub Actions)¶
GitHub Actions workflows use repository secrets:
| Secret | Used By |
|---|---|
AWS_ACCESS_KEY_ID | All deployment workflows |
AWS_SECRET_ACCESS_KEY | All deployment workflows |
AWS_REGION | All deployment workflows (default: eu-central-1) |
PULUMI_CONFIG_PASSPHRASE | deploy-infra.yml |
S3_PULUMI_BACKEND_URL | deploy-infra.yml |
CI/CD Workflow Triggers¶
| Workflow | Trigger | Purpose |
|---|---|---|
ci.yml | Push to main, version tags, PRs, weekly schedule | Lint, typecheck, test, security scan |
deploy-infra.yml | PR touching infra/**, manual dispatch | Preview on PR, deploy on manual trigger |
deploy-lambdas.yml | After CI on version tags, manual dispatch | Build + push Lambda images, update functions |
docker-build.yml | After CI on version tags | Build + push service images, ECS redeploy |
publish-libs.yml | After CI on version tags | Publish libraries to CodeArtifact |
10. Changelog¶
| Date | Version | Change |
|---|---|---|
| 2026-03-28 | 1.0.0 | Initial document |
Dependencies¶
This document references:
| Document | Relationship |
|---|---|
| 02-ARCHITECTURE-OVERVIEW | Overall system architecture |
| 03-VPC-NETWORKING | Network topology deployed by foundation stack |
| 05-SERVICES | ECS service definitions |
| 09-PULUMI-CODE | Pulumi stack implementation details |
| 10-CANONICAL-CONFIG | Configuration constants shared across stacks |
infra/TEARDOWN.md | Full teardown guide with manual steps |