TradAI CI/CD Pipeline Architecture¶

Version: 1.0.0 | Date: 2026-03-28 | Status: CURRENT Source: .github/workflows/, justfile

TL;DR: GitHub Actions (sole CI/CD). 11 workflows with path-based change detection and dynamic test matrix. Tag-based deployment gates: v* tags trigger docker-build, Lambda deploy, and library publish -- all gated behind CI success via workflow_run. 4-stack Pulumi deployment via just infra-bootstrap. Manual workflow_dispatch available for emergency bypasses.

1. Pipeline Overview¶

flowchart LR
    subgraph Triggers
        PR[Pull Request]
        Push[Push to main]
        Tag["Tag v*"]
        Manual[workflow_dispatch]
        Schedule[Weekly / Sunday]
    end

    subgraph CI["CI Gate (ci.yml)"]
        Changes[Detect Changes]
        Lint[Lint & Format]
        Type[Type Check]
        Test[Test Matrix]
        Security[Security Scan]
        Perf[Performance Tests]
        Contract[Contract Tests]
    end

    subgraph Deploy["Deployment Workflows"]
        Docker[Docker Build & Push]
        Lambda[Deploy Lambdas]
        Publish[Publish Libraries]
        Infra[Deploy Infrastructure]
        Docs[Deploy Documentation]
    end

    subgraph Targets
        ECR[Amazon ECR]
        ECS[ECS Services]
        LambdaFn[Lambda Functions]
        CA[CodeArtifact]
        CF[Cloudflare Pages]
        Pulumi[Pulumi Stacks]
    end

    PR --> CI
    Push --> CI
    Tag --> CI
    Schedule --> CI

    CI -->|workflow_run + v* tag| Docker
    CI -->|workflow_run + v* tag| Lambda
    CI -->|workflow_run + v* tag| Publish
    Manual --> Infra
    Manual --> Lambda
    Push -->|docs paths| Docs

    Docker --> ECR --> ECS
    Lambda --> ECR --> LambdaFn
    Publish --> CA
    Infra --> Pulumi
    Docs --> CF

2. GitHub Actions Workflows¶

11 workflow files in .github/workflows/:

Workflow	File	Trigger	Purpose	Timeout
CI	`ci.yml`	push (main), PR, tag `v*`, weekly, dispatch	Orchestrator: change detection, lint, typecheck, test matrix, security, perf, contract	varies
Lint	`_lint.yml`	`workflow_call` (reusable)	Ruff check + format (called by CI)	5 min
Test Package	`_test.yml`	`workflow_call` (reusable)	Per-package pytest with coverage (called by CI matrix)	20 min
Docker Build & Push	`docker-build.yml`	`workflow_run` (CI success)	Build 4 service images, push to ECR, redeploy ECS	--
Deploy Lambdas	`deploy-lambdas.yml`	`workflow_run` (CI success), dispatch	5-stage: version, wheel, base image, individual lambdas, update functions	--
Publish Libraries	`publish-libs.yml`	`workflow_run` (CI success)	Build tradai-strategy wheel, publish to CodeArtifact	--
Deploy Infrastructure	`deploy-infra.yml`	PR (`infra/**`), dispatch	Validate, preview (PR), deploy (manual) via `pulumi-ci.sh`	60 min
Deploy Documentation	`docs.yml`	push (main, docs paths), dispatch	Build MkDocs, deploy to Cloudflare Pages	--
Devcontainer CI	`devcontainer-ci.yml`	weekly (Sunday 02:00), dispatch	Full test suite inside devcontainer image	30 min
Devcontainer Prebuild	`devcontainer-prebuild.yml`	push (main, `.devcontainer/**`), weekly	Build and push devcontainer image to GHCR	30 min
Docs Freshness	`docs-freshness.yml`	scheduled, dispatch	Check documentation freshness against codebase	--

3. CI Gate Structure¶

The CI workflow (ci.yml) is the quality gate. Deployment workflows only fire when CI passes on a v* tag.

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub
    participant CI as CI Workflow
    participant Deploy as Deployment Workflows

    Dev->>GH: Push PR
    GH->>CI: Trigger CI (PR)
    CI->>CI: Detect Changes (dorny/paths-filter)
    CI->>CI: Build Test Matrix
    par Parallel Jobs
        CI->>CI: Lint & Format (Ruff)
        CI->>CI: Type Check (MyPy, 7 packages)
        CI->>CI: Security Scan (pip-audit + Bandit)
        CI->>CI: Performance Tests (PR only)
        CI->>CI: Contract Tests (PR + schedule)
    end
    CI->>CI: Test Matrix (up to 8 packages, parallel)
    CI-->>Dev: Status checks on PR

    Dev->>GH: Merge + Tag v1.2.3
    GH->>CI: Trigger CI (tag)
    CI->>CI: Full matrix (all 8 packages)
    CI-->>GH: CI completed (success)

    GH->>Deploy: workflow_run event
    par Tag Deployments
        Deploy->>Deploy: Docker Build & Push (4 services)
        Deploy->>Deploy: Deploy Lambdas (17 Dockerfile-based functions)
        Deploy->>Deploy: Publish Libraries (CodeArtifact)
    end
    Deploy->>Deploy: Redeploy ECS Services

Tag-Based Gating

Deployment workflows use workflow_run with a condition: github.event.workflow_run.conclusion == 'success' && startsWith(github.event.workflow_run.head_branch, 'v'). This ensures only successful CI runs on version tags trigger deployments.

4. Change Detection and Test Matrix¶

The CI workflow uses path-based change detection (dorny/paths-filter@v3) to avoid running all 8 package test suites on every PR. Each filter includes the package's own source plus the specific tradai-common submodules it imports. Heavy consumers (backend, strategy-service, cli) watch all of tradai-common/**; light consumers (data, strategy, data-collection) watch only their specific deps.

8 filters: deps (workspace config -- triggers all), common, data, strategy, backend, data-collection, strategy-service, cli.

On PRs, only affected packages run. On push/schedule/dispatch, all 8 matrix entries run (7 packages at 60% coverage threshold, cli at 45%, plus integration tests at 0%). Integration tests run when common changes or 2+ packages are affected.

5. Deployment Flows¶

5.1 Lambda Deployment (5-Stage Pipeline)¶

The deploy-lambdas.yml workflow builds and deploys 18 Lambda functions as container images.

flowchart TB
    subgraph Stage1["Stage 1: Version"]
        V[Calculate Version<br/>tag or manual-YYYYMMDDHHMMSS]
    end

    subgraph Stage2["Stage 2: Wheel"]
        W[Build tradai-common wheel<br/>uv build libs/tradai-common]
    end

    subgraph Stage3["Stage 3: Base Image"]
        B[Build lambda-base<br/>lambdas/base/Dockerfile]
    end

    subgraph Stage4["Stage 4: Individual Lambdas"]
        direction LR
        L1[backtest-consumer]
        L2[drift-monitor]
        L3[health-check]
        L4[update-status]
        LN["... 14 more"]
    end

    subgraph Stage5["Stage 5: Update Functions"]
        U[aws lambda update-function-code<br/>for each function]
    end

    V --> Stage2
    Stage2 --> Stage3
    Stage3 --> Stage4
    Stage4 --> Stage5

17 Lambda functions auto-discovered from lambdas/*/Dockerfile (backtest-consumer, drift-monitor, health-check, update-status, sqs-consumer, validate-strategy, and 11 more -- plus base shared image). The 18th Lambda (update-nat-routes) is an inline Python handler deployed directly via Pulumi (no Dockerfile), so it is not part of the lambda-bootstrap pipeline.

Local equivalent:

just lambda-bootstrap          # Full pipeline: wheel -> base -> all lambdas -> ECR push
just lambda-build-all          # Build only (no push)
just lambda-push-all           # Push pre-built images to ECR

5.2 Service Deployment (Docker Build & Push)¶

The docker-build.yml workflow builds 4 service images and redeploys ECS.

flowchart LR
    subgraph Build["Parallel Builds"]
        B1[backend<br/>services/backend/Dockerfile]
        B2[data-collection<br/>services/data-collection/Dockerfile]
        B3[strategy-service<br/>services/strategy-service/Dockerfile]
        B4[mlflow<br/>services/mlflow/Dockerfile]
    end

    subgraph Push["ECR Push"]
        ECR["ECR Registry<br/>:version + :latest tags"]
    end

    subgraph Redeploy["ECS Redeploy"]
        ECS["aws ecs update-service<br/>--force-new-deployment"]
    end

    Build --> Push --> Redeploy

ECS service names follow the pattern tradai-{service}-{env} with backend-api (not backend) matching infra/config.py.

Local equivalent:

just docker-build              # Build all 4 service images (linux/amd64)
just service-push-all          # Build + push all to ECR
just ecs-force-deploy-all      # Force ECS redeployment (strategy-service, dry-run-trading, live-trading)

CI vs Local Redeployment Targets

The docker-build.yml CI workflow redeploys backend-api, data-collection, strategy-service, and mlflow (the 4 services it builds). The just ecs-force-deploy-all command targets only strategy-service, dry-run-trading, and live-trading (the trading services). To redeploy all services locally, use just ecs-force-deploy <service> for each service individually.

5.3 Infrastructure Deployment (4-Stack Pulumi)¶

The deploy-infra.yml workflow manages 4 Pulumi stacks deployed in strict order.

flowchart TB
    subgraph Validate
        V[Lint + Unit Tests<br/>all 4 stacks]
    end

    subgraph PR["PR: Preview"]
        P1["Preview dev"]
        P2["Preview staging"]
        P3["Preview prod"]
        PS["Post PR Summary<br/>(create/update/delete counts)"]
    end

    subgraph Manual["Manual: Deploy"]
        D1["persistent<br/>S3, DynamoDB, ECR, Cognito, CodeArtifact"]
        D2["foundation<br/>VPC, RDS, SQS, SNS"]
        D3["Lambda Bootstrap<br/>(just lambda-bootstrap)"]
        D4["compute<br/>ALB, ECS, Lambda, Step Functions"]
        D5["edge<br/>API Gateway, WAF, CloudWatch"]
    end

    Validate --> PR
    Validate --> Manual

    P1 --> PS
    P2 --> PS
    P3 --> PS

    D1 --> D2 --> D3 --> D4 --> D5

Deployment Order

The compute stack has a pre-flight check that verifies all Lambda images exist in ECR before deploying. If images are missing, it fails fast with instructions to run just lambda-bootstrap first.

Local equivalent:

just infra-bootstrap dev       # Full: persistent -> foundation -> lambda-bootstrap -> service-push -> compute -> edge
just infra-up-foundation dev   # Single stack
just infra-preview dev         # Preview all 4 stacks
just infra-recover foundation dev  # Recovery: cancel + refresh + preview drift

Deploy script: infra/pulumi-ci.sh handles layer iteration, backend login, stack selection, and pulumi preview/pulumi up for any combination of layers and environments.

5.4 Library Publishing¶

The publish-libs.yml workflow publishes tradai-strategy to AWS CodeArtifact for use by the separate tradai-strategies repository.

# CI pipeline steps:
uv build libs/tradai-strategy --out-dir dist       # Build wheel
twine upload --repository-url $CODEARTIFACT_URL ... # Publish to CodeArtifact

Local equivalent:

just codeartifact-login dev    # Configure pip/twine auth (12h expiry)
just publish-all-libs dev      # Build + publish tradai-strategy

5.5 Documentation Deployment¶

Triggered by pushes to docs/**, mkdocs.yml, lib/service READMEs, and architecture reports. Builds MkDocs with Material theme and deploys to Cloudflare Pages.

6. Emergency Procedures¶

Manual Dispatch Bypass

The workflow_dispatch trigger on deploy-lambdas.yml and deploy-infra.yml bypasses the CI gate. Use only for emergency hotfixes when CI is broken or blocking a critical deploy.

Emergency Lambda deploy: Actions > Deploy Lambdas > Run workflow (select env). Or locally: just lambda-bootstrap.
Emergency ECS hotfix: just ecs-force-deploy-all (or single service: just ecs-force-deploy strategy-service).
Rollback: Check out previous tag, rebuild and push: git checkout v1.2.2 && just docker-build && just service-push-all && just ecs-force-deploy-all. For Lambdas: just lambda-bootstrap.
Infrastructure recovery: just infra-recover foundation dev (cancels pending ops, refreshes state, previews drift). deploy-infra.yml also prints recovery instructions on failure.

7. Changelog¶

Date	Change	Author
2026-03-28	Initial document	Architecture team

8. Dependencies¶

This document relates to:

02-ARCHITECTURE-OVERVIEW.md -- System architecture and service topology
04-SECURITY.md -- Security controls and secrets management
05-SERVICES.md -- Service definitions (backend, data-collection, strategy-service, mlflow)
.github/workflows/ -- All 11 workflow definitions
justfile -- Local development and deployment commands
infra/pulumi-ci.sh -- Pulumi deployment script (4-stack orchestration)
.github/actions/setup-workspace/ -- Shared composite action for CI setup