Skip to content

TradAI Final Architecture - Step Functions Workflows

Version: 11.1.0 | Date: 2026-03-28 | Status: CURRENT

Last synced with: infra/compute/asl_templates/backtest_workflow.json.j2, infra/compute/asl_templates/retraining_workflow.json.j2

TL;DR: 2 Step Functions workflows: Backtest (12 states, 7200s timeout) and Retraining (10 states). Both use Jinja2 ASL templates with Lambda ARN injection. DLQ handling is at SQS level, not workflow level.


1. When to Use Step Functions

Step Functions is one of four execution modes for backtesting (see 05-SERVICES.md Section 4).

Mode Use Case Use Step Functions?
local Development/testing NO - Docker only
ecs Simple production backtests NO - Direct ECS launch
sqs High-volume with backpressure NO - SQS → Lambda → ECS
stepfunctions Complex multi-step workflows YES

Use Step Functions when you need: - Data freshness validation before backtest - Strategy validation (ECR/S3 checks) - Multi-step workflows (data sync → backtest → analyze) - Visual debugging and execution history - Complex retry/catch logic

Skip Step Functions when: - Simple backtest execution (use ECS direct or SQS mode) - Development/testing (use local mode) - Container handles its own status updates (v9.2 architecture)


2. Workflow Overview

Workflow Type States Timeout Trigger
Backtest Workflow (v11) STANDARD 12 7200s (2 hours) SQS -> Lambda
Retraining Workflow (MO005) STANDARD 10 10800s (3 hours) EventBridge / Manual

Backtest Workflow Timeout: 7200s (2 hours)

The backtest workflow has a hard timeout of 7200 seconds. Backtests exceeding this limit (e.g., multi-year multi-pair runs) will be caught by States.Timeout and routed to HandleTimeout -> CleanupResources -> UpdateStatusFailed. Monitor ExecutionDuration CloudWatch metrics to detect backtests approaching the limit. Consider increasing the timeout for ML-heavy strategies.

Critical: All workflows use STANDARD type (not EXPRESS) because: - Backtests can run 30-60+ minutes - EXPRESS has 5-minute maximum duration - STANDARD provides execution history for 90 days


3. Backtest Workflow (v11)

The v11 workflow replaced the earlier v8.0 "full" and v9.2 "simplified" workflows with a single, unified workflow. Key changes from earlier versions:

  • Removed PrepareConfig (ECS Task) — task definition resolved dynamically from $.task_definition
  • Removed TransformResults (Lambda) — container handles DynamoDB/MLflow/S3 directly
  • Renamed CheckDataFreshnessEnsureData — now does DB-first coverage check, fetches only gaps
  • Added UpdateStatusRunning — explicitly sets RUNNING before backtest via update-status Lambda
  • Split UpdateStatusUpdateStatusCompleted / UpdateStatusFailed / UpdateStatusValidationFailed
  • Split HandleErrorHandleTimeout (Pass) + CleanupResources (Lambda)
  • Added HandleSuccess as a Parallel state (UpdateStatus + Notify run concurrently)

Workflow Diagram

Mermaid Diagram

stateDiagram-v2
    [*] --> ValidateStrategy
    ValidateStrategy --> EvaluateValidation

    state EvaluateValidation <<choice>>
    EvaluateValidation --> EnsureData: Valid
    EvaluateValidation --> UpdateStatusValidationFailed: Invalid

    UpdateStatusValidationFailed --> FailValidation
    FailValidation --> [*]

    EnsureData --> UpdateStatusRunning
    UpdateStatusRunning --> RunBacktest

    state RunBacktest_result <<choice>>
    RunBacktest --> RunBacktest_result
    RunBacktest_result --> HandleSuccess: Success
    RunBacktest_result --> HandleTimeout: Timeout
    RunBacktest_result --> CleanupResources: TaskFailed

    HandleTimeout --> CleanupResources
    CleanupResources --> UpdateStatusFailed
    UpdateStatusFailed --> NotifyFailure
    NotifyFailure --> [*]

    state HandleSuccess {
        [*] --> UpdateStatusCompleted
        [*] --> NotifySuccess
    }
    HandleSuccess --> [*]

State Reference

State Type Purpose
ValidateStrategy Task (Lambda) Validate strategy exists and is deployable
EvaluateValidation Choice Route based on validation result
UpdateStatusValidationFailed Task (Lambda) Set job status to FAILED for validation errors
FailValidation Fail Terminal state for validation failures
EnsureData Task (Lambda) Check ArcticDB coverage, fetch gaps from exchange
UpdateStatusRunning Task (Lambda) Set job status to RUNNING before ECS launch
RunBacktest Task (ECS RunTask) Execute backtest in strategy container
HandleTimeout Pass Format timeout error for cleanup
CleanupResources Task (Lambda) Stop orphaned ECS tasks on failure
UpdateStatusFailed Task (Lambda) Set job status to FAILED
HandleSuccess Parallel Run UpdateStatusCompleted + NotifySuccess concurrently
NotifyFailure Task (Lambda) Send failure notification

State Machine Definition

See the canonical ASL template at infra/compute/asl_templates/backtest_workflow.json.j2.

Key configuration parameters (Jinja2 variables):

Variable Description
config.backtest_timeout_seconds Overall workflow timeout (7200s = 2 hours)
config.heartbeat_seconds ECS task heartbeat interval (900s = 15 min)

Two Distinct Heartbeats

This is the Step Functions ECS task heartbeat (900s = 15 min). It is distinct from the application-level DynamoDB heartbeat sent by HealthReporter every 30 seconds (see 16-OBSERVABILITY.md Section 4). The SFN heartbeat detects hung ECS tasks that stop communicating with Step Functions; the app heartbeat monitors live trading strategy health and publishes metrics to CloudWatch.

| lambda_arns.validate_strategy | ValidateStrategy Lambda ARN | | lambda_arns.data_collection_proxy | EnsureData Lambda ARN | | lambda_arns.update_status | UpdateStatus Lambda ARN | | lambda_arns.cleanup_resources | CleanupResources Lambda ARN | | lambda_arns.notify_completion | NotifyCompletion Lambda ARN | | ecs.cluster_arn | ECS cluster ARN | | ecs.subnets | Private subnet list | | ecs.security_group_id | ECS security group | | dynamodb_table_name | Workflow state table | | mlflow_tracking_uri | MLflow tracking URI | | arctic_bucket | S3 bucket for ArcticDB data | | arctic_library | ArcticDB library name (default: ohlcv) |

Key Design Decisions (v11)

  1. Sequential validation — ValidateStrategy runs first, then EnsureData. Earlier versions ran these in parallel, but v11 validates first to fail fast before expensive data operations.

  2. Dynamic task definitionTaskDefinition.$: "$.task_definition" is resolved from execution input, not hardcoded. This allows different strategy containers per execution.

  3. Explicit status updates via Lambda — Uses update-status Lambda instead of direct DynamoDB integration. This enforces state transition guards and publishes CloudWatch metrics.

  4. Separate timeout handlingHandleTimeout (Pass state) formats the error before routing to CleanupResources, keeping timeout and task-failure paths distinct.

  5. Parallel success handlingHandleSuccess runs UpdateStatusCompleted and NotifySuccess concurrently with independent catch-all states, so a notification failure doesn't block status updates.

Error Handling (RunBacktest)

RunBacktest uses Catch blocks only (no Retry) for ECS task errors:

Error Handler Next State
States.TaskFailed Catch CleanupResources
States.Timeout Catch HandleTimeout → CleanupResources
States.ALL Catch CleanupResources

Lambda states (ValidateStrategy, EnsureData) have Retry blocks for transient AWS errors:

Error Type Max Attempts Interval Backoff
Lambda.ServiceException 3 2s 2.0x
Lambda.TooManyRequestsException 3 2s 2.0x

DLQ Handling at SQS Level

Dead letter queue handling is at the SQS level, not within the Step Functions workflow. Failed messages in tradai-backtest-queue.fifo are routed to tradai-backtest-dlq.fifo after maxReceiveCount (3) attempts. Step Functions failures (e.g., Lambda errors, ECS task crashes) are handled by Catch blocks within the workflow, not by SQS retries.

Expected Flow (Happy Path)

1. ValidateStrategy
   └─ Output: {valid: true, resolved_version: "1.2.0"}

2. EvaluateValidation → EnsureData

3. EnsureData
   └─ Output: {coverage: {BTC/USDT: 100%}, gaps_filled: 0}

4. UpdateStatusRunning
   └─ Sets job status to RUNNING in DynamoDB

5. RunBacktest
   └─ Duration: 10-30 minutes
   └─ Container handles MLflow logging, S3 upload

6. HandleSuccess (Parallel)
   ├─ UpdateStatusCompleted → DynamoDB updated to COMPLETED
   └─ NotifySuccess → SNS notification sent

Version History

Version Key Changes
v8.0 Initial: ParallelValidation, PrepareConfig, TransformResults, direct DynamoDB
v9.2 Simplified: container handles status/MLflow/S3, removed PrepareConfig/TransformResults
v11 Current: sequential validation, EnsureData with DB-first check, explicit status Lambdas, parallel success handling

4. Retraining Workflow (MO005)

The retraining workflow orchestrates ML model retraining, comparing the challenger model against the champion, and optionally promoting the new version.

See the canonical ASL template at infra/compute/asl_templates/retraining_workflow.json.j2.

Workflow Diagram

stateDiagram-v2
    [*] --> CheckRetrainingNeeded

    CheckRetrainingNeeded --> EvaluateRetrainingNeed
    CheckRetrainingNeeded --> NotifyFailure: States.ALL (Catch)

    state EvaluateRetrainingNeed <<choice>>
    EvaluateRetrainingNeed --> RunRetraining: needs_retraining
    EvaluateRetrainingNeed --> SkipRetraining: otherwise (default)

    SkipRetraining --> [*]

    RunRetraining --> CompareModels
    RunRetraining --> NotifyFailure: States.ALL (Catch)

    CompareModels --> DecidePromotion
    CompareModels --> NotifyFailure: States.ALL (Catch)

    state DecidePromotion <<choice>>
    DecidePromotion --> PromoteModel: promote
    DecidePromotion --> KeepCurrentModel: otherwise (default)

    PromoteModel --> NotifyCompletion
    PromoteModel --> NotifyFailure: States.ALL (Catch)
    KeepCurrentModel --> NotifyCompletion

    NotifyCompletion --> [*]
    NotifyFailure --> [*]

    note right of CheckRetrainingNeeded: Lambda&#58; check drift state and schedule triggers
    note right of RunRetraining: ECS RunTask (FARGATE_SPOT)<br/>TimeoutSeconds&#58; training_timeout_seconds<br/>HeartbeatSeconds&#58; heartbeat_seconds
    note right of CompareModels: Lambda&#58; champion vs challenger metrics
    note right of PromoteModel: Lambda&#58; promote challenger to Production in MLflow

State Reference

State Type Purpose
CheckRetrainingNeeded Task (Lambda) Check drift state and schedule triggers
EvaluateRetrainingNeed Choice Route based on decision (needs_retraining or skip)
SkipRetraining Pass Terminal state when retraining not needed
RunRetraining Task (ECS RunTask) Execute model training in strategy container
CompareModels Task (Lambda) Compare champion vs challenger metrics
DecidePromotion Choice Route based on comparison (promote or keep)
KeepCurrentModel Pass Skip promotion when challenger is not better
PromoteModel Task (Lambda) Promote challenger to Production in MLflow
NotifyCompletion Task (Lambda) Send success notification
NotifyFailure Task (Lambda) Send failure notification

Template Variables

Variable Description
config.retraining_timeout_seconds Overall workflow timeout (10800s = 3 hours)
config.training_timeout_seconds ECS training task timeout (7200s = 2 hours)
config.heartbeat_seconds ECS task heartbeat interval (900s = 15 min)
lambda_arns.check_retraining_needed CheckRetrainingNeeded Lambda ARN
lambda_arns.compare_models CompareModels Lambda ARN
lambda_arns.promote_model PromoteModel Lambda ARN
lambda_arns.notify_completion NotifyCompletion Lambda ARN
ecs.cluster_arn ECS cluster ARN
ecs.subnets Private subnet list
ecs.security_group_id ECS security group
dynamodb_table_name Workflow state table
mlflow_tracking_uri MLflow tracking URI

Error Handling

All states use Catch blocks routing to NotifyFailure on States.ALL errors. RunRetraining has no Retry -- ECS task failures go directly to NotifyFailure.


5. CloudWatch Integration

Metrics Published

Metric Namespace Dimensions Unit
ExecutionStarted TradAI/StepFunctions WorkflowType Count
ExecutionSucceeded TradAI/StepFunctions WorkflowType Count
ExecutionFailed TradAI/StepFunctions WorkflowType Count
ExecutionDuration TradAI/StepFunctions WorkflowType Seconds
BacktestDuration TradAI/Backtest Strategy Seconds

Note: CloudWatch alarms for Step Functions are defined in the edge stack (infra/edge/), not in the compute stack.


6. Cost Analysis

Step Functions Pricing (Standard)

Metric Rate Monthly Usage Cost
State transitions $0.025 per 1000 ~6000 (50 backtests x 12 states x 10) $0.15
Total $0.15

Comparison: EXPRESS vs STANDARD

Aspect EXPRESS STANDARD
Max Duration 5 minutes 1 year
Cost per 1M transitions $1.00 $25.00
State persistence No 90 days
Suitable for backtests NO YES

7. Testing Workflows

Test Input

{
  "run_id": "test-001",
  "trace_id": "trace-test-001",
  "strategy_name": "MomentumStrategy",
  "strategy_version": "1.0.0",
  "task_definition": "arn:aws:ecs:eu-central-1:123456:task-definition/momentum-strategy:5",
  "experiment_name": "test-experiment",
  "symbols": ["BTC/USDT:USDT"],
  "symbols_csv": "BTC/USDT:USDT",
  "exchange": "binance",
  "timeframe": "1h",
  "start_date": "20240101",
  "end_date": "20240601",
  "config_version_id": "abc123"
}

8. Next Steps

  1. Review 07-COST-ANALYSIS.md for complete cost breakdown
  2. Review 09-PULUMI-CODE.md for Step Functions deployment code

Changelog

Version Date Changes
11.1.0 2026-03-28 Added TL;DR, changelog, and dependencies sections

Dependencies

If This Changes Update This Doc
infra/compute/asl_templates/backtest_workflow.json.j2 Backtest workflow diagram + state reference (Section 3)
infra/compute/asl_templates/retraining_workflow.json.j2 Retraining workflow diagram + state reference (Section 4)
infra/compute/modules/lambda_funcs.py LAMBDA_CONFIGS Lambda ARN references in template variables
infra/compute/modules/step_functions.py Workflow creation, timeout, and config values (Section 2)