Skip to content

TradAI Final Architecture - Step Functions Workflows

Version: v11 | Date: 2026-02-21

Last synced with: infra/compute/asl_templates/backtest_workflow.json.j2


When to Use Step Functions

Step Functions is one of four execution modes for backtesting (see 05-SERVICES.md Section 6).

Mode Use Case Use Step Functions?
local Development/testing NO - Docker only
ecs Simple production backtests NO - Direct ECS launch
sqs High-volume with backpressure NO - SQS → Lambda → ECS
stepfunctions Complex multi-step workflows YES

Use Step Functions when you need: - Data freshness validation before backtest - Strategy validation (ECR/S3 checks) - Multi-step workflows (data sync → backtest → analyze) - Visual debugging and execution history - Complex retry/catch logic

Skip Step Functions when: - Simple backtest execution (use ECS direct or SQS mode) - Development/testing (use local mode) - Container handles its own status updates (v9.2 architecture)


Workflow Overview

Workflow Type States Duration Trigger
Backtest Workflow (v11) STANDARD 12 10-70 min SQS → Lambda
Data Sync Workflow STANDARD 5 10-30 min API Gateway
Deploy Workflow STANDARD 6 5-15 min API Gateway

Critical: All workflows use STANDARD type (not EXPRESS) because: - Backtests can run 30-60+ minutes - EXPRESS has 5-minute maximum duration - STANDARD provides execution history for 90 days


1. Backtest Workflow (v11)

The v11 workflow replaced the earlier v8.0 "full" and v9.2 "simplified" workflows with a single, unified workflow. Key changes from earlier versions:

  • Removed PrepareConfig (ECS Task) — task definition resolved dynamically from $.task_definition
  • Removed TransformResults (Lambda) — container handles DynamoDB/MLflow/S3 directly
  • Renamed CheckDataFreshnessEnsureData — now does DB-first coverage check, fetches only gaps
  • Added UpdateStatusRunning — explicitly sets RUNNING before backtest via update-status Lambda
  • Split UpdateStatusUpdateStatusCompleted / UpdateStatusFailed / UpdateStatusValidationFailed
  • Split HandleErrorHandleTimeout (Pass) + CleanupResources (Lambda)
  • Added HandleSuccess as a Parallel state (UpdateStatus + Notify run concurrently)

Workflow Diagram

                               ┌──────────────────┐
                               │      START       │
                               └────────┬─────────┘
                               ┌────────┴────────┐
                               │ValidateStrategy │
                               │    (Lambda)     │
                               └────────┬────────┘
                               ┌────────┴────────┐
                               │EvaluateValidation│
                               │    (Choice)     │
                               └────────┬────────┘
                          ┌─────────────┴─────────────┐
                          │ Valid                      │ Invalid
                          ▼                            ▼
                 ┌────────────────┐          ┌─────────────────────────┐
                 │  EnsureData   │          │UpdateStatusValidationFailed│
                 │   (Lambda)    │          │        (Lambda)          │
                 │ DB-first check│          └────────────┬────────────┘
                 └───────┬──────┘                        │
                         │                      ┌────────┴────────┐
                         ▼                      │  FailValidation │
                ┌─────────────────┐             │     (Fail)      │
                │UpdateStatusRunning│            └─────────────────┘
                │    (Lambda)     │
                └───────┬────────┘
                ┌───────┴────────┐
                │  RunBacktest   │
                │  (ECS RunTask) │
                └───────┬────────┘
           ┌────────────┼──────────────┐
           │ Success    │ Timeout      │ TaskFailed/Other
           ▼            ▼              ▼
    ┌──────────────┐ ┌──────────┐ ┌──────────────┐
    │ HandleSuccess│ │HandleTimeout│ │CleanupResources│
    │  (Parallel)  │ │  (Pass)  │ │   (Lambda)    │
    │ ┌──────────┐ │ └────┬─────┘ └──────┬───────┘
    │ │UpdateStatus│ │      │              │
    │ │Completed │ │      └──────┬───────┘
    │ └──────────┘ │             │
    │ ┌──────────┐ │    ┌────────┴────────┐
    │ │NotifySuccess│ │  │UpdateStatusFailed│
    │ └──────────┘ │    │    (Lambda)     │
    └──────┬───────┘    └────────┬────────┘
           │                     │
           │            ┌────────┴────────┐
           │            │ NotifyFailure   │
           │            │    (Lambda)     │
           │            └────────┬────────┘
           │                     │
           └────────┬────────────┘
             ┌──────┴──────┐
             │     END     │
             └─────────────┘

State Reference

State Type Purpose
ValidateStrategy Task (Lambda) Validate strategy exists and is deployable
EvaluateValidation Choice Route based on validation result
UpdateStatusValidationFailed Task (Lambda) Set job status to FAILED for validation errors
FailValidation Fail Terminal state for validation failures
EnsureData Task (Lambda) Check ArcticDB coverage, fetch gaps from exchange
UpdateStatusRunning Task (Lambda) Set job status to RUNNING before ECS launch
RunBacktest Task (ECS RunTask) Execute backtest in strategy container
HandleTimeout Pass Format timeout error for cleanup
CleanupResources Task (Lambda) Stop orphaned ECS tasks on failure
UpdateStatusFailed Task (Lambda) Set job status to FAILED
HandleSuccess Parallel Run UpdateStatusCompleted + NotifySuccess concurrently
NotifyFailure Task (Lambda) Send failure notification

State Machine Definition

See the canonical ASL template at infra/compute/asl_templates/backtest_workflow.json.j2.

Key configuration parameters (Jinja2 variables):

Variable Description
config.backtest_timeout_seconds Overall workflow timeout
config.heartbeat_seconds ECS task heartbeat interval
lambda_arns.validate_strategy ValidateStrategy Lambda ARN
lambda_arns.data_collection_proxy EnsureData Lambda ARN
lambda_arns.update_status UpdateStatus Lambda ARN
lambda_arns.cleanup_resources CleanupResources Lambda ARN
lambda_arns.notify_completion NotifyCompletion Lambda ARN
ecs.cluster_arn ECS cluster ARN
ecs.subnets Private subnet list
ecs.security_group_id ECS security group
dynamodb_table_name Workflow state table
mlflow_tracking_uri MLflow tracking URI

Key Design Decisions (v11)

  1. Sequential validation — ValidateStrategy runs first, then EnsureData. Earlier versions ran these in parallel, but v11 validates first to fail fast before expensive data operations.

  2. Dynamic task definitionTaskDefinition.$: "$.task_definition" is resolved from execution input, not hardcoded. This allows different strategy containers per execution.

  3. Explicit status updates via Lambda — Uses update-status Lambda instead of direct DynamoDB integration. This enforces state transition guards and publishes CloudWatch metrics.

  4. Separate timeout handlingHandleTimeout (Pass state) formats the error before routing to CleanupResources, keeping timeout and task-failure paths distinct.

  5. Parallel success handlingHandleSuccess runs UpdateStatusCompleted and NotifySuccess concurrently with independent catch-all states, so a notification failure doesn't block status updates.

Expected Flow (Happy Path)

1. ValidateStrategy
   └─ Output: {valid: true, resolved_version: "1.2.0"}

2. EvaluateValidation → EnsureData

3. EnsureData
   └─ Output: {coverage: {BTC/USDT: 100%}, gaps_filled: 0}

4. UpdateStatusRunning
   └─ Sets job status to RUNNING in DynamoDB

5. RunBacktest
   └─ Duration: 10-30 minutes
   └─ Container handles MLflow logging, S3 upload

6. HandleSuccess (Parallel)
   ├─ UpdateStatusCompleted → DynamoDB updated to COMPLETED
   └─ NotifySuccess → SNS notification sent

Version History

Version Key Changes
v8.0 Initial: ParallelValidation, PrepareConfig, TransformResults, direct DynamoDB
v9.2 Simplified: container handles status/MLflow/S3, removed PrepareConfig/TransformResults
v11 Current: sequential validation, EnsureData with DB-first check, explicit status Lambdas, parallel success handling

2. Error Handling Strategy

Retry Configuration

Error Type Max Attempts Interval Backoff
Lambda.ServiceException 3 2s 2.0x
Lambda.TooManyRequestsException 3 2s 2.0x
ECS.AmazonECSException 2 5s 2.0x
States.TaskFailed 1 10s 1.0x

Error Categories

Transient Errors (Retry):
├─ Lambda.ServiceException       → Auto-retry with backoff
├─ Lambda.TooManyRequestsException → Auto-retry with backoff
├─ ECS.AmazonECSException        → Auto-retry once
└─ States.Timeout                → Check heartbeat, may retry

Business Errors (Fail Fast):
├─ ValidationError               → Fail immediately
├─ DataStaleError               → Fail immediately
└─ ConfigurationError           → Fail immediately

Infrastructure Errors (Alert):
├─ States.Permissions           → Alert ops team
├─ States.ResultPathMatchFailure → Alert dev team
└─ Unknown errors               → DLQ + alert

Dead Letter Queue Integration

{
  "PublishToDLQ": {
    "Type": "Task",
    "Resource": "arn:aws:states:::sqs:sendMessage",
    "Parameters": {
      "QueueUrl": "${DeadLetterQueueUrl}",
      "MessageBody": {
        "workflow": "backtest",
        "run_id.$": "$.run_id",
        "error.$": "$.error",
        "timestamp.$": "$$.State.EnteredTime",
        "execution_arn.$": "$$.Execution.Id"
      }
    },
    "Next": "NotifyFailure"
  }
}

3. Data Sync Workflow (unchanged)

{
  "Comment": "TradAI Data Sync Workflow",
  "StartAt": "ValidateRequest",
  "States": {

    "ValidateRequest": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:${AccountId}:function:tradai-validate-data-request",
      "Next": "FetchAndStoreData"
    },

    "FetchAndStoreData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ecs:runTask.sync",
      "TimeoutSeconds": 1800,
      "HeartbeatSeconds": 300,
      "Parameters": {
        "Cluster": "tradai-cluster",
        "TaskDefinition": "tradai-data-collection-task",
        "LaunchType": "FARGATE",
        "NetworkConfiguration": {
          "AwsvpcConfiguration": {
            "Subnets": ["${PrivateSubnet1}", "${PrivateSubnet2}"],
            "SecurityGroups": ["${ECSSecurityGroup}"],
            "AssignPublicIp": "DISABLED"
          }
        },
        "Overrides": {
          "ContainerOverrides": [
            {
              "Name": "data-collection",
              "Environment": [
                {"Name": "COMMAND", "Value": "full-sync"},
                {"Name": "SYMBOLS", "Value.$": "States.JsonToString($.symbols)"},
                {"Name": "TIMEFRAME", "Value.$": "$.timeframe"}
              ]
            }
          ]
        }
      },
      "ResultPath": "$.sync_result",
      "Next": "ValidateDataQuality"
    },

    "ValidateDataQuality": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:${AccountId}:function:tradai-validate-data-quality",
      "Next": "NotifyCompletion"
    },

    "NotifyCompletion": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:${AccountId}:function:tradai-notify-completion",
      "Parameters": {
        "workflow": "data-sync",
        "status": "COMPLETED"
      },
      "End": true
    }
  }
}

4. CloudWatch Integration

Metrics Published

Metric Namespace Dimensions Unit
ExecutionStarted TradAI/StepFunctions WorkflowType Count
ExecutionSucceeded TradAI/StepFunctions WorkflowType Count
ExecutionFailed TradAI/StepFunctions WorkflowType Count
ExecutionDuration TradAI/StepFunctions WorkflowType Seconds
BacktestDuration TradAI/Backtest Strategy Seconds

CloudWatch Alarms

Alarms:
  - Name: backtest-failures-high
    Metric: ExecutionFailed
    Threshold: 5 in 1 hour
    Action: SNS notification

  - Name: backtest-duration-long
    Metric: ExecutionDuration
    Threshold: 3600 seconds (1 hour)
    Action: SNS warning

  - Name: dlq-messages
    Metric: ApproximateNumberOfMessagesVisible
    Threshold: > 0
    Action: SNS alert

5. Cost Analysis

Step Functions Pricing (Standard)

Metric Rate Monthly Usage Cost
State transitions $0.025 per 1000 ~4500 (50 backtests × 9 states × 10) $0.11
Total $0.11

Comparison: EXPRESS vs STANDARD

Aspect EXPRESS STANDARD
Max Duration 5 minutes 1 year
Cost per 1M transitions $1.00 $25.00
State persistence No 90 days
Suitable for backtests NO YES

6. Testing Workflows

Test Input

{
  "run_id": "test-001",
  "trace_id": "trace-test-001",
  "strategy_name": "MomentumStrategy",
  "strategy_version": "1.0.0",
  "task_definition": "arn:aws:ecs:us-east-1:123456:task-definition/momentum-strategy:5",
  "experiment_name": "test-experiment",
  "symbols": ["BTC/USDT:USDT"],
  "symbols_csv": "BTC/USDT:USDT",
  "exchange": "binance",
  "timeframe": "1h",
  "start_date": "20240101",
  "end_date": "20240601",
  "config_version_id": "abc123"
}

Expected Flow (Happy Path)

1. ValidateStrategy: {valid: true, resolved_version: "1.0.0"}

2. EvaluateValidation → EnsureData

3. EnsureData: {coverage: 100%, gaps_filled: 0}

4. UpdateStatusRunning → DynamoDB status = RUNNING

5. RunBacktest (10-30 minutes, container handles MLflow/S3)

6. HandleSuccess (Parallel)
   ├─ UpdateStatusCompleted → DynamoDB status = COMPLETED
   └─ NotifySuccess → SNS notification sent

Next Steps

  1. Review 07-COST-ANALYSIS.md for complete cost breakdown
  2. Review 09-PULUMI-CODE.md for Step Functions deployment code