TradAI Final Architecture - Step Functions Workflows¶

Version: 11.1.0 | Date: 2026-03-28 | Status: CURRENT

Last synced with: infra/compute/asl_templates/backtest_workflow.json.j2, infra/compute/asl_templates/retraining_workflow.json.j2

TL;DR: 2 Step Functions workflows: Backtest (12 states, 7200s timeout) and Retraining (10 states). Both use Jinja2 ASL templates with Lambda ARN injection. DLQ handling is at SQS level, not workflow level.

1. When to Use Step Functions¶

Step Functions is one of four execution modes for backtesting (see 05-SERVICES.md Section 4).

Mode	Use Case	Use Step Functions?
local	Development/testing	NO - Docker only
ecs	Simple production backtests	NO - Direct ECS launch
sqs	High-volume with backpressure	NO - SQS → Lambda → ECS
stepfunctions	Complex multi-step workflows	YES

Use Step Functions when you need: - Data freshness validation before backtest - Strategy validation (ECR/S3 checks) - Multi-step workflows (data sync → backtest → analyze) - Visual debugging and execution history - Complex retry/catch logic

Skip Step Functions when: - Simple backtest execution (use ECS direct or SQS mode) - Development/testing (use local mode) - Container handles its own status updates (v9.2 architecture)

2. Workflow Overview¶

Workflow	Type	States	Timeout	Trigger
Backtest Workflow (v11)	STANDARD	12	7200s (2 hours)	SQS -> Lambda
Retraining Workflow (MO005)	STANDARD	10	10800s (3 hours)	EventBridge / Manual

Backtest Workflow Timeout: 7200s (2 hours)

The backtest workflow has a hard timeout of 7200 seconds. Backtests exceeding this limit (e.g., multi-year multi-pair runs) will be caught by States.Timeout and routed to HandleTimeout -> CleanupResources -> UpdateStatusFailed. Monitor ExecutionDuration CloudWatch metrics to detect backtests approaching the limit. Consider increasing the timeout for ML-heavy strategies.

Critical: All workflows use STANDARD type (not EXPRESS) because: - Backtests can run 30-60+ minutes - EXPRESS has 5-minute maximum duration - STANDARD provides execution history for 90 days

3. Backtest Workflow (v11)¶

The v11 workflow replaced the earlier v8.0 "full" and v9.2 "simplified" workflows with a single, unified workflow. Key changes from earlier versions:

Removed PrepareConfig (ECS Task) — task definition resolved dynamically from $.task_definition
Removed TransformResults (Lambda) — container handles DynamoDB/MLflow/S3 directly
Renamed CheckDataFreshness → EnsureData — now does DB-first coverage check, fetches only gaps
Added UpdateStatusRunning — explicitly sets RUNNING before backtest via update-status Lambda
Split UpdateStatus → UpdateStatusCompleted / UpdateStatusFailed / UpdateStatusValidationFailed
Split HandleError → HandleTimeout (Pass) + CleanupResources (Lambda)
Added HandleSuccess as a Parallel state (UpdateStatus + Notify run concurrently)

Workflow Diagram¶

Mermaid Diagram¶

stateDiagram-v2
    [*] --> ValidateStrategy
    ValidateStrategy --> EvaluateValidation

    state EvaluateValidation <<choice>>
    EvaluateValidation --> EnsureData: Valid
    EvaluateValidation --> UpdateStatusValidationFailed: Invalid

    UpdateStatusValidationFailed --> FailValidation
    FailValidation --> [*]

    EnsureData --> UpdateStatusRunning
    UpdateStatusRunning --> RunBacktest

    state RunBacktest_result <<choice>>
    RunBacktest --> RunBacktest_result
    RunBacktest_result --> HandleSuccess: Success
    RunBacktest_result --> HandleTimeout: Timeout
    RunBacktest_result --> CleanupResources: TaskFailed

    HandleTimeout --> CleanupResources
    CleanupResources --> UpdateStatusFailed
    UpdateStatusFailed --> NotifyFailure
    NotifyFailure --> [*]

    state HandleSuccess {
        [*] --> UpdateStatusCompleted
        [*] --> NotifySuccess
    }
    HandleSuccess --> [*]

State Reference¶

State	Type	Purpose
`ValidateStrategy`	Task (Lambda)	Validate strategy exists and is deployable
`EvaluateValidation`	Choice	Route based on validation result
`UpdateStatusValidationFailed`	Task (Lambda)	Set job status to FAILED for validation errors
`FailValidation`	Fail	Terminal state for validation failures
`EnsureData`	Task (Lambda)	Check ArcticDB coverage, fetch gaps from exchange
`UpdateStatusRunning`	Task (Lambda)	Set job status to RUNNING before ECS launch
`RunBacktest`	Task (ECS RunTask)	Execute backtest in strategy container
`HandleTimeout`	Pass	Format timeout error for cleanup
`CleanupResources`	Task (Lambda)	Stop orphaned ECS tasks on failure
`UpdateStatusFailed`	Task (Lambda)	Set job status to FAILED
`HandleSuccess`	Parallel	Run UpdateStatusCompleted + NotifySuccess concurrently
`NotifyFailure`	Task (Lambda)	Send failure notification

State Machine Definition¶

See the canonical ASL template at infra/compute/asl_templates/backtest_workflow.json.j2.

Key configuration parameters (Jinja2 variables):

Variable	Description
`config.backtest_timeout_seconds`	Overall workflow timeout (7200s = 2 hours)
`config.heartbeat_seconds`	ECS task heartbeat interval (900s = 15 min)

Two Distinct Heartbeats

This is the Step Functions ECS task heartbeat (900s = 15 min). It is distinct from the application-level DynamoDB heartbeat sent by HealthReporter every 30 seconds (see 16-OBSERVABILITY.md Section 4). The SFN heartbeat detects hung ECS tasks that stop communicating with Step Functions; the app heartbeat monitors live trading strategy health and publishes metrics to CloudWatch.

Key Design Decisions (v11)¶

Sequential validation — ValidateStrategy runs first, then EnsureData. Earlier versions ran these in parallel, but v11 validates first to fail fast before expensive data operations.
Dynamic task definition — TaskDefinition.$: "$.task_definition" is resolved from execution input, not hardcoded. This allows different strategy containers per execution.
Explicit status updates via Lambda — Uses update-status Lambda instead of direct DynamoDB integration. This enforces state transition guards and publishes CloudWatch metrics.
Separate timeout handling — HandleTimeout (Pass state) formats the error before routing to CleanupResources, keeping timeout and task-failure paths distinct.
Parallel success handling — HandleSuccess runs UpdateStatusCompleted and NotifySuccess concurrently with independent catch-all states, so a notification failure doesn't block status updates.

Error Handling (RunBacktest)¶

RunBacktest uses Catch blocks only (no Retry) for ECS task errors:

Error	Handler	Next State
`States.TaskFailed`	Catch	CleanupResources
`States.Timeout`	Catch	HandleTimeout → CleanupResources
`States.ALL`	Catch	CleanupResources

Lambda states (ValidateStrategy, EnsureData) have Retry blocks for transient AWS errors:

Error Type	Max Attempts	Interval	Backoff
Lambda.ServiceException	3	2s	2.0x
Lambda.TooManyRequestsException	3	2s	2.0x

DLQ Handling at SQS Level

Dead letter queue handling is at the SQS level, not within the Step Functions workflow. Failed messages in tradai-backtest-queue.fifo are routed to tradai-backtest-dlq.fifo after maxReceiveCount (3) attempts. Step Functions failures (e.g., Lambda errors, ECS task crashes) are handled by Catch blocks within the workflow, not by SQS retries.

Expected Flow (Happy Path)¶

1. ValidateStrategy
   └─ Output: {valid: true, resolved_version: "1.2.0"}

2. EvaluateValidation → EnsureData

3. EnsureData
   └─ Output: {coverage: {BTC/USDT: 100%}, gaps_filled: 0}

4. UpdateStatusRunning
   └─ Sets job status to RUNNING in DynamoDB

5. RunBacktest
   └─ Duration: 10-30 minutes
   └─ Container handles MLflow logging, S3 upload

6. HandleSuccess (Parallel)
   ├─ UpdateStatusCompleted → DynamoDB updated to COMPLETED
   └─ NotifySuccess → SNS notification sent

Version History¶

Version	Key Changes
v8.0	Initial: ParallelValidation, PrepareConfig, TransformResults, direct DynamoDB
v9.2	Simplified: container handles status/MLflow/S3, removed PrepareConfig/TransformResults
v11	Current: sequential validation, EnsureData with DB-first check, explicit status Lambdas, parallel success handling

4. Retraining Workflow (MO005)¶

The retraining workflow orchestrates ML model retraining, comparing the challenger model against the champion, and optionally promoting the new version.

See the canonical ASL template at infra/compute/asl_templates/retraining_workflow.json.j2.

Workflow Diagram¶

stateDiagram-v2
    [*] --> CheckRetrainingNeeded

    CheckRetrainingNeeded --> EvaluateRetrainingNeed
    CheckRetrainingNeeded --> NotifyFailure: States.ALL (Catch)

    state EvaluateRetrainingNeed <<choice>>
    EvaluateRetrainingNeed --> RunRetraining: needs_retraining
    EvaluateRetrainingNeed --> SkipRetraining: otherwise (default)

    SkipRetraining --> [*]

    RunRetraining --> CompareModels
    RunRetraining --> NotifyFailure: States.ALL (Catch)

    CompareModels --> DecidePromotion
    CompareModels --> NotifyFailure: States.ALL (Catch)

    state DecidePromotion <<choice>>
    DecidePromotion --> PromoteModel: promote
    DecidePromotion --> KeepCurrentModel: otherwise (default)

    PromoteModel --> NotifyCompletion
    PromoteModel --> NotifyFailure: States.ALL (Catch)
    KeepCurrentModel --> NotifyCompletion

    NotifyCompletion --> [*]
    NotifyFailure --> [*]

    note right of CheckRetrainingNeeded: Lambda&#58; check drift state and schedule triggers
    note right of RunRetraining: ECS RunTask (FARGATE_SPOT)<br/>TimeoutSeconds&#58; training_timeout_seconds<br/>HeartbeatSeconds&#58; heartbeat_seconds
    note right of CompareModels: Lambda&#58; champion vs challenger metrics
    note right of PromoteModel: Lambda&#58; promote challenger to Production in MLflow

State Reference¶

State	Type	Purpose
`CheckRetrainingNeeded`	Task (Lambda)	Check drift state and schedule triggers
`EvaluateRetrainingNeed`	Choice	Route based on decision (needs_retraining or skip)
`SkipRetraining`	Pass	Terminal state when retraining not needed
`RunRetraining`	Task (ECS RunTask)	Execute model training in strategy container
`CompareModels`	Task (Lambda)	Compare champion vs challenger metrics
`DecidePromotion`	Choice	Route based on comparison (promote or keep)
`KeepCurrentModel`	Pass	Skip promotion when challenger is not better
`PromoteModel`	Task (Lambda)	Promote challenger to Production in MLflow
`NotifyCompletion`	Task (Lambda)	Send success notification
`NotifyFailure`	Task (Lambda)	Send failure notification

Template Variables¶

Variable	Description
`config.retraining_timeout_seconds`	Overall workflow timeout (10800s = 3 hours)
`config.training_timeout_seconds`	ECS training task timeout (7200s = 2 hours)
`config.heartbeat_seconds`	ECS task heartbeat interval (900s = 15 min)
`lambda_arns.check_retraining_needed`	CheckRetrainingNeeded Lambda ARN
`lambda_arns.compare_models`	CompareModels Lambda ARN
`lambda_arns.promote_model`	PromoteModel Lambda ARN
`lambda_arns.notify_completion`	NotifyCompletion Lambda ARN
`ecs.cluster_arn`	ECS cluster ARN
`ecs.subnets`	Private subnet list
`ecs.security_group_id`	ECS security group
`dynamodb_table_name`	Workflow state table
`mlflow_tracking_uri`	MLflow tracking URI

Error Handling¶

All states use Catch blocks routing to NotifyFailure on States.ALL errors. RunRetraining has no Retry -- ECS task failures go directly to NotifyFailure.

5. CloudWatch Integration¶

Metrics Published¶

Metric	Namespace	Dimensions	Unit
ExecutionStarted	TradAI/StepFunctions	WorkflowType	Count
ExecutionSucceeded	TradAI/StepFunctions	WorkflowType	Count
ExecutionFailed	TradAI/StepFunctions	WorkflowType	Count
ExecutionDuration	TradAI/StepFunctions	WorkflowType	Seconds
BacktestDuration	TradAI/Backtest	Strategy	Seconds

Note: CloudWatch alarms for Step Functions are defined in the edge stack (infra/edge/), not in the compute stack.

6. Cost Analysis¶

Step Functions Pricing (Standard)¶

Metric	Rate	Monthly Usage	Cost
State transitions	$0.025 per 1000	~6000 (50 backtests x 12 states x 10)	$0.15
Total			$0.15

Comparison: EXPRESS vs STANDARD¶

Aspect	EXPRESS	STANDARD
Max Duration	5 minutes	1 year
Cost per 1M transitions	$1.00	$25.00
State persistence	No	90 days
Suitable for backtests	NO	YES

7. Testing Workflows¶

Test Input¶

{
  "run_id": "test-001",
  "trace_id": "trace-test-001",
  "strategy_name": "MomentumStrategy",
  "strategy_version": "1.0.0",
  "task_definition": "arn:aws:ecs:eu-central-1:123456:task-definition/momentum-strategy:5",
  "experiment_name": "test-experiment",
  "symbols": ["BTC/USDT:USDT"],
  "symbols_csv": "BTC/USDT:USDT",
  "exchange": "binance",
  "timeframe": "1h",
  "start_date": "20240101",
  "end_date": "20240601",
  "config_version_id": "abc123"
}

8. Next Steps¶

Review 07-COST-ANALYSIS.md for complete cost breakdown
Review 09-PULUMI-CODE.md for Step Functions deployment code

Changelog¶

Version	Date	Changes
11.1.0	2026-03-28	Added TL;DR, changelog, and dependencies sections

Dependencies¶

If This Changes	Update This Doc
`infra/compute/asl_templates/backtest_workflow.json.j2`	Backtest workflow diagram + state reference (Section 3)
`infra/compute/asl_templates/retraining_workflow.json.j2`	Retraining workflow diagram + state reference (Section 4)
`infra/compute/modules/lambda_funcs.py` LAMBDA_CONFIGS	Lambda ARN references in template variables
`infra/compute/modules/step_functions.py`	Workflow creation, timeout, and config values (Section 2)