TradAI Final Architecture - Step Functions Workflows¶
Version: 11.1.0 | Date: 2026-03-28 | Status: CURRENT
Last synced with:
infra/compute/asl_templates/backtest_workflow.json.j2,infra/compute/asl_templates/retraining_workflow.json.j2TL;DR: 2 Step Functions workflows: Backtest (12 states, 7200s timeout) and Retraining (10 states). Both use Jinja2 ASL templates with Lambda ARN injection. DLQ handling is at SQS level, not workflow level.
1. When to Use Step Functions¶
Step Functions is one of four execution modes for backtesting (see 05-SERVICES.md Section 4).
| Mode | Use Case | Use Step Functions? |
|---|---|---|
| local | Development/testing | NO - Docker only |
| ecs | Simple production backtests | NO - Direct ECS launch |
| sqs | High-volume with backpressure | NO - SQS → Lambda → ECS |
| stepfunctions | Complex multi-step workflows | YES |
Use Step Functions when you need: - Data freshness validation before backtest - Strategy validation (ECR/S3 checks) - Multi-step workflows (data sync → backtest → analyze) - Visual debugging and execution history - Complex retry/catch logic
Skip Step Functions when: - Simple backtest execution (use ECS direct or SQS mode) - Development/testing (use local mode) - Container handles its own status updates (v9.2 architecture)
2. Workflow Overview¶
| Workflow | Type | States | Timeout | Trigger |
|---|---|---|---|---|
| Backtest Workflow (v11) | STANDARD | 12 | 7200s (2 hours) | SQS -> Lambda |
| Retraining Workflow (MO005) | STANDARD | 10 | 10800s (3 hours) | EventBridge / Manual |
Backtest Workflow Timeout: 7200s (2 hours)
The backtest workflow has a hard timeout of 7200 seconds. Backtests exceeding this limit (e.g., multi-year multi-pair runs) will be caught by States.Timeout and routed to HandleTimeout -> CleanupResources -> UpdateStatusFailed. Monitor ExecutionDuration CloudWatch metrics to detect backtests approaching the limit. Consider increasing the timeout for ML-heavy strategies.
Critical: All workflows use STANDARD type (not EXPRESS) because: - Backtests can run 30-60+ minutes - EXPRESS has 5-minute maximum duration - STANDARD provides execution history for 90 days
3. Backtest Workflow (v11)¶
The v11 workflow replaced the earlier v8.0 "full" and v9.2 "simplified" workflows with a single, unified workflow. Key changes from earlier versions:
- Removed
PrepareConfig(ECS Task) — task definition resolved dynamically from$.task_definition - Removed
TransformResults(Lambda) — container handles DynamoDB/MLflow/S3 directly - Renamed
CheckDataFreshness→EnsureData— now does DB-first coverage check, fetches only gaps - Added
UpdateStatusRunning— explicitly sets RUNNING before backtest viaupdate-statusLambda - Split
UpdateStatus→UpdateStatusCompleted/UpdateStatusFailed/UpdateStatusValidationFailed - Split
HandleError→HandleTimeout(Pass) +CleanupResources(Lambda) - Added
HandleSuccessas a Parallel state (UpdateStatus + Notify run concurrently)
Workflow Diagram¶
Mermaid Diagram¶
stateDiagram-v2
[*] --> ValidateStrategy
ValidateStrategy --> EvaluateValidation
state EvaluateValidation <<choice>>
EvaluateValidation --> EnsureData: Valid
EvaluateValidation --> UpdateStatusValidationFailed: Invalid
UpdateStatusValidationFailed --> FailValidation
FailValidation --> [*]
EnsureData --> UpdateStatusRunning
UpdateStatusRunning --> RunBacktest
state RunBacktest_result <<choice>>
RunBacktest --> RunBacktest_result
RunBacktest_result --> HandleSuccess: Success
RunBacktest_result --> HandleTimeout: Timeout
RunBacktest_result --> CleanupResources: TaskFailed
HandleTimeout --> CleanupResources
CleanupResources --> UpdateStatusFailed
UpdateStatusFailed --> NotifyFailure
NotifyFailure --> [*]
state HandleSuccess {
[*] --> UpdateStatusCompleted
[*] --> NotifySuccess
}
HandleSuccess --> [*] State Reference¶
| State | Type | Purpose |
|---|---|---|
ValidateStrategy | Task (Lambda) | Validate strategy exists and is deployable |
EvaluateValidation | Choice | Route based on validation result |
UpdateStatusValidationFailed | Task (Lambda) | Set job status to FAILED for validation errors |
FailValidation | Fail | Terminal state for validation failures |
EnsureData | Task (Lambda) | Check ArcticDB coverage, fetch gaps from exchange |
UpdateStatusRunning | Task (Lambda) | Set job status to RUNNING before ECS launch |
RunBacktest | Task (ECS RunTask) | Execute backtest in strategy container |
HandleTimeout | Pass | Format timeout error for cleanup |
CleanupResources | Task (Lambda) | Stop orphaned ECS tasks on failure |
UpdateStatusFailed | Task (Lambda) | Set job status to FAILED |
HandleSuccess | Parallel | Run UpdateStatusCompleted + NotifySuccess concurrently |
NotifyFailure | Task (Lambda) | Send failure notification |
State Machine Definition¶
See the canonical ASL template at infra/compute/asl_templates/backtest_workflow.json.j2.
Key configuration parameters (Jinja2 variables):
| Variable | Description |
|---|---|
config.backtest_timeout_seconds | Overall workflow timeout (7200s = 2 hours) |
config.heartbeat_seconds | ECS task heartbeat interval (900s = 15 min) |
Two Distinct Heartbeats
This is the Step Functions ECS task heartbeat (900s = 15 min). It is distinct from the application-level DynamoDB heartbeat sent by HealthReporter every 30 seconds (see 16-OBSERVABILITY.md Section 4). The SFN heartbeat detects hung ECS tasks that stop communicating with Step Functions; the app heartbeat monitors live trading strategy health and publishes metrics to CloudWatch.
| lambda_arns.validate_strategy | ValidateStrategy Lambda ARN | | lambda_arns.data_collection_proxy | EnsureData Lambda ARN | | lambda_arns.update_status | UpdateStatus Lambda ARN | | lambda_arns.cleanup_resources | CleanupResources Lambda ARN | | lambda_arns.notify_completion | NotifyCompletion Lambda ARN | | ecs.cluster_arn | ECS cluster ARN | | ecs.subnets | Private subnet list | | ecs.security_group_id | ECS security group | | dynamodb_table_name | Workflow state table | | mlflow_tracking_uri | MLflow tracking URI | | arctic_bucket | S3 bucket for ArcticDB data | | arctic_library | ArcticDB library name (default: ohlcv) |
Key Design Decisions (v11)¶
-
Sequential validation — ValidateStrategy runs first, then EnsureData. Earlier versions ran these in parallel, but v11 validates first to fail fast before expensive data operations.
-
Dynamic task definition —
TaskDefinition.$: "$.task_definition"is resolved from execution input, not hardcoded. This allows different strategy containers per execution. -
Explicit status updates via Lambda — Uses
update-statusLambda instead of direct DynamoDB integration. This enforces state transition guards and publishes CloudWatch metrics. -
Separate timeout handling —
HandleTimeout(Pass state) formats the error before routing toCleanupResources, keeping timeout and task-failure paths distinct. -
Parallel success handling —
HandleSuccessrunsUpdateStatusCompletedandNotifySuccessconcurrently with independent catch-all states, so a notification failure doesn't block status updates.
Error Handling (RunBacktest)¶
RunBacktest uses Catch blocks only (no Retry) for ECS task errors:
| Error | Handler | Next State |
|---|---|---|
States.TaskFailed | Catch | CleanupResources |
States.Timeout | Catch | HandleTimeout → CleanupResources |
States.ALL | Catch | CleanupResources |
Lambda states (ValidateStrategy, EnsureData) have Retry blocks for transient AWS errors:
| Error Type | Max Attempts | Interval | Backoff |
|---|---|---|---|
| Lambda.ServiceException | 3 | 2s | 2.0x |
| Lambda.TooManyRequestsException | 3 | 2s | 2.0x |
DLQ Handling at SQS Level
Dead letter queue handling is at the SQS level, not within the Step Functions workflow. Failed messages in tradai-backtest-queue.fifo are routed to tradai-backtest-dlq.fifo after maxReceiveCount (3) attempts. Step Functions failures (e.g., Lambda errors, ECS task crashes) are handled by Catch blocks within the workflow, not by SQS retries.
Expected Flow (Happy Path)¶
1. ValidateStrategy
└─ Output: {valid: true, resolved_version: "1.2.0"}
2. EvaluateValidation → EnsureData
3. EnsureData
└─ Output: {coverage: {BTC/USDT: 100%}, gaps_filled: 0}
4. UpdateStatusRunning
└─ Sets job status to RUNNING in DynamoDB
5. RunBacktest
└─ Duration: 10-30 minutes
└─ Container handles MLflow logging, S3 upload
6. HandleSuccess (Parallel)
├─ UpdateStatusCompleted → DynamoDB updated to COMPLETED
└─ NotifySuccess → SNS notification sent
Version History¶
| Version | Key Changes |
|---|---|
| v8.0 | Initial: ParallelValidation, PrepareConfig, TransformResults, direct DynamoDB |
| v9.2 | Simplified: container handles status/MLflow/S3, removed PrepareConfig/TransformResults |
| v11 | Current: sequential validation, EnsureData with DB-first check, explicit status Lambdas, parallel success handling |
4. Retraining Workflow (MO005)¶
The retraining workflow orchestrates ML model retraining, comparing the challenger model against the champion, and optionally promoting the new version.
See the canonical ASL template at infra/compute/asl_templates/retraining_workflow.json.j2.
Workflow Diagram¶
stateDiagram-v2
[*] --> CheckRetrainingNeeded
CheckRetrainingNeeded --> EvaluateRetrainingNeed
CheckRetrainingNeeded --> NotifyFailure: States.ALL (Catch)
state EvaluateRetrainingNeed <<choice>>
EvaluateRetrainingNeed --> RunRetraining: needs_retraining
EvaluateRetrainingNeed --> SkipRetraining: otherwise (default)
SkipRetraining --> [*]
RunRetraining --> CompareModels
RunRetraining --> NotifyFailure: States.ALL (Catch)
CompareModels --> DecidePromotion
CompareModels --> NotifyFailure: States.ALL (Catch)
state DecidePromotion <<choice>>
DecidePromotion --> PromoteModel: promote
DecidePromotion --> KeepCurrentModel: otherwise (default)
PromoteModel --> NotifyCompletion
PromoteModel --> NotifyFailure: States.ALL (Catch)
KeepCurrentModel --> NotifyCompletion
NotifyCompletion --> [*]
NotifyFailure --> [*]
note right of CheckRetrainingNeeded: Lambda: check drift state and schedule triggers
note right of RunRetraining: ECS RunTask (FARGATE_SPOT)<br/>TimeoutSeconds: training_timeout_seconds<br/>HeartbeatSeconds: heartbeat_seconds
note right of CompareModels: Lambda: champion vs challenger metrics
note right of PromoteModel: Lambda: promote challenger to Production in MLflow State Reference¶
| State | Type | Purpose |
|---|---|---|
CheckRetrainingNeeded | Task (Lambda) | Check drift state and schedule triggers |
EvaluateRetrainingNeed | Choice | Route based on decision (needs_retraining or skip) |
SkipRetraining | Pass | Terminal state when retraining not needed |
RunRetraining | Task (ECS RunTask) | Execute model training in strategy container |
CompareModels | Task (Lambda) | Compare champion vs challenger metrics |
DecidePromotion | Choice | Route based on comparison (promote or keep) |
KeepCurrentModel | Pass | Skip promotion when challenger is not better |
PromoteModel | Task (Lambda) | Promote challenger to Production in MLflow |
NotifyCompletion | Task (Lambda) | Send success notification |
NotifyFailure | Task (Lambda) | Send failure notification |
Template Variables¶
| Variable | Description |
|---|---|
config.retraining_timeout_seconds | Overall workflow timeout (10800s = 3 hours) |
config.training_timeout_seconds | ECS training task timeout (7200s = 2 hours) |
config.heartbeat_seconds | ECS task heartbeat interval (900s = 15 min) |
lambda_arns.check_retraining_needed | CheckRetrainingNeeded Lambda ARN |
lambda_arns.compare_models | CompareModels Lambda ARN |
lambda_arns.promote_model | PromoteModel Lambda ARN |
lambda_arns.notify_completion | NotifyCompletion Lambda ARN |
ecs.cluster_arn | ECS cluster ARN |
ecs.subnets | Private subnet list |
ecs.security_group_id | ECS security group |
dynamodb_table_name | Workflow state table |
mlflow_tracking_uri | MLflow tracking URI |
Error Handling¶
All states use Catch blocks routing to NotifyFailure on States.ALL errors. RunRetraining has no Retry -- ECS task failures go directly to NotifyFailure.
5. CloudWatch Integration¶
Metrics Published¶
| Metric | Namespace | Dimensions | Unit |
|---|---|---|---|
| ExecutionStarted | TradAI/StepFunctions | WorkflowType | Count |
| ExecutionSucceeded | TradAI/StepFunctions | WorkflowType | Count |
| ExecutionFailed | TradAI/StepFunctions | WorkflowType | Count |
| ExecutionDuration | TradAI/StepFunctions | WorkflowType | Seconds |
| BacktestDuration | TradAI/Backtest | Strategy | Seconds |
Note: CloudWatch alarms for Step Functions are defined in the edge stack (infra/edge/), not in the compute stack.
6. Cost Analysis¶
Step Functions Pricing (Standard)¶
| Metric | Rate | Monthly Usage | Cost |
|---|---|---|---|
| State transitions | $0.025 per 1000 | ~6000 (50 backtests x 12 states x 10) | $0.15 |
| Total | $0.15 |
Comparison: EXPRESS vs STANDARD¶
| Aspect | EXPRESS | STANDARD |
|---|---|---|
| Max Duration | 5 minutes | 1 year |
| Cost per 1M transitions | $1.00 | $25.00 |
| State persistence | No | 90 days |
| Suitable for backtests | NO | YES |
7. Testing Workflows¶
Test Input¶
{
"run_id": "test-001",
"trace_id": "trace-test-001",
"strategy_name": "MomentumStrategy",
"strategy_version": "1.0.0",
"task_definition": "arn:aws:ecs:eu-central-1:123456:task-definition/momentum-strategy:5",
"experiment_name": "test-experiment",
"symbols": ["BTC/USDT:USDT"],
"symbols_csv": "BTC/USDT:USDT",
"exchange": "binance",
"timeframe": "1h",
"start_date": "20240101",
"end_date": "20240601",
"config_version_id": "abc123"
}
8. Next Steps¶
- Review 07-COST-ANALYSIS.md for complete cost breakdown
- Review 09-PULUMI-CODE.md for Step Functions deployment code
Changelog¶
| Version | Date | Changes |
|---|---|---|
| 11.1.0 | 2026-03-28 | Added TL;DR, changelog, and dependencies sections |
Dependencies¶
| If This Changes | Update This Doc |
|---|---|
infra/compute/asl_templates/backtest_workflow.json.j2 | Backtest workflow diagram + state reference (Section 3) |
infra/compute/asl_templates/retraining_workflow.json.j2 | Retraining workflow diagram + state reference (Section 4) |
infra/compute/modules/lambda_funcs.py LAMBDA_CONFIGS | Lambda ARN references in template variables |
infra/compute/modules/step_functions.py | Workflow creation, timeout, and config values (Section 2) |