TradAI Final Architecture - Step Functions Workflows¶
Version: v11 | Date: 2026-02-21
Last synced with:
infra/compute/asl_templates/backtest_workflow.json.j2
When to Use Step Functions¶
Step Functions is one of four execution modes for backtesting (see 05-SERVICES.md Section 6).
| Mode | Use Case | Use Step Functions? |
|---|---|---|
| local | Development/testing | NO - Docker only |
| ecs | Simple production backtests | NO - Direct ECS launch |
| sqs | High-volume with backpressure | NO - SQS → Lambda → ECS |
| stepfunctions | Complex multi-step workflows | YES |
Use Step Functions when you need: - Data freshness validation before backtest - Strategy validation (ECR/S3 checks) - Multi-step workflows (data sync → backtest → analyze) - Visual debugging and execution history - Complex retry/catch logic
Skip Step Functions when: - Simple backtest execution (use ECS direct or SQS mode) - Development/testing (use local mode) - Container handles its own status updates (v9.2 architecture)
Workflow Overview¶
| Workflow | Type | States | Duration | Trigger |
|---|---|---|---|---|
| Backtest Workflow (v11) | STANDARD | 12 | 10-70 min | SQS → Lambda |
| Data Sync Workflow | STANDARD | 5 | 10-30 min | API Gateway |
| Deploy Workflow | STANDARD | 6 | 5-15 min | API Gateway |
Critical: All workflows use STANDARD type (not EXPRESS) because: - Backtests can run 30-60+ minutes - EXPRESS has 5-minute maximum duration - STANDARD provides execution history for 90 days
1. Backtest Workflow (v11)¶
The v11 workflow replaced the earlier v8.0 "full" and v9.2 "simplified" workflows with a single, unified workflow. Key changes from earlier versions:
- Removed
PrepareConfig(ECS Task) — task definition resolved dynamically from$.task_definition - Removed
TransformResults(Lambda) — container handles DynamoDB/MLflow/S3 directly - Renamed
CheckDataFreshness→EnsureData— now does DB-first coverage check, fetches only gaps - Added
UpdateStatusRunning— explicitly sets RUNNING before backtest viaupdate-statusLambda - Split
UpdateStatus→UpdateStatusCompleted/UpdateStatusFailed/UpdateStatusValidationFailed - Split
HandleError→HandleTimeout(Pass) +CleanupResources(Lambda) - Added
HandleSuccessas a Parallel state (UpdateStatus + Notify run concurrently)
Workflow Diagram¶
┌──────────────────┐
│ START │
└────────┬─────────┘
│
┌────────┴────────┐
│ValidateStrategy │
│ (Lambda) │
└────────┬────────┘
│
┌────────┴────────┐
│EvaluateValidation│
│ (Choice) │
└────────┬────────┘
│
┌─────────────┴─────────────┐
│ Valid │ Invalid
▼ ▼
┌────────────────┐ ┌─────────────────────────┐
│ EnsureData │ │UpdateStatusValidationFailed│
│ (Lambda) │ │ (Lambda) │
│ DB-first check│ └────────────┬────────────┘
└───────┬──────┘ │
│ ┌────────┴────────┐
▼ │ FailValidation │
┌─────────────────┐ │ (Fail) │
│UpdateStatusRunning│ └─────────────────┘
│ (Lambda) │
└───────┬────────┘
│
┌───────┴────────┐
│ RunBacktest │
│ (ECS RunTask) │
└───────┬────────┘
│
┌────────────┼──────────────┐
│ Success │ Timeout │ TaskFailed/Other
▼ ▼ ▼
┌──────────────┐ ┌──────────┐ ┌──────────────┐
│ HandleSuccess│ │HandleTimeout│ │CleanupResources│
│ (Parallel) │ │ (Pass) │ │ (Lambda) │
│ ┌──────────┐ │ └────┬─────┘ └──────┬───────┘
│ │UpdateStatus│ │ │ │
│ │Completed │ │ └──────┬───────┘
│ └──────────┘ │ │
│ ┌──────────┐ │ ┌────────┴────────┐
│ │NotifySuccess│ │ │UpdateStatusFailed│
│ └──────────┘ │ │ (Lambda) │
└──────┬───────┘ └────────┬────────┘
│ │
│ ┌────────┴────────┐
│ │ NotifyFailure │
│ │ (Lambda) │
│ └────────┬────────┘
│ │
└────────┬────────────┘
│
┌──────┴──────┐
│ END │
└─────────────┘
State Reference¶
| State | Type | Purpose |
|---|---|---|
ValidateStrategy | Task (Lambda) | Validate strategy exists and is deployable |
EvaluateValidation | Choice | Route based on validation result |
UpdateStatusValidationFailed | Task (Lambda) | Set job status to FAILED for validation errors |
FailValidation | Fail | Terminal state for validation failures |
EnsureData | Task (Lambda) | Check ArcticDB coverage, fetch gaps from exchange |
UpdateStatusRunning | Task (Lambda) | Set job status to RUNNING before ECS launch |
RunBacktest | Task (ECS RunTask) | Execute backtest in strategy container |
HandleTimeout | Pass | Format timeout error for cleanup |
CleanupResources | Task (Lambda) | Stop orphaned ECS tasks on failure |
UpdateStatusFailed | Task (Lambda) | Set job status to FAILED |
HandleSuccess | Parallel | Run UpdateStatusCompleted + NotifySuccess concurrently |
NotifyFailure | Task (Lambda) | Send failure notification |
State Machine Definition¶
See the canonical ASL template at infra/compute/asl_templates/backtest_workflow.json.j2.
Key configuration parameters (Jinja2 variables):
| Variable | Description |
|---|---|
config.backtest_timeout_seconds | Overall workflow timeout |
config.heartbeat_seconds | ECS task heartbeat interval |
lambda_arns.validate_strategy | ValidateStrategy Lambda ARN |
lambda_arns.data_collection_proxy | EnsureData Lambda ARN |
lambda_arns.update_status | UpdateStatus Lambda ARN |
lambda_arns.cleanup_resources | CleanupResources Lambda ARN |
lambda_arns.notify_completion | NotifyCompletion Lambda ARN |
ecs.cluster_arn | ECS cluster ARN |
ecs.subnets | Private subnet list |
ecs.security_group_id | ECS security group |
dynamodb_table_name | Workflow state table |
mlflow_tracking_uri | MLflow tracking URI |
Key Design Decisions (v11)¶
-
Sequential validation — ValidateStrategy runs first, then EnsureData. Earlier versions ran these in parallel, but v11 validates first to fail fast before expensive data operations.
-
Dynamic task definition —
TaskDefinition.$: "$.task_definition"is resolved from execution input, not hardcoded. This allows different strategy containers per execution. -
Explicit status updates via Lambda — Uses
update-statusLambda instead of direct DynamoDB integration. This enforces state transition guards and publishes CloudWatch metrics. -
Separate timeout handling —
HandleTimeout(Pass state) formats the error before routing toCleanupResources, keeping timeout and task-failure paths distinct. -
Parallel success handling —
HandleSuccessrunsUpdateStatusCompletedandNotifySuccessconcurrently with independent catch-all states, so a notification failure doesn't block status updates.
Expected Flow (Happy Path)¶
1. ValidateStrategy
└─ Output: {valid: true, resolved_version: "1.2.0"}
2. EvaluateValidation → EnsureData
3. EnsureData
└─ Output: {coverage: {BTC/USDT: 100%}, gaps_filled: 0}
4. UpdateStatusRunning
└─ Sets job status to RUNNING in DynamoDB
5. RunBacktest
└─ Duration: 10-30 minutes
└─ Container handles MLflow logging, S3 upload
6. HandleSuccess (Parallel)
├─ UpdateStatusCompleted → DynamoDB updated to COMPLETED
└─ NotifySuccess → SNS notification sent
Version History¶
| Version | Key Changes |
|---|---|
| v8.0 | Initial: ParallelValidation, PrepareConfig, TransformResults, direct DynamoDB |
| v9.2 | Simplified: container handles status/MLflow/S3, removed PrepareConfig/TransformResults |
| v11 | Current: sequential validation, EnsureData with DB-first check, explicit status Lambdas, parallel success handling |
2. Error Handling Strategy¶
Retry Configuration¶
| Error Type | Max Attempts | Interval | Backoff |
|---|---|---|---|
| Lambda.ServiceException | 3 | 2s | 2.0x |
| Lambda.TooManyRequestsException | 3 | 2s | 2.0x |
| ECS.AmazonECSException | 2 | 5s | 2.0x |
| States.TaskFailed | 1 | 10s | 1.0x |
Error Categories¶
Transient Errors (Retry):
├─ Lambda.ServiceException → Auto-retry with backoff
├─ Lambda.TooManyRequestsException → Auto-retry with backoff
├─ ECS.AmazonECSException → Auto-retry once
└─ States.Timeout → Check heartbeat, may retry
Business Errors (Fail Fast):
├─ ValidationError → Fail immediately
├─ DataStaleError → Fail immediately
└─ ConfigurationError → Fail immediately
Infrastructure Errors (Alert):
├─ States.Permissions → Alert ops team
├─ States.ResultPathMatchFailure → Alert dev team
└─ Unknown errors → DLQ + alert
Dead Letter Queue Integration¶
{
"PublishToDLQ": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage",
"Parameters": {
"QueueUrl": "${DeadLetterQueueUrl}",
"MessageBody": {
"workflow": "backtest",
"run_id.$": "$.run_id",
"error.$": "$.error",
"timestamp.$": "$$.State.EnteredTime",
"execution_arn.$": "$$.Execution.Id"
}
},
"Next": "NotifyFailure"
}
}
3. Data Sync Workflow (unchanged)¶
{
"Comment": "TradAI Data Sync Workflow",
"StartAt": "ValidateRequest",
"States": {
"ValidateRequest": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:${AccountId}:function:tradai-validate-data-request",
"Next": "FetchAndStoreData"
},
"FetchAndStoreData": {
"Type": "Task",
"Resource": "arn:aws:states:::ecs:runTask.sync",
"TimeoutSeconds": 1800,
"HeartbeatSeconds": 300,
"Parameters": {
"Cluster": "tradai-cluster",
"TaskDefinition": "tradai-data-collection-task",
"LaunchType": "FARGATE",
"NetworkConfiguration": {
"AwsvpcConfiguration": {
"Subnets": ["${PrivateSubnet1}", "${PrivateSubnet2}"],
"SecurityGroups": ["${ECSSecurityGroup}"],
"AssignPublicIp": "DISABLED"
}
},
"Overrides": {
"ContainerOverrides": [
{
"Name": "data-collection",
"Environment": [
{"Name": "COMMAND", "Value": "full-sync"},
{"Name": "SYMBOLS", "Value.$": "States.JsonToString($.symbols)"},
{"Name": "TIMEFRAME", "Value.$": "$.timeframe"}
]
}
]
}
},
"ResultPath": "$.sync_result",
"Next": "ValidateDataQuality"
},
"ValidateDataQuality": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:${AccountId}:function:tradai-validate-data-quality",
"Next": "NotifyCompletion"
},
"NotifyCompletion": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:${AccountId}:function:tradai-notify-completion",
"Parameters": {
"workflow": "data-sync",
"status": "COMPLETED"
},
"End": true
}
}
}
4. CloudWatch Integration¶
Metrics Published¶
| Metric | Namespace | Dimensions | Unit |
|---|---|---|---|
| ExecutionStarted | TradAI/StepFunctions | WorkflowType | Count |
| ExecutionSucceeded | TradAI/StepFunctions | WorkflowType | Count |
| ExecutionFailed | TradAI/StepFunctions | WorkflowType | Count |
| ExecutionDuration | TradAI/StepFunctions | WorkflowType | Seconds |
| BacktestDuration | TradAI/Backtest | Strategy | Seconds |
CloudWatch Alarms¶
Alarms:
- Name: backtest-failures-high
Metric: ExecutionFailed
Threshold: 5 in 1 hour
Action: SNS notification
- Name: backtest-duration-long
Metric: ExecutionDuration
Threshold: 3600 seconds (1 hour)
Action: SNS warning
- Name: dlq-messages
Metric: ApproximateNumberOfMessagesVisible
Threshold: > 0
Action: SNS alert
5. Cost Analysis¶
Step Functions Pricing (Standard)¶
| Metric | Rate | Monthly Usage | Cost |
|---|---|---|---|
| State transitions | $0.025 per 1000 | ~4500 (50 backtests × 9 states × 10) | $0.11 |
| Total | $0.11 |
Comparison: EXPRESS vs STANDARD¶
| Aspect | EXPRESS | STANDARD |
|---|---|---|
| Max Duration | 5 minutes | 1 year |
| Cost per 1M transitions | $1.00 | $25.00 |
| State persistence | No | 90 days |
| Suitable for backtests | NO | YES |
6. Testing Workflows¶
Test Input¶
{
"run_id": "test-001",
"trace_id": "trace-test-001",
"strategy_name": "MomentumStrategy",
"strategy_version": "1.0.0",
"task_definition": "arn:aws:ecs:us-east-1:123456:task-definition/momentum-strategy:5",
"experiment_name": "test-experiment",
"symbols": ["BTC/USDT:USDT"],
"symbols_csv": "BTC/USDT:USDT",
"exchange": "binance",
"timeframe": "1h",
"start_date": "20240101",
"end_date": "20240601",
"config_version_id": "abc123"
}
Expected Flow (Happy Path)¶
1. ValidateStrategy: {valid: true, resolved_version: "1.0.0"}
2. EvaluateValidation → EnsureData
3. EnsureData: {coverage: 100%, gaps_filled: 0}
4. UpdateStatusRunning → DynamoDB status = RUNNING
5. RunBacktest (10-30 minutes, container handles MLflow/S3)
6. HandleSuccess (Parallel)
├─ UpdateStatusCompleted → DynamoDB status = COMPLETED
└─ NotifySuccess → SNS notification sent
Next Steps¶
- Review 07-COST-ANALYSIS.md for complete cost breakdown
- Review 09-PULUMI-CODE.md for Step Functions deployment code