TradAI Error Handling & Resilience¶
Version: 1.0.0 | Date: 2026-03-28 | Status: CURRENT
1. TL;DR¶
TradAI uses a layered error handling strategy spanning application code, AWS infrastructure, and workflow orchestration:
- Application layer --
with_retrydecorator with exponential backoff and configurableRetryConfig(max attempts, backoff factor, delay caps). ACircuitBreakerprevents cascading failures by transitioning through CLOSED / OPEN / HALF_OPEN states. - Unified policy --
ResiliencePolicycombines retry + circuit breaker into a single decorator (@policy.protect) with presets for HTTP, database, and external API use cases. - SQS + DLQ -- FIFO queues with
maxReceiveCount: 3and 14-day DLQ retention. Lambda consumers returnbatchItemFailuresfor partial-batch retry. - Step Functions -- Every state has
Retryblocks for transient Lambda errors andCatchblocks that route toUpdateStatusFailed/CleanupResources/NotifyFailure. - Exception hierarchy -- All errors descend from
TradAIError. Retriable exceptions (ExternalServiceErrorand subclasses) are separated from non-retriable ones (ValidationError,NotFoundError).
Golden rule
Never catch Exception broadly. Retry only ExternalServiceError subtypes. Let ValidationError and ConfigurationError fail fast.
2. Retry Patterns¶
2.1 Decision Flow¶
flowchart TD
A[Call function] --> B{Exception raised?}
B -- No --> C[Return result]
B -- Yes --> D{Is exception retriable?}
D -- No --> E[Raise immediately]
D -- Yes --> F{Attempts remaining?}
F -- No --> G[Log exhaustion & raise last exception]
F -- Yes --> H[Call on_retry callback]
H --> I[Log warning with attempt/delay]
I --> J[Sleep: delay * backoff_factor]
J --> K[Cap delay at max_delay]
K --> A 2.2 RetryConfig Fields¶
| Field | Type | Default | Description |
|---|---|---|---|
max_attempts | int | 3 | Total attempts (first call + retries) |
backoff_factor | float | 2.0 | Multiplier applied to delay each iteration |
initial_delay | float | 1.0 | Delay in seconds before first retry |
max_delay | float | 30.0 | Upper bound on delay (prevents runaway waits) |
retriable_exceptions | tuple[type[Exception], ...] | (ExternalServiceError,) | Only these exceptions trigger retries |
2.3 Decorator API¶
from tradai.common.resilience import with_retry, with_retry_async
@with_retry(max_attempts=3, retriable_exceptions=(ConnectionError,))
def fetch_data():
return requests.get("https://api.example.com/data")
@with_retry_async(max_attempts=5, initial_delay=0.5, max_delay=10.0)
async def fetch_data_async():
return await client.get("https://api.example.com/data")
2.4 Manual Retry Control¶
For scenarios requiring fine-grained control, use RetryContext:
from tradai.common.resilience import RetryContext
with RetryContext(max_attempts=3) as retry:
for attempt in retry:
try:
result = fetch_data()
break
except ConnectionError as e:
if not retry.should_retry(e):
raise
Sync and Async Unification
RetryEngine internally uses SyncContext (wrapping time.sleep) and AsyncContext (wrapping asyncio.sleep) to share retry logic across both sync and async code paths, eliminating duplication.
3. Circuit Breaker¶
3.1 State Machine¶
stateDiagram-v2
[*] --> CLOSED
CLOSED --> OPEN : failure_count >= failure_threshold (default 5)
OPEN --> HALF_OPEN : recovery_timeout elapsed (default 60s)
HALF_OPEN --> CLOSED : success_count >= success_threshold (default 1)
HALF_OPEN --> OPEN : Any failure
CLOSED --> CLOSED : Success (resets failure_count)
OPEN --> OPEN : Requests rejected (CircuitOpenError) 3.2 CircuitBreakerConfig¶
| Field | Type | Default | Description |
|---|---|---|---|
failure_threshold | int | 5 | Consecutive failures before circuit opens |
recovery_timeout | float | 60.0 | Seconds in OPEN before trying HALF_OPEN |
success_threshold | int | 1 | Successes in HALF_OPEN needed to close circuit |
3.3 Recorded Exceptions¶
The circuit breaker only records transient I/O exceptions by default:
ConnectionErrorTimeoutErrorOSError
Programming errors (TypeError, ValueError) pass through without tripping the breaker.
3.4 Thread Safety¶
All state transitions are protected by threading.Lock. The state property checks for timeout-based OPEN to HALF_OPEN transitions on every access, so recovery is automatic.
CircuitOpenError
When the circuit is OPEN, all requests are immediately rejected with CircuitOpenError (a subclass of ExternalServiceError). Callers must handle this -- typically by returning a cached value or a degraded response.
4. Unified Resilience Policy¶
ResiliencePolicy combines retry and circuit breaker into a single configuration with preset profiles:
4.1 Presets¶
| Preset | Retries | Delay | Backoff | CB Threshold | CB Timeout | Use Case |
|---|---|---|---|---|---|---|
for_http() | 3 | 0.5s | 2.0x | 5 | 30s | FastAPI internal calls |
for_database() | 3 | 2.0s | 2.0x | 3 | 60s | ArcticDB / DynamoDB |
for_external_api() | 2 | 1.0s | 3.0x | 3 | 45s | CCXT / exchange APIs |
retry_only(n) | n | 1.0s | 2.0x | disabled | -- | One-off operations |
circuit_breaker_only(n) | disabled | -- | -- | n | 60s | High-throughput paths |
4.2 Execution Order¶
Circuit breaker check runs first (fail fast), then retry logic wraps the call:
1. Check circuit breaker -> if OPEN, raise CircuitOpenError
2. Execute function
3. On success -> record_success() on circuit breaker
4. On retriable failure -> record_failure() + retry with backoff
5. On retry exhaustion -> raise last exception
from tradai.common.resilience import ResiliencePolicy, ResilienceConfig
policy = ResiliencePolicy(ResilienceConfig.for_external_api(), name="binance")
@policy.protect
def fetch_ohlcv(symbol: str) -> dict:
return exchange.fetch_ohlcv(symbol)
5. DLQ Handling¶
5.1 SQS Queue Configuration¶
| Parameter | Main Queue | Dead Letter Queue |
|---|---|---|
| Type | FIFO | FIFO |
| Content Dedup | Enabled | -- |
| Visibility Timeout | 900s (15 min) | -- |
| Message Retention | 4 days | 14 days |
| Long Polling | 20s | -- |
| Max Receive Count | 3 | -- |
5.2 Message Lifecycle¶
sequenceDiagram
participant API as Backend API
participant SQS as SQS FIFO Queue
participant Lambda as Backtest Consumer
participant DLQ as Dead Letter Queue
participant Ops as Operator
API->>SQS: Send backtest request
SQS->>Lambda: Deliver batch of messages
alt Processing succeeds
Lambda->>Lambda: SQSJobProcessor.process_batch()
Lambda-->>SQS: Message deleted (ack)
else Individual message fails
Lambda-->>SQS: Return batchItemFailures
Note over SQS: Message becomes visible again<br/>after 900s timeout
SQS->>Lambda: Re-deliver (attempt 2 of 3)
else All 3 attempts exhausted
SQS->>DLQ: Move message to DLQ<br/>(14-day retention)
Ops->>DLQ: Inspect failed messages
Ops->>SQS: Replay after fixing root cause
end 5.3 Partial Batch Failure¶
The backtest-consumer Lambda uses SQS partial batch failure reporting. On failure, it returns specific messageId values in batchItemFailures, so only failed messages are retried -- successful ones in the same batch are not reprocessed.
# From lambdas/backtest-consumer/handler.py
return {
"batchItemFailures": [
{"itemIdentifier": mid} for mid in result.failed_message_ids
]
}
Idempotency
SQSJobProcessor checks DynamoDB for existing job state before processing, ensuring that redelivered messages do not trigger duplicate Step Functions executions or ECS tasks.
6. Step Functions Error Handling¶
6.1 Retry Blocks (ASL)¶
Every Lambda invocation in the backtest workflow includes a Retry block for transient AWS errors:
| State | Retried Errors | Interval | Max Attempts | Backoff Rate |
|---|---|---|---|---|
ValidateStrategy | Lambda.ServiceException, Lambda.TooManyRequestsException | 2s | 3 | 2.0 |
EnsureData | Lambda.ServiceException, Lambda.TooManyRequestsException | 5s | 3 | 2.0 |
UpdateStatusRunning | Lambda.ServiceException | 1s | 2 | -- |
CleanupResources | States.ALL | 2s | 2 | 1.5 |
UpdateStatusFailed | States.ALL | 1s | 2 | -- |
UpdateStatusCompleted | States.ALL | 1s | 2 | -- |
NotifySuccess / NotifyFailure | States.ALL | 1s | 2 | -- |
6.2 Catch Blocks and Error Routing¶
ValidateStrategy ──Catch(States.ALL)──> UpdateStatusFailed
EnsureData ──Catch(States.ALL)──> UpdateStatusFailed
RunBacktest ──Catch(States.TaskFailed)──> CleanupResources
RunBacktest ──Catch(States.Timeout)──> HandleTimeout ──> CleanupResources
RunBacktest ──Catch(States.ALL)──> CleanupResources
CleanupResources ──────────────────────> UpdateStatusFailed
UpdateStatusFailed ──────────────────> NotifyFailure
UpdateStatusFailed ──Catch(States.ALL)──> NotifyFailure (continue even if status update fails)
6.3 Timeout Handling¶
| Scope | Timeout | Purpose |
|---|---|---|
| Workflow global | config.backtest_timeout_seconds | Hard cap on entire execution |
EnsureData state | 300s (5 min) | Step Functions state timeout (Lambda timeout is 180s) |
RunBacktest ECS task | config.backtest_timeout_seconds | ECS task execution |
RunBacktest heartbeat | config.heartbeat_seconds | Detect stuck containers |
Heartbeat failures
If the ECS container does not send a heartbeat within config.heartbeat_seconds, Step Functions treats it as a States.Timeout and routes to HandleTimeout -> CleanupResources, which stops orphaned ECS tasks.
6.4 Non-Critical Failure Tolerance¶
The UpdateStatusRunning state has a special Catch block with "Comment": "Continue even if status update fails". If the DynamoDB status update fails, the workflow proceeds to RunBacktest anyway -- the ECS container will update status itself. This prevents transient DynamoDB errors from blocking the entire backtest.
7. Failure Modes Table¶
| Service | Failure Mode | Strategy | Recovery |
|---|---|---|---|
| CCXT / Exchange | Rate limit / timeout | with_retry (3 attempts, exponential backoff) + CircuitBreaker | Automatic retry; circuit opens after 5 failures, recovers after 60s |
| ArcticDB | Connection failure | ResilienceConfig.for_database() (3 retries, 2s initial delay) | Automatic retry with conservative backoff; circuit opens after 3 failures |
| DynamoDB | Throttling | Lambda retries via SQS redelivery (3 attempts) | Partial batch failure; messages retry from SQS |
| SQS Consumer | Lambda crash | SQS visibility timeout (900s) + maxReceiveCount (3) | Re-delivery up to 3 times, then DLQ |
| Step Functions | Lambda.ServiceException | ASL Retry (2-3 attempts, 1-5s interval) | Automatic retry within state machine |
| ECS Backtest | Container crash | Catch(States.TaskFailed) -> CleanupResources | Status set to FAILED, notification sent, orphan tasks stopped |
| ECS Backtest | Stuck / no heartbeat | Catch(States.Timeout) -> HandleTimeout | Cleanup + status update + failure notification |
| WebSocket Stream | Disconnection | StreamDisconnectedError -> reconnect loop | StreamReconnectExhaustedError raised after max attempts |
| S3 / AWS APIs | Transient errors | ExternalServiceError base + retry | Automatic retry via with_retry decorator |
| Strategy Validation | Invalid strategy | No retry (non-retriable ValidationError) | Immediate fail -> UpdateStatusValidationFailed -> FailValidation |
8. Exception Hierarchy¶
TradAIError (base)
+-- ValidationError # Non-retriable: bad input
+-- FeatureValidationError # Non-retriable: ML schema mismatch
+-- NotFoundError # Non-retriable: missing resource
| +-- DataNotFoundError # Missing data in storage
+-- ConfigurationError # Non-retriable: bad config
+-- ConflictError # Non-retriable: state conflict
+-- AuthenticationError # Non-retriable: auth failure
+-- BacktestError # Non-retriable: execution failure
+-- TradingError # Non-retriable: live trading failure
+-- ExternalServiceError # RETRIABLE: external dependency
+-- DataFetchError # Exchange / storage read failure
+-- StorageError # Storage write failure
+-- CircuitOpenError # Circuit breaker is open
+-- WebSocketError # WebSocket base
+-- StreamDisconnectedError # Connection lost
+-- StreamReconnectExhaustedError # Max reconnects exceeded
Retriability boundary
The ExternalServiceError subtree is the only branch retried by default. All other TradAIError subtypes fail immediately. This is enforced by RetryConfig.retriable_exceptions defaulting to (ExternalServiceError,).
FeatureValidationError
This exception halts ML inference to prevent silent prediction failures. It carries an errors: list[str] attribute with specific mismatches (feature count, missing features, ordering). Never suppress this error -- it indicates a training/inference schema drift.
Changelog¶
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-03-28 | Initial creation from resilience module analysis |
Dependencies¶
| If This Changes | Update This Doc |
|---|---|
libs/tradai-common/src/tradai/common/resilience/ | Retry and circuit breaker patterns (Sections 2-3) |
libs/tradai-common/src/tradai/common/exceptions.py | Exception hierarchy (Section 7) |
infra/compute/asl_templates/backtest_workflow.json.j2 Catch/Retry blocks | Step Functions error handling (Section 5) |
infra/foundation/modules/sqs.py DLQ config | DLQ handling (Section 4) |