Skip to content

TradAI Error Handling & Resilience

Version: 1.0.0 | Date: 2026-03-28 | Status: CURRENT


1. TL;DR

TradAI uses a layered error handling strategy spanning application code, AWS infrastructure, and workflow orchestration:

  • Application layer -- with_retry decorator with exponential backoff and configurable RetryConfig (max attempts, backoff factor, delay caps). A CircuitBreaker prevents cascading failures by transitioning through CLOSED / OPEN / HALF_OPEN states.
  • Unified policy -- ResiliencePolicy combines retry + circuit breaker into a single decorator (@policy.protect) with presets for HTTP, database, and external API use cases.
  • SQS + DLQ -- FIFO queues with maxReceiveCount: 3 and 14-day DLQ retention. Lambda consumers return batchItemFailures for partial-batch retry.
  • Step Functions -- Every state has Retry blocks for transient Lambda errors and Catch blocks that route to UpdateStatusFailed / CleanupResources / NotifyFailure.
  • Exception hierarchy -- All errors descend from TradAIError. Retriable exceptions (ExternalServiceError and subclasses) are separated from non-retriable ones (ValidationError, NotFoundError).

Golden rule

Never catch Exception broadly. Retry only ExternalServiceError subtypes. Let ValidationError and ConfigurationError fail fast.


2. Retry Patterns

2.1 Decision Flow

flowchart TD
    A[Call function] --> B{Exception raised?}
    B -- No --> C[Return result]
    B -- Yes --> D{Is exception retriable?}
    D -- No --> E[Raise immediately]
    D -- Yes --> F{Attempts remaining?}
    F -- No --> G[Log exhaustion & raise last exception]
    F -- Yes --> H[Call on_retry callback]
    H --> I[Log warning with attempt/delay]
    I --> J[Sleep: delay * backoff_factor]
    J --> K[Cap delay at max_delay]
    K --> A

2.2 RetryConfig Fields

Field Type Default Description
max_attempts int 3 Total attempts (first call + retries)
backoff_factor float 2.0 Multiplier applied to delay each iteration
initial_delay float 1.0 Delay in seconds before first retry
max_delay float 30.0 Upper bound on delay (prevents runaway waits)
retriable_exceptions tuple[type[Exception], ...] (ExternalServiceError,) Only these exceptions trigger retries

2.3 Decorator API

from tradai.common.resilience import with_retry, with_retry_async

@with_retry(max_attempts=3, retriable_exceptions=(ConnectionError,))
def fetch_data():
    return requests.get("https://api.example.com/data")

@with_retry_async(max_attempts=5, initial_delay=0.5, max_delay=10.0)
async def fetch_data_async():
    return await client.get("https://api.example.com/data")

2.4 Manual Retry Control

For scenarios requiring fine-grained control, use RetryContext:

from tradai.common.resilience import RetryContext

with RetryContext(max_attempts=3) as retry:
    for attempt in retry:
        try:
            result = fetch_data()
            break
        except ConnectionError as e:
            if not retry.should_retry(e):
                raise

Sync and Async Unification

RetryEngine internally uses SyncContext (wrapping time.sleep) and AsyncContext (wrapping asyncio.sleep) to share retry logic across both sync and async code paths, eliminating duplication.


3. Circuit Breaker

3.1 State Machine

stateDiagram-v2
    [*] --> CLOSED

    CLOSED --> OPEN : failure_count >= failure_threshold (default 5)
    OPEN --> HALF_OPEN : recovery_timeout elapsed (default 60s)
    HALF_OPEN --> CLOSED : success_count >= success_threshold (default 1)
    HALF_OPEN --> OPEN : Any failure

    CLOSED --> CLOSED : Success (resets failure_count)
    OPEN --> OPEN : Requests rejected (CircuitOpenError)

3.2 CircuitBreakerConfig

Field Type Default Description
failure_threshold int 5 Consecutive failures before circuit opens
recovery_timeout float 60.0 Seconds in OPEN before trying HALF_OPEN
success_threshold int 1 Successes in HALF_OPEN needed to close circuit

3.3 Recorded Exceptions

The circuit breaker only records transient I/O exceptions by default:

  • ConnectionError
  • TimeoutError
  • OSError

Programming errors (TypeError, ValueError) pass through without tripping the breaker.

3.4 Thread Safety

All state transitions are protected by threading.Lock. The state property checks for timeout-based OPEN to HALF_OPEN transitions on every access, so recovery is automatic.

CircuitOpenError

When the circuit is OPEN, all requests are immediately rejected with CircuitOpenError (a subclass of ExternalServiceError). Callers must handle this -- typically by returning a cached value or a degraded response.


4. Unified Resilience Policy

ResiliencePolicy combines retry and circuit breaker into a single configuration with preset profiles:

4.1 Presets

Preset Retries Delay Backoff CB Threshold CB Timeout Use Case
for_http() 3 0.5s 2.0x 5 30s FastAPI internal calls
for_database() 3 2.0s 2.0x 3 60s ArcticDB / DynamoDB
for_external_api() 2 1.0s 3.0x 3 45s CCXT / exchange APIs
retry_only(n) n 1.0s 2.0x disabled -- One-off operations
circuit_breaker_only(n) disabled -- -- n 60s High-throughput paths

4.2 Execution Order

Circuit breaker check runs first (fail fast), then retry logic wraps the call:

1. Check circuit breaker -> if OPEN, raise CircuitOpenError
2. Execute function
3. On success -> record_success() on circuit breaker
4. On retriable failure -> record_failure() + retry with backoff
5. On retry exhaustion -> raise last exception
from tradai.common.resilience import ResiliencePolicy, ResilienceConfig

policy = ResiliencePolicy(ResilienceConfig.for_external_api(), name="binance")

@policy.protect
def fetch_ohlcv(symbol: str) -> dict:
    return exchange.fetch_ohlcv(symbol)

5. DLQ Handling

5.1 SQS Queue Configuration

Parameter Main Queue Dead Letter Queue
Type FIFO FIFO
Content Dedup Enabled --
Visibility Timeout 900s (15 min) --
Message Retention 4 days 14 days
Long Polling 20s --
Max Receive Count 3 --

5.2 Message Lifecycle

sequenceDiagram
    participant API as Backend API
    participant SQS as SQS FIFO Queue
    participant Lambda as Backtest Consumer
    participant DLQ as Dead Letter Queue
    participant Ops as Operator

    API->>SQS: Send backtest request
    SQS->>Lambda: Deliver batch of messages

    alt Processing succeeds
        Lambda->>Lambda: SQSJobProcessor.process_batch()
        Lambda-->>SQS: Message deleted (ack)
    else Individual message fails
        Lambda-->>SQS: Return batchItemFailures
        Note over SQS: Message becomes visible again<br/>after 900s timeout
        SQS->>Lambda: Re-deliver (attempt 2 of 3)
    else All 3 attempts exhausted
        SQS->>DLQ: Move message to DLQ<br/>(14-day retention)
        Ops->>DLQ: Inspect failed messages
        Ops->>SQS: Replay after fixing root cause
    end

5.3 Partial Batch Failure

The backtest-consumer Lambda uses SQS partial batch failure reporting. On failure, it returns specific messageId values in batchItemFailures, so only failed messages are retried -- successful ones in the same batch are not reprocessed.

# From lambdas/backtest-consumer/handler.py
return {
    "batchItemFailures": [
        {"itemIdentifier": mid} for mid in result.failed_message_ids
    ]
}

Idempotency

SQSJobProcessor checks DynamoDB for existing job state before processing, ensuring that redelivered messages do not trigger duplicate Step Functions executions or ECS tasks.


6. Step Functions Error Handling

6.1 Retry Blocks (ASL)

Every Lambda invocation in the backtest workflow includes a Retry block for transient AWS errors:

State Retried Errors Interval Max Attempts Backoff Rate
ValidateStrategy Lambda.ServiceException, Lambda.TooManyRequestsException 2s 3 2.0
EnsureData Lambda.ServiceException, Lambda.TooManyRequestsException 5s 3 2.0
UpdateStatusRunning Lambda.ServiceException 1s 2 --
CleanupResources States.ALL 2s 2 1.5
UpdateStatusFailed States.ALL 1s 2 --
UpdateStatusCompleted States.ALL 1s 2 --
NotifySuccess / NotifyFailure States.ALL 1s 2 --

6.2 Catch Blocks and Error Routing

ValidateStrategy ──Catch(States.ALL)──> UpdateStatusFailed
EnsureData ──Catch(States.ALL)──> UpdateStatusFailed
RunBacktest ──Catch(States.TaskFailed)──> CleanupResources
RunBacktest ──Catch(States.Timeout)──> HandleTimeout ──> CleanupResources
RunBacktest ──Catch(States.ALL)──> CleanupResources
CleanupResources ──────────────────────> UpdateStatusFailed
UpdateStatusFailed ──────────────────> NotifyFailure
UpdateStatusFailed ──Catch(States.ALL)──> NotifyFailure  (continue even if status update fails)

6.3 Timeout Handling

Scope Timeout Purpose
Workflow global config.backtest_timeout_seconds Hard cap on entire execution
EnsureData state 300s (5 min) Step Functions state timeout (Lambda timeout is 180s)
RunBacktest ECS task config.backtest_timeout_seconds ECS task execution
RunBacktest heartbeat config.heartbeat_seconds Detect stuck containers

Heartbeat failures

If the ECS container does not send a heartbeat within config.heartbeat_seconds, Step Functions treats it as a States.Timeout and routes to HandleTimeout -> CleanupResources, which stops orphaned ECS tasks.

6.4 Non-Critical Failure Tolerance

The UpdateStatusRunning state has a special Catch block with "Comment": "Continue even if status update fails". If the DynamoDB status update fails, the workflow proceeds to RunBacktest anyway -- the ECS container will update status itself. This prevents transient DynamoDB errors from blocking the entire backtest.


7. Failure Modes Table

Service Failure Mode Strategy Recovery
CCXT / Exchange Rate limit / timeout with_retry (3 attempts, exponential backoff) + CircuitBreaker Automatic retry; circuit opens after 5 failures, recovers after 60s
ArcticDB Connection failure ResilienceConfig.for_database() (3 retries, 2s initial delay) Automatic retry with conservative backoff; circuit opens after 3 failures
DynamoDB Throttling Lambda retries via SQS redelivery (3 attempts) Partial batch failure; messages retry from SQS
SQS Consumer Lambda crash SQS visibility timeout (900s) + maxReceiveCount (3) Re-delivery up to 3 times, then DLQ
Step Functions Lambda.ServiceException ASL Retry (2-3 attempts, 1-5s interval) Automatic retry within state machine
ECS Backtest Container crash Catch(States.TaskFailed) -> CleanupResources Status set to FAILED, notification sent, orphan tasks stopped
ECS Backtest Stuck / no heartbeat Catch(States.Timeout) -> HandleTimeout Cleanup + status update + failure notification
WebSocket Stream Disconnection StreamDisconnectedError -> reconnect loop StreamReconnectExhaustedError raised after max attempts
S3 / AWS APIs Transient errors ExternalServiceError base + retry Automatic retry via with_retry decorator
Strategy Validation Invalid strategy No retry (non-retriable ValidationError) Immediate fail -> UpdateStatusValidationFailed -> FailValidation

8. Exception Hierarchy

TradAIError (base)
+-- ValidationError                    # Non-retriable: bad input
+-- FeatureValidationError             # Non-retriable: ML schema mismatch
+-- NotFoundError                      # Non-retriable: missing resource
|   +-- DataNotFoundError              # Missing data in storage
+-- ConfigurationError                 # Non-retriable: bad config
+-- ConflictError                      # Non-retriable: state conflict
+-- AuthenticationError                # Non-retriable: auth failure
+-- BacktestError                      # Non-retriable: execution failure
+-- TradingError                       # Non-retriable: live trading failure
+-- ExternalServiceError               # RETRIABLE: external dependency
    +-- DataFetchError                 # Exchange / storage read failure
    +-- StorageError                   # Storage write failure
    +-- CircuitOpenError               # Circuit breaker is open
    +-- WebSocketError                 # WebSocket base
        +-- StreamDisconnectedError    # Connection lost
        +-- StreamReconnectExhaustedError  # Max reconnects exceeded

Retriability boundary

The ExternalServiceError subtree is the only branch retried by default. All other TradAIError subtypes fail immediately. This is enforced by RetryConfig.retriable_exceptions defaulting to (ExternalServiceError,).

FeatureValidationError

This exception halts ML inference to prevent silent prediction failures. It carries an errors: list[str] attribute with specific mismatches (feature count, missing features, ordering). Never suppress this error -- it indicates a training/inference schema drift.


Changelog

Version Date Changes
1.0.0 2026-03-28 Initial creation from resilience module analysis

Dependencies

If This Changes Update This Doc
libs/tradai-common/src/tradai/common/resilience/ Retry and circuit breaker patterns (Sections 2-3)
libs/tradai-common/src/tradai/common/exceptions.py Exception hierarchy (Section 7)
infra/compute/asl_templates/backtest_workflow.json.j2 Catch/Retry blocks Step Functions error handling (Section 5)
infra/foundation/modules/sqs.py DLQ config DLQ handling (Section 4)