TradAI Error Handling & Resilience¶

Version: 1.0.0 | Date: 2026-03-28 | Status: CURRENT

1. TL;DR¶

TradAI uses a layered error handling strategy spanning application code, AWS infrastructure, and workflow orchestration:

Application layer -- with_retry decorator with exponential backoff and configurable RetryConfig (max attempts, backoff factor, delay caps). A CircuitBreaker prevents cascading failures by transitioning through CLOSED / OPEN / HALF_OPEN states.
Unified policy -- ResiliencePolicy combines retry + circuit breaker into a single decorator (@policy.protect) with presets for HTTP, database, and external API use cases.
SQS + DLQ -- FIFO queues with maxReceiveCount: 3 and 14-day DLQ retention. Lambda consumers return batchItemFailures for partial-batch retry.
Step Functions -- Every state has Retry blocks for transient Lambda errors and Catch blocks that route to UpdateStatusFailed / CleanupResources / NotifyFailure.
Exception hierarchy -- All errors descend from TradAIError. Retriable exceptions (ExternalServiceError and subclasses) are separated from non-retriable ones (ValidationError, NotFoundError).

Golden rule

Never catch Exception broadly. Retry only ExternalServiceError subtypes. Let ValidationError and ConfigurationError fail fast.

2. Retry Patterns¶

2.1 Decision Flow¶

flowchart TD
    A[Call function] --> B{Exception raised?}
    B -- No --> C[Return result]
    B -- Yes --> D{Is exception retriable?}
    D -- No --> E[Raise immediately]
    D -- Yes --> F{Attempts remaining?}
    F -- No --> G[Log exhaustion & raise last exception]
    F -- Yes --> H[Call on_retry callback]
    H --> I[Log warning with attempt/delay]
    I --> J[Sleep: delay * backoff_factor]
    J --> K[Cap delay at max_delay]
    K --> A

2.2 RetryConfig Fields¶

Field	Type	Default	Description
`max_attempts`	`int`	`3`	Total attempts (first call + retries)
`backoff_factor`	`float`	`2.0`	Multiplier applied to delay each iteration
`initial_delay`	`float`	`1.0`	Delay in seconds before first retry
`max_delay`	`float`	`30.0`	Upper bound on delay (prevents runaway waits)
`retriable_exceptions`	`tuple[type[Exception], ...]`	`(ExternalServiceError,)`	Only these exceptions trigger retries

2.3 Decorator API¶

from tradai.common.resilience import with_retry, with_retry_async

@with_retry(max_attempts=3, retriable_exceptions=(ConnectionError,))
def fetch_data():
    return requests.get("https://api.example.com/data")

@with_retry_async(max_attempts=5, initial_delay=0.5, max_delay=10.0)
async def fetch_data_async():
    return await client.get("https://api.example.com/data")

2.4 Manual Retry Control¶

For scenarios requiring fine-grained control, use RetryContext:

from tradai.common.resilience import RetryContext

with RetryContext(max_attempts=3) as retry:
    for attempt in retry:
        try:
            result = fetch_data()
            break
        except ConnectionError as e:
            if not retry.should_retry(e):
                raise

Sync and Async Unification

RetryEngine internally uses SyncContext (wrapping time.sleep) and AsyncContext (wrapping asyncio.sleep) to share retry logic across both sync and async code paths, eliminating duplication.

3. Circuit Breaker¶

3.1 State Machine¶

stateDiagram-v2
    [*] --> CLOSED

    CLOSED --> OPEN : failure_count >= failure_threshold (default 5)
    OPEN --> HALF_OPEN : recovery_timeout elapsed (default 60s)
    HALF_OPEN --> CLOSED : success_count >= success_threshold (default 1)
    HALF_OPEN --> OPEN : Any failure

    CLOSED --> CLOSED : Success (resets failure_count)
    OPEN --> OPEN : Requests rejected (CircuitOpenError)

3.2 CircuitBreakerConfig¶

Field	Type	Default	Description
`failure_threshold`	`int`	`5`	Consecutive failures before circuit opens
`recovery_timeout`	`float`	`60.0`	Seconds in OPEN before trying HALF_OPEN
`success_threshold`	`int`	`1`	Successes in HALF_OPEN needed to close circuit

3.3 Recorded Exceptions¶

The circuit breaker only records transient I/O exceptions by default:

ConnectionError
TimeoutError
OSError

Programming errors (TypeError, ValueError) pass through without tripping the breaker.

3.4 Thread Safety¶

All state transitions are protected by threading.Lock. The state property checks for timeout-based OPEN to HALF_OPEN transitions on every access, so recovery is automatic.

CircuitOpenError

When the circuit is OPEN, all requests are immediately rejected with CircuitOpenError (a subclass of ExternalServiceError). Callers must handle this -- typically by returning a cached value or a degraded response.

4. Unified Resilience Policy¶

ResiliencePolicy combines retry and circuit breaker into a single configuration with preset profiles:

4.1 Presets¶

Preset	Retries	Delay	Backoff	CB Threshold	CB Timeout	Use Case
`for_http()`	3	0.5s	2.0x	5	30s	FastAPI internal calls
`for_database()`	3	2.0s	2.0x	3	60s	ArcticDB / DynamoDB
`for_external_api()`	2	1.0s	3.0x	3	45s	CCXT / exchange APIs
`retry_only(n)`	n	1.0s	2.0x	disabled	--	One-off operations
`circuit_breaker_only(n)`	disabled	--	--	n	60s	High-throughput paths

4.2 Execution Order¶

Circuit breaker check runs first (fail fast), then retry logic wraps the call:

1. Check circuit breaker -> if OPEN, raise CircuitOpenError
2. Execute function
3. On success -> record_success() on circuit breaker
4. On retriable failure -> record_failure() + retry with backoff
5. On retry exhaustion -> raise last exception

from tradai.common.resilience import ResiliencePolicy, ResilienceConfig

policy = ResiliencePolicy(ResilienceConfig.for_external_api(), name="binance")

@policy.protect
def fetch_ohlcv(symbol: str) -> dict:
    return exchange.fetch_ohlcv(symbol)

5. DLQ Handling¶

5.1 SQS Queue Configuration¶

Parameter	Main Queue	Dead Letter Queue
Type	FIFO	FIFO
Content Dedup	Enabled	--
Visibility Timeout	900s (15 min)	--
Message Retention	4 days	14 days
Long Polling	20s	--
Max Receive Count	3	--

5.2 Message Lifecycle¶

sequenceDiagram
    participant API as Backend API
    participant SQS as SQS FIFO Queue
    participant Lambda as Backtest Consumer
    participant DLQ as Dead Letter Queue
    participant Ops as Operator

    API->>SQS: Send backtest request
    SQS->>Lambda: Deliver batch of messages

    alt Processing succeeds
        Lambda->>Lambda: SQSJobProcessor.process_batch()
        Lambda-->>SQS: Message deleted (ack)
    else Individual message fails
        Lambda-->>SQS: Return batchItemFailures
        Note over SQS: Message becomes visible again<br/>after 900s timeout
        SQS->>Lambda: Re-deliver (attempt 2 of 3)
    else All 3 attempts exhausted
        SQS->>DLQ: Move message to DLQ<br/>(14-day retention)
        Ops->>DLQ: Inspect failed messages
        Ops->>SQS: Replay after fixing root cause
    end

5.3 Partial Batch Failure¶

The backtest-consumer Lambda uses SQS partial batch failure reporting. On failure, it returns specific messageId values in batchItemFailures, so only failed messages are retried -- successful ones in the same batch are not reprocessed.

# From lambdas/backtest-consumer/handler.py
return {
    "batchItemFailures": [
        {"itemIdentifier": mid} for mid in result.failed_message_ids
    ]
}

Idempotency

SQSJobProcessor checks DynamoDB for existing job state before processing, ensuring that redelivered messages do not trigger duplicate Step Functions executions or ECS tasks.

6. Step Functions Error Handling¶

6.1 Retry Blocks (ASL)¶

Every Lambda invocation in the backtest workflow includes a Retry block for transient AWS errors:

State	Retried Errors	Interval	Max Attempts	Backoff Rate
`ValidateStrategy`	`Lambda.ServiceException`, `Lambda.TooManyRequestsException`	2s	3	2.0
`EnsureData`	`Lambda.ServiceException`, `Lambda.TooManyRequestsException`	5s	3	2.0
`UpdateStatusRunning`	`Lambda.ServiceException`	1s	2	--
`CleanupResources`	`States.ALL`	2s	2	1.5
`UpdateStatusFailed`	`States.ALL`	1s	2	--
`UpdateStatusCompleted`	`States.ALL`	1s	2	--
`NotifySuccess` / `NotifyFailure`	`States.ALL`	1s	2	--

6.2 Catch Blocks and Error Routing¶

ValidateStrategy ──Catch(States.ALL)──> UpdateStatusFailed
EnsureData ──Catch(States.ALL)──> UpdateStatusFailed
RunBacktest ──Catch(States.TaskFailed)──> CleanupResources
RunBacktest ──Catch(States.Timeout)──> HandleTimeout ──> CleanupResources
RunBacktest ──Catch(States.ALL)──> CleanupResources
CleanupResources ──────────────────────> UpdateStatusFailed
UpdateStatusFailed ──────────────────> NotifyFailure
UpdateStatusFailed ──Catch(States.ALL)──> NotifyFailure  (continue even if status update fails)

6.3 Timeout Handling¶

Scope	Timeout	Purpose
Workflow global	`config.backtest_timeout_seconds`	Hard cap on entire execution
`EnsureData` state	300s (5 min)	Step Functions state timeout (Lambda timeout is 180s)
`RunBacktest` ECS task	`config.backtest_timeout_seconds`	ECS task execution
`RunBacktest` heartbeat	`config.heartbeat_seconds`	Detect stuck containers

Heartbeat failures

If the ECS container does not send a heartbeat within config.heartbeat_seconds, Step Functions treats it as a States.Timeout and routes to HandleTimeout -> CleanupResources, which stops orphaned ECS tasks.

6.4 Non-Critical Failure Tolerance¶

The UpdateStatusRunning state has a special Catch block with "Comment": "Continue even if status update fails". If the DynamoDB status update fails, the workflow proceeds to RunBacktest anyway -- the ECS container will update status itself. This prevents transient DynamoDB errors from blocking the entire backtest.

7. Failure Modes Table¶

Service	Failure Mode	Strategy	Recovery
CCXT / Exchange	Rate limit / timeout	`with_retry` (3 attempts, exponential backoff) + `CircuitBreaker`	Automatic retry; circuit opens after 5 failures, recovers after 60s
ArcticDB	Connection failure	`ResilienceConfig.for_database()` (3 retries, 2s initial delay)	Automatic retry with conservative backoff; circuit opens after 3 failures
DynamoDB	Throttling	Lambda retries via SQS redelivery (3 attempts)	Partial batch failure; messages retry from SQS
SQS Consumer	Lambda crash	SQS visibility timeout (900s) + maxReceiveCount (3)	Re-delivery up to 3 times, then DLQ
Step Functions	Lambda.ServiceException	ASL Retry (2-3 attempts, 1-5s interval)	Automatic retry within state machine
ECS Backtest	Container crash	Catch(States.TaskFailed) -> CleanupResources	Status set to FAILED, notification sent, orphan tasks stopped
ECS Backtest	Stuck / no heartbeat	Catch(States.Timeout) -> HandleTimeout	Cleanup + status update + failure notification
WebSocket Stream	Disconnection	`StreamDisconnectedError` -> reconnect loop	`StreamReconnectExhaustedError` raised after max attempts
S3 / AWS APIs	Transient errors	`ExternalServiceError` base + retry	Automatic retry via `with_retry` decorator
Strategy Validation	Invalid strategy	No retry (non-retriable `ValidationError`)	Immediate fail -> `UpdateStatusValidationFailed` -> `FailValidation`

8. Exception Hierarchy¶

TradAIError (base)
+-- ValidationError                    # Non-retriable: bad input
+-- FeatureValidationError             # Non-retriable: ML schema mismatch
+-- NotFoundError                      # Non-retriable: missing resource
|   +-- DataNotFoundError              # Missing data in storage
+-- ConfigurationError                 # Non-retriable: bad config
+-- ConflictError                      # Non-retriable: state conflict
+-- AuthenticationError                # Non-retriable: auth failure
+-- BacktestError                      # Non-retriable: execution failure
+-- TradingError                       # Non-retriable: live trading failure
+-- ExternalServiceError               # RETRIABLE: external dependency
    +-- DataFetchError                 # Exchange / storage read failure
    +-- StorageError                   # Storage write failure
    +-- CircuitOpenError               # Circuit breaker is open
    +-- WebSocketError                 # WebSocket base
        +-- StreamDisconnectedError    # Connection lost
        +-- StreamReconnectExhaustedError  # Max reconnects exceeded

Retriability boundary

The ExternalServiceError subtree is the only branch retried by default. All other TradAIError subtypes fail immediately. This is enforced by RetryConfig.retriable_exceptions defaulting to (ExternalServiceError,).

FeatureValidationError

This exception halts ML inference to prevent silent prediction failures. It carries an errors: list[str] attribute with specific mismatches (feature count, missing features, ordering). Never suppress this error -- it indicates a training/inference schema drift.

Changelog¶

Version	Date	Changes
1.0.0	2026-03-28	Initial creation from resilience module analysis

Dependencies¶

If This Changes	Update This Doc
`libs/tradai-common/src/tradai/common/resilience/`	Retry and circuit breaker patterns (Sections 2-3)
`libs/tradai-common/src/tradai/common/exceptions.py`	Exception hierarchy (Section 7)
`infra/compute/asl_templates/backtest_workflow.json.j2` Catch/Retry blocks	Step Functions error handling (Section 5)
`infra/foundation/modules/sqs.py` DLQ config	DLQ handling (Section 4)