TradAI State Machines¶
Version: 1.0.0 | Date: 2026-03-28 | Status: CURRENT
1. TL;DR¶
TradAI has five state machines governing different lifecycle concerns:
| State Machine | Enum | States | Storage |
|---|---|---|---|
| Backtest Job | JobStatus | 5 | tradai-workflow-state-{env} |
| Trading Session | TradingStatus | 6 | tradai-trading-state-{env} |
| Strategy Promotion | ModelStage | 4 | MLflow Model Registry |
| ML Retraining | RetrainingStatus | 9 | tradai-retraining-state-{env} |
| Config Version | ConfigVersionStatus | 3 | tradai-config-versions-{env} |
All state enums are str, Enum subclasses defined in libs/tradai-common/src/tradai/common/entities/. Transitions are enforced through can_transition_to() guards, is_terminal properties, and with_status() immutable-update methods.
2. Backtest Job Lifecycle¶
Source: tradai.common.entities.aws.JobStatus Orchestrator: BacktestOrchestrationService (backend)
States¶
| State | Value | Terminal | Description |
|---|---|---|---|
PENDING | "pending" | No | Job created, waiting to execute |
RUNNING | "running" | No | Executor picked up the job |
COMPLETED | "completed" | Yes | Backtest finished successfully |
FAILED | "failed" | Yes | Execution or submission error |
CANCELLED | "cancelled" | Yes | User-initiated cancellation |
Transition Diagram¶
stateDiagram-v2
[*] --> PENDING : submit_backtest()
PENDING --> RUNNING : Executor starts task
PENDING --> FAILED : Submission error
PENDING --> CANCELLED : cancel_backtest()
RUNNING --> COMPLETED : Backtest finishes OK
RUNNING --> FAILED : Runtime error / timeout
RUNNING --> CANCELLED : cancel_backtest() + ExecutionStopper
COMPLETED --> [*]
FAILED --> [*]
CANCELLED --> [*] Valid Transitions (from _VALID_TRANSITIONS)¶
Same-state transitions are allowed for idempotency. Terminal states reject all outbound transitions.
Transition Triggers¶
| Transition | Trigger | Component |
|---|---|---|
[*] -> PENDING | submit_backtest() creates job with UUID | BacktestOrchestrationService |
PENDING -> RUNNING | SQS consumer / Step Functions picks up job | sqs-trigger Lambda |
PENDING -> FAILED | Executor submission raises ExternalServiceError | BacktestOrchestrationService |
PENDING/RUNNING -> CANCELLED | cancel_backtest() + optional ExecutionStopper.stop() | BacktestOrchestrationService |
RUNNING -> COMPLETED | Backtest consumer processes results from S3 | backtest-consumer Lambda |
RUNNING -> FAILED | Task error, timeout, or orphan scanner cleanup | update-status Lambda |
Concurrent Job Limit
BacktestOrchestrationService enforces a per-strategy concurrent RUNNING job limit (default: 3). A TOCTOU race window of milliseconds allows at most limit + 1 jobs briefly. This is accepted as non-critical.
Event Publishing
State transitions emit JobStateEvent via EventPublisher (M11). Failed transitions include error field in the event payload for downstream alerting.
3. Trading Session Lifecycle¶
Source: tradai.common.entities.trading_state.TradingStatus Handler: TradingHandler (common entrypoint)
States¶
| State | Value | Active | Terminal | Description |
|---|---|---|---|---|
INITIALIZING | "initializing" | No | No | Container starting, loading config & credentials |
WARMUP | "warmup" | Yes | No | Loading historical data from ArcticDB |
RUNNING | "running" | Yes | No | Actively trading via Freqtrade subprocess |
PAUSED | "paused" | No | No | Temporarily paused (manual intervention) |
ERROR | "error" | No | Yes | Fatal error, needs investigation |
STOPPED | "stopped" | No | Yes | Gracefully stopped |
Transition Diagram¶
stateDiagram-v2
[*] --> INITIALIZING : Container starts
INITIALIZING --> WARMUP : Config + credentials loaded
WARMUP --> RUNNING : Historical data loaded
RUNNING --> PAUSED : Manual pause (C9)
PAUSED --> RUNNING : Manual resume (C9)
RUNNING --> STOPPED : Graceful shutdown (SIGTERM)
RUNNING --> ERROR : Unhandled exception
INITIALIZING --> ERROR : Config/credential failure
WARMUP --> ERROR : Warmup failure
PAUSED --> STOPPED : Shutdown while paused
PAUSED --> ERROR : Error while paused
ERROR --> [*]
STOPPED --> [*] Lifecycle Phases in TradingHandler.run()¶
Phase 1: INITIALIZING
- _initialize_state_management() -> DynamoDB TradingStateRepository
- _load_strategy_config() -> MLflow/S3 config (LIVE fails on error)
- _load_exchange_credentials() -> AWS Secrets Manager
Phase 2: WARMUP
- _warmup_historical_data() -> ArcticDB candle data
Phase 3: RUNNING
- _create_trader() -> FreqtradeTrader instance
- _start_health_reporter() -> Heartbeats + MetricsCollector + RiskMonitor
- _execute_trading_loop() -> Freqtrade subprocess
Phase 4: STOPPED (or ERROR on exception)
- _cleanup() -> Stop health reporter, clear refs
LIVE vs DRY_RUN
In LIVE mode, config load failures and missing exchange credentials are fatal (raise TradingError). In DRY_RUN mode, these degrade gracefully with warnings.
Terminal State Timestamps
with_status() automatically sets stopped_at when transitioning to ERROR or STOPPED. Transitioning to INITIALIZING clears stopped_at for container restart scenarios.
4. Strategy Promotion¶
Source: tradai.common.entities.mlflow.ModelStage Service: ModelPromotionService (strategy-service)
States¶
| Stage | Value | Description |
|---|---|---|
NONE | "None" | Default for new model registrations |
STAGING | "Staging" | Ready for validation/testing |
PRODUCTION | "Production" | Live production model |
ARCHIVED | "Archived" | Previous production version (rollback target) |
Transition Diagram¶
stateDiagram-v2
[*] --> NONE : Model registered in MLflow
NONE --> STAGING : stage()
STAGING --> PRODUCTION : promote_to_production()
PRODUCTION --> ARCHIVED : Auto-archive when new version promoted
ARCHIVED --> PRODUCTION : rollback()
STAGING --> PRODUCTION : promote_to_production(skip_validation=True)
NONE --> PRODUCTION : promote_to_production(skip_validation=True)
ARCHIVED --> STAGING : stage() (re-stage for re-validation) Promotion Workflow¶
- Stage:
service.stage(model_name, version)-- moves toStaging - Validate:
service.validate_for_production(model_name, version)-- checks: - Version exists and is in
READYstatus - Version is in
Stagingstage - Required tags present (
strategy_name,strategy_version) - Optional: validation backtest (7-day window)
- Promote:
service.promote_to_production(model_name, version)-- atomic operation: - Archives all existing
Productionversions (viaarchive_existing_versions=True) - Promotes target version to
Production - Rollback:
service.rollback(model_name, target_version)-- promotes anArchived/Staging/Noneversion back toProduction
A/B Test Guard
promote_to_production() checks for active A/B tests (ABTestStatus.RUNNING). Promoting the challenger or an unrelated version is blocked. Only the current champion may be promoted during an active test. Complete the A/B test first.
Atomic Archive+Promote
The archive-then-promote operation is delegated to MLflow's transition_model_version_stage(archive_existing_versions=True) to prevent a race condition where a crash between archive and promote leaves zero Production versions.
5. ML Retraining Pipeline¶
Source: tradai.common.entities.retraining.RetrainingStatus Orchestrator: Step Functions retraining workflow (MO005)
States¶
| State | Value | Description |
|---|---|---|
PENDING | "pending" | Waiting to start |
CHECKING | "checking" | Evaluating if retraining is needed |
TRAINING | "training" | ECS training task running |
VALIDATING | "validating" | Validation backtest running (see note below) |
COMPARING | "comparing" | Comparing champion vs challenger model |
PROMOTING | "promoting" | Promoting new model version |
COMPLETED | "completed" | Successfully completed |
FAILED | "failed" | Workflow failed at any step |
SKIPPED | "skipped" | Retraining not needed (recently trained / no drift) |
VALIDATING state vs Step Functions workflow
VALIDATING is a code-level state that can be set by the retraining handler, but the Step Functions workflow (retraining_workflow.json.j2) does not have a separate RunValidation task -- validation happens within the training/comparison steps. The workflow goes directly from RunRetraining to CompareModels. See 06-STEP-FUNCTIONS.md Section 4 for the actual workflow states.
Transition Diagram¶
stateDiagram-v2
[*] --> PENDING : Trigger received
PENDING --> CHECKING : Workflow starts
CHECKING --> SKIPPED : No retraining needed
CHECKING --> TRAINING : Retraining needed
TRAINING --> VALIDATING : Training complete (handler-level)
TRAINING --> FAILED : Training error
VALIDATING --> COMPARING : Validation backtest done
VALIDATING --> FAILED : Validation error
COMPARING --> PROMOTING : Challenger wins
COMPARING --> COMPLETED : Champion wins (no promotion)
COMPARING --> FAILED : Comparison error
PROMOTING --> COMPLETED : Model promoted
PROMOTING --> FAILED : Promotion error
SKIPPED --> [*]
COMPLETED --> [*]
FAILED --> [*] Triggers (RetrainingTrigger)¶
| Trigger | Value | Source |
|---|---|---|
DRIFT_DETECTED | "drift_detected" | drift-monitor Lambda |
SCHEDULED | "scheduled" | retraining-scheduler Lambda |
MANUAL | "manual" | CLI / API manual trigger |
PERFORMANCE_DEGRADATION | "performance_degradation" | CloudWatch alarm |
Check Decision (RetrainingDecision)¶
| Decision | Value | Description |
|---|---|---|
NEEDS_RETRAINING | "needs_retraining" | Drift/schedule/performance triggers retraining |
NO_RETRAINING | "no_retraining" | Metrics within acceptable bounds |
RECENTLY_TRAINED | "recently_trained" | Last retraining too recent, cooldown active |
State Persistence
RetrainingState is stored in DynamoDB with model_name as partition key. It tracks task_arn, step_function_execution_arn, and new_version for end-to-end traceability. The error_message field captures failure details.
6. Config Version Lifecycle¶
Source: tradai.common.entities.config_version.ConfigVersionStatus Service: ConfigVersionService (common)
States¶
| State | Value | Terminal | Description |
|---|---|---|---|
DRAFT | "draft" | No | Not yet validated or deployed |
ACTIVE | "active" | No | Currently deployed (one per strategy) |
DEPRECATED | "deprecated" | Yes | Superseded by newer version, auto-cleanup via TTL |
Transition Diagram¶
stateDiagram-v2
[*] --> DRAFT : Config version created
DRAFT --> ACTIVE : Validated and deployed
ACTIVE --> DEPRECATED : Superseded by new ACTIVE version
DEPRECATED --> [*] : DynamoDB TTL auto-deletes (90 days) Key Behaviors¶
- Content-addressable: Each version has a SHA256
config_hashfor deduplication. - Single active: Only one
ACTIVEconfig version per strategy at any time. - Immutable updates:
with_status(ACTIVE)setsdeployed_at;with_status(DEPRECATED)setsdeprecated_atandttl. - Auto-cleanup: Deprecated versions have a 90-day TTL (
_TTL_DAYS = 90) after which DynamoDB auto-deletes them.
Activation Side Effect
When a new version is activated, the BacktestOrchestrationService can resolve config_version_id="ACTIVE" to the current active config. Activating a new version implicitly makes all future "ACTIVE" references point to it.
7. DynamoDB State Tables¶
All tables use the naming pattern tradai-{purpose}-{environment}.
| State Machine | DynamoDB Table | Partition Key | Entity Class |
|---|---|---|---|
| Backtest Job | tradai-workflow-state-{env} | run_id | BacktestJobStatus |
| Trading Session | tradai-trading-state-{env} | strategy_id | TradingState |
| ML Retraining | tradai-retraining-state-{env} | model_name | RetrainingState |
| Config Version | tradai-config-versions-{env} | strategy_name (SK: config_id) | ConfigVersion |
| Health State | tradai-health-state-{env} | strategy_id | Health check records |
| Drift State | tradai-drift-state-{env} | model_name | Drift detection results |
| Rollback State | tradai-rollback-state-{env} | - | Rollback audit trail |
| Notifications | tradai-notifications-{env} | - | Alert/notification records |
| Idempotency | tradai-idempotency-{env} | - | Lambda deduplication |
Strategy Promotion uses MLflow Model Registry (not DynamoDB) for stage tracking.
Serialization¶
All DynamoDB-persisted state entities extend DynamoDBSerializableMixin, which provides:
to_dynamodb_item()-- serialize to DynamoDB-compatible dictfrom_dynamodb_item()-- deserialize from DynamoDB item
Combined with frozen=True and with_status() methods, this ensures all state mutations produce new immutable instances that are safely persisted.
Cross-Reference: Trace Context¶
Every backtest/training execution carries a unified trace context:
| Field | Propagated Through | Used By |
|---|---|---|
trace_id | DynamoDB, Step Functions, ECS env | End-to-end correlation |
job_id | DynamoDB, S3 results | Job tracking |
mlflow_run_id | DynamoDB, BacktestResult | Experiment tracking |
git_commit | BacktestResult, MLflow tags | Code version pinning |
Changelog¶
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-03-28 | Initial creation — 5 state machines from actual enums |
Dependencies¶
| If This Changes | Update This Doc |
|---|---|
libs/tradai-common/src/tradai/common/entities/aws.py JobStatus | Backtest states (Section 2) |
libs/tradai-common/src/tradai/common/entities/trading_state.py TradingStatus | Trading states (Section 3) |
libs/tradai-common/src/tradai/common/entities/mlflow.py ModelStage | Promotion states (Section 4) |
libs/tradai-common/src/tradai/common/entities/retraining.py RetrainingStatus | Retraining states (Section 5) |
libs/tradai-common/src/tradai/common/entities/config_version.py ConfigVersionStatus | Config states (Section 6) |