Skip to content

TradAI State Machines

Version: 1.0.0 | Date: 2026-03-28 | Status: CURRENT


1. TL;DR

TradAI has five state machines governing different lifecycle concerns:

State Machine Enum States Storage
Backtest Job JobStatus 5 tradai-workflow-state-{env}
Trading Session TradingStatus 6 tradai-trading-state-{env}
Strategy Promotion ModelStage 4 MLflow Model Registry
ML Retraining RetrainingStatus 9 tradai-retraining-state-{env}
Config Version ConfigVersionStatus 3 tradai-config-versions-{env}

All state enums are str, Enum subclasses defined in libs/tradai-common/src/tradai/common/entities/. Transitions are enforced through can_transition_to() guards, is_terminal properties, and with_status() immutable-update methods.


2. Backtest Job Lifecycle

Source: tradai.common.entities.aws.JobStatus Orchestrator: BacktestOrchestrationService (backend)

States

State Value Terminal Description
PENDING "pending" No Job created, waiting to execute
RUNNING "running" No Executor picked up the job
COMPLETED "completed" Yes Backtest finished successfully
FAILED "failed" Yes Execution or submission error
CANCELLED "cancelled" Yes User-initiated cancellation

Transition Diagram

stateDiagram-v2
    [*] --> PENDING : submit_backtest()
    PENDING --> RUNNING : Executor starts task
    PENDING --> FAILED : Submission error
    PENDING --> CANCELLED : cancel_backtest()
    RUNNING --> COMPLETED : Backtest finishes OK
    RUNNING --> FAILED : Runtime error / timeout
    RUNNING --> CANCELLED : cancel_backtest() + ExecutionStopper
    COMPLETED --> [*]
    FAILED --> [*]
    CANCELLED --> [*]

Valid Transitions (from _VALID_TRANSITIONS)

PENDING  -> {RUNNING, FAILED, CANCELLED}
RUNNING  -> {COMPLETED, FAILED, CANCELLED}

Same-state transitions are allowed for idempotency. Terminal states reject all outbound transitions.

Transition Triggers

Transition Trigger Component
[*] -> PENDING submit_backtest() creates job with UUID BacktestOrchestrationService
PENDING -> RUNNING SQS consumer / Step Functions picks up job sqs-trigger Lambda
PENDING -> FAILED Executor submission raises ExternalServiceError BacktestOrchestrationService
PENDING/RUNNING -> CANCELLED cancel_backtest() + optional ExecutionStopper.stop() BacktestOrchestrationService
RUNNING -> COMPLETED Backtest consumer processes results from S3 backtest-consumer Lambda
RUNNING -> FAILED Task error, timeout, or orphan scanner cleanup update-status Lambda

Concurrent Job Limit

BacktestOrchestrationService enforces a per-strategy concurrent RUNNING job limit (default: 3). A TOCTOU race window of milliseconds allows at most limit + 1 jobs briefly. This is accepted as non-critical.

Event Publishing

State transitions emit JobStateEvent via EventPublisher (M11). Failed transitions include error field in the event payload for downstream alerting.


3. Trading Session Lifecycle

Source: tradai.common.entities.trading_state.TradingStatus Handler: TradingHandler (common entrypoint)

States

State Value Active Terminal Description
INITIALIZING "initializing" No No Container starting, loading config & credentials
WARMUP "warmup" Yes No Loading historical data from ArcticDB
RUNNING "running" Yes No Actively trading via Freqtrade subprocess
PAUSED "paused" No No Temporarily paused (manual intervention)
ERROR "error" No Yes Fatal error, needs investigation
STOPPED "stopped" No Yes Gracefully stopped

Transition Diagram

stateDiagram-v2
    [*] --> INITIALIZING : Container starts
    INITIALIZING --> WARMUP : Config + credentials loaded
    WARMUP --> RUNNING : Historical data loaded
    RUNNING --> PAUSED : Manual pause (C9)
    PAUSED --> RUNNING : Manual resume (C9)
    RUNNING --> STOPPED : Graceful shutdown (SIGTERM)
    RUNNING --> ERROR : Unhandled exception
    INITIALIZING --> ERROR : Config/credential failure
    WARMUP --> ERROR : Warmup failure
    PAUSED --> STOPPED : Shutdown while paused
    PAUSED --> ERROR : Error while paused
    ERROR --> [*]
    STOPPED --> [*]

Lifecycle Phases in TradingHandler.run()

Phase 1: INITIALIZING
  - _initialize_state_management()     -> DynamoDB TradingStateRepository
  - _load_strategy_config()            -> MLflow/S3 config (LIVE fails on error)
  - _load_exchange_credentials()       -> AWS Secrets Manager

Phase 2: WARMUP
  - _warmup_historical_data()          -> ArcticDB candle data

Phase 3: RUNNING
  - _create_trader()                   -> FreqtradeTrader instance
  - _start_health_reporter()           -> Heartbeats + MetricsCollector + RiskMonitor
  - _execute_trading_loop()            -> Freqtrade subprocess

Phase 4: STOPPED (or ERROR on exception)
  - _cleanup()                         -> Stop health reporter, clear refs

LIVE vs DRY_RUN

In LIVE mode, config load failures and missing exchange credentials are fatal (raise TradingError). In DRY_RUN mode, these degrade gracefully with warnings.

Terminal State Timestamps

with_status() automatically sets stopped_at when transitioning to ERROR or STOPPED. Transitioning to INITIALIZING clears stopped_at for container restart scenarios.


4. Strategy Promotion

Source: tradai.common.entities.mlflow.ModelStage Service: ModelPromotionService (strategy-service)

States

Stage Value Description
NONE "None" Default for new model registrations
STAGING "Staging" Ready for validation/testing
PRODUCTION "Production" Live production model
ARCHIVED "Archived" Previous production version (rollback target)

Transition Diagram

stateDiagram-v2
    [*] --> NONE : Model registered in MLflow
    NONE --> STAGING : stage()
    STAGING --> PRODUCTION : promote_to_production()
    PRODUCTION --> ARCHIVED : Auto-archive when new version promoted
    ARCHIVED --> PRODUCTION : rollback()
    STAGING --> PRODUCTION : promote_to_production(skip_validation=True)
    NONE --> PRODUCTION : promote_to_production(skip_validation=True)
    ARCHIVED --> STAGING : stage() (re-stage for re-validation)

Promotion Workflow

  1. Stage: service.stage(model_name, version) -- moves to Staging
  2. Validate: service.validate_for_production(model_name, version) -- checks:
  3. Version exists and is in READY status
  4. Version is in Staging stage
  5. Required tags present (strategy_name, strategy_version)
  6. Optional: validation backtest (7-day window)
  7. Promote: service.promote_to_production(model_name, version) -- atomic operation:
  8. Archives all existing Production versions (via archive_existing_versions=True)
  9. Promotes target version to Production
  10. Rollback: service.rollback(model_name, target_version) -- promotes an Archived/Staging/None version back to Production

A/B Test Guard

promote_to_production() checks for active A/B tests (ABTestStatus.RUNNING). Promoting the challenger or an unrelated version is blocked. Only the current champion may be promoted during an active test. Complete the A/B test first.

Atomic Archive+Promote

The archive-then-promote operation is delegated to MLflow's transition_model_version_stage(archive_existing_versions=True) to prevent a race condition where a crash between archive and promote leaves zero Production versions.


5. ML Retraining Pipeline

Source: tradai.common.entities.retraining.RetrainingStatus Orchestrator: Step Functions retraining workflow (MO005)

States

State Value Description
PENDING "pending" Waiting to start
CHECKING "checking" Evaluating if retraining is needed
TRAINING "training" ECS training task running
VALIDATING "validating" Validation backtest running (see note below)
COMPARING "comparing" Comparing champion vs challenger model
PROMOTING "promoting" Promoting new model version
COMPLETED "completed" Successfully completed
FAILED "failed" Workflow failed at any step
SKIPPED "skipped" Retraining not needed (recently trained / no drift)

VALIDATING state vs Step Functions workflow

VALIDATING is a code-level state that can be set by the retraining handler, but the Step Functions workflow (retraining_workflow.json.j2) does not have a separate RunValidation task -- validation happens within the training/comparison steps. The workflow goes directly from RunRetraining to CompareModels. See 06-STEP-FUNCTIONS.md Section 4 for the actual workflow states.

Transition Diagram

stateDiagram-v2
    [*] --> PENDING : Trigger received
    PENDING --> CHECKING : Workflow starts
    CHECKING --> SKIPPED : No retraining needed
    CHECKING --> TRAINING : Retraining needed
    TRAINING --> VALIDATING : Training complete (handler-level)
    TRAINING --> FAILED : Training error
    VALIDATING --> COMPARING : Validation backtest done
    VALIDATING --> FAILED : Validation error
    COMPARING --> PROMOTING : Challenger wins
    COMPARING --> COMPLETED : Champion wins (no promotion)
    COMPARING --> FAILED : Comparison error
    PROMOTING --> COMPLETED : Model promoted
    PROMOTING --> FAILED : Promotion error
    SKIPPED --> [*]
    COMPLETED --> [*]
    FAILED --> [*]

Triggers (RetrainingTrigger)

Trigger Value Source
DRIFT_DETECTED "drift_detected" drift-monitor Lambda
SCHEDULED "scheduled" retraining-scheduler Lambda
MANUAL "manual" CLI / API manual trigger
PERFORMANCE_DEGRADATION "performance_degradation" CloudWatch alarm

Check Decision (RetrainingDecision)

Decision Value Description
NEEDS_RETRAINING "needs_retraining" Drift/schedule/performance triggers retraining
NO_RETRAINING "no_retraining" Metrics within acceptable bounds
RECENTLY_TRAINED "recently_trained" Last retraining too recent, cooldown active

State Persistence

RetrainingState is stored in DynamoDB with model_name as partition key. It tracks task_arn, step_function_execution_arn, and new_version for end-to-end traceability. The error_message field captures failure details.


6. Config Version Lifecycle

Source: tradai.common.entities.config_version.ConfigVersionStatus Service: ConfigVersionService (common)

States

State Value Terminal Description
DRAFT "draft" No Not yet validated or deployed
ACTIVE "active" No Currently deployed (one per strategy)
DEPRECATED "deprecated" Yes Superseded by newer version, auto-cleanup via TTL

Transition Diagram

stateDiagram-v2
    [*] --> DRAFT : Config version created
    DRAFT --> ACTIVE : Validated and deployed
    ACTIVE --> DEPRECATED : Superseded by new ACTIVE version
    DEPRECATED --> [*] : DynamoDB TTL auto-deletes (90 days)

Key Behaviors

  • Content-addressable: Each version has a SHA256 config_hash for deduplication.
  • Single active: Only one ACTIVE config version per strategy at any time.
  • Immutable updates: with_status(ACTIVE) sets deployed_at; with_status(DEPRECATED) sets deprecated_at and ttl.
  • Auto-cleanup: Deprecated versions have a 90-day TTL (_TTL_DAYS = 90) after which DynamoDB auto-deletes them.

Activation Side Effect

When a new version is activated, the BacktestOrchestrationService can resolve config_version_id="ACTIVE" to the current active config. Activating a new version implicitly makes all future "ACTIVE" references point to it.


7. DynamoDB State Tables

All tables use the naming pattern tradai-{purpose}-{environment}.

State Machine DynamoDB Table Partition Key Entity Class
Backtest Job tradai-workflow-state-{env} run_id BacktestJobStatus
Trading Session tradai-trading-state-{env} strategy_id TradingState
ML Retraining tradai-retraining-state-{env} model_name RetrainingState
Config Version tradai-config-versions-{env} strategy_name (SK: config_id) ConfigVersion
Health State tradai-health-state-{env} strategy_id Health check records
Drift State tradai-drift-state-{env} model_name Drift detection results
Rollback State tradai-rollback-state-{env} - Rollback audit trail
Notifications tradai-notifications-{env} - Alert/notification records
Idempotency tradai-idempotency-{env} - Lambda deduplication

Strategy Promotion uses MLflow Model Registry (not DynamoDB) for stage tracking.

Serialization

All DynamoDB-persisted state entities extend DynamoDBSerializableMixin, which provides:

  • to_dynamodb_item() -- serialize to DynamoDB-compatible dict
  • from_dynamodb_item() -- deserialize from DynamoDB item

Combined with frozen=True and with_status() methods, this ensures all state mutations produce new immutable instances that are safely persisted.

Cross-Reference: Trace Context

Every backtest/training execution carries a unified trace context:

Field Propagated Through Used By
trace_id DynamoDB, Step Functions, ECS env End-to-end correlation
job_id DynamoDB, S3 results Job tracking
mlflow_run_id DynamoDB, BacktestResult Experiment tracking
git_commit BacktestResult, MLflow tags Code version pinning

Changelog

Version Date Changes
1.0.0 2026-03-28 Initial creation — 5 state machines from actual enums

Dependencies

If This Changes Update This Doc
libs/tradai-common/src/tradai/common/entities/aws.py JobStatus Backtest states (Section 2)
libs/tradai-common/src/tradai/common/entities/trading_state.py TradingStatus Trading states (Section 3)
libs/tradai-common/src/tradai/common/entities/mlflow.py ModelStage Promotion states (Section 4)
libs/tradai-common/src/tradai/common/entities/retraining.py RetrainingStatus Retraining states (Section 5)
libs/tradai-common/src/tradai/common/entities/config_version.py ConfigVersionStatus Config states (Section 6)