TradAI State Machines¶

Version: 1.0.0 | Date: 2026-03-28 | Status: CURRENT

1. TL;DR¶

TradAI has five state machines governing different lifecycle concerns:

State Machine	Enum	States	Storage
Backtest Job	`JobStatus`	5	`tradai-workflow-state-{env}`
Trading Session	`TradingStatus`	6	`tradai-trading-state-{env}`
Strategy Promotion	`ModelStage`	4	MLflow Model Registry
ML Retraining	`RetrainingStatus`	9	`tradai-retraining-state-{env}`
Config Version	`ConfigVersionStatus`	3	`tradai-config-versions-{env}`

All state enums are str, Enum subclasses defined in libs/tradai-common/src/tradai/common/entities/. Transitions are enforced through can_transition_to() guards, is_terminal properties, and with_status() immutable-update methods.

2. Backtest Job Lifecycle¶

Source: tradai.common.entities.aws.JobStatus Orchestrator: BacktestOrchestrationService (backend)

States¶

State	Value	Terminal	Description
`PENDING`	`"pending"`	No	Job created, waiting to execute
`RUNNING`	`"running"`	No	Executor picked up the job
`COMPLETED`	`"completed"`	Yes	Backtest finished successfully
`FAILED`	`"failed"`	Yes	Execution or submission error
`CANCELLED`	`"cancelled"`	Yes	User-initiated cancellation

Transition Diagram¶

stateDiagram-v2
    [*] --> PENDING : submit_backtest()
    PENDING --> RUNNING : Executor starts task
    PENDING --> FAILED : Submission error
    PENDING --> CANCELLED : cancel_backtest()
    RUNNING --> COMPLETED : Backtest finishes OK
    RUNNING --> FAILED : Runtime error / timeout
    RUNNING --> CANCELLED : cancel_backtest() + ExecutionStopper
    COMPLETED --> [*]
    FAILED --> [*]
    CANCELLED --> [*]

Valid Transitions (from `_VALID_TRANSITIONS`)¶

PENDING  -> {RUNNING, FAILED, CANCELLED}
RUNNING  -> {COMPLETED, FAILED, CANCELLED}

Same-state transitions are allowed for idempotency. Terminal states reject all outbound transitions.

Transition Triggers¶

Transition	Trigger	Component
`[*] -> PENDING`	`submit_backtest()` creates job with UUID	`BacktestOrchestrationService`
`PENDING -> RUNNING`	SQS consumer / Step Functions picks up job	`sqs-trigger` Lambda
`PENDING -> FAILED`	Executor submission raises `ExternalServiceError`	`BacktestOrchestrationService`
`PENDING/RUNNING -> CANCELLED`	`cancel_backtest()` + optional `ExecutionStopper.stop()`	`BacktestOrchestrationService`
`RUNNING -> COMPLETED`	Backtest consumer processes results from S3	`backtest-consumer` Lambda
`RUNNING -> FAILED`	Task error, timeout, or orphan scanner cleanup	`update-status` Lambda

Concurrent Job Limit

BacktestOrchestrationService enforces a per-strategy concurrent RUNNING job limit (default: 3). A TOCTOU race window of milliseconds allows at most limit + 1 jobs briefly. This is accepted as non-critical.

Event Publishing

State transitions emit JobStateEvent via EventPublisher (M11). Failed transitions include error field in the event payload for downstream alerting.

3. Trading Session Lifecycle¶

Source: tradai.common.entities.trading_state.TradingStatus Handler: TradingHandler (common entrypoint)

States¶

State	Value	Active	Terminal	Description
`INITIALIZING`	`"initializing"`	No	No	Container starting, loading config & credentials
`WARMUP`	`"warmup"`	Yes	No	Loading historical data from ArcticDB
`RUNNING`	`"running"`	Yes	No	Actively trading via Freqtrade subprocess
`PAUSED`	`"paused"`	No	No	Temporarily paused (manual intervention)
`ERROR`	`"error"`	No	Yes	Fatal error, needs investigation
`STOPPED`	`"stopped"`	No	Yes	Gracefully stopped

Transition Diagram¶

stateDiagram-v2
    [*] --> INITIALIZING : Container starts
    INITIALIZING --> WARMUP : Config + credentials loaded
    WARMUP --> RUNNING : Historical data loaded
    RUNNING --> PAUSED : Manual pause (C9)
    PAUSED --> RUNNING : Manual resume (C9)
    RUNNING --> STOPPED : Graceful shutdown (SIGTERM)
    RUNNING --> ERROR : Unhandled exception
    INITIALIZING --> ERROR : Config/credential failure
    WARMUP --> ERROR : Warmup failure
    PAUSED --> STOPPED : Shutdown while paused
    PAUSED --> ERROR : Error while paused
    ERROR --> [*]
    STOPPED --> [*]

Lifecycle Phases in `TradingHandler.run()`¶

Phase 1: INITIALIZING
  - _initialize_state_management()     -> DynamoDB TradingStateRepository
  - _load_strategy_config()            -> MLflow/S3 config (LIVE fails on error)
  - _load_exchange_credentials()       -> AWS Secrets Manager

Phase 2: WARMUP
  - _warmup_historical_data()          -> ArcticDB candle data

Phase 3: RUNNING
  - _create_trader()                   -> FreqtradeTrader instance
  - _start_health_reporter()           -> Heartbeats + MetricsCollector + RiskMonitor
  - _execute_trading_loop()            -> Freqtrade subprocess

Phase 4: STOPPED (or ERROR on exception)
  - _cleanup()                         -> Stop health reporter, clear refs

LIVE vs DRY_RUN

In LIVE mode, config load failures and missing exchange credentials are fatal (raise TradingError). In DRY_RUN mode, these degrade gracefully with warnings.

Terminal State Timestamps

with_status() automatically sets stopped_at when transitioning to ERROR or STOPPED. Transitioning to INITIALIZING clears stopped_at for container restart scenarios.

4. Strategy Promotion¶

Source: tradai.common.entities.mlflow.ModelStage Service: ModelPromotionService (strategy-service)

States¶

Stage	Value	Description
`NONE`	`"None"`	Default for new model registrations
`STAGING`	`"Staging"`	Ready for validation/testing
`PRODUCTION`	`"Production"`	Live production model
`ARCHIVED`	`"Archived"`	Previous production version (rollback target)

Transition Diagram¶

stateDiagram-v2
    [*] --> NONE : Model registered in MLflow
    NONE --> STAGING : stage()
    STAGING --> PRODUCTION : promote_to_production()
    PRODUCTION --> ARCHIVED : Auto-archive when new version promoted
    ARCHIVED --> PRODUCTION : rollback()
    STAGING --> PRODUCTION : promote_to_production(skip_validation=True)
    NONE --> PRODUCTION : promote_to_production(skip_validation=True)
    ARCHIVED --> STAGING : stage() (re-stage for re-validation)

Promotion Workflow¶

Stage: service.stage(model_name, version) -- moves to Staging
Validate: service.validate_for_production(model_name, version) -- checks:
Version exists and is in READY status
Version is in Staging stage
Required tags present (strategy_name, strategy_version)
Optional: validation backtest (7-day window)
Promote: service.promote_to_production(model_name, version) -- atomic operation:
Archives all existing Production versions (via archive_existing_versions=True)
Promotes target version to Production
Rollback: service.rollback(model_name, target_version) -- promotes an Archived/Staging/None version back to Production

A/B Test Guard

promote_to_production() checks for active A/B tests (ABTestStatus.RUNNING). Promoting the challenger or an unrelated version is blocked. Only the current champion may be promoted during an active test. Complete the A/B test first.

Atomic Archive+Promote

The archive-then-promote operation is delegated to MLflow's transition_model_version_stage(archive_existing_versions=True) to prevent a race condition where a crash between archive and promote leaves zero Production versions.

5. ML Retraining Pipeline¶

Source: tradai.common.entities.retraining.RetrainingStatus Orchestrator: Step Functions retraining workflow (MO005)

States¶

State	Value	Description
`PENDING`	`"pending"`	Waiting to start
`CHECKING`	`"checking"`	Evaluating if retraining is needed
`TRAINING`	`"training"`	ECS training task running
`VALIDATING`	`"validating"`	Validation backtest running (see note below)
`COMPARING`	`"comparing"`	Comparing champion vs challenger model
`PROMOTING`	`"promoting"`	Promoting new model version
`COMPLETED`	`"completed"`	Successfully completed
`FAILED`	`"failed"`	Workflow failed at any step
`SKIPPED`	`"skipped"`	Retraining not needed (recently trained / no drift)

VALIDATING state vs Step Functions workflow

VALIDATING is a code-level state that can be set by the retraining handler, but the Step Functions workflow (retraining_workflow.json.j2) does not have a separate RunValidation task -- validation happens within the training/comparison steps. The workflow goes directly from RunRetraining to CompareModels. See 06-STEP-FUNCTIONS.md Section 4 for the actual workflow states.

Transition Diagram¶

stateDiagram-v2
    [*] --> PENDING : Trigger received
    PENDING --> CHECKING : Workflow starts
    CHECKING --> SKIPPED : No retraining needed
    CHECKING --> TRAINING : Retraining needed
    TRAINING --> VALIDATING : Training complete (handler-level)
    TRAINING --> FAILED : Training error
    VALIDATING --> COMPARING : Validation backtest done
    VALIDATING --> FAILED : Validation error
    COMPARING --> PROMOTING : Challenger wins
    COMPARING --> COMPLETED : Champion wins (no promotion)
    COMPARING --> FAILED : Comparison error
    PROMOTING --> COMPLETED : Model promoted
    PROMOTING --> FAILED : Promotion error
    SKIPPED --> [*]
    COMPLETED --> [*]
    FAILED --> [*]

Triggers (`RetrainingTrigger`)¶

Trigger	Value	Source
`DRIFT_DETECTED`	`"drift_detected"`	`drift-monitor` Lambda
`SCHEDULED`	`"scheduled"`	`retraining-scheduler` Lambda
`MANUAL`	`"manual"`	CLI / API manual trigger
`PERFORMANCE_DEGRADATION`	`"performance_degradation"`	CloudWatch alarm

Check Decision (`RetrainingDecision`)¶

Decision	Value	Description
`NEEDS_RETRAINING`	`"needs_retraining"`	Drift/schedule/performance triggers retraining
`NO_RETRAINING`	`"no_retraining"`	Metrics within acceptable bounds
`RECENTLY_TRAINED`	`"recently_trained"`	Last retraining too recent, cooldown active

State Persistence

RetrainingState is stored in DynamoDB with model_name as partition key. It tracks task_arn, step_function_execution_arn, and new_version for end-to-end traceability. The error_message field captures failure details.

6. Config Version Lifecycle¶

Source: tradai.common.entities.config_version.ConfigVersionStatus Service: ConfigVersionService (common)

States¶

State	Value	Terminal	Description
`DRAFT`	`"draft"`	No	Not yet validated or deployed
`ACTIVE`	`"active"`	No	Currently deployed (one per strategy)
`DEPRECATED`	`"deprecated"`	Yes	Superseded by newer version, auto-cleanup via TTL

Transition Diagram¶

stateDiagram-v2
    [*] --> DRAFT : Config version created
    DRAFT --> ACTIVE : Validated and deployed
    ACTIVE --> DEPRECATED : Superseded by new ACTIVE version
    DEPRECATED --> [*] : DynamoDB TTL auto-deletes (90 days)

Key Behaviors¶

Content-addressable: Each version has a SHA256 config_hash for deduplication.
Single active: Only one ACTIVE config version per strategy at any time.
Immutable updates: with_status(ACTIVE) sets deployed_at; with_status(DEPRECATED) sets deprecated_at and ttl.
Auto-cleanup: Deprecated versions have a 90-day TTL (_TTL_DAYS = 90) after which DynamoDB auto-deletes them.

Activation Side Effect

When a new version is activated, the BacktestOrchestrationService can resolve config_version_id="ACTIVE" to the current active config. Activating a new version implicitly makes all future "ACTIVE" references point to it.

7. DynamoDB State Tables¶

All tables use the naming pattern tradai-{purpose}-{environment}.

State Machine	DynamoDB Table	Partition Key	Entity Class
Backtest Job	`tradai-workflow-state-{env}`	`run_id`	`BacktestJobStatus`
Trading Session	`tradai-trading-state-{env}`	`strategy_id`	`TradingState`
ML Retraining	`tradai-retraining-state-{env}`	`model_name`	`RetrainingState`
Config Version	`tradai-config-versions-{env}`	`strategy_name` (SK: `config_id`)	`ConfigVersion`
Health State	`tradai-health-state-{env}`	`strategy_id`	Health check records
Drift State	`tradai-drift-state-{env}`	`model_name`	Drift detection results
Rollback State	`tradai-rollback-state-{env}`	-	Rollback audit trail
Notifications	`tradai-notifications-{env}`	-	Alert/notification records
Idempotency	`tradai-idempotency-{env}`	-	Lambda deduplication

Strategy Promotion uses MLflow Model Registry (not DynamoDB) for stage tracking.

Serialization¶

All DynamoDB-persisted state entities extend DynamoDBSerializableMixin, which provides:

to_dynamodb_item() -- serialize to DynamoDB-compatible dict
from_dynamodb_item() -- deserialize from DynamoDB item

Combined with frozen=True and with_status() methods, this ensures all state mutations produce new immutable instances that are safely persisted.

Cross-Reference: Trace Context¶

Every backtest/training execution carries a unified trace context:

Field	Propagated Through	Used By
`trace_id`	DynamoDB, Step Functions, ECS env	End-to-end correlation
`job_id`	DynamoDB, S3 results	Job tracking
`mlflow_run_id`	DynamoDB, BacktestResult	Experiment tracking
`git_commit`	BacktestResult, MLflow tags	Code version pinning

Changelog¶

Version	Date	Changes
1.0.0	2026-03-28	Initial creation — 5 state machines from actual enums

Dependencies¶

If This Changes	Update This Doc
`libs/tradai-common/src/tradai/common/entities/aws.py` JobStatus	Backtest states (Section 2)
`libs/tradai-common/src/tradai/common/entities/trading_state.py` TradingStatus	Trading states (Section 3)
`libs/tradai-common/src/tradai/common/entities/mlflow.py` ModelStage	Promotion states (Section 4)
`libs/tradai-common/src/tradai/common/entities/retraining.py` RetrainingStatus	Retraining states (Section 5)
`libs/tradai-common/src/tradai/common/entities/config_version.py` ConfigVersionStatus	Config states (Section 6)

TradAI State Machines¶

1. TL;DR¶

2. Backtest Job Lifecycle¶

States¶

Transition Diagram¶

Valid Transitions (from _VALID_TRANSITIONS)¶

Transition Triggers¶

3. Trading Session Lifecycle¶

States¶

Transition Diagram¶

Lifecycle Phases in TradingHandler.run()¶

4. Strategy Promotion¶

States¶

Transition Diagram¶

Promotion Workflow¶

5. ML Retraining Pipeline¶

States¶

Transition Diagram¶

Triggers (RetrainingTrigger)¶

Check Decision (RetrainingDecision)¶

6. Config Version Lifecycle¶

States¶

Transition Diagram¶

Key Behaviors¶

7. DynamoDB State Tables¶

Serialization¶

Cross-Reference: Trace Context¶

Changelog¶

Dependencies¶

Valid Transitions (from `_VALID_TRANSITIONS`)¶

Lifecycle Phases in `TradingHandler.run()`¶

Triggers (`RetrainingTrigger`)¶

Check Decision (`RetrainingDecision`)¶