TradAI ML Strategy Lifecycle Architecture¶

Version: 2.0.0 | Date: 2026-03-28 | Status: IMPLEMENTED -- E2E Validation Pending

Depends On: 11-LIVE-TRADING.md, 02-ARCHITECTURE-OVERVIEW.md

TL;DR: TradAI implements a complete ML strategy lifecycle: FreqAI walk-forward training with LightGBM/CatBoost/XGBoost, automated MLflow model registration with reproducibility manifests, and a full MLOps pipeline (drift detection via PSI/KS statistics, scheduled retraining, champion-vs-challenger comparison, and auto-rollback). All components (ML001-ML005, MO001-MO004) are built and unit-tested. End-to-end validation across the full pipeline remains pending.

1. Lifecycle Overview¶

flowchart LR
    A["Feature<br/>Engineering"] --> B["Model<br/>Training"]
    B --> C["MLflow<br/>Registry"]
    C --> D["Backtest<br/>Validation"]
    D --> E["Live/Paper<br/>Trading"]
    E --> F["Drift<br/>Detection"]
    F -->|"retrain<br/>trigger"| B
    F -->|"rollback<br/>trigger"| C

    style A fill:#1565c0,color:#fff
    style B fill:#1565c0,color:#fff
    style C fill:#2e7d32,color:#fff
    style D fill:#2e7d32,color:#fff
    style E fill:#f57f17,color:#000
    style F fill:#d32f2f,color:#fff

The ML lifecycle has six stages:

Stage	Description	Key Component
Feature Engineering	Strategy defines `%`-prefixed features and `&`-prefixed targets	FreqAI `feature_engineering_expand_all()`
Model Training	Walk-forward training via `TrainHandler` orchestrating Freqtrade	`TrainHandler`, `TrainingPredictionModel`
Model Registry	Auto-registration in MLflow with metrics, tags, reproducibility manifests	`ModelRegistrar`, `MLflowReporter`
Backtest Validation	Inference-only backtesting using registered models	`TradAIPredictionModel`
Drift Detection	PSI and KS-statistic monitoring on production predictions	`DriftDetector`, `drift-monitor` Lambda
Retraining / Rollback	Scheduled or drift-triggered retraining, automatic rollback	`retraining-scheduler`, `model-rollback` Lambdas

2. Training Pipeline¶

Implemented: ML001-ML003

TrainHandler orchestrates the full training workflow: Freqtrade subprocess execution, result parsing, MLflow metric logging, model registration, and reproducibility manifest upload.

Architecture¶

TrainHandler delegates to four collaborators following single-responsibility:

Collaborator	Responsibility	File
`TrainingResultParser`	Parse Freqtrade stdout/backtest JSON for metrics	`entrypoint/training/result_parser.py`
`MLflowReporter`	Log metrics, tags, and environment info to MLflow	`entrypoint/training/mlflow_reporter.py`
`ModelRegistrar`	Upload artifacts, create model versions, optionally promote	`entrypoint/training/model_registrar.py`
`ReproducibilityManifestBuilder`	Build JSON manifests with feature schemas, seeds, git info	`entrypoint/training/manifest_builder.py`

Training Flow¶

TrainHandler.run() sets global random seeds (if configured) for reproducibility
Builds TrainingConfig from environment settings (strategy, FreqAI model, pairs, timeframe)
Executes Freqtrade backtesting as a subprocess with the chosen FreqAI model (e.g., LightGBMRegressor)
TrainingResultParser extracts validation metrics (Sharpe, profit, drawdown, trade count)
MLflowReporter.log() creates an MLflow run with all metrics and tags
ModelRegistrar.register() uploads model artifacts and creates a new model version
Updates DynamoDB job status to COMPLETED

See libs/tradai-common/src/tradai/common/entrypoint/training/handler.py

Training Configuration¶

TrainingConfig is a frozen Pydantic model that validates all training parameters:

strategy -- FreqAI-enabled strategy class name
freqai_model -- ML model type, validated against FreqAIModelRegistry (LightGBM, CatBoost, XGBoost)
train_period_days -- Training window (7-365 days)
backtest_period_days -- Walk-forward validation period (1-30 days)
n_estimators, learning_rate, max_depth -- Model hyperparameters

See libs/tradai-common/src/tradai/common/entrypoint/training/config.py

3. Walk-Forward Training Model (FreqAI)¶

Implemented: ML005

TrainingPredictionModel extends FreqAI's BaseRegressionModel to support actual ML training during walk-forward backtesting with LightGBM, CatBoost, and XGBoost.

TradAI provides two custom FreqAI models:

Model	Mode	Purpose
`TrainingPredictionModel`	Training	Walk-forward training with progressive model retraining
`TradAIPredictionModel`	Inference	Loads pre-trained models from CSV, S3, or MLflow (no training)

TrainingPredictionModel implements the fit() method called by FreqAI for each training window. It extracts FitParams (train/test splits, sample weights), delegates to the configured ML framework, and logs per-window metrics.

TradAIPredictionModel uses a PredictionLoader protocol with two implementations: CSVLoader (local file) and MLflowLoader (model registry). The PredictionSource enum controls which loader is used.

Key files:

libs/tradai-strategy/src/tradai/strategy/freqai/training_model.py
libs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.py
libs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py

4. Model Registry (MLflow)¶

Implemented: ML004 (Hyperparameter Optimization) + MLflowAdapter with full CRUD

ML004: Optuna integration for hyperparameter search (cli/src/tradai/cli/optuna_commands.py, services/strategy-service/src/tradai/strategy_service/core/optuna_handler.py). The MLflowAdapter composes four mixins (MLflowClientMixin, ExperimentsMixin, RegistryMixin, ArtifactsMixin) providing experiment tracking, model registration, stage transitions, and artifact management. Compatible with MLflow 3.x.

Registry Lifecycle¶

Training Complete
    --> ModelRegistrar uploads artifacts (model files, feature importance, manifests)
    --> Creates new model version in MLflow (stage: None)
    --> Optionally promotes to Staging if metrics exceed thresholds
    --> compare-models Lambda compares champion vs challenger
    --> promote-model Lambda transitions winner to Production

Stage Transitions¶

Stage	Meaning	Trigger
`None`	Newly registered, not yet evaluated	Auto after training
`Staging`	Passed initial thresholds, under evaluation	Auto-promotion in `ModelRegistrar`
`Production`	Active model for live inference	`promote-model` Lambda
`Archived`	Previous version, retained for rollback	On promotion of replacement

Key Operations (RegistryMixin)¶

get_model_version(name, version) -- Fetch version details
create_model_version(name, source, run_id, tags) -- Register new version
transition_model_version_stage(name, version, stage) -- Stage promotion/demotion
get_registered_model(name) -- Get model with all versions
search_registered_models(filter_string) -- Query models

See libs/tradai-common/src/tradai/common/mlflow/adapter.py and libs/tradai-common/src/tradai/common/mlflow/registry.py

5. Drift Detection¶

Implemented: MO002

DriftDetector calculates Population Stability Index (PSI) and Kolmogorov-Smirnov statistics for both predictions and individual features. The drift-monitor Lambda runs on an EventBridge schedule, fetching metrics from MLflow and publishing CloudWatch metrics.

DriftDetector¶

The DriftDetector class provides drift analysis with configurable thresholds:

Metric	Method	Interpretation
PSI (Population Stability Index)	Quantile-based binning (10 bins)	< 0.10 none, 0.10-0.25 moderate, >= 0.25 significant
KS Statistic	Two-sample Kolmogorov-Smirnov test	p-value < 0.05 indicates distribution shift
Accuracy Deviation	Sign-match accuracy comparison	> 10% deviation triggers alert

The analysis produces a DriftResult containing:

Overall PSI and severity classification (DriftSeverity enum)
Per-feature drift metrics (FeatureDrift with PSI, KS stat, mean/std shifts)
Actionable recommendations (auto-generated)
Alert message (when requires_attention=True)
CloudWatch-compatible metric output (to_cloudwatch_metrics())

Key files:

libs/tradai-common/src/tradai/common/drift/detector.py
libs/tradai-common/src/tradai/common/drift/entities.py

drift-monitor Lambda¶

The Lambda runs on a schedule and:

Fetches backtest metrics from MLflow experiments for each monitored model
Compares current period against reference (training) period
Calculates PSI and KS statistics for key metrics (profit_total, win_rate, sharpe_ratio, max_drawdown, trades_count)
Publishes CloudWatch metrics under the DriftMonitoring namespace
Sends SNS alerts when drift exceeds thresholds
Persists drift state to DynamoDB via DynamoDBStateRepository

See lambdas/drift-monitor/handler.py

6. Retraining Pipeline¶

Implemented: MO001

The retraining-scheduler Lambda triggers ECS Fargate training tasks based on scheduled intervals, drift detection results, or manual requests.

retraining-scheduler Lambda¶

The Lambda supports three trigger modes:

Trigger	Source	Behavior
Scheduled	EventBridge cron	Checks each model's last retraining timestamp against configured interval
Drift-triggered	Drift state in DynamoDB	Launches retraining when drift is detected
Manual	Direct invocation with `"trigger": "manual"`	Force-retrain specified models

Each retraining request launches an ECS Fargate task with TRADING_MODE=train and model-specific configuration (strategy, FreqAI model, pairs, training period).

See lambdas/retraining-scheduler/handler.py

7. Model Comparison and Promotion¶

Implemented: MO003-MO004

ModelComparator provides champion-vs-challenger comparison. The compare-models Lambda produces promotion decisions, the promote-model Lambda executes stage transitions, and the model-rollback Lambda handles automatic rollback with cooldown protection.

ModelComparator¶

The ModelComparator class fetches backtest metrics for both champion and challenger from MLflow, compares key performance indicators, and returns a PromotionDecision:

Decision	Meaning
`PROMOTE`	Challenger outperforms champion with sufficient confidence
`KEEP`	Champion is better or difference is negligible
`INCONCLUSIVE`	Metrics are too close to call
`NEEDS_MORE_DATA`	Insufficient samples for reliable comparison

See libs/tradai-common/src/tradai/common/model_comparison/comparator.py

compare-models Lambda¶

Invoked by Step Functions after training completes. Receives model_name and challenger_run_id, returns a decision with confidence score and profit improvement metrics.

See lambdas/compare-models/handler.py

promote-model Lambda¶

Executes the promotion: transitions the challenger to Production, archives the previous champion. Requires a minimum confidence threshold.

See lambdas/promote-model/handler.py

model-rollback Lambda¶

Rolls back to a previous model version when:

CloudWatch alarms trigger (drift or performance degradation)
Manual rollback is requested

Includes a configurable cooldown period (ROLLBACK_COOLDOWN_HOURS, default 24h) to prevent rapid rollback oscillation. Persists rollback state to DynamoDB.

See lambdas/model-rollback/handler.py

8. Key Source Files¶

Component	File Path
TrainHandler	`libs/tradai-common/src/tradai/common/entrypoint/training/handler.py`
TrainingConfig / TrainingResult	`libs/tradai-common/src/tradai/common/entrypoint/training/config.py`
TrainingResultParser	`libs/tradai-common/src/tradai/common/entrypoint/training/result_parser.py`
MLflowReporter	`libs/tradai-common/src/tradai/common/entrypoint/training/mlflow_reporter.py`
ModelRegistrar	`libs/tradai-common/src/tradai/common/entrypoint/training/model_registrar.py`
ReproducibilityManifestBuilder	`libs/tradai-common/src/tradai/common/entrypoint/training/manifest_builder.py`
MLflowAdapter	`libs/tradai-common/src/tradai/common/mlflow/adapter.py`
RegistryMixin	`libs/tradai-common/src/tradai/common/mlflow/registry.py`
ExperimentsMixin	`libs/tradai-common/src/tradai/common/mlflow/experiments.py`
ArtifactsMixin	`libs/tradai-common/src/tradai/common/mlflow/artifacts.py`
DriftDetector	`libs/tradai-common/src/tradai/common/drift/detector.py`
DriftResult / DriftThresholds	`libs/tradai-common/src/tradai/common/drift/entities.py`
ModelComparator	`libs/tradai-common/src/tradai/common/model_comparison/comparator.py`
ComparisonResult / PromotionDecision	`libs/tradai-common/src/tradai/common/model_comparison/entities.py`
TrainingPredictionModel	`libs/tradai-strategy/src/tradai/strategy/freqai/training_model.py`
TradAIPredictionModel	`libs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.py`
FreqAIConfigBuilder	`libs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py`
drift-monitor Lambda	`lambdas/drift-monitor/handler.py`
retraining-scheduler Lambda	`lambdas/retraining-scheduler/handler.py`
compare-models Lambda	`lambdas/compare-models/handler.py`
promote-model Lambda	`lambdas/promote-model/handler.py`
model-rollback Lambda	`lambdas/model-rollback/handler.py`

9. Known Limitations¶

E2E Validation Pending

All components are implemented and unit-tested individually. The full pipeline (training -> registration -> drift detection -> retraining -> promotion) has not been validated end-to-end in a staging environment.

No Automated Drift-to-Retraining Trigger

The drift-monitor Lambda publishes drift state to DynamoDB and CloudWatch metrics, and the retraining-scheduler Lambda can read drift state. However, there is no EventBridge rule or Step Functions workflow that automatically chains drift detection into retraining. This connection must be configured in infrastructure.

Live Trading Integration Partial

Live trading mode (TradingHandler wiring, ECS service definition, exchange credentials via Secrets Manager) is not fully wired. The training and monitoring pipeline operates against backtest results, not live prediction streams.

10. Changelog¶

Version	Date	Changes
2.0.0	2026-03-28	Complete rewrite: removed stale planning content, scaffold pseudo-code, and phase estimates; replaced with implementation reference based on actual source code
1.1.0	2026-03-28	Added TL;DR, Dependencies section; all 9 components marked implemented
1.0.0	2025-12-22	Initial ML Lifecycle architecture document

Dependencies¶

If This Changes	Update This Section
`libs/tradai-common/src/tradai/common/entrypoint/training/`	Section 2 (Training Pipeline)
`libs/tradai-strategy/src/tradai/strategy/freqai/`	Section 3 (Walk-Forward Training Model)
`libs/tradai-common/src/tradai/common/mlflow/`	Section 4 (Model Registry)
`libs/tradai-common/src/tradai/common/drift/`	Section 5 (Drift Detection)
`lambdas/retraining-scheduler/`	Section 6 (Retraining Pipeline)
`lambdas/compare-models/`, `promote-model/`, `model-rollback/`	Section 7 (Model Comparison and Promotion)