TradAI ML Strategy Lifecycle Architecture¶
Version: 2.0.0 | Date: 2026-03-28 | Status: IMPLEMENTED -- E2E Validation Pending
Depends On: 11-LIVE-TRADING.md, 02-ARCHITECTURE-OVERVIEW.md
TL;DR: TradAI implements a complete ML strategy lifecycle: FreqAI walk-forward training with LightGBM/CatBoost/XGBoost, automated MLflow model registration with reproducibility manifests, and a full MLOps pipeline (drift detection via PSI/KS statistics, scheduled retraining, champion-vs-challenger comparison, and auto-rollback). All components (ML001-ML005, MO001-MO004) are built and unit-tested. End-to-end validation across the full pipeline remains pending.
1. Lifecycle Overview¶
flowchart LR
A["Feature<br/>Engineering"] --> B["Model<br/>Training"]
B --> C["MLflow<br/>Registry"]
C --> D["Backtest<br/>Validation"]
D --> E["Live/Paper<br/>Trading"]
E --> F["Drift<br/>Detection"]
F -->|"retrain<br/>trigger"| B
F -->|"rollback<br/>trigger"| C
style A fill:#1565c0,color:#fff
style B fill:#1565c0,color:#fff
style C fill:#2e7d32,color:#fff
style D fill:#2e7d32,color:#fff
style E fill:#f57f17,color:#000
style F fill:#d32f2f,color:#fff The ML lifecycle has six stages:
| Stage | Description | Key Component |
|---|---|---|
| Feature Engineering | Strategy defines %-prefixed features and &-prefixed targets | FreqAI feature_engineering_expand_all() |
| Model Training | Walk-forward training via TrainHandler orchestrating Freqtrade | TrainHandler, TrainingPredictionModel |
| Model Registry | Auto-registration in MLflow with metrics, tags, reproducibility manifests | ModelRegistrar, MLflowReporter |
| Backtest Validation | Inference-only backtesting using registered models | TradAIPredictionModel |
| Drift Detection | PSI and KS-statistic monitoring on production predictions | DriftDetector, drift-monitor Lambda |
| Retraining / Rollback | Scheduled or drift-triggered retraining, automatic rollback | retraining-scheduler, model-rollback Lambdas |
2. Training Pipeline¶
Implemented: ML001-ML003
TrainHandler orchestrates the full training workflow: Freqtrade subprocess execution, result parsing, MLflow metric logging, model registration, and reproducibility manifest upload.
Architecture¶
TrainHandler delegates to four collaborators following single-responsibility:
| Collaborator | Responsibility | File |
|---|---|---|
TrainingResultParser | Parse Freqtrade stdout/backtest JSON for metrics | entrypoint/training/result_parser.py |
MLflowReporter | Log metrics, tags, and environment info to MLflow | entrypoint/training/mlflow_reporter.py |
ModelRegistrar | Upload artifacts, create model versions, optionally promote | entrypoint/training/model_registrar.py |
ReproducibilityManifestBuilder | Build JSON manifests with feature schemas, seeds, git info | entrypoint/training/manifest_builder.py |
Training Flow¶
TrainHandler.run()sets global random seeds (if configured) for reproducibility- Builds
TrainingConfigfrom environment settings (strategy, FreqAI model, pairs, timeframe) - Executes Freqtrade backtesting as a subprocess with the chosen FreqAI model (e.g.,
LightGBMRegressor) TrainingResultParserextracts validation metrics (Sharpe, profit, drawdown, trade count)MLflowReporter.log()creates an MLflow run with all metrics and tagsModelRegistrar.register()uploads model artifacts and creates a new model version- Updates DynamoDB job status to COMPLETED
See libs/tradai-common/src/tradai/common/entrypoint/training/handler.py
Training Configuration¶
TrainingConfig is a frozen Pydantic model that validates all training parameters:
strategy-- FreqAI-enabled strategy class namefreqai_model-- ML model type, validated againstFreqAIModelRegistry(LightGBM, CatBoost, XGBoost)train_period_days-- Training window (7-365 days)backtest_period_days-- Walk-forward validation period (1-30 days)n_estimators,learning_rate,max_depth-- Model hyperparameters
See libs/tradai-common/src/tradai/common/entrypoint/training/config.py
3. Walk-Forward Training Model (FreqAI)¶
Implemented: ML005
TrainingPredictionModel extends FreqAI's BaseRegressionModel to support actual ML training during walk-forward backtesting with LightGBM, CatBoost, and XGBoost.
TradAI provides two custom FreqAI models:
| Model | Mode | Purpose |
|---|---|---|
TrainingPredictionModel | Training | Walk-forward training with progressive model retraining |
TradAIPredictionModel | Inference | Loads pre-trained models from CSV, S3, or MLflow (no training) |
TrainingPredictionModel implements the fit() method called by FreqAI for each training window. It extracts FitParams (train/test splits, sample weights), delegates to the configured ML framework, and logs per-window metrics.
TradAIPredictionModel uses a PredictionLoader protocol with two implementations: CSVLoader (local file) and MLflowLoader (model registry). The PredictionSource enum controls which loader is used.
Key files:
libs/tradai-strategy/src/tradai/strategy/freqai/training_model.pylibs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.pylibs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py
4. Model Registry (MLflow)¶
Implemented: ML004 (Hyperparameter Optimization) + MLflowAdapter with full CRUD
ML004: Optuna integration for hyperparameter search (cli/src/tradai/cli/optuna_commands.py, services/strategy-service/src/tradai/strategy_service/core/optuna_handler.py). The MLflowAdapter composes four mixins (MLflowClientMixin, ExperimentsMixin, RegistryMixin, ArtifactsMixin) providing experiment tracking, model registration, stage transitions, and artifact management. Compatible with MLflow 3.x.
Registry Lifecycle¶
Training Complete
--> ModelRegistrar uploads artifacts (model files, feature importance, manifests)
--> Creates new model version in MLflow (stage: None)
--> Optionally promotes to Staging if metrics exceed thresholds
--> compare-models Lambda compares champion vs challenger
--> promote-model Lambda transitions winner to Production
Stage Transitions¶
| Stage | Meaning | Trigger |
|---|---|---|
None | Newly registered, not yet evaluated | Auto after training |
Staging | Passed initial thresholds, under evaluation | Auto-promotion in ModelRegistrar |
Production | Active model for live inference | promote-model Lambda |
Archived | Previous version, retained for rollback | On promotion of replacement |
Key Operations (RegistryMixin)¶
get_model_version(name, version)-- Fetch version detailscreate_model_version(name, source, run_id, tags)-- Register new versiontransition_model_version_stage(name, version, stage)-- Stage promotion/demotionget_registered_model(name)-- Get model with all versionssearch_registered_models(filter_string)-- Query models
See libs/tradai-common/src/tradai/common/mlflow/adapter.py and libs/tradai-common/src/tradai/common/mlflow/registry.py
5. Drift Detection¶
Implemented: MO002
DriftDetector calculates Population Stability Index (PSI) and Kolmogorov-Smirnov statistics for both predictions and individual features. The drift-monitor Lambda runs on an EventBridge schedule, fetching metrics from MLflow and publishing CloudWatch metrics.
DriftDetector¶
The DriftDetector class provides drift analysis with configurable thresholds:
| Metric | Method | Interpretation |
|---|---|---|
| PSI (Population Stability Index) | Quantile-based binning (10 bins) | < 0.10 none, 0.10-0.25 moderate, >= 0.25 significant |
| KS Statistic | Two-sample Kolmogorov-Smirnov test | p-value < 0.05 indicates distribution shift |
| Accuracy Deviation | Sign-match accuracy comparison | > 10% deviation triggers alert |
The analysis produces a DriftResult containing:
- Overall PSI and severity classification (
DriftSeverityenum) - Per-feature drift metrics (
FeatureDriftwith PSI, KS stat, mean/std shifts) - Actionable recommendations (auto-generated)
- Alert message (when
requires_attention=True) - CloudWatch-compatible metric output (
to_cloudwatch_metrics())
Key files:
libs/tradai-common/src/tradai/common/drift/detector.pylibs/tradai-common/src/tradai/common/drift/entities.py
drift-monitor Lambda¶
The Lambda runs on a schedule and:
- Fetches backtest metrics from MLflow experiments for each monitored model
- Compares current period against reference (training) period
- Calculates PSI and KS statistics for key metrics (
profit_total,win_rate,sharpe_ratio,max_drawdown,trades_count) - Publishes CloudWatch metrics under the
DriftMonitoringnamespace - Sends SNS alerts when drift exceeds thresholds
- Persists drift state to DynamoDB via
DynamoDBStateRepository
See lambdas/drift-monitor/handler.py
6. Retraining Pipeline¶
Implemented: MO001
The retraining-scheduler Lambda triggers ECS Fargate training tasks based on scheduled intervals, drift detection results, or manual requests.
retraining-scheduler Lambda¶
The Lambda supports three trigger modes:
| Trigger | Source | Behavior |
|---|---|---|
| Scheduled | EventBridge cron | Checks each model's last retraining timestamp against configured interval |
| Drift-triggered | Drift state in DynamoDB | Launches retraining when drift is detected |
| Manual | Direct invocation with "trigger": "manual" | Force-retrain specified models |
Each retraining request launches an ECS Fargate task with TRADING_MODE=train and model-specific configuration (strategy, FreqAI model, pairs, training period).
See lambdas/retraining-scheduler/handler.py
7. Model Comparison and Promotion¶
Implemented: MO003-MO004
ModelComparator provides champion-vs-challenger comparison. The compare-models Lambda produces promotion decisions, the promote-model Lambda executes stage transitions, and the model-rollback Lambda handles automatic rollback with cooldown protection.
ModelComparator¶
The ModelComparator class fetches backtest metrics for both champion and challenger from MLflow, compares key performance indicators, and returns a PromotionDecision:
| Decision | Meaning |
|---|---|
PROMOTE | Challenger outperforms champion with sufficient confidence |
KEEP | Champion is better or difference is negligible |
INCONCLUSIVE | Metrics are too close to call |
NEEDS_MORE_DATA | Insufficient samples for reliable comparison |
See libs/tradai-common/src/tradai/common/model_comparison/comparator.py
compare-models Lambda¶
Invoked by Step Functions after training completes. Receives model_name and challenger_run_id, returns a decision with confidence score and profit improvement metrics.
See lambdas/compare-models/handler.py
promote-model Lambda¶
Executes the promotion: transitions the challenger to Production, archives the previous champion. Requires a minimum confidence threshold.
See lambdas/promote-model/handler.py
model-rollback Lambda¶
Rolls back to a previous model version when:
- CloudWatch alarms trigger (drift or performance degradation)
- Manual rollback is requested
Includes a configurable cooldown period (ROLLBACK_COOLDOWN_HOURS, default 24h) to prevent rapid rollback oscillation. Persists rollback state to DynamoDB.
See lambdas/model-rollback/handler.py
8. Key Source Files¶
| Component | File Path |
|---|---|
| TrainHandler | libs/tradai-common/src/tradai/common/entrypoint/training/handler.py |
| TrainingConfig / TrainingResult | libs/tradai-common/src/tradai/common/entrypoint/training/config.py |
| TrainingResultParser | libs/tradai-common/src/tradai/common/entrypoint/training/result_parser.py |
| MLflowReporter | libs/tradai-common/src/tradai/common/entrypoint/training/mlflow_reporter.py |
| ModelRegistrar | libs/tradai-common/src/tradai/common/entrypoint/training/model_registrar.py |
| ReproducibilityManifestBuilder | libs/tradai-common/src/tradai/common/entrypoint/training/manifest_builder.py |
| MLflowAdapter | libs/tradai-common/src/tradai/common/mlflow/adapter.py |
| RegistryMixin | libs/tradai-common/src/tradai/common/mlflow/registry.py |
| ExperimentsMixin | libs/tradai-common/src/tradai/common/mlflow/experiments.py |
| ArtifactsMixin | libs/tradai-common/src/tradai/common/mlflow/artifacts.py |
| DriftDetector | libs/tradai-common/src/tradai/common/drift/detector.py |
| DriftResult / DriftThresholds | libs/tradai-common/src/tradai/common/drift/entities.py |
| ModelComparator | libs/tradai-common/src/tradai/common/model_comparison/comparator.py |
| ComparisonResult / PromotionDecision | libs/tradai-common/src/tradai/common/model_comparison/entities.py |
| TrainingPredictionModel | libs/tradai-strategy/src/tradai/strategy/freqai/training_model.py |
| TradAIPredictionModel | libs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.py |
| FreqAIConfigBuilder | libs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py |
| drift-monitor Lambda | lambdas/drift-monitor/handler.py |
| retraining-scheduler Lambda | lambdas/retraining-scheduler/handler.py |
| compare-models Lambda | lambdas/compare-models/handler.py |
| promote-model Lambda | lambdas/promote-model/handler.py |
| model-rollback Lambda | lambdas/model-rollback/handler.py |
9. Known Limitations¶
E2E Validation Pending
All components are implemented and unit-tested individually. The full pipeline (training -> registration -> drift detection -> retraining -> promotion) has not been validated end-to-end in a staging environment.
No Automated Drift-to-Retraining Trigger
The drift-monitor Lambda publishes drift state to DynamoDB and CloudWatch metrics, and the retraining-scheduler Lambda can read drift state. However, there is no EventBridge rule or Step Functions workflow that automatically chains drift detection into retraining. This connection must be configured in infrastructure.
Live Trading Integration Partial
Live trading mode (TradingHandler wiring, ECS service definition, exchange credentials via Secrets Manager) is not fully wired. The training and monitoring pipeline operates against backtest results, not live prediction streams.
10. Changelog¶
| Version | Date | Changes |
|---|---|---|
| 2.0.0 | 2026-03-28 | Complete rewrite: removed stale planning content, scaffold pseudo-code, and phase estimates; replaced with implementation reference based on actual source code |
| 1.1.0 | 2026-03-28 | Added TL;DR, Dependencies section; all 9 components marked implemented |
| 1.0.0 | 2025-12-22 | Initial ML Lifecycle architecture document |
Dependencies¶
| If This Changes | Update This Section |
|---|---|
libs/tradai-common/src/tradai/common/entrypoint/training/ | Section 2 (Training Pipeline) |
libs/tradai-strategy/src/tradai/strategy/freqai/ | Section 3 (Walk-Forward Training Model) |
libs/tradai-common/src/tradai/common/mlflow/ | Section 4 (Model Registry) |
libs/tradai-common/src/tradai/common/drift/ | Section 5 (Drift Detection) |
lambdas/retraining-scheduler/ | Section 6 (Retraining Pipeline) |
lambdas/compare-models/, promote-model/, model-rollback/ | Section 7 (Model Comparison and Promotion) |