Skip to content

TradAI ML Strategy Lifecycle Architecture

Version: 2.0.0 | Date: 2026-03-28 | Status: IMPLEMENTED -- E2E Validation Pending

Depends On: 11-LIVE-TRADING.md, 02-ARCHITECTURE-OVERVIEW.md

TL;DR: TradAI implements a complete ML strategy lifecycle: FreqAI walk-forward training with LightGBM/CatBoost/XGBoost, automated MLflow model registration with reproducibility manifests, and a full MLOps pipeline (drift detection via PSI/KS statistics, scheduled retraining, champion-vs-challenger comparison, and auto-rollback). All components (ML001-ML005, MO001-MO004) are built and unit-tested. End-to-end validation across the full pipeline remains pending.


1. Lifecycle Overview

flowchart LR
    A["Feature<br/>Engineering"] --> B["Model<br/>Training"]
    B --> C["MLflow<br/>Registry"]
    C --> D["Backtest<br/>Validation"]
    D --> E["Live/Paper<br/>Trading"]
    E --> F["Drift<br/>Detection"]
    F -->|"retrain<br/>trigger"| B
    F -->|"rollback<br/>trigger"| C

    style A fill:#1565c0,color:#fff
    style B fill:#1565c0,color:#fff
    style C fill:#2e7d32,color:#fff
    style D fill:#2e7d32,color:#fff
    style E fill:#f57f17,color:#000
    style F fill:#d32f2f,color:#fff

The ML lifecycle has six stages:

Stage Description Key Component
Feature Engineering Strategy defines %-prefixed features and &-prefixed targets FreqAI feature_engineering_expand_all()
Model Training Walk-forward training via TrainHandler orchestrating Freqtrade TrainHandler, TrainingPredictionModel
Model Registry Auto-registration in MLflow with metrics, tags, reproducibility manifests ModelRegistrar, MLflowReporter
Backtest Validation Inference-only backtesting using registered models TradAIPredictionModel
Drift Detection PSI and KS-statistic monitoring on production predictions DriftDetector, drift-monitor Lambda
Retraining / Rollback Scheduled or drift-triggered retraining, automatic rollback retraining-scheduler, model-rollback Lambdas

2. Training Pipeline

Implemented: ML001-ML003

TrainHandler orchestrates the full training workflow: Freqtrade subprocess execution, result parsing, MLflow metric logging, model registration, and reproducibility manifest upload.

Architecture

TrainHandler delegates to four collaborators following single-responsibility:

Collaborator Responsibility File
TrainingResultParser Parse Freqtrade stdout/backtest JSON for metrics entrypoint/training/result_parser.py
MLflowReporter Log metrics, tags, and environment info to MLflow entrypoint/training/mlflow_reporter.py
ModelRegistrar Upload artifacts, create model versions, optionally promote entrypoint/training/model_registrar.py
ReproducibilityManifestBuilder Build JSON manifests with feature schemas, seeds, git info entrypoint/training/manifest_builder.py

Training Flow

  1. TrainHandler.run() sets global random seeds (if configured) for reproducibility
  2. Builds TrainingConfig from environment settings (strategy, FreqAI model, pairs, timeframe)
  3. Executes Freqtrade backtesting as a subprocess with the chosen FreqAI model (e.g., LightGBMRegressor)
  4. TrainingResultParser extracts validation metrics (Sharpe, profit, drawdown, trade count)
  5. MLflowReporter.log() creates an MLflow run with all metrics and tags
  6. ModelRegistrar.register() uploads model artifacts and creates a new model version
  7. Updates DynamoDB job status to COMPLETED

See libs/tradai-common/src/tradai/common/entrypoint/training/handler.py

Training Configuration

TrainingConfig is a frozen Pydantic model that validates all training parameters:

  • strategy -- FreqAI-enabled strategy class name
  • freqai_model -- ML model type, validated against FreqAIModelRegistry (LightGBM, CatBoost, XGBoost)
  • train_period_days -- Training window (7-365 days)
  • backtest_period_days -- Walk-forward validation period (1-30 days)
  • n_estimators, learning_rate, max_depth -- Model hyperparameters

See libs/tradai-common/src/tradai/common/entrypoint/training/config.py


3. Walk-Forward Training Model (FreqAI)

Implemented: ML005

TrainingPredictionModel extends FreqAI's BaseRegressionModel to support actual ML training during walk-forward backtesting with LightGBM, CatBoost, and XGBoost.

TradAI provides two custom FreqAI models:

Model Mode Purpose
TrainingPredictionModel Training Walk-forward training with progressive model retraining
TradAIPredictionModel Inference Loads pre-trained models from CSV, S3, or MLflow (no training)

TrainingPredictionModel implements the fit() method called by FreqAI for each training window. It extracts FitParams (train/test splits, sample weights), delegates to the configured ML framework, and logs per-window metrics.

TradAIPredictionModel uses a PredictionLoader protocol with two implementations: CSVLoader (local file) and MLflowLoader (model registry). The PredictionSource enum controls which loader is used.

Key files:

  • libs/tradai-strategy/src/tradai/strategy/freqai/training_model.py
  • libs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.py
  • libs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py

4. Model Registry (MLflow)

Implemented: ML004 (Hyperparameter Optimization) + MLflowAdapter with full CRUD

ML004: Optuna integration for hyperparameter search (cli/src/tradai/cli/optuna_commands.py, services/strategy-service/src/tradai/strategy_service/core/optuna_handler.py). The MLflowAdapter composes four mixins (MLflowClientMixin, ExperimentsMixin, RegistryMixin, ArtifactsMixin) providing experiment tracking, model registration, stage transitions, and artifact management. Compatible with MLflow 3.x.

Registry Lifecycle

Training Complete
    --> ModelRegistrar uploads artifacts (model files, feature importance, manifests)
    --> Creates new model version in MLflow (stage: None)
    --> Optionally promotes to Staging if metrics exceed thresholds
    --> compare-models Lambda compares champion vs challenger
    --> promote-model Lambda transitions winner to Production

Stage Transitions

Stage Meaning Trigger
None Newly registered, not yet evaluated Auto after training
Staging Passed initial thresholds, under evaluation Auto-promotion in ModelRegistrar
Production Active model for live inference promote-model Lambda
Archived Previous version, retained for rollback On promotion of replacement

Key Operations (RegistryMixin)

  • get_model_version(name, version) -- Fetch version details
  • create_model_version(name, source, run_id, tags) -- Register new version
  • transition_model_version_stage(name, version, stage) -- Stage promotion/demotion
  • get_registered_model(name) -- Get model with all versions
  • search_registered_models(filter_string) -- Query models

See libs/tradai-common/src/tradai/common/mlflow/adapter.py and libs/tradai-common/src/tradai/common/mlflow/registry.py


5. Drift Detection

Implemented: MO002

DriftDetector calculates Population Stability Index (PSI) and Kolmogorov-Smirnov statistics for both predictions and individual features. The drift-monitor Lambda runs on an EventBridge schedule, fetching metrics from MLflow and publishing CloudWatch metrics.

DriftDetector

The DriftDetector class provides drift analysis with configurable thresholds:

Metric Method Interpretation
PSI (Population Stability Index) Quantile-based binning (10 bins) < 0.10 none, 0.10-0.25 moderate, >= 0.25 significant
KS Statistic Two-sample Kolmogorov-Smirnov test p-value < 0.05 indicates distribution shift
Accuracy Deviation Sign-match accuracy comparison > 10% deviation triggers alert

The analysis produces a DriftResult containing:

  • Overall PSI and severity classification (DriftSeverity enum)
  • Per-feature drift metrics (FeatureDrift with PSI, KS stat, mean/std shifts)
  • Actionable recommendations (auto-generated)
  • Alert message (when requires_attention=True)
  • CloudWatch-compatible metric output (to_cloudwatch_metrics())

Key files:

  • libs/tradai-common/src/tradai/common/drift/detector.py
  • libs/tradai-common/src/tradai/common/drift/entities.py

drift-monitor Lambda

The Lambda runs on a schedule and:

  1. Fetches backtest metrics from MLflow experiments for each monitored model
  2. Compares current period against reference (training) period
  3. Calculates PSI and KS statistics for key metrics (profit_total, win_rate, sharpe_ratio, max_drawdown, trades_count)
  4. Publishes CloudWatch metrics under the DriftMonitoring namespace
  5. Sends SNS alerts when drift exceeds thresholds
  6. Persists drift state to DynamoDB via DynamoDBStateRepository

See lambdas/drift-monitor/handler.py


6. Retraining Pipeline

Implemented: MO001

The retraining-scheduler Lambda triggers ECS Fargate training tasks based on scheduled intervals, drift detection results, or manual requests.

retraining-scheduler Lambda

The Lambda supports three trigger modes:

Trigger Source Behavior
Scheduled EventBridge cron Checks each model's last retraining timestamp against configured interval
Drift-triggered Drift state in DynamoDB Launches retraining when drift is detected
Manual Direct invocation with "trigger": "manual" Force-retrain specified models

Each retraining request launches an ECS Fargate task with TRADING_MODE=train and model-specific configuration (strategy, FreqAI model, pairs, training period).

See lambdas/retraining-scheduler/handler.py


7. Model Comparison and Promotion

Implemented: MO003-MO004

ModelComparator provides champion-vs-challenger comparison. The compare-models Lambda produces promotion decisions, the promote-model Lambda executes stage transitions, and the model-rollback Lambda handles automatic rollback with cooldown protection.

ModelComparator

The ModelComparator class fetches backtest metrics for both champion and challenger from MLflow, compares key performance indicators, and returns a PromotionDecision:

Decision Meaning
PROMOTE Challenger outperforms champion with sufficient confidence
KEEP Champion is better or difference is negligible
INCONCLUSIVE Metrics are too close to call
NEEDS_MORE_DATA Insufficient samples for reliable comparison

See libs/tradai-common/src/tradai/common/model_comparison/comparator.py

compare-models Lambda

Invoked by Step Functions after training completes. Receives model_name and challenger_run_id, returns a decision with confidence score and profit improvement metrics.

See lambdas/compare-models/handler.py

promote-model Lambda

Executes the promotion: transitions the challenger to Production, archives the previous champion. Requires a minimum confidence threshold.

See lambdas/promote-model/handler.py

model-rollback Lambda

Rolls back to a previous model version when:

  • CloudWatch alarms trigger (drift or performance degradation)
  • Manual rollback is requested

Includes a configurable cooldown period (ROLLBACK_COOLDOWN_HOURS, default 24h) to prevent rapid rollback oscillation. Persists rollback state to DynamoDB.

See lambdas/model-rollback/handler.py


8. Key Source Files

Component File Path
TrainHandler libs/tradai-common/src/tradai/common/entrypoint/training/handler.py
TrainingConfig / TrainingResult libs/tradai-common/src/tradai/common/entrypoint/training/config.py
TrainingResultParser libs/tradai-common/src/tradai/common/entrypoint/training/result_parser.py
MLflowReporter libs/tradai-common/src/tradai/common/entrypoint/training/mlflow_reporter.py
ModelRegistrar libs/tradai-common/src/tradai/common/entrypoint/training/model_registrar.py
ReproducibilityManifestBuilder libs/tradai-common/src/tradai/common/entrypoint/training/manifest_builder.py
MLflowAdapter libs/tradai-common/src/tradai/common/mlflow/adapter.py
RegistryMixin libs/tradai-common/src/tradai/common/mlflow/registry.py
ExperimentsMixin libs/tradai-common/src/tradai/common/mlflow/experiments.py
ArtifactsMixin libs/tradai-common/src/tradai/common/mlflow/artifacts.py
DriftDetector libs/tradai-common/src/tradai/common/drift/detector.py
DriftResult / DriftThresholds libs/tradai-common/src/tradai/common/drift/entities.py
ModelComparator libs/tradai-common/src/tradai/common/model_comparison/comparator.py
ComparisonResult / PromotionDecision libs/tradai-common/src/tradai/common/model_comparison/entities.py
TrainingPredictionModel libs/tradai-strategy/src/tradai/strategy/freqai/training_model.py
TradAIPredictionModel libs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.py
FreqAIConfigBuilder libs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py
drift-monitor Lambda lambdas/drift-monitor/handler.py
retraining-scheduler Lambda lambdas/retraining-scheduler/handler.py
compare-models Lambda lambdas/compare-models/handler.py
promote-model Lambda lambdas/promote-model/handler.py
model-rollback Lambda lambdas/model-rollback/handler.py

9. Known Limitations

E2E Validation Pending

All components are implemented and unit-tested individually. The full pipeline (training -> registration -> drift detection -> retraining -> promotion) has not been validated end-to-end in a staging environment.

No Automated Drift-to-Retraining Trigger

The drift-monitor Lambda publishes drift state to DynamoDB and CloudWatch metrics, and the retraining-scheduler Lambda can read drift state. However, there is no EventBridge rule or Step Functions workflow that automatically chains drift detection into retraining. This connection must be configured in infrastructure.

Live Trading Integration Partial

Live trading mode (TradingHandler wiring, ECS service definition, exchange credentials via Secrets Manager) is not fully wired. The training and monitoring pipeline operates against backtest results, not live prediction streams.


10. Changelog

Version Date Changes
2.0.0 2026-03-28 Complete rewrite: removed stale planning content, scaffold pseudo-code, and phase estimates; replaced with implementation reference based on actual source code
1.1.0 2026-03-28 Added TL;DR, Dependencies section; all 9 components marked implemented
1.0.0 2025-12-22 Initial ML Lifecycle architecture document

Dependencies

If This Changes Update This Section
libs/tradai-common/src/tradai/common/entrypoint/training/ Section 2 (Training Pipeline)
libs/tradai-strategy/src/tradai/strategy/freqai/ Section 3 (Walk-Forward Training Model)
libs/tradai-common/src/tradai/common/mlflow/ Section 4 (Model Registry)
libs/tradai-common/src/tradai/common/drift/ Section 5 (Drift Detection)
lambdas/retraining-scheduler/ Section 6 (Retraining Pipeline)
lambdas/compare-models/, promote-model/, model-rollback/ Section 7 (Model Comparison and Promotion)