TradAI ML Strategy Lifecycle Architecture¶

Version: 1.0.0 Date: 2025-12-22 Status: DRAFT - E2E VALIDATION REQUIRED Depends On: 11-LIVE-TRADING.md, 02-ARCHITECTURE-OVERVIEW.md

Status: Not Yet Implemented

This document describes the target architecture. Implementation is tracked in reports/implementation-tasks/.

Executive Summary¶

This document defines the complete lifecycle for ML-based trading strategies in TradAI, covering:

Model Development - Feature engineering, training, hyperparameter tuning
Experimentation - MLflow tracking, experiment comparison
Model Registry - Versioning, staging, promotion
Backtesting - Walk-forward validation, historical simulation
Live/Paper Trading - Inference, monitoring
Retraining - Scheduled retraining, drift detection, rollback

Current Implementation Status¶

Phase	Status	Completion
Model Development	Manual	0% automated
Experimentation (MLflow)	Implemented	100%
Model Registry	Implemented	100%
Backtesting with ML	Implemented	100%
Live Inference	Designed	60%
Retraining Pipeline	Code Complete	80% (needs IaC)
Drift Detection	Code Complete	80% (needs IaC)

Overall ML Lifecycle: ~60% Complete

Note (2025-12-24): Retraining and drift detection Lambda code EXISTS in lambdas/retraining-scheduler/handler.py and lambdas/drift-monitor/handler.py. Remaining work is Pulumi IaC integration (EventBridge schedules, DynamoDB tables, IAM roles).

System Architecture: ML Strategy Flow¶

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                    ML STRATEGY E2E LIFECYCLE                                                 │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│  │                          PHASE 1: MODEL DEVELOPMENT                                                     │ │
│  │  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │ │
│  │  │  Feature     │───►│  FreqAI      │───►│  Model       │───►│  Local       │───►│  Manual      │     │ │
│  │  │  Engineering │    │  Training    │    │  Selection   │    │  Validation  │    │  Upload      │     │ │
│  │  │              │    │  Config      │    │              │    │              │    │  to MLflow   │     │ │
│  │  │  %-features  │    │  LightGBM    │    │  Hyperopt    │    │  Backtests   │    │              │     │ │
│  │  │  &-targets   │    │  CatBoost    │    │  Grid Search │    │              │    │              │     │ │
│  │  └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘     │ │
│  │                                                                                                         │ │
│  │  Status: ⚠️ MANUAL - No automated training pipeline                                                    │ │
│  └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│                                              │                                                               │
│                                              ▼                                                               │
│  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│  │                          PHASE 2: EXPERIMENTATION & REGISTRY (✅ IMPLEMENTED)                          │ │
│  │                                                                                                         │ │
│  │  ┌──────────────────────────────────────────────────────────────────────────────────────────────────┐  │ │
│  │  │                              MLflow Tracking Server                                               │  │ │
│  │  │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐              │  │ │
│  │  │  │ Experiments     │  │ Runs            │  │ Artifacts       │  │ Model Registry  │              │  │ │
│  │  │  │                 │  │                 │  │                 │  │                 │              │  │ │
│  │  │  │ strategy-exp    │  │ run_id: abc123  │  │ model.pkl       │  │ RadStrategy     │              │  │ │
│  │  │  │ hyperopt-exp    │  │ metrics:        │  │ config.json     │  │ ├─ v1 (Staging) │              │  │ │
│  │  │  │ backtest-exp    │  │  - sharpe: 1.5  │  │ features.csv    │  │ └─ v2 (Prod)    │              │  │ │
│  │  │  │                 │  │  - profit: 0.15 │  │ backtest.json   │  │                 │              │  │ │
│  │  │  └─────────────────┘  └─────────────────┘  └─────────────────┘  └─────────────────┘              │  │ │
│  │  └──────────────────────────────────────────────────────────────────────────────────────────────────┘  │ │
│  │                                                                                                         │ │
│  │  Implemented: MLflowAdapter with full CRUD (CL007)                                                     │ │
│  └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│                                              │                                                               │
│                                              ▼                                                               │
│  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│  │                          PHASE 3: BACKTESTING WITH ML (✅ IMPLEMENTED)                                 │ │
│  │                                                                                                         │ │
│  │  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │ │
│  │  │  BacktestSvc │───►│  DataQuery   │───►│  Freqtrade   │───►│  MLflow      │───►│  S3          │     │ │
│  │  │              │    │  Service     │    │  Runner      │    │  Logging     │    │  Upload      │     │ │
│  │  │  config:     │    │              │    │              │    │              │    │              │     │ │
│  │  │  - strategy  │    │  OHLCV data  │    │  --freqai    │    │  metrics     │    │  results.json│     │ │
│  │  │  - freqai    │    │  → .feather  │    │  model=...   │    │  artifacts   │    │              │     │ │
│  │  └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘     │ │
│  │                                                                                                         │ │
│  │  TradAIPredictionModel: Inference-only (loads from CSV/MLflow, does NOT train)                         │ │
│  └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│                                              │                                                               │
│                                              ▼                                                               │
│  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│  │                          PHASE 4: LIVE/PAPER TRADING (⚠️ PARTIAL - 60%)                                │ │
│  │                                                                                                         │ │
│  │  ┌──────────────────────────────────────────────────────────────────────────────────────────────────┐  │ │
│  │  │                          Strategy Container (Live Mode)                                           │  │ │
│  │  │                                                                                                   │  │ │
│  │  │  IMPLEMENTED (Building Blocks):                                                                   │  │ │
│  │  │  ✅ LF002: StrategyConfigLoader (load from MLflow tags + S3)                                      │  │ │
│  │  │  ✅ LF003: ArcticDBWarmupLoader (historical data warmup)                                          │  │ │
│  │  │  ✅ LF004: TradingStateRepository (DynamoDB state)                                                │  │ │
│  │  │  ✅ LF005: HealthReporter (container health)                                                      │  │ │
│  │  │  ✅ LM003: FreqAIConfigBuilder (inference config)                                                 │  │ │
│  │  │                                                                                                   │  │ │
│  │  │  NOT IMPLEMENTED (Blockers):                                                                      │  │ │
│  │  │  ❌ LF006: ECS Service definition                                                                 │  │ │
│  │  │  ❌ LF007: Exchange credentials (Secrets Manager)                                                 │  │ │
│  │  │  ❌ LF008: TradingHandler wiring                                                                  │  │ │
│  │  │                                                                                                   │  │ │
│  │  │  FLOW: StrategyConfigLoader → Load MLflow Model → FreqAI Inference (no training)                  │  │ │
│  │  └──────────────────────────────────────────────────────────────────────────────────────────────────┘  │ │
│  └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│                                              │                                                               │
│                                              ▼                                                               │
│  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│  │                          PHASE 5: RETRAINING & MONITORING (❌ NOT IMPLEMENTED)                         │ │
│  │                                                                                                         │ │
│  │  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │ │
│  │  │  Drift       │───►│  Retrain     │───►│  Model       │───►│  A/B         │───►│  Auto        │     │ │
│  │  │  Detection   │    │  Trigger     │    │  Training    │    │  Testing     │    │  Promotion   │     │ │
│  │  │              │    │              │    │              │    │              │    │              │     │ │
│  │  │  ❌ MISSING  │    │  ❌ MISSING  │    │  ❌ MISSING  │    │  ❌ MISSING  │    │  ❌ MISSING  │     │ │
│  │  └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘     │ │
│  │                                                                                                         │ │
│  │  Required Tasks: ML001-ML008 (see Implementation Plan below)                                           │ │
│  └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│                                                                                                              │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Component Detail¶

1. FreqAI Integration Architecture¶

TradAI uses FreqAI (Freqtrade's ML framework) for model training and inference:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         FreqAI Architecture in TradAI                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Strategy Layer                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  RadStrategy (extends TradAIStrategy)                                   │ │
│  │  ├── feature_engineering_expand_all()  →  Define %-prefixed features   │ │
│  │  ├── set_freqai_targets()              →  Define &-prefixed targets    │ │
│  │  ├── populate_indicators()             →  self.freqai.start()          │ │
│  │  └── populate_entry_trend()            →  Use &-predictions            │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                       │                                      │
│                                       ▼                                      │
│  FreqAI Layer (Freqtrade)                                                   │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  --freqaimodel selection:                                               │ │
│  │  ├── LightGBMRegressor (built-in)     →  Train during backtest         │ │
│  │  ├── CatBoostRegressor (built-in)     →  Train during backtest         │ │
│  │  ├── TradAIPredictionModel (custom)   →  Load from CSV/MLflow          │ │
│  │  │   ├── train() → NO-OP (skip training)                               │ │
│  │  │   └── predict() → Load predictions from external source             │ │
│  │  └── TradAIMLflowModel (custom)       →  Load from MLflow registry     │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                       │                                      │
│                                       ▼                                      │
│  Mode Selection                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  TRAINING MODE (not implemented):                                       │ │
│  │  freqtrade backtesting --freqaimodel LightGBMRegressor                 │ │
│  │  → FreqAI trains models during walk-forward backtest                    │ │
│  │  → Models saved to user_data/models/                                    │ │
│  │  → Manual upload to MLflow required                                     │ │
│  │                                                                         │ │
│  │  INFERENCE MODE (implemented):                                          │ │
│  │  freqtrade backtesting --freqaimodel TradAIPredictionModel             │ │
│  │  → Loads pre-trained model from MLflow                                  │ │
│  │  → Runs predict() without training                                      │ │
│  │  → Used for backtesting and live trading                                │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

2. Custom FreqAI Model: TradAIPredictionModel¶

# libs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.py

class TradAIPredictionModel(IFreqaiModel):
    """Custom FreqAI model for inference-only mode.

    Loads predictions from CSV or MLflow registry instead of training.
    """

    def train(self, unfiltered_df, pair, dk, **kwargs) -> None:
        """No training - return immediately."""
        self.logger.debug(f"Skipping training for {pair}")
        return None  # NO-OP

    def predict(self, unfiltered_df, dk, **kwargs) -> tuple[np.ndarray, np.ndarray]:
        """Load predictions from external source."""
        source = self._get_prediction_source()

        if source == "csv":
            return self._load_csv_predictions(dk)
        elif source == "mlflow":
            return self._load_mlflow_predictions(dk)
        else:
            raise ValueError(f"Unknown source: {source}")

Key Files: - libs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.py - libs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py

3. MLflow Model Registry Integration¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                         MLflow Model Lifecycle                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Model Registration Flow:                                                    │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                   │
│  │  Backtest    │───►│  Manual      │───►│  Model       │                   │
│  │  Complete    │    │  Upload      │    │  Registry    │                   │
│  │              │    │  to MLflow   │    │              │                   │
│  └──────────────┘    └──────────────┘    └──────────────┘                   │
│                                                                              │
│  Model Registry Schema:                                                      │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  RadStrategy                                                            │ │
│  │  ├── Version 1                                                          │ │
│  │  │   ├── Stage: Archived                                                │ │
│  │  │   ├── Tags: timeframe=1h, pairs=BTC/USDT:USDT                       │ │
│  │  │   └── Artifacts: model.pkl, config.json                              │ │
│  │  ├── Version 2                                                          │ │
│  │  │   ├── Stage: Staging                                                 │ │
│  │  │   ├── Tags: timeframe=1h, sharpe=1.8                                 │ │
│  │  │   └── Artifacts: model.pkl, feature_importance.json                  │ │
│  │  └── Version 3                                                          │ │
│  │      ├── Stage: Production                                              │ │
│  │      ├── Tags: timeframe=1h, sharpe=2.1, promoted_at=2025-12-15        │ │
│  │      └── Artifacts: model.pkl, backtest_results.json                    │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  Live Trading Model Loading:                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                   │
│  │  Strategy    │───►│  MLflow      │───►│  FreqAI      │                   │
│  │  Container   │    │  Registry    │    │  Inference   │                   │
│  │              │    │              │    │              │                   │
│  │  ENV:        │    │  Query:      │    │  Model       │                   │
│  │  STRATEGY_   │    │  Production  │    │  predict()   │                   │
│  │  STAGE=Prod  │    │  Stage       │    │              │                   │
│  └──────────────┘    └──────────────┘    └──────────────┘                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implemented Methods (MLflowAdapter): - create_model_version(name, source, run_id, tags) ✅ - transition_model_version_stage(name, version, stage) ✅ - get_registered_model(name) ✅ - search_registered_models(filter_string) ✅ - load_model(model_uri) ✅ - delete_model_version(name, version) ✅

4. FreqAI Configuration Builder¶

# libs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py

class FreqAIConfigBuilder:
    """Build FreqAI configuration for different modes."""

    def build_inference_config(self, strategy_metadata: StrategyMetadata) -> dict:
        """Build config for inference-only mode (live trading)."""
        return {
            "freqai": {
                "enabled": True,
                "live_retrain_hours": 0,  # No retraining during live
                "fit_live_predictions_candles": 0,
                "purge_old_models": False,
                "model_training_parameters": {},
                "feature_parameters": self._build_feature_params(strategy_metadata),
                "data_split_parameters": {"test_size": 0},
            }
        }

    def build_training_config(self, strategy_metadata: StrategyMetadata) -> dict:
        """Build config for training mode (NOT YET IMPLEMENTED)."""
        raise NotImplementedError("Training mode requires ML001-ML003")

Gap Analysis: What's Missing¶

Critical Missing Components¶

ID	Component	Purpose	Blocks
ML001	TradingMode.TRAIN	Add TRAIN mode to TradingMode enum	ML002
ML002	TrainingHandler	Orchestrate model training via Freqtrade	ML003, ML005
ML003	MLflow Model Upload	Auto-upload trained models to registry	MO001
ML004	ML Hyperparameter Optimization	Optuna for model hyperparams	-
ML005	TrainingPredictionModel	FreqAI model that actually trains (walk-forward)	-
MO001	Retraining Scheduler	EventBridge + Lambda for scheduled retraining	MO002
MO002	Model Drift Monitor	Track prediction accuracy vs actuals	MO003
MO003	Model Comparison Service	A/B testing between model versions	MO004
MO004	Model Rollback Service	Auto-rollback on performance degradation	-

Walk-Forward Backtesting (ML005)¶

Current Problem: TradAIPredictionModel.train() is a NO-OP. Cannot do walk-forward validation.

FreqAI Native Walk-Forward:

{
  "freqai": {
    "train_period_days": 30,      // Train on 30 days of data
    "backtest_period_days": 7,    // Retrain every 7 days
    "data_split_parameters": {
      "test_size": 0.1,
      "shuffle": false
    }
  }
}

This enables progressive training during backtesting: 1. Train model on days 1-30 2. Predict on days 31-37 3. Retrain on days 8-37 (sliding window) 4. Predict on days 38-44 5. Repeat until end of backtest range

Solution: Create TrainingPredictionModel that extends IFreqaiModel with actual train() implementation (LightGBM, CatBoost, etc.).

Component Dependency Graph¶

ML001 (TradingMode.TRAIN) ─────────────────────────────────────────────────────┐
                                                                                │
                                                                                ▼
                                         ┌──────────────────────────────────────┴────────────┐
                                         │                                                    │
                                         ▼                                                    ▼
                        ML002 (TrainingHandler) ────────────────►  ML005 (TrainingPredictionModel)
                                         │                         Walk-forward backtesting
                                         │
                                         ▼
                        ML003 (MLflow Upload) ────────► ML004 (Hyperopt - Optuna)
                                         │
                                         ▼
                        MO001 (Retraining Scheduler)
                                         │
                                         ▼
                        MO002 (Drift Monitor) ─────────────────────────────────────┐
                                         │                                          │
                                         ▼                                          ▼
                        MO003 (A/B Testing) ◄───────────────────────────── MO004 (Rollback)

Implementation Plan¶

Phase 1: Training Pipeline (Estimated: 2-3 days)¶

ML001: Add TradingMode.TRAIN¶

# libs/tradai-common/src/tradai/common/entrypoint/settings.py

class TradingMode(str, Enum):
    BACKTEST = "backtest"
    HYPEROPT = "hyperopt"
    DRY_RUN = "dry-run"
    LIVE = "live"
    TRAIN = "train"  # NEW: Training mode for ML models

ML002: TrainingHandler Implementation¶

# libs/tradai-common/src/tradai/common/entrypoint/training.py

class TrainingHandler(BaseEntrypointHandler):
    """Handle ML model training via Freqtrade."""

    def __init__(
        self,
        config_loader: StrategyConfigLoader,
        mlflow_adapter: MLflowAdapter,
        freqtrade_config_builder: FreqtradeConfigBuilder,
    ):
        self.config_loader = config_loader
        self.mlflow = mlflow_adapter
        self.config_builder = freqtrade_config_builder

    def run(self) -> int:
        """Execute training workflow."""
        # 1. Load strategy config
        config = self.config_loader.load_config(
            strategy_name=self.settings.strategy_name,
            stage="Staging",  # Train against staging config
        )

        # 2. Build Freqtrade training config
        ft_config = self.config_builder.build_training_config(config)

        # 3. Execute Freqtrade backtesting with real FreqAI model
        #    (LightGBM, CatBoost, etc. - NOT TradAIPredictionModel)
        result = self._execute_training(ft_config)

        # 4. Upload trained model to MLflow
        model_version = self._upload_to_mlflow(result)

        # 5. Log metrics
        self._log_training_metrics(result, model_version)

        return 0

    def _execute_training(self, config: dict) -> TrainingResult:
        """Run Freqtrade backtesting with actual training."""
        cmd = [
            "freqtrade", "backtesting",
            "--strategy", config["strategy"],
            "--freqaimodel", config["freqai"]["model"],  # LightGBMRegressor
            "--config", self._write_temp_config(config),
            "--timerange", config["timerange"],
        ]
        # ... execute and parse results

    def _upload_to_mlflow(self, result: TrainingResult) -> ModelVersion:
        """Upload trained model artifacts to MLflow registry."""
        with mlflow.start_run():
            # Log model artifacts
            mlflow.log_artifact(result.model_path)
            mlflow.log_artifact(result.feature_importance_path)

            # Log metrics
            mlflow.log_metrics({
                "sharpe_ratio": result.sharpe,
                "profit_total": result.profit,
                "max_drawdown": result.max_drawdown,
            })

            # Register model
            model_uri = f"runs:/{mlflow.active_run().info.run_id}/model"
            return mlflow.register_model(model_uri, self.settings.strategy_name)

ML003: MLflow Model Auto-Upload¶

Integrated into TrainingHandler above.

Phase 2: Hyperparameter Optimization (Estimated: 2 days)¶

ML004: ML Hyperparameter Optimizer¶

# libs/tradai-strategy/src/tradai/strategy/training/hyperopt.py

class MLHyperparameterOptimizer:
    """Optimize ML model hyperparameters using Optuna."""

    def __init__(
        self,
        training_handler: TrainingHandler,
        mlflow_adapter: MLflowAdapter,
    ):
        self.training_handler = training_handler
        self.mlflow = mlflow_adapter

    def optimize(
        self,
        strategy_name: str,
        n_trials: int = 100,
        objective_metric: str = "sharpe_ratio",
    ) -> HyperoptResult:
        """Run hyperparameter optimization."""
        import optuna

        def objective(trial):
            # Sample hyperparameters
            params = {
                "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
                "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
                "max_depth": trial.suggest_int("max_depth", 3, 10),
                "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
            }

            # Train with these params
            result = self.training_handler.run_with_params(params)

            # Log to MLflow
            with mlflow.start_run(nested=True):
                mlflow.log_params(params)
                mlflow.log_metric(objective_metric, result.metrics[objective_metric])

            return result.metrics[objective_metric]

        study = optuna.create_study(direction="maximize")
        study.optimize(objective, n_trials=n_trials)

        return HyperoptResult(
            best_params=study.best_params,
            best_value=study.best_value,
            n_trials=n_trials,
        )

Phase 3: Retraining & Monitoring (Estimated: 3-4 days)¶

ML005: Retraining Scheduler¶

# lambdas/retrain_scheduler/handler.py

def handler(event, context):
    """EventBridge-triggered retraining scheduler."""
    dynamodb = boto3.resource("dynamodb")
    ecs = boto3.client("ecs")

    # Get strategies that need retraining
    strategies = get_strategies_for_retraining(dynamodb)

    for strategy in strategies:
        # Check if retraining is due
        if should_retrain(strategy):
            # Launch ECS training task
            ecs.run_task(
                cluster=os.environ["ECS_CLUSTER"],
                taskDefinition=f"tradai-train-{strategy.name}",
                launchType="FARGATE",
                overrides={
                    "containerOverrides": [{
                        "name": "trainer",
                        "environment": [
                            {"name": "TRADING_MODE", "value": "train"},
                            {"name": "STRATEGY_NAME", "value": strategy.name},
                        ]
                    }]
                }
            )

            # Update last_retrain timestamp
            update_retrain_timestamp(dynamodb, strategy)

def should_retrain(strategy) -> bool:
    """Check if strategy should be retrained."""
    # Option 1: Time-based (weekly/monthly)
    days_since_last = (datetime.utcnow() - strategy.last_retrain).days
    if days_since_last >= strategy.retrain_interval_days:
        return True

    # Option 2: Drift-based (if drift monitor triggered)
    if strategy.drift_detected:
        return True

    return False

ML006: Model Drift Monitor¶

# libs/tradai-common/src/tradai/common/monitoring/drift.py

class ModelDriftMonitor:
    """Monitor model prediction accuracy and detect drift."""

    def __init__(
        self,
        dynamodb_table: str,
        cloudwatch_namespace: str,
    ):
        self.dynamodb = boto3.resource("dynamodb").Table(dynamodb_table)
        self.cloudwatch = boto3.client("cloudwatch")

    def record_prediction(
        self,
        strategy_name: str,
        prediction: float,
        actual: float,
        timestamp: datetime,
    ):
        """Record prediction vs actual for drift analysis."""
        self.dynamodb.put_item(Item={
            "pk": f"PRED#{strategy_name}",
            "sk": timestamp.isoformat(),
            "prediction": Decimal(str(prediction)),
            "actual": Decimal(str(actual)),
            "error": Decimal(str(abs(prediction - actual))),
            "ttl": int((timestamp + timedelta(days=30)).timestamp()),
        })

    def calculate_drift_metrics(
        self,
        strategy_name: str,
        window_days: int = 7,
    ) -> DriftMetrics:
        """Calculate drift metrics over time window."""
        # Query predictions from DynamoDB
        predictions = self._query_predictions(strategy_name, window_days)

        # Calculate metrics
        mae = np.mean([p["error"] for p in predictions])
        baseline_mae = self._get_baseline_mae(strategy_name)
        drift_ratio = mae / baseline_mae if baseline_mae > 0 else 1.0

        # Publish to CloudWatch
        self.cloudwatch.put_metric_data(
            Namespace=self.cloudwatch_namespace,
            MetricData=[
                {
                    "MetricName": "ModelDriftRatio",
                    "Dimensions": [{"Name": "Strategy", "Value": strategy_name}],
                    "Value": drift_ratio,
                    "Unit": "None",
                }
            ]
        )

        return DriftMetrics(
            mae=mae,
            baseline_mae=baseline_mae,
            drift_ratio=drift_ratio,
            drift_detected=drift_ratio > 1.5,  # 50% degradation threshold
        )

ML007: Model Comparison Service¶

# services/strategy-service/src/tradai/strategy_service/core/comparison_service.py

class ModelComparisonService:
    """Compare performance between model versions."""

    def compare_versions(
        self,
        strategy_name: str,
        version_a: str,
        version_b: str,
        timerange: str,
    ) -> ComparisonResult:
        """Run side-by-side backtest comparison."""
        # Run backtest with version A
        result_a = self._run_backtest(strategy_name, version_a, timerange)

        # Run backtest with version B
        result_b = self._run_backtest(strategy_name, version_b, timerange)

        # Calculate comparison metrics
        return ComparisonResult(
            version_a=version_a,
            version_b=version_b,
            sharpe_diff=result_b.sharpe - result_a.sharpe,
            profit_diff=result_b.profit - result_a.profit,
            drawdown_diff=result_b.max_drawdown - result_a.max_drawdown,
            recommendation=self._recommend_version(result_a, result_b),
        )

    def shadow_test(
        self,
        strategy_name: str,
        production_version: str,
        candidate_version: str,
        duration_days: int = 7,
    ) -> ShadowTestConfig:
        """Configure shadow testing for candidate model."""
        return ShadowTestConfig(
            strategy_name=strategy_name,
            production_version=production_version,
            candidate_version=candidate_version,
            start_time=datetime.utcnow(),
            end_time=datetime.utcnow() + timedelta(days=duration_days),
            metrics_to_compare=["sharpe_ratio", "profit_total", "win_rate"],
        )

ML008: Model Rollback Service¶

# services/strategy-service/src/tradai/strategy_service/core/rollback_service.py

class ModelRollbackService:
    """Handle model version rollback."""

    def __init__(
        self,
        mlflow_adapter: MLflowAdapter,
        ecs_client: ECSClient,
    ):
        self.mlflow = mlflow_adapter
        self.ecs = ecs_client

    def rollback(
        self,
        strategy_name: str,
        target_version: Optional[str] = None,
        reason: str = "",
    ) -> RollbackResult:
        """Rollback to previous model version."""
        # Get current production version
        current = self.mlflow.get_model_version(
            name=strategy_name,
            stage="Production",
        )

        # Determine target version (previous or specified)
        if target_version is None:
            target_version = self._get_previous_version(strategy_name, current.version)

        # Transition stages
        self.mlflow.transition_model_version_stage(
            name=strategy_name,
            version=current.version,
            stage="Archived",
        )
        self.mlflow.transition_model_version_stage(
            name=strategy_name,
            version=target_version,
            stage="Production",
        )

        # Restart live trading container to pick up new model
        self._restart_strategy_container(strategy_name)

        # Log rollback event
        self._log_rollback_event(strategy_name, current.version, target_version, reason)

        return RollbackResult(
            previous_version=current.version,
            new_version=target_version,
            timestamp=datetime.utcnow(),
            reason=reason,
        )

    def auto_rollback_on_drift(
        self,
        strategy_name: str,
        drift_threshold: float = 1.5,
    ):
        """Configure automatic rollback when drift exceeds threshold."""
        # Create CloudWatch alarm
        cloudwatch = boto3.client("cloudwatch")
        cloudwatch.put_metric_alarm(
            AlarmName=f"{strategy_name}-drift-rollback",
            MetricName="ModelDriftRatio",
            Namespace="TradAI/MLOps",
            Dimensions=[{"Name": "Strategy", "Value": strategy_name}],
            Statistic="Average",
            Period=300,
            EvaluationPeriods=3,
            Threshold=drift_threshold,
            ComparisonOperator="GreaterThanThreshold",
            AlarmActions=[os.environ["ROLLBACK_SNS_TOPIC"]],
        )

Infrastructure Requirements¶

New AWS Resources¶

Resource	Purpose	Cost/Month
ECS Task Definition (Training)	Run training tasks	Pay-per-use
EventBridge Rule	Trigger scheduled retraining	~$1
Lambda (Retrain Scheduler)	Orchestrate retraining	~$1
Lambda (Drift Monitor)	Calculate drift metrics	~$1
DynamoDB Table (Predictions)	Store prediction history	~$5
CloudWatch Alarms	Drift detection alerts	~$1
SNS Topic	Rollback notifications	~$0.50

Total Additional Cost: ~$10/month

ECS Task Definition for Training¶

Resource: AWS::ECS::TaskDefinition
  Family: tradai-train-{strategy}
  Cpu: 2048  # More CPU for training
  Memory: 4096  # More memory for model training
  NetworkMode: awsvpc
  RequiresCompatibilities:
    - FARGATE
  ContainerDefinitions:
    - Name: trainer
      Image: !Sub ${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/tradai-{strategy}:latest
      Essential: true
      Environment:
        - Name: TRADING_MODE
          Value: train
        - Name: STRATEGY_NAME
          Value: {strategy}
        - Name: MLFLOW_TRACKING_URI
          Value: http://mlflow.internal:5000
        - Name: FREQAI_MODEL
          Value: LightGBMRegressor  # Actual training model
        - Name: TRAIN_PERIOD_DAYS
          Value: "30"
        - Name: BACKTEST_PERIOD_DAYS
          Value: "7"

Configuration Schema¶

FreqAI Training Configuration¶

{
  "freqai": {
    "enabled": true,
    "model": "LightGBMRegressor",
    "train_period_days": 30,
    "backtest_period_days": 7,
    "identifier": "rad-strategy-v3",
    "model_training_parameters": {
      "n_estimators": 500,
      "learning_rate": 0.05,
      "max_depth": 7,
      "min_child_samples": 20
    },
    "feature_parameters": {
      "label_period_candles": 24,
      "include_timeframes": ["1h", "4h"],
      "indicator_periods_candles": [10, 20, 50]
    },
    "data_split_parameters": {
      "test_size": 0.15,
      "shuffle": false
    }
  }
}

Retraining Schedule Configuration¶

# DynamoDB: tradai-ml-config

{
  "pk": "CONFIG#RadStrategy",
  "sk": "RETRAIN",
  "retrain_interval_days": 7,
  "retrain_on_drift": true,
  "drift_threshold": 1.5,
  "auto_promote": false,
  "shadow_test_days": 3,
  "rollback_enabled": true,
  "notification_emails": ["team@tradai.io"]
}

Monitoring & Alerting¶

CloudWatch Metrics¶

Metric	Namespace	Description
`ModelDriftRatio`	TradAI/MLOps	Ratio of current MAE to baseline
`TrainingDuration`	TradAI/MLOps	Time to train model (seconds)
`ModelVersion`	TradAI/MLOps	Current production model version
`RetrainingCount`	TradAI/MLOps	Number of retraining events
`RollbackCount`	TradAI/MLOps	Number of rollback events

Alerts¶

Alert	Threshold	Action
High Model Drift	DriftRatio > 1.5	SNS → Auto-rollback
Training Failed	Duration > 2h or Exit != 0	SNS → Page on-call
Rollback Triggered	Any rollback	SNS → Notify team

Implementation Roadmap¶

Phase 1: Training Pipeline (Priority: HIGH)¶

Task	Description	Est. Hours	Dependencies
ML001	Add TradingMode.TRAIN	2	-
ML002	Implement TrainingHandler	8	ML001
ML003	MLflow model auto-upload	4	ML002

Subtotal: 14 hours

Phase 2: Hyperparameter Optimization (Priority: MEDIUM)¶

Task	Description	Est. Hours	Dependencies
ML004	Optuna integration for ML hyperparams	12	ML002

Subtotal: 12 hours

Phase 3: Retraining & Monitoring (Priority: MEDIUM)¶

Task	Description	Est. Hours	Dependencies
ML005	Retraining scheduler (EventBridge + Lambda)	8	ML002
ML006	Model drift monitor	10	-
ML007	Model comparison service	8	-
ML008	Model rollback service	6	ML007

Subtotal: 32 hours

Total Effort: ~58 hours (7-8 days)¶

Success Criteria¶

Criterion	Validation
Training mode works	`TRADING_MODE=train` successfully trains and uploads model
Models auto-registered	Trained models appear in MLflow registry with correct tags
Hyperopt works	`ml_hyperopt` command optimizes model parameters
Scheduled retraining	EventBridge triggers retraining on schedule
Drift detected	CloudWatch alarm fires when drift > threshold
Rollback works	`rollback` command transitions model versions correctly
Live containers reload	Restarted containers pick up new production model

Document Changelog¶

Version	Date	Changes
1.0.0	2025-12-22	Initial ML Lifecycle architecture document

References¶

11-LIVE-TRADING.md - Live trading architecture with inference mode
Freqtrade FreqAI Documentation - FreqAI official docs
MLflow Model Registry - MLflow registry docs

Next Actions: 1. Complete LF007/LF008 to enable live inference (prerequisite) 2. Implement ML001-ML003 for training pipeline 3. Add ML004 for hyperparameter optimization 4. Deploy ML005-ML008 for full MLOps lifecycle