Skip to content

TradAI ML Strategy Lifecycle Architecture

Version: 1.0.0 Date: 2025-12-22 Status: DRAFT - E2E VALIDATION REQUIRED Depends On: 11-LIVE-TRADING.md, 02-ARCHITECTURE-OVERVIEW.md


Executive Summary

This document defines the complete lifecycle for ML-based trading strategies in TradAI, covering:

  1. Model Development - Feature engineering, training, hyperparameter tuning
  2. Experimentation - MLflow tracking, experiment comparison
  3. Model Registry - Versioning, staging, promotion
  4. Backtesting - Walk-forward validation, historical simulation
  5. Live/Paper Trading - Inference, monitoring
  6. Retraining - Scheduled retraining, drift detection, rollback

Current Implementation Status

Phase Status Completion
Model Development Manual 0% automated
Experimentation (MLflow) Implemented 100%
Model Registry Implemented 100%
Backtesting with ML Implemented 100%
Live Inference Designed 60%
Retraining Pipeline Code Complete 80% (needs IaC)
Drift Detection Code Complete 80% (needs IaC)

Overall ML Lifecycle: ~60% Complete

Note (2025-12-24): Retraining and drift detection Lambda code EXISTS in lambdas/retraining-scheduler/handler.py and lambdas/drift-monitor/handler.py. Remaining work is Pulumi IaC integration (EventBridge schedules, DynamoDB tables, IAM roles).


System Architecture: ML Strategy Flow

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                    ML STRATEGY E2E LIFECYCLE                                                 │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│  │                          PHASE 1: MODEL DEVELOPMENT                                                     │ │
│  │  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │ │
│  │  │  Feature     │───►│  FreqAI      │───►│  Model       │───►│  Local       │───►│  Manual      │     │ │
│  │  │  Engineering │    │  Training    │    │  Selection   │    │  Validation  │    │  Upload      │     │ │
│  │  │              │    │  Config      │    │              │    │              │    │  to MLflow   │     │ │
│  │  │  %-features  │    │  LightGBM    │    │  Hyperopt    │    │  Backtests   │    │              │     │ │
│  │  │  &-targets   │    │  CatBoost    │    │  Grid Search │    │              │    │              │     │ │
│  │  └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘     │ │
│  │                                                                                                         │ │
│  │  Status: ⚠️ MANUAL - No automated training pipeline                                                    │ │
│  └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│                                              │                                                               │
│                                              ▼                                                               │
│  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│  │                          PHASE 2: EXPERIMENTATION & REGISTRY (✅ IMPLEMENTED)                          │ │
│  │                                                                                                         │ │
│  │  ┌──────────────────────────────────────────────────────────────────────────────────────────────────┐  │ │
│  │  │                              MLflow Tracking Server                                               │  │ │
│  │  │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐              │  │ │
│  │  │  │ Experiments     │  │ Runs            │  │ Artifacts       │  │ Model Registry  │              │  │ │
│  │  │  │                 │  │                 │  │                 │  │                 │              │  │ │
│  │  │  │ strategy-exp    │  │ run_id: abc123  │  │ model.pkl       │  │ RadStrategy     │              │  │ │
│  │  │  │ hyperopt-exp    │  │ metrics:        │  │ config.json     │  │ ├─ v1 (Staging) │              │  │ │
│  │  │  │ backtest-exp    │  │  - sharpe: 1.5  │  │ features.csv    │  │ └─ v2 (Prod)    │              │  │ │
│  │  │  │                 │  │  - profit: 0.15 │  │ backtest.json   │  │                 │              │  │ │
│  │  │  └─────────────────┘  └─────────────────┘  └─────────────────┘  └─────────────────┘              │  │ │
│  │  └──────────────────────────────────────────────────────────────────────────────────────────────────┘  │ │
│  │                                                                                                         │ │
│  │  Implemented: MLflowAdapter with full CRUD (CL007)                                                     │ │
│  └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│                                              │                                                               │
│                                              ▼                                                               │
│  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│  │                          PHASE 3: BACKTESTING WITH ML (✅ IMPLEMENTED)                                 │ │
│  │                                                                                                         │ │
│  │  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │ │
│  │  │  BacktestSvc │───►│  DataQuery   │───►│  Freqtrade   │───►│  MLflow      │───►│  S3          │     │ │
│  │  │              │    │  Service     │    │  Runner      │    │  Logging     │    │  Upload      │     │ │
│  │  │  config:     │    │              │    │              │    │              │    │              │     │ │
│  │  │  - strategy  │    │  OHLCV data  │    │  --freqai    │    │  metrics     │    │  results.json│     │ │
│  │  │  - freqai    │    │  → .feather  │    │  model=...   │    │  artifacts   │    │              │     │ │
│  │  └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘     │ │
│  │                                                                                                         │ │
│  │  TradAIPredictionModel: Inference-only (loads from CSV/MLflow, does NOT train)                         │ │
│  └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│                                              │                                                               │
│                                              ▼                                                               │
│  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│  │                          PHASE 4: LIVE/PAPER TRADING (⚠️ PARTIAL - 60%)                                │ │
│  │                                                                                                         │ │
│  │  ┌──────────────────────────────────────────────────────────────────────────────────────────────────┐  │ │
│  │  │                          Strategy Container (Live Mode)                                           │  │ │
│  │  │                                                                                                   │  │ │
│  │  │  IMPLEMENTED (Building Blocks):                                                                   │  │ │
│  │  │  ✅ LF002: StrategyConfigLoader (load from MLflow tags + S3)                                      │  │ │
│  │  │  ✅ LF003: ArcticDBWarmupLoader (historical data warmup)                                          │  │ │
│  │  │  ✅ LF004: TradingStateRepository (DynamoDB state)                                                │  │ │
│  │  │  ✅ LF005: HealthReporter (container health)                                                      │  │ │
│  │  │  ✅ LM003: FreqAIConfigBuilder (inference config)                                                 │  │ │
│  │  │                                                                                                   │  │ │
│  │  │  NOT IMPLEMENTED (Blockers):                                                                      │  │ │
│  │  │  ❌ LF006: ECS Service definition                                                                 │  │ │
│  │  │  ❌ LF007: Exchange credentials (Secrets Manager)                                                 │  │ │
│  │  │  ❌ LF008: TradingHandler wiring                                                                  │  │ │
│  │  │                                                                                                   │  │ │
│  │  │  FLOW: StrategyConfigLoader → Load MLflow Model → FreqAI Inference (no training)                  │  │ │
│  │  └──────────────────────────────────────────────────────────────────────────────────────────────────┘  │ │
│  └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│                                              │                                                               │
│                                              ▼                                                               │
│  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│  │                          PHASE 5: RETRAINING & MONITORING (❌ NOT IMPLEMENTED)                         │ │
│  │                                                                                                         │ │
│  │  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │ │
│  │  │  Drift       │───►│  Retrain     │───►│  Model       │───►│  A/B         │───►│  Auto        │     │ │
│  │  │  Detection   │    │  Trigger     │    │  Training    │    │  Testing     │    │  Promotion   │     │ │
│  │  │              │    │              │    │              │    │              │    │              │     │ │
│  │  │  ❌ MISSING  │    │  ❌ MISSING  │    │  ❌ MISSING  │    │  ❌ MISSING  │    │  ❌ MISSING  │     │ │
│  │  └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘     │ │
│  │                                                                                                         │ │
│  │  Required Tasks: ML001-ML008 (see Implementation Plan below)                                           │ │
│  └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│                                                                                                              │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Component Detail

1. FreqAI Integration Architecture

TradAI uses FreqAI (Freqtrade's ML framework) for model training and inference:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         FreqAI Architecture in TradAI                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Strategy Layer                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  RadStrategy (extends TradAIStrategy)                                   │ │
│  │  ├── feature_engineering_expand_all()  →  Define %-prefixed features   │ │
│  │  ├── set_freqai_targets()              →  Define &-prefixed targets    │ │
│  │  ├── populate_indicators()             →  self.freqai.start()          │ │
│  │  └── populate_entry_trend()            →  Use &-predictions            │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                       │                                      │
│                                       ▼                                      │
│  FreqAI Layer (Freqtrade)                                                   │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  --freqaimodel selection:                                               │ │
│  │  ├── LightGBMRegressor (built-in)     →  Train during backtest         │ │
│  │  ├── CatBoostRegressor (built-in)     →  Train during backtest         │ │
│  │  ├── TradAIPredictionModel (custom)   →  Load from CSV/MLflow          │ │
│  │  │   ├── train() → NO-OP (skip training)                               │ │
│  │  │   └── predict() → Load predictions from external source             │ │
│  │  └── TradAIMLflowModel (custom)       →  Load from MLflow registry     │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                       │                                      │
│                                       ▼                                      │
│  Mode Selection                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  TRAINING MODE (not implemented):                                       │ │
│  │  freqtrade backtesting --freqaimodel LightGBMRegressor                 │ │
│  │  → FreqAI trains models during walk-forward backtest                    │ │
│  │  → Models saved to user_data/models/                                    │ │
│  │  → Manual upload to MLflow required                                     │ │
│  │                                                                         │ │
│  │  INFERENCE MODE (implemented):                                          │ │
│  │  freqtrade backtesting --freqaimodel TradAIPredictionModel             │ │
│  │  → Loads pre-trained model from MLflow                                  │ │
│  │  → Runs predict() without training                                      │ │
│  │  → Used for backtesting and live trading                                │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

2. Custom FreqAI Model: TradAIPredictionModel

# libs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.py

class TradAIPredictionModel(IFreqaiModel):
    """Custom FreqAI model for inference-only mode.

    Loads predictions from CSV or MLflow registry instead of training.
    """

    def train(self, unfiltered_df, pair, dk, **kwargs) -> None:
        """No training - return immediately."""
        self.logger.debug(f"Skipping training for {pair}")
        return None  # NO-OP

    def predict(self, unfiltered_df, dk, **kwargs) -> tuple[np.ndarray, np.ndarray]:
        """Load predictions from external source."""
        source = self._get_prediction_source()

        if source == "csv":
            return self._load_csv_predictions(dk)
        elif source == "mlflow":
            return self._load_mlflow_predictions(dk)
        else:
            raise ValueError(f"Unknown source: {source}")

Key Files: - libs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.py - libs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py

3. MLflow Model Registry Integration

┌─────────────────────────────────────────────────────────────────────────────┐
│                         MLflow Model Lifecycle                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Model Registration Flow:                                                    │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                   │
│  │  Backtest    │───►│  Manual      │───►│  Model       │                   │
│  │  Complete    │    │  Upload      │    │  Registry    │                   │
│  │              │    │  to MLflow   │    │              │                   │
│  └──────────────┘    └──────────────┘    └──────────────┘                   │
│                                                                              │
│  Model Registry Schema:                                                      │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  RadStrategy                                                            │ │
│  │  ├── Version 1                                                          │ │
│  │  │   ├── Stage: Archived                                                │ │
│  │  │   ├── Tags: timeframe=1h, pairs=BTC/USDT:USDT                       │ │
│  │  │   └── Artifacts: model.pkl, config.json                              │ │
│  │  ├── Version 2                                                          │ │
│  │  │   ├── Stage: Staging                                                 │ │
│  │  │   ├── Tags: timeframe=1h, sharpe=1.8                                 │ │
│  │  │   └── Artifacts: model.pkl, feature_importance.json                  │ │
│  │  └── Version 3                                                          │ │
│  │      ├── Stage: Production                                              │ │
│  │      ├── Tags: timeframe=1h, sharpe=2.1, promoted_at=2025-12-15        │ │
│  │      └── Artifacts: model.pkl, backtest_results.json                    │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  Live Trading Model Loading:                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                   │
│  │  Strategy    │───►│  MLflow      │───►│  FreqAI      │                   │
│  │  Container   │    │  Registry    │    │  Inference   │                   │
│  │              │    │              │    │              │                   │
│  │  ENV:        │    │  Query:      │    │  Model       │                   │
│  │  STRATEGY_   │    │  Production  │    │  predict()   │                   │
│  │  STAGE=Prod  │    │  Stage       │    │              │                   │
│  └──────────────┘    └──────────────┘    └──────────────┘                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implemented Methods (MLflowAdapter): - create_model_version(name, source, run_id, tags) ✅ - transition_model_version_stage(name, version, stage) ✅ - get_registered_model(name) ✅ - search_registered_models(filter_string) ✅ - load_model(model_uri) ✅ - delete_model_version(name, version)

4. FreqAI Configuration Builder

# libs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py

class FreqAIConfigBuilder:
    """Build FreqAI configuration for different modes."""

    def build_inference_config(self, strategy_metadata: StrategyMetadata) -> dict:
        """Build config for inference-only mode (live trading)."""
        return {
            "freqai": {
                "enabled": True,
                "live_retrain_hours": 0,  # No retraining during live
                "fit_live_predictions_candles": 0,
                "purge_old_models": False,
                "model_training_parameters": {},
                "feature_parameters": self._build_feature_params(strategy_metadata),
                "data_split_parameters": {"test_size": 0},
            }
        }

    def build_training_config(self, strategy_metadata: StrategyMetadata) -> dict:
        """Build config for training mode (NOT YET IMPLEMENTED)."""
        raise NotImplementedError("Training mode requires ML001-ML003")

Gap Analysis: What's Missing

Critical Missing Components

ID Component Purpose Blocks
ML001 TradingMode.TRAIN Add TRAIN mode to TradingMode enum ML002
ML002 TrainingHandler Orchestrate model training via Freqtrade ML003, ML005
ML003 MLflow Model Upload Auto-upload trained models to registry MO001
ML004 ML Hyperparameter Optimization Optuna for model hyperparams -
ML005 TrainingPredictionModel FreqAI model that actually trains (walk-forward) -
MO001 Retraining Scheduler EventBridge + Lambda for scheduled retraining MO002
MO002 Model Drift Monitor Track prediction accuracy vs actuals MO003
MO003 Model Comparison Service A/B testing between model versions MO004
MO004 Model Rollback Service Auto-rollback on performance degradation -

Walk-Forward Backtesting (ML005)

Current Problem: TradAIPredictionModel.train() is a NO-OP. Cannot do walk-forward validation.

FreqAI Native Walk-Forward:

{
  "freqai": {
    "train_period_days": 30,      // Train on 30 days of data
    "backtest_period_days": 7,    // Retrain every 7 days
    "data_split_parameters": {
      "test_size": 0.1,
      "shuffle": false
    }
  }
}

This enables progressive training during backtesting: 1. Train model on days 1-30 2. Predict on days 31-37 3. Retrain on days 8-37 (sliding window) 4. Predict on days 38-44 5. Repeat until end of backtest range

Solution: Create TrainingPredictionModel that extends IFreqaiModel with actual train() implementation (LightGBM, CatBoost, etc.).

Component Dependency Graph

ML001 (TradingMode.TRAIN) ─────────────────────────────────────────────────────┐
                                         ┌──────────────────────────────────────┴────────────┐
                                         │                                                    │
                                         ▼                                                    ▼
                        ML002 (TrainingHandler) ────────────────►  ML005 (TrainingPredictionModel)
                                         │                         Walk-forward backtesting
                        ML003 (MLflow Upload) ────────► ML004 (Hyperopt - Optuna)
                        MO001 (Retraining Scheduler)
                        MO002 (Drift Monitor) ─────────────────────────────────────┐
                                         │                                          │
                                         ▼                                          ▼
                        MO003 (A/B Testing) ◄───────────────────────────── MO004 (Rollback)

Implementation Plan

Phase 1: Training Pipeline (Estimated: 2-3 days)

ML001: Add TradingMode.TRAIN

# libs/tradai-common/src/tradai/common/entrypoint/settings.py

class TradingMode(str, Enum):
    BACKTEST = "backtest"
    HYPEROPT = "hyperopt"
    DRY_RUN = "dry-run"
    LIVE = "live"
    TRAIN = "train"  # NEW: Training mode for ML models

ML002: TrainingHandler Implementation

# libs/tradai-common/src/tradai/common/entrypoint/training.py

class TrainingHandler(BaseEntrypointHandler):
    """Handle ML model training via Freqtrade."""

    def __init__(
        self,
        config_loader: StrategyConfigLoader,
        mlflow_adapter: MLflowAdapter,
        freqtrade_config_builder: FreqtradeConfigBuilder,
    ):
        self.config_loader = config_loader
        self.mlflow = mlflow_adapter
        self.config_builder = freqtrade_config_builder

    def run(self) -> int:
        """Execute training workflow."""
        # 1. Load strategy config
        config = self.config_loader.load_config(
            strategy_name=self.settings.strategy_name,
            stage="Staging",  # Train against staging config
        )

        # 2. Build Freqtrade training config
        ft_config = self.config_builder.build_training_config(config)

        # 3. Execute Freqtrade backtesting with real FreqAI model
        #    (LightGBM, CatBoost, etc. - NOT TradAIPredictionModel)
        result = self._execute_training(ft_config)

        # 4. Upload trained model to MLflow
        model_version = self._upload_to_mlflow(result)

        # 5. Log metrics
        self._log_training_metrics(result, model_version)

        return 0

    def _execute_training(self, config: dict) -> TrainingResult:
        """Run Freqtrade backtesting with actual training."""
        cmd = [
            "freqtrade", "backtesting",
            "--strategy", config["strategy"],
            "--freqaimodel", config["freqai"]["model"],  # LightGBMRegressor
            "--config", self._write_temp_config(config),
            "--timerange", config["timerange"],
        ]
        # ... execute and parse results

    def _upload_to_mlflow(self, result: TrainingResult) -> ModelVersion:
        """Upload trained model artifacts to MLflow registry."""
        with mlflow.start_run():
            # Log model artifacts
            mlflow.log_artifact(result.model_path)
            mlflow.log_artifact(result.feature_importance_path)

            # Log metrics
            mlflow.log_metrics({
                "sharpe_ratio": result.sharpe,
                "profit_total": result.profit,
                "max_drawdown": result.max_drawdown,
            })

            # Register model
            model_uri = f"runs:/{mlflow.active_run().info.run_id}/model"
            return mlflow.register_model(model_uri, self.settings.strategy_name)

ML003: MLflow Model Auto-Upload

Integrated into TrainingHandler above.

Phase 2: Hyperparameter Optimization (Estimated: 2 days)

ML004: ML Hyperparameter Optimizer

# libs/tradai-strategy/src/tradai/strategy/training/hyperopt.py

class MLHyperparameterOptimizer:
    """Optimize ML model hyperparameters using Optuna."""

    def __init__(
        self,
        training_handler: TrainingHandler,
        mlflow_adapter: MLflowAdapter,
    ):
        self.training_handler = training_handler
        self.mlflow = mlflow_adapter

    def optimize(
        self,
        strategy_name: str,
        n_trials: int = 100,
        objective_metric: str = "sharpe_ratio",
    ) -> HyperoptResult:
        """Run hyperparameter optimization."""
        import optuna

        def objective(trial):
            # Sample hyperparameters
            params = {
                "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
                "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
                "max_depth": trial.suggest_int("max_depth", 3, 10),
                "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
            }

            # Train with these params
            result = self.training_handler.run_with_params(params)

            # Log to MLflow
            with mlflow.start_run(nested=True):
                mlflow.log_params(params)
                mlflow.log_metric(objective_metric, result.metrics[objective_metric])

            return result.metrics[objective_metric]

        study = optuna.create_study(direction="maximize")
        study.optimize(objective, n_trials=n_trials)

        return HyperoptResult(
            best_params=study.best_params,
            best_value=study.best_value,
            n_trials=n_trials,
        )

Phase 3: Retraining & Monitoring (Estimated: 3-4 days)

ML005: Retraining Scheduler

# lambdas/retrain_scheduler/handler.py

def handler(event, context):
    """EventBridge-triggered retraining scheduler."""
    dynamodb = boto3.resource("dynamodb")
    ecs = boto3.client("ecs")

    # Get strategies that need retraining
    strategies = get_strategies_for_retraining(dynamodb)

    for strategy in strategies:
        # Check if retraining is due
        if should_retrain(strategy):
            # Launch ECS training task
            ecs.run_task(
                cluster=os.environ["ECS_CLUSTER"],
                taskDefinition=f"tradai-train-{strategy.name}",
                launchType="FARGATE",
                overrides={
                    "containerOverrides": [{
                        "name": "trainer",
                        "environment": [
                            {"name": "TRADING_MODE", "value": "train"},
                            {"name": "STRATEGY_NAME", "value": strategy.name},
                        ]
                    }]
                }
            )

            # Update last_retrain timestamp
            update_retrain_timestamp(dynamodb, strategy)

def should_retrain(strategy) -> bool:
    """Check if strategy should be retrained."""
    # Option 1: Time-based (weekly/monthly)
    days_since_last = (datetime.utcnow() - strategy.last_retrain).days
    if days_since_last >= strategy.retrain_interval_days:
        return True

    # Option 2: Drift-based (if drift monitor triggered)
    if strategy.drift_detected:
        return True

    return False

ML006: Model Drift Monitor

# libs/tradai-common/src/tradai/common/monitoring/drift.py

class ModelDriftMonitor:
    """Monitor model prediction accuracy and detect drift."""

    def __init__(
        self,
        dynamodb_table: str,
        cloudwatch_namespace: str,
    ):
        self.dynamodb = boto3.resource("dynamodb").Table(dynamodb_table)
        self.cloudwatch = boto3.client("cloudwatch")

    def record_prediction(
        self,
        strategy_name: str,
        prediction: float,
        actual: float,
        timestamp: datetime,
    ):
        """Record prediction vs actual for drift analysis."""
        self.dynamodb.put_item(Item={
            "pk": f"PRED#{strategy_name}",
            "sk": timestamp.isoformat(),
            "prediction": Decimal(str(prediction)),
            "actual": Decimal(str(actual)),
            "error": Decimal(str(abs(prediction - actual))),
            "ttl": int((timestamp + timedelta(days=30)).timestamp()),
        })

    def calculate_drift_metrics(
        self,
        strategy_name: str,
        window_days: int = 7,
    ) -> DriftMetrics:
        """Calculate drift metrics over time window."""
        # Query predictions from DynamoDB
        predictions = self._query_predictions(strategy_name, window_days)

        # Calculate metrics
        mae = np.mean([p["error"] for p in predictions])
        baseline_mae = self._get_baseline_mae(strategy_name)
        drift_ratio = mae / baseline_mae if baseline_mae > 0 else 1.0

        # Publish to CloudWatch
        self.cloudwatch.put_metric_data(
            Namespace=self.cloudwatch_namespace,
            MetricData=[
                {
                    "MetricName": "ModelDriftRatio",
                    "Dimensions": [{"Name": "Strategy", "Value": strategy_name}],
                    "Value": drift_ratio,
                    "Unit": "None",
                }
            ]
        )

        return DriftMetrics(
            mae=mae,
            baseline_mae=baseline_mae,
            drift_ratio=drift_ratio,
            drift_detected=drift_ratio > 1.5,  # 50% degradation threshold
        )

ML007: Model Comparison Service

# services/strategy-service/src/tradai/strategy_service/core/comparison_service.py

class ModelComparisonService:
    """Compare performance between model versions."""

    def compare_versions(
        self,
        strategy_name: str,
        version_a: str,
        version_b: str,
        timerange: str,
    ) -> ComparisonResult:
        """Run side-by-side backtest comparison."""
        # Run backtest with version A
        result_a = self._run_backtest(strategy_name, version_a, timerange)

        # Run backtest with version B
        result_b = self._run_backtest(strategy_name, version_b, timerange)

        # Calculate comparison metrics
        return ComparisonResult(
            version_a=version_a,
            version_b=version_b,
            sharpe_diff=result_b.sharpe - result_a.sharpe,
            profit_diff=result_b.profit - result_a.profit,
            drawdown_diff=result_b.max_drawdown - result_a.max_drawdown,
            recommendation=self._recommend_version(result_a, result_b),
        )

    def shadow_test(
        self,
        strategy_name: str,
        production_version: str,
        candidate_version: str,
        duration_days: int = 7,
    ) -> ShadowTestConfig:
        """Configure shadow testing for candidate model."""
        return ShadowTestConfig(
            strategy_name=strategy_name,
            production_version=production_version,
            candidate_version=candidate_version,
            start_time=datetime.utcnow(),
            end_time=datetime.utcnow() + timedelta(days=duration_days),
            metrics_to_compare=["sharpe_ratio", "profit_total", "win_rate"],
        )

ML008: Model Rollback Service

# services/strategy-service/src/tradai/strategy_service/core/rollback_service.py

class ModelRollbackService:
    """Handle model version rollback."""

    def __init__(
        self,
        mlflow_adapter: MLflowAdapter,
        ecs_client: ECSClient,
    ):
        self.mlflow = mlflow_adapter
        self.ecs = ecs_client

    def rollback(
        self,
        strategy_name: str,
        target_version: Optional[str] = None,
        reason: str = "",
    ) -> RollbackResult:
        """Rollback to previous model version."""
        # Get current production version
        current = self.mlflow.get_model_version(
            name=strategy_name,
            stage="Production",
        )

        # Determine target version (previous or specified)
        if target_version is None:
            target_version = self._get_previous_version(strategy_name, current.version)

        # Transition stages
        self.mlflow.transition_model_version_stage(
            name=strategy_name,
            version=current.version,
            stage="Archived",
        )
        self.mlflow.transition_model_version_stage(
            name=strategy_name,
            version=target_version,
            stage="Production",
        )

        # Restart live trading container to pick up new model
        self._restart_strategy_container(strategy_name)

        # Log rollback event
        self._log_rollback_event(strategy_name, current.version, target_version, reason)

        return RollbackResult(
            previous_version=current.version,
            new_version=target_version,
            timestamp=datetime.utcnow(),
            reason=reason,
        )

    def auto_rollback_on_drift(
        self,
        strategy_name: str,
        drift_threshold: float = 1.5,
    ):
        """Configure automatic rollback when drift exceeds threshold."""
        # Create CloudWatch alarm
        cloudwatch = boto3.client("cloudwatch")
        cloudwatch.put_metric_alarm(
            AlarmName=f"{strategy_name}-drift-rollback",
            MetricName="ModelDriftRatio",
            Namespace="TradAI/MLOps",
            Dimensions=[{"Name": "Strategy", "Value": strategy_name}],
            Statistic="Average",
            Period=300,
            EvaluationPeriods=3,
            Threshold=drift_threshold,
            ComparisonOperator="GreaterThanThreshold",
            AlarmActions=[os.environ["ROLLBACK_SNS_TOPIC"]],
        )

Infrastructure Requirements

New AWS Resources

Resource Purpose Cost/Month
ECS Task Definition (Training) Run training tasks Pay-per-use
EventBridge Rule Trigger scheduled retraining ~$1
Lambda (Retrain Scheduler) Orchestrate retraining ~$1
Lambda (Drift Monitor) Calculate drift metrics ~$1
DynamoDB Table (Predictions) Store prediction history ~$5
CloudWatch Alarms Drift detection alerts ~$1
SNS Topic Rollback notifications ~$0.50

Total Additional Cost: ~$10/month

ECS Task Definition for Training

Resource: AWS::ECS::TaskDefinition
  Family: tradai-train-{strategy}
  Cpu: 2048  # More CPU for training
  Memory: 4096  # More memory for model training
  NetworkMode: awsvpc
  RequiresCompatibilities:
    - FARGATE
  ContainerDefinitions:
    - Name: trainer
      Image: !Sub ${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/tradai-{strategy}:latest
      Essential: true
      Environment:
        - Name: TRADING_MODE
          Value: train
        - Name: STRATEGY_NAME
          Value: {strategy}
        - Name: MLFLOW_TRACKING_URI
          Value: http://mlflow.internal:5000
        - Name: FREQAI_MODEL
          Value: LightGBMRegressor  # Actual training model
        - Name: TRAIN_PERIOD_DAYS
          Value: "30"
        - Name: BACKTEST_PERIOD_DAYS
          Value: "7"

Configuration Schema

FreqAI Training Configuration

{
  "freqai": {
    "enabled": true,
    "model": "LightGBMRegressor",
    "train_period_days": 30,
    "backtest_period_days": 7,
    "identifier": "rad-strategy-v3",
    "model_training_parameters": {
      "n_estimators": 500,
      "learning_rate": 0.05,
      "max_depth": 7,
      "min_child_samples": 20
    },
    "feature_parameters": {
      "label_period_candles": 24,
      "include_timeframes": ["1h", "4h"],
      "indicator_periods_candles": [10, 20, 50]
    },
    "data_split_parameters": {
      "test_size": 0.15,
      "shuffle": false
    }
  }
}

Retraining Schedule Configuration

# DynamoDB: tradai-ml-config

{
  "pk": "CONFIG#RadStrategy",
  "sk": "RETRAIN",
  "retrain_interval_days": 7,
  "retrain_on_drift": true,
  "drift_threshold": 1.5,
  "auto_promote": false,
  "shadow_test_days": 3,
  "rollback_enabled": true,
  "notification_emails": ["team@tradai.io"]
}

Monitoring & Alerting

CloudWatch Metrics

Metric Namespace Description
ModelDriftRatio TradAI/MLOps Ratio of current MAE to baseline
TrainingDuration TradAI/MLOps Time to train model (seconds)
ModelVersion TradAI/MLOps Current production model version
RetrainingCount TradAI/MLOps Number of retraining events
RollbackCount TradAI/MLOps Number of rollback events

Alerts

Alert Threshold Action
High Model Drift DriftRatio > 1.5 SNS → Auto-rollback
Training Failed Duration > 2h or Exit != 0 SNS → Page on-call
Rollback Triggered Any rollback SNS → Notify team

Implementation Roadmap

Phase 1: Training Pipeline (Priority: HIGH)

Task Description Est. Hours Dependencies
ML001 Add TradingMode.TRAIN 2 -
ML002 Implement TrainingHandler 8 ML001
ML003 MLflow model auto-upload 4 ML002

Subtotal: 14 hours

Phase 2: Hyperparameter Optimization (Priority: MEDIUM)

Task Description Est. Hours Dependencies
ML004 Optuna integration for ML hyperparams 12 ML002

Subtotal: 12 hours

Phase 3: Retraining & Monitoring (Priority: MEDIUM)

Task Description Est. Hours Dependencies
ML005 Retraining scheduler (EventBridge + Lambda) 8 ML002
ML006 Model drift monitor 10 -
ML007 Model comparison service 8 -
ML008 Model rollback service 6 ML007

Subtotal: 32 hours

Total Effort: ~58 hours (7-8 days)


Success Criteria

Criterion Validation
Training mode works TRADING_MODE=train successfully trains and uploads model
Models auto-registered Trained models appear in MLflow registry with correct tags
Hyperopt works ml_hyperopt command optimizes model parameters
Scheduled retraining EventBridge triggers retraining on schedule
Drift detected CloudWatch alarm fires when drift > threshold
Rollback works rollback command transitions model versions correctly
Live containers reload Restarted containers pick up new production model

Document Changelog

Version Date Changes
1.0.0 2025-12-22 Initial ML Lifecycle architecture document

References


Next Actions: 1. Complete LF007/LF008 to enable live inference (prerequisite) 2. Implement ML001-ML003 for training pipeline 3. Add ML004 for hyperparameter optimization 4. Deploy ML005-ML008 for full MLOps lifecycle