TradAI ML Strategy Lifecycle Architecture¶
Version: 1.0.0 Date: 2025-12-22 Status: DRAFT - E2E VALIDATION REQUIRED Depends On: 11-LIVE-TRADING.md, 02-ARCHITECTURE-OVERVIEW.md
Executive Summary¶
This document defines the complete lifecycle for ML-based trading strategies in TradAI, covering:
- Model Development - Feature engineering, training, hyperparameter tuning
- Experimentation - MLflow tracking, experiment comparison
- Model Registry - Versioning, staging, promotion
- Backtesting - Walk-forward validation, historical simulation
- Live/Paper Trading - Inference, monitoring
- Retraining - Scheduled retraining, drift detection, rollback
Current Implementation Status¶
| Phase | Status | Completion |
|---|---|---|
| Model Development | Manual | 0% automated |
| Experimentation (MLflow) | Implemented | 100% |
| Model Registry | Implemented | 100% |
| Backtesting with ML | Implemented | 100% |
| Live Inference | Designed | 60% |
| Retraining Pipeline | Code Complete | 80% (needs IaC) |
| Drift Detection | Code Complete | 80% (needs IaC) |
Overall ML Lifecycle: ~60% Complete
Note (2025-12-24): Retraining and drift detection Lambda code EXISTS in
lambdas/retraining-scheduler/handler.pyandlambdas/drift-monitor/handler.py. Remaining work is Pulumi IaC integration (EventBridge schedules, DynamoDB tables, IAM roles).
System Architecture: ML Strategy Flow¶
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ ML STRATEGY E2E LIFECYCLE │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 1: MODEL DEVELOPMENT │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Feature │───►│ FreqAI │───►│ Model │───►│ Local │───►│ Manual │ │ │
│ │ │ Engineering │ │ Training │ │ Selection │ │ Validation │ │ Upload │ │ │
│ │ │ │ │ Config │ │ │ │ │ │ to MLflow │ │ │
│ │ │ %-features │ │ LightGBM │ │ Hyperopt │ │ Backtests │ │ │ │ │
│ │ │ &-targets │ │ CatBoost │ │ Grid Search │ │ │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ │ │ │
│ │ Status: ⚠️ MANUAL - No automated training pipeline │ │
│ └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 2: EXPERIMENTATION & REGISTRY (✅ IMPLEMENTED) │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ MLflow Tracking Server │ │ │
│ │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │
│ │ │ │ Experiments │ │ Runs │ │ Artifacts │ │ Model Registry │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ │ │ strategy-exp │ │ run_id: abc123 │ │ model.pkl │ │ RadStrategy │ │ │ │
│ │ │ │ hyperopt-exp │ │ metrics: │ │ config.json │ │ ├─ v1 (Staging) │ │ │ │
│ │ │ │ backtest-exp │ │ - sharpe: 1.5 │ │ features.csv │ │ └─ v2 (Prod) │ │ │ │
│ │ │ │ │ │ - profit: 0.15 │ │ backtest.json │ │ │ │ │ │
│ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Implemented: MLflowAdapter with full CRUD (CL007) │ │
│ └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 3: BACKTESTING WITH ML (✅ IMPLEMENTED) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ BacktestSvc │───►│ DataQuery │───►│ Freqtrade │───►│ MLflow │───►│ S3 │ │ │
│ │ │ │ │ Service │ │ Runner │ │ Logging │ │ Upload │ │ │
│ │ │ config: │ │ │ │ │ │ │ │ │ │ │
│ │ │ - strategy │ │ OHLCV data │ │ --freqai │ │ metrics │ │ results.json│ │ │
│ │ │ - freqai │ │ → .feather │ │ model=... │ │ artifacts │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ │ │ │
│ │ TradAIPredictionModel: Inference-only (loads from CSV/MLflow, does NOT train) │ │
│ └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 4: LIVE/PAPER TRADING (⚠️ PARTIAL - 60%) │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Strategy Container (Live Mode) │ │ │
│ │ │ │ │ │
│ │ │ IMPLEMENTED (Building Blocks): │ │ │
│ │ │ ✅ LF002: StrategyConfigLoader (load from MLflow tags + S3) │ │ │
│ │ │ ✅ LF003: ArcticDBWarmupLoader (historical data warmup) │ │ │
│ │ │ ✅ LF004: TradingStateRepository (DynamoDB state) │ │ │
│ │ │ ✅ LF005: HealthReporter (container health) │ │ │
│ │ │ ✅ LM003: FreqAIConfigBuilder (inference config) │ │ │
│ │ │ │ │ │
│ │ │ NOT IMPLEMENTED (Blockers): │ │ │
│ │ │ ❌ LF006: ECS Service definition │ │ │
│ │ │ ❌ LF007: Exchange credentials (Secrets Manager) │ │ │
│ │ │ ❌ LF008: TradingHandler wiring │ │ │
│ │ │ │ │ │
│ │ │ FLOW: StrategyConfigLoader → Load MLflow Model → FreqAI Inference (no training) │ │ │
│ │ └──────────────────────────────────────────────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 5: RETRAINING & MONITORING (❌ NOT IMPLEMENTED) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Drift │───►│ Retrain │───►│ Model │───►│ A/B │───►│ Auto │ │ │
│ │ │ Detection │ │ Trigger │ │ Training │ │ Testing │ │ Promotion │ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ │ ❌ MISSING │ │ ❌ MISSING │ │ ❌ MISSING │ │ ❌ MISSING │ │ ❌ MISSING │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ │ │ │
│ │ Required Tasks: ML001-ML008 (see Implementation Plan below) │ │
│ └────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Component Detail¶
1. FreqAI Integration Architecture¶
TradAI uses FreqAI (Freqtrade's ML framework) for model training and inference:
┌─────────────────────────────────────────────────────────────────────────────┐
│ FreqAI Architecture in TradAI │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Strategy Layer │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ RadStrategy (extends TradAIStrategy) │ │
│ │ ├── feature_engineering_expand_all() → Define %-prefixed features │ │
│ │ ├── set_freqai_targets() → Define &-prefixed targets │ │
│ │ ├── populate_indicators() → self.freqai.start() │ │
│ │ └── populate_entry_trend() → Use &-predictions │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ FreqAI Layer (Freqtrade) │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ --freqaimodel selection: │ │
│ │ ├── LightGBMRegressor (built-in) → Train during backtest │ │
│ │ ├── CatBoostRegressor (built-in) → Train during backtest │ │
│ │ ├── TradAIPredictionModel (custom) → Load from CSV/MLflow │ │
│ │ │ ├── train() → NO-OP (skip training) │ │
│ │ │ └── predict() → Load predictions from external source │ │
│ │ └── TradAIMLflowModel (custom) → Load from MLflow registry │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Mode Selection │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ TRAINING MODE (not implemented): │ │
│ │ freqtrade backtesting --freqaimodel LightGBMRegressor │ │
│ │ → FreqAI trains models during walk-forward backtest │ │
│ │ → Models saved to user_data/models/ │ │
│ │ → Manual upload to MLflow required │ │
│ │ │ │
│ │ INFERENCE MODE (implemented): │ │
│ │ freqtrade backtesting --freqaimodel TradAIPredictionModel │ │
│ │ → Loads pre-trained model from MLflow │ │
│ │ → Runs predict() without training │ │
│ │ → Used for backtesting and live trading │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
2. Custom FreqAI Model: TradAIPredictionModel¶
# libs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.py
class TradAIPredictionModel(IFreqaiModel):
"""Custom FreqAI model for inference-only mode.
Loads predictions from CSV or MLflow registry instead of training.
"""
def train(self, unfiltered_df, pair, dk, **kwargs) -> None:
"""No training - return immediately."""
self.logger.debug(f"Skipping training for {pair}")
return None # NO-OP
def predict(self, unfiltered_df, dk, **kwargs) -> tuple[np.ndarray, np.ndarray]:
"""Load predictions from external source."""
source = self._get_prediction_source()
if source == "csv":
return self._load_csv_predictions(dk)
elif source == "mlflow":
return self._load_mlflow_predictions(dk)
else:
raise ValueError(f"Unknown source: {source}")
Key Files: - libs/tradai-strategy/src/tradai/strategy/freqai/prediction_model.py - libs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py
3. MLflow Model Registry Integration¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ MLflow Model Lifecycle │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Model Registration Flow: │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Backtest │───►│ Manual │───►│ Model │ │
│ │ Complete │ │ Upload │ │ Registry │ │
│ │ │ │ to MLflow │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Model Registry Schema: │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ RadStrategy │ │
│ │ ├── Version 1 │ │
│ │ │ ├── Stage: Archived │ │
│ │ │ ├── Tags: timeframe=1h, pairs=BTC/USDT:USDT │ │
│ │ │ └── Artifacts: model.pkl, config.json │ │
│ │ ├── Version 2 │ │
│ │ │ ├── Stage: Staging │ │
│ │ │ ├── Tags: timeframe=1h, sharpe=1.8 │ │
│ │ │ └── Artifacts: model.pkl, feature_importance.json │ │
│ │ └── Version 3 │ │
│ │ ├── Stage: Production │ │
│ │ ├── Tags: timeframe=1h, sharpe=2.1, promoted_at=2025-12-15 │ │
│ │ └── Artifacts: model.pkl, backtest_results.json │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ Live Trading Model Loading: │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Strategy │───►│ MLflow │───►│ FreqAI │ │
│ │ Container │ │ Registry │ │ Inference │ │
│ │ │ │ │ │ │ │
│ │ ENV: │ │ Query: │ │ Model │ │
│ │ STRATEGY_ │ │ Production │ │ predict() │ │
│ │ STAGE=Prod │ │ Stage │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Implemented Methods (MLflowAdapter): - create_model_version(name, source, run_id, tags) ✅ - transition_model_version_stage(name, version, stage) ✅ - get_registered_model(name) ✅ - search_registered_models(filter_string) ✅ - load_model(model_uri) ✅ - delete_model_version(name, version) ✅
4. FreqAI Configuration Builder¶
# libs/tradai-strategy/src/tradai/strategy/freqai/config_builder.py
class FreqAIConfigBuilder:
"""Build FreqAI configuration for different modes."""
def build_inference_config(self, strategy_metadata: StrategyMetadata) -> dict:
"""Build config for inference-only mode (live trading)."""
return {
"freqai": {
"enabled": True,
"live_retrain_hours": 0, # No retraining during live
"fit_live_predictions_candles": 0,
"purge_old_models": False,
"model_training_parameters": {},
"feature_parameters": self._build_feature_params(strategy_metadata),
"data_split_parameters": {"test_size": 0},
}
}
def build_training_config(self, strategy_metadata: StrategyMetadata) -> dict:
"""Build config for training mode (NOT YET IMPLEMENTED)."""
raise NotImplementedError("Training mode requires ML001-ML003")
Gap Analysis: What's Missing¶
Critical Missing Components¶
| ID | Component | Purpose | Blocks |
|---|---|---|---|
| ML001 | TradingMode.TRAIN | Add TRAIN mode to TradingMode enum | ML002 |
| ML002 | TrainingHandler | Orchestrate model training via Freqtrade | ML003, ML005 |
| ML003 | MLflow Model Upload | Auto-upload trained models to registry | MO001 |
| ML004 | ML Hyperparameter Optimization | Optuna for model hyperparams | - |
| ML005 | TrainingPredictionModel | FreqAI model that actually trains (walk-forward) | - |
| MO001 | Retraining Scheduler | EventBridge + Lambda for scheduled retraining | MO002 |
| MO002 | Model Drift Monitor | Track prediction accuracy vs actuals | MO003 |
| MO003 | Model Comparison Service | A/B testing between model versions | MO004 |
| MO004 | Model Rollback Service | Auto-rollback on performance degradation | - |
Walk-Forward Backtesting (ML005)¶
Current Problem: TradAIPredictionModel.train() is a NO-OP. Cannot do walk-forward validation.
FreqAI Native Walk-Forward:
{
"freqai": {
"train_period_days": 30, // Train on 30 days of data
"backtest_period_days": 7, // Retrain every 7 days
"data_split_parameters": {
"test_size": 0.1,
"shuffle": false
}
}
}
This enables progressive training during backtesting: 1. Train model on days 1-30 2. Predict on days 31-37 3. Retrain on days 8-37 (sliding window) 4. Predict on days 38-44 5. Repeat until end of backtest range
Solution: Create TrainingPredictionModel that extends IFreqaiModel with actual train() implementation (LightGBM, CatBoost, etc.).
Component Dependency Graph¶
ML001 (TradingMode.TRAIN) ─────────────────────────────────────────────────────┐
│
▼
┌──────────────────────────────────────┴────────────┐
│ │
▼ ▼
ML002 (TrainingHandler) ────────────────► ML005 (TrainingPredictionModel)
│ Walk-forward backtesting
│
▼
ML003 (MLflow Upload) ────────► ML004 (Hyperopt - Optuna)
│
▼
MO001 (Retraining Scheduler)
│
▼
MO002 (Drift Monitor) ─────────────────────────────────────┐
│ │
▼ ▼
MO003 (A/B Testing) ◄───────────────────────────── MO004 (Rollback)
Implementation Plan¶
Phase 1: Training Pipeline (Estimated: 2-3 days)¶
ML001: Add TradingMode.TRAIN¶
# libs/tradai-common/src/tradai/common/entrypoint/settings.py
class TradingMode(str, Enum):
BACKTEST = "backtest"
HYPEROPT = "hyperopt"
DRY_RUN = "dry-run"
LIVE = "live"
TRAIN = "train" # NEW: Training mode for ML models
ML002: TrainingHandler Implementation¶
# libs/tradai-common/src/tradai/common/entrypoint/training.py
class TrainingHandler(BaseEntrypointHandler):
"""Handle ML model training via Freqtrade."""
def __init__(
self,
config_loader: StrategyConfigLoader,
mlflow_adapter: MLflowAdapter,
freqtrade_config_builder: FreqtradeConfigBuilder,
):
self.config_loader = config_loader
self.mlflow = mlflow_adapter
self.config_builder = freqtrade_config_builder
def run(self) -> int:
"""Execute training workflow."""
# 1. Load strategy config
config = self.config_loader.load_config(
strategy_name=self.settings.strategy_name,
stage="Staging", # Train against staging config
)
# 2. Build Freqtrade training config
ft_config = self.config_builder.build_training_config(config)
# 3. Execute Freqtrade backtesting with real FreqAI model
# (LightGBM, CatBoost, etc. - NOT TradAIPredictionModel)
result = self._execute_training(ft_config)
# 4. Upload trained model to MLflow
model_version = self._upload_to_mlflow(result)
# 5. Log metrics
self._log_training_metrics(result, model_version)
return 0
def _execute_training(self, config: dict) -> TrainingResult:
"""Run Freqtrade backtesting with actual training."""
cmd = [
"freqtrade", "backtesting",
"--strategy", config["strategy"],
"--freqaimodel", config["freqai"]["model"], # LightGBMRegressor
"--config", self._write_temp_config(config),
"--timerange", config["timerange"],
]
# ... execute and parse results
def _upload_to_mlflow(self, result: TrainingResult) -> ModelVersion:
"""Upload trained model artifacts to MLflow registry."""
with mlflow.start_run():
# Log model artifacts
mlflow.log_artifact(result.model_path)
mlflow.log_artifact(result.feature_importance_path)
# Log metrics
mlflow.log_metrics({
"sharpe_ratio": result.sharpe,
"profit_total": result.profit,
"max_drawdown": result.max_drawdown,
})
# Register model
model_uri = f"runs:/{mlflow.active_run().info.run_id}/model"
return mlflow.register_model(model_uri, self.settings.strategy_name)
ML003: MLflow Model Auto-Upload¶
Integrated into TrainingHandler above.
Phase 2: Hyperparameter Optimization (Estimated: 2 days)¶
ML004: ML Hyperparameter Optimizer¶
# libs/tradai-strategy/src/tradai/strategy/training/hyperopt.py
class MLHyperparameterOptimizer:
"""Optimize ML model hyperparameters using Optuna."""
def __init__(
self,
training_handler: TrainingHandler,
mlflow_adapter: MLflowAdapter,
):
self.training_handler = training_handler
self.mlflow = mlflow_adapter
def optimize(
self,
strategy_name: str,
n_trials: int = 100,
objective_metric: str = "sharpe_ratio",
) -> HyperoptResult:
"""Run hyperparameter optimization."""
import optuna
def objective(trial):
# Sample hyperparameters
params = {
"n_estimators": trial.suggest_int("n_estimators", 100, 1000),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
"max_depth": trial.suggest_int("max_depth", 3, 10),
"min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
}
# Train with these params
result = self.training_handler.run_with_params(params)
# Log to MLflow
with mlflow.start_run(nested=True):
mlflow.log_params(params)
mlflow.log_metric(objective_metric, result.metrics[objective_metric])
return result.metrics[objective_metric]
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=n_trials)
return HyperoptResult(
best_params=study.best_params,
best_value=study.best_value,
n_trials=n_trials,
)
Phase 3: Retraining & Monitoring (Estimated: 3-4 days)¶
ML005: Retraining Scheduler¶
# lambdas/retrain_scheduler/handler.py
def handler(event, context):
"""EventBridge-triggered retraining scheduler."""
dynamodb = boto3.resource("dynamodb")
ecs = boto3.client("ecs")
# Get strategies that need retraining
strategies = get_strategies_for_retraining(dynamodb)
for strategy in strategies:
# Check if retraining is due
if should_retrain(strategy):
# Launch ECS training task
ecs.run_task(
cluster=os.environ["ECS_CLUSTER"],
taskDefinition=f"tradai-train-{strategy.name}",
launchType="FARGATE",
overrides={
"containerOverrides": [{
"name": "trainer",
"environment": [
{"name": "TRADING_MODE", "value": "train"},
{"name": "STRATEGY_NAME", "value": strategy.name},
]
}]
}
)
# Update last_retrain timestamp
update_retrain_timestamp(dynamodb, strategy)
def should_retrain(strategy) -> bool:
"""Check if strategy should be retrained."""
# Option 1: Time-based (weekly/monthly)
days_since_last = (datetime.utcnow() - strategy.last_retrain).days
if days_since_last >= strategy.retrain_interval_days:
return True
# Option 2: Drift-based (if drift monitor triggered)
if strategy.drift_detected:
return True
return False
ML006: Model Drift Monitor¶
# libs/tradai-common/src/tradai/common/monitoring/drift.py
class ModelDriftMonitor:
"""Monitor model prediction accuracy and detect drift."""
def __init__(
self,
dynamodb_table: str,
cloudwatch_namespace: str,
):
self.dynamodb = boto3.resource("dynamodb").Table(dynamodb_table)
self.cloudwatch = boto3.client("cloudwatch")
def record_prediction(
self,
strategy_name: str,
prediction: float,
actual: float,
timestamp: datetime,
):
"""Record prediction vs actual for drift analysis."""
self.dynamodb.put_item(Item={
"pk": f"PRED#{strategy_name}",
"sk": timestamp.isoformat(),
"prediction": Decimal(str(prediction)),
"actual": Decimal(str(actual)),
"error": Decimal(str(abs(prediction - actual))),
"ttl": int((timestamp + timedelta(days=30)).timestamp()),
})
def calculate_drift_metrics(
self,
strategy_name: str,
window_days: int = 7,
) -> DriftMetrics:
"""Calculate drift metrics over time window."""
# Query predictions from DynamoDB
predictions = self._query_predictions(strategy_name, window_days)
# Calculate metrics
mae = np.mean([p["error"] for p in predictions])
baseline_mae = self._get_baseline_mae(strategy_name)
drift_ratio = mae / baseline_mae if baseline_mae > 0 else 1.0
# Publish to CloudWatch
self.cloudwatch.put_metric_data(
Namespace=self.cloudwatch_namespace,
MetricData=[
{
"MetricName": "ModelDriftRatio",
"Dimensions": [{"Name": "Strategy", "Value": strategy_name}],
"Value": drift_ratio,
"Unit": "None",
}
]
)
return DriftMetrics(
mae=mae,
baseline_mae=baseline_mae,
drift_ratio=drift_ratio,
drift_detected=drift_ratio > 1.5, # 50% degradation threshold
)
ML007: Model Comparison Service¶
# services/strategy-service/src/tradai/strategy_service/core/comparison_service.py
class ModelComparisonService:
"""Compare performance between model versions."""
def compare_versions(
self,
strategy_name: str,
version_a: str,
version_b: str,
timerange: str,
) -> ComparisonResult:
"""Run side-by-side backtest comparison."""
# Run backtest with version A
result_a = self._run_backtest(strategy_name, version_a, timerange)
# Run backtest with version B
result_b = self._run_backtest(strategy_name, version_b, timerange)
# Calculate comparison metrics
return ComparisonResult(
version_a=version_a,
version_b=version_b,
sharpe_diff=result_b.sharpe - result_a.sharpe,
profit_diff=result_b.profit - result_a.profit,
drawdown_diff=result_b.max_drawdown - result_a.max_drawdown,
recommendation=self._recommend_version(result_a, result_b),
)
def shadow_test(
self,
strategy_name: str,
production_version: str,
candidate_version: str,
duration_days: int = 7,
) -> ShadowTestConfig:
"""Configure shadow testing for candidate model."""
return ShadowTestConfig(
strategy_name=strategy_name,
production_version=production_version,
candidate_version=candidate_version,
start_time=datetime.utcnow(),
end_time=datetime.utcnow() + timedelta(days=duration_days),
metrics_to_compare=["sharpe_ratio", "profit_total", "win_rate"],
)
ML008: Model Rollback Service¶
# services/strategy-service/src/tradai/strategy_service/core/rollback_service.py
class ModelRollbackService:
"""Handle model version rollback."""
def __init__(
self,
mlflow_adapter: MLflowAdapter,
ecs_client: ECSClient,
):
self.mlflow = mlflow_adapter
self.ecs = ecs_client
def rollback(
self,
strategy_name: str,
target_version: Optional[str] = None,
reason: str = "",
) -> RollbackResult:
"""Rollback to previous model version."""
# Get current production version
current = self.mlflow.get_model_version(
name=strategy_name,
stage="Production",
)
# Determine target version (previous or specified)
if target_version is None:
target_version = self._get_previous_version(strategy_name, current.version)
# Transition stages
self.mlflow.transition_model_version_stage(
name=strategy_name,
version=current.version,
stage="Archived",
)
self.mlflow.transition_model_version_stage(
name=strategy_name,
version=target_version,
stage="Production",
)
# Restart live trading container to pick up new model
self._restart_strategy_container(strategy_name)
# Log rollback event
self._log_rollback_event(strategy_name, current.version, target_version, reason)
return RollbackResult(
previous_version=current.version,
new_version=target_version,
timestamp=datetime.utcnow(),
reason=reason,
)
def auto_rollback_on_drift(
self,
strategy_name: str,
drift_threshold: float = 1.5,
):
"""Configure automatic rollback when drift exceeds threshold."""
# Create CloudWatch alarm
cloudwatch = boto3.client("cloudwatch")
cloudwatch.put_metric_alarm(
AlarmName=f"{strategy_name}-drift-rollback",
MetricName="ModelDriftRatio",
Namespace="TradAI/MLOps",
Dimensions=[{"Name": "Strategy", "Value": strategy_name}],
Statistic="Average",
Period=300,
EvaluationPeriods=3,
Threshold=drift_threshold,
ComparisonOperator="GreaterThanThreshold",
AlarmActions=[os.environ["ROLLBACK_SNS_TOPIC"]],
)
Infrastructure Requirements¶
New AWS Resources¶
| Resource | Purpose | Cost/Month |
|---|---|---|
| ECS Task Definition (Training) | Run training tasks | Pay-per-use |
| EventBridge Rule | Trigger scheduled retraining | ~$1 |
| Lambda (Retrain Scheduler) | Orchestrate retraining | ~$1 |
| Lambda (Drift Monitor) | Calculate drift metrics | ~$1 |
| DynamoDB Table (Predictions) | Store prediction history | ~$5 |
| CloudWatch Alarms | Drift detection alerts | ~$1 |
| SNS Topic | Rollback notifications | ~$0.50 |
Total Additional Cost: ~$10/month
ECS Task Definition for Training¶
Resource: AWS::ECS::TaskDefinition
Family: tradai-train-{strategy}
Cpu: 2048 # More CPU for training
Memory: 4096 # More memory for model training
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
ContainerDefinitions:
- Name: trainer
Image: !Sub ${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/tradai-{strategy}:latest
Essential: true
Environment:
- Name: TRADING_MODE
Value: train
- Name: STRATEGY_NAME
Value: {strategy}
- Name: MLFLOW_TRACKING_URI
Value: http://mlflow.internal:5000
- Name: FREQAI_MODEL
Value: LightGBMRegressor # Actual training model
- Name: TRAIN_PERIOD_DAYS
Value: "30"
- Name: BACKTEST_PERIOD_DAYS
Value: "7"
Configuration Schema¶
FreqAI Training Configuration¶
{
"freqai": {
"enabled": true,
"model": "LightGBMRegressor",
"train_period_days": 30,
"backtest_period_days": 7,
"identifier": "rad-strategy-v3",
"model_training_parameters": {
"n_estimators": 500,
"learning_rate": 0.05,
"max_depth": 7,
"min_child_samples": 20
},
"feature_parameters": {
"label_period_candles": 24,
"include_timeframes": ["1h", "4h"],
"indicator_periods_candles": [10, 20, 50]
},
"data_split_parameters": {
"test_size": 0.15,
"shuffle": false
}
}
}
Retraining Schedule Configuration¶
# DynamoDB: tradai-ml-config
{
"pk": "CONFIG#RadStrategy",
"sk": "RETRAIN",
"retrain_interval_days": 7,
"retrain_on_drift": true,
"drift_threshold": 1.5,
"auto_promote": false,
"shadow_test_days": 3,
"rollback_enabled": true,
"notification_emails": ["team@tradai.io"]
}
Monitoring & Alerting¶
CloudWatch Metrics¶
| Metric | Namespace | Description |
|---|---|---|
ModelDriftRatio | TradAI/MLOps | Ratio of current MAE to baseline |
TrainingDuration | TradAI/MLOps | Time to train model (seconds) |
ModelVersion | TradAI/MLOps | Current production model version |
RetrainingCount | TradAI/MLOps | Number of retraining events |
RollbackCount | TradAI/MLOps | Number of rollback events |
Alerts¶
| Alert | Threshold | Action |
|---|---|---|
| High Model Drift | DriftRatio > 1.5 | SNS → Auto-rollback |
| Training Failed | Duration > 2h or Exit != 0 | SNS → Page on-call |
| Rollback Triggered | Any rollback | SNS → Notify team |
Implementation Roadmap¶
Phase 1: Training Pipeline (Priority: HIGH)¶
| Task | Description | Est. Hours | Dependencies |
|---|---|---|---|
| ML001 | Add TradingMode.TRAIN | 2 | - |
| ML002 | Implement TrainingHandler | 8 | ML001 |
| ML003 | MLflow model auto-upload | 4 | ML002 |
Subtotal: 14 hours
Phase 2: Hyperparameter Optimization (Priority: MEDIUM)¶
| Task | Description | Est. Hours | Dependencies |
|---|---|---|---|
| ML004 | Optuna integration for ML hyperparams | 12 | ML002 |
Subtotal: 12 hours
Phase 3: Retraining & Monitoring (Priority: MEDIUM)¶
| Task | Description | Est. Hours | Dependencies |
|---|---|---|---|
| ML005 | Retraining scheduler (EventBridge + Lambda) | 8 | ML002 |
| ML006 | Model drift monitor | 10 | - |
| ML007 | Model comparison service | 8 | - |
| ML008 | Model rollback service | 6 | ML007 |
Subtotal: 32 hours
Total Effort: ~58 hours (7-8 days)¶
Success Criteria¶
| Criterion | Validation |
|---|---|
| Training mode works | TRADING_MODE=train successfully trains and uploads model |
| Models auto-registered | Trained models appear in MLflow registry with correct tags |
| Hyperopt works | ml_hyperopt command optimizes model parameters |
| Scheduled retraining | EventBridge triggers retraining on schedule |
| Drift detected | CloudWatch alarm fires when drift > threshold |
| Rollback works | rollback command transitions model versions correctly |
| Live containers reload | Restarted containers pick up new production model |
Document Changelog¶
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2025-12-22 | Initial ML Lifecycle architecture document |
References¶
- 11-LIVE-TRADING.md - Live trading architecture with inference mode
- Freqtrade FreqAI Documentation - FreqAI official docs
- MLflow Model Registry - MLflow registry docs
Next Actions: 1. Complete LF007/LF008 to enable live inference (prerequisite) 2. Implement ML001-ML003 for training pipeline 3. Add ML004 for hyperparameter optimization 4. Deploy ML005-ML008 for full MLOps lifecycle