Model Drift Response Runbook¶
Procedures for responding to model drift alerts and managing model retraining.
flowchart TD
Start["Drift Alert Received"] --> PSI{"PSI Score?"}
PSI -->|"< 0.10"| Monitor["Monitor<br/>No action needed"]
PSI -->|"0.10 - 0.25"| Assess["Assess Impact"]
PSI -->|"> 0.25"| Critical["Critical Drift"]
Assess --> A1["Check recent trades"]
Assess --> A2["Compare model versions"]
Assess --> A3["Review data freshness"]
Critical --> C1["Pause live trading"]
Critical --> C2["Trigger retraining"]
Critical --> C3["Alert on-call"] Understanding PSI (Population Stability Index)¶
| PSI Range | Severity | Interpretation | Action Required |
|---|---|---|---|
| < 0.10 | None | No significant drift | Monitor only |
| 0.10 - 0.25 | Moderate | Some drift detected | Investigate, plan retraining |
| > 0.25 | Significant | Major distribution shift | Immediate action required |
Moderate Drift (PSI 0.10 - 0.25)¶
Symptoms¶
- CloudWatch alarm:
tradai-{env}-drift-moderate - SNS notification with severity "moderate"
- Drift monitor Lambda detected PSI between 0.10-0.25
Assessment¶
-
Check drift details in DynamoDB:
-
Review recent backtest metrics:
-
Identify affected metrics:
- Which features have the highest PSI contribution?
- Are market conditions significantly different?
Response Options¶
Option 1: Continue monitoring - If drift is minor and model still performs acceptably - Set reminder to check in 24 hours
Option 2: Schedule retraining
# Trigger retraining via Lambda
aws lambda invoke \
--function-name tradai-retraining-scheduler-${ENVIRONMENT} \
--payload '{"models": [{"name": "MODEL_NAME"}], "force": false}' \
output.json
Option 3: Reduce stake amount - Reduce exposure while monitoring - Update strategy config in DynamoDB
Significant Drift (PSI > 0.25)¶
Symptoms¶
- CloudWatch alarm:
tradai-{env}-drift-significant - SNS notification with severity "significant"
- PSI exceeds 0.25 threshold
Immediate Actions¶
-
Consider pausing trading (if not already paused):
# Update trading state to paused aws dynamodb update-item \ --table-name tradai-trading-state-${ENVIRONMENT} \ --key '{"strategy_id": {"S": "STRATEGY_ID"}}' \ --update-expression "SET #status = :paused" \ --expression-attribute-names '{"#status": "status"}' \ --expression-attribute-values '{":paused": {"S": "paused"}}' -
Trigger immediate retraining:
-
Review open positions:
- Check if any positions need manual intervention
- Consider reducing exposure
Investigation¶
- Analyze the drift:
- Review MLflow experiment comparisons
- Check if external events caused market shift
-
Verify data quality (no gaps, correct sources)
-
Check for data issues:
# Verify data freshness via the Data Collection service API curl "http://localhost:8002/api/v1/freshness?symbols=BTC/USDT:USDT" | jq # Or from ECS aws ecs execute-command \ --cluster tradai-${ENVIRONMENT} \ --task $DATA_TASK_ARN \ --container data-collection \ --command "curl -s http://localhost:8002/api/v1/freshness?symbols=BTC/USDT:USDT" -
Review feature distributions:
- Compare reference vs current feature distributions
- Identify which features changed most
Model Rollback Procedure¶
If a newly trained model performs worse than expected:
-
Identify previous model version:
# Check MLflow model registry for previous versions curl "http://localhost:5001/api/2.0/mlflow/registered-models/get-latest-versions?name=MODEL_NAME" | jq # (Port 5001 is the local dev mapping. In AWS, use Service Discovery at `mlflow.tradai-{env}.local:5000`) # Or check S3 model artifacts aws s3 ls s3://tradai-mlflow-${ENVIRONMENT}/models/MODEL_NAME/ --recursive -
Transition model version in MLflow registry:
# Transition the bad version to "Archived" curl -X POST "http://localhost:5001/api/2.0/mlflow/model-versions/transition-stage" \ -H "Content-Type: application/json" \ -d '{"name": "MODEL_NAME", "version": "BAD_VERSION", "stage": "Archived"}' # (Port 5001 is the local dev mapping. In AWS, use Service Discovery at `mlflow.tradai-{env}.local:5000`) # Transition the previous version back to "Production" curl -X POST "http://localhost:5001/api/2.0/mlflow/model-versions/transition-stage" \ -H "Content-Type: application/json" \ -d '{"name": "MODEL_NAME", "version": "PREVIOUS_VERSION", "stage": "Production"}' -
Restart strategy service to pick up new model:
-
Verify rollback:
- Check strategy service logs for model loading
- Verify predictions are using correct model version
Retraining Monitoring¶
DynamoDB tables used in drift response:
| Table | Purpose |
|---|---|
tradai-drift-state-{ENV} | Current drift metrics per model |
tradai-retraining-state-{ENV} | Retraining job status and history |
tradai-shadow-test-state-{ENV} | Shadow test results for retrained models (GSI: model_name-status_created_at-index) |
Check retraining status:¶
aws dynamodb get-item \
--table-name tradai-retraining-state-${ENVIRONMENT} \
--key '{"model_name": {"S": "MODEL_NAME"}}'
Check ECS training task:¶
aws ecs list-tasks \
--cluster tradai-${ENVIRONMENT} \
--started-by tradai-retraining
# Get task details
aws ecs describe-tasks \
--cluster tradai-${ENVIRONMENT} \
--tasks TASK_ARN
Check training logs:¶
Check shadow test results (after retraining):¶
# Query by model name using the GSI
aws dynamodb query \
--table-name tradai-shadow-test-state-${ENVIRONMENT} \
--index-name model_name-status_created_at-index \
--key-condition-expression "model_name = :name" \
--expression-attribute-values '{":name": {"S": "MODEL_NAME"}}' \
--scan-index-forward false \
--limit 5
Post-Drift Resolution¶
-
Reset drift state (after successful retraining):
-
Resume trading (if was paused):
-
Monitor closely for next 24-48 hours
-
Update incident log with resolution details