Skip to content

Model Drift Response Runbook

Procedures for responding to model drift alerts and managing model retraining.

flowchart TD
    Start["Drift Alert Received"] --> PSI{"PSI Score?"}
    PSI -->|"< 0.10"| Monitor["Monitor<br/>No action needed"]
    PSI -->|"0.10 - 0.25"| Assess["Assess Impact"]
    PSI -->|"> 0.25"| Critical["Critical Drift"]
    Assess --> A1["Check recent trades"]
    Assess --> A2["Compare model versions"]
    Assess --> A3["Review data freshness"]
    Critical --> C1["Pause live trading"]
    Critical --> C2["Trigger retraining"]
    Critical --> C3["Alert on-call"]

Understanding PSI (Population Stability Index)

PSI Range Severity Interpretation Action Required
< 0.10 None No significant drift Monitor only
0.10 - 0.25 Moderate Some drift detected Investigate, plan retraining
> 0.25 Significant Major distribution shift Immediate action required

Moderate Drift (PSI 0.10 - 0.25)

Symptoms

  • CloudWatch alarm: tradai-{env}-drift-moderate
  • SNS notification with severity "moderate"
  • Drift monitor Lambda detected PSI between 0.10-0.25

Assessment

  1. Check drift details in DynamoDB:

    aws dynamodb get-item \
      --table-name tradai-drift-state-${ENVIRONMENT} \
      --key '{"model_name": {"S": "MODEL_NAME"}}'
    

  2. Review recent backtest metrics:

    # Check MLflow for recent experiment runs
    # Compare current performance to reference period
    

  3. Identify affected metrics:

  4. Which features have the highest PSI contribution?
  5. Are market conditions significantly different?

Response Options

Option 1: Continue monitoring - If drift is minor and model still performs acceptably - Set reminder to check in 24 hours

Option 2: Schedule retraining

# Trigger retraining via Lambda
aws lambda invoke \
  --function-name tradai-retraining-scheduler-${ENVIRONMENT} \
  --payload '{"models": [{"name": "MODEL_NAME"}], "force": false}' \
  output.json

Option 3: Reduce stake amount - Reduce exposure while monitoring - Update strategy config in DynamoDB


Significant Drift (PSI > 0.25)

Symptoms

  • CloudWatch alarm: tradai-{env}-drift-significant
  • SNS notification with severity "significant"
  • PSI exceeds 0.25 threshold

Immediate Actions

  1. Consider pausing trading (if not already paused):

    # Update trading state to paused
    aws dynamodb update-item \
      --table-name tradai-trading-state-${ENVIRONMENT} \
      --key '{"strategy_id": {"S": "STRATEGY_ID"}}' \
      --update-expression "SET #status = :paused" \
      --expression-attribute-names '{"#status": "status"}' \
      --expression-attribute-values '{":paused": {"S": "paused"}}'
    

  2. Trigger immediate retraining:

    aws lambda invoke \
      --function-name tradai-retraining-scheduler-${ENVIRONMENT} \
      --payload '{"models": [{"name": "MODEL_NAME"}], "force": true}' \
      output.json
    

  3. Review open positions:

  4. Check if any positions need manual intervention
  5. Consider reducing exposure

Investigation

  1. Analyze the drift:
  2. Review MLflow experiment comparisons
  3. Check if external events caused market shift
  4. Verify data quality (no gaps, correct sources)

  5. Check for data issues:

    # Verify data freshness via the Data Collection service API
    curl "http://localhost:8002/api/v1/freshness?symbols=BTC/USDT:USDT" | jq
    
    # Or from ECS
    aws ecs execute-command \
      --cluster tradai-${ENVIRONMENT} \
      --task $DATA_TASK_ARN \
      --container data-collection \
      --command "curl -s http://localhost:8002/api/v1/freshness?symbols=BTC/USDT:USDT"
    

  6. Review feature distributions:

  7. Compare reference vs current feature distributions
  8. Identify which features changed most

Model Rollback Procedure

If a newly trained model performs worse than expected:

  1. Identify previous model version:

    # Check MLflow model registry for previous versions
    curl "http://localhost:5001/api/2.0/mlflow/registered-models/get-latest-versions?name=MODEL_NAME" | jq
    # (Port 5001 is the local dev mapping. In AWS, use Service Discovery at `mlflow.tradai-{env}.local:5000`)
    
    # Or check S3 model artifacts
    aws s3 ls s3://tradai-mlflow-${ENVIRONMENT}/models/MODEL_NAME/ --recursive
    

  2. Transition model version in MLflow registry:

    # Transition the bad version to "Archived"
    curl -X POST "http://localhost:5001/api/2.0/mlflow/model-versions/transition-stage" \
      -H "Content-Type: application/json" \
      -d '{"name": "MODEL_NAME", "version": "BAD_VERSION", "stage": "Archived"}'
    # (Port 5001 is the local dev mapping. In AWS, use Service Discovery at `mlflow.tradai-{env}.local:5000`)
    
    # Transition the previous version back to "Production"
    curl -X POST "http://localhost:5001/api/2.0/mlflow/model-versions/transition-stage" \
      -H "Content-Type: application/json" \
      -d '{"name": "MODEL_NAME", "version": "PREVIOUS_VERSION", "stage": "Production"}'
    

  3. Restart strategy service to pick up new model:

    aws ecs update-service \
      --cluster tradai-${ENVIRONMENT} \
      --service tradai-strategy-service-${ENVIRONMENT} \
      --force-new-deployment
    

  4. Verify rollback:

  5. Check strategy service logs for model loading
  6. Verify predictions are using correct model version

Retraining Monitoring

DynamoDB tables used in drift response:

Table Purpose
tradai-drift-state-{ENV} Current drift metrics per model
tradai-retraining-state-{ENV} Retraining job status and history
tradai-shadow-test-state-{ENV} Shadow test results for retrained models (GSI: model_name-status_created_at-index)

Check retraining status:

aws dynamodb get-item \
  --table-name tradai-retraining-state-${ENVIRONMENT} \
  --key '{"model_name": {"S": "MODEL_NAME"}}'

Check ECS training task:

aws ecs list-tasks \
  --cluster tradai-${ENVIRONMENT} \
  --started-by tradai-retraining

# Get task details
aws ecs describe-tasks \
  --cluster tradai-${ENVIRONMENT} \
  --tasks TASK_ARN

Check training logs:

aws logs tail /ecs/tradai/${ENVIRONMENT} --follow

Check shadow test results (after retraining):

# Query by model name using the GSI
aws dynamodb query \
  --table-name tradai-shadow-test-state-${ENVIRONMENT} \
  --index-name model_name-status_created_at-index \
  --key-condition-expression "model_name = :name" \
  --expression-attribute-values '{":name": {"S": "MODEL_NAME"}}' \
  --scan-index-forward false \
  --limit 5

Post-Drift Resolution

  1. Reset drift state (after successful retraining):

    aws dynamodb update-item \
      --table-name tradai-drift-state-${ENVIRONMENT} \
      --key '{"model_name": {"S": "MODEL_NAME"}}' \
      --update-expression "SET is_drifted = :f, overall_psi = :psi" \
      --expression-attribute-values '{":f": {"BOOL": false}, ":psi": {"N": "0"}}'
    

  2. Resume trading (if was paused):

    aws dynamodb update-item \
      --table-name tradai-trading-state-${ENVIRONMENT} \
      --key '{"strategy_id": {"S": "STRATEGY_ID"}}' \
      --update-expression "SET #status = :running" \
      --expression-attribute-names '{"#status": "status"}' \
      --expression-attribute-values '{":running": {"S": "running"}}'
    

  3. Monitor closely for next 24-48 hours

  4. Update incident log with resolution details