Skip to content

model-rollback

Rolls back a model to a previous version when performance degrades or alarms trigger.

Overview

Property Value
Trigger CloudWatch Alarm / Step Functions
Runtime Python 3.11
Timeout 120 seconds
Memory 256 MB

Input Schema

{
    "model_name": "PascalStrategy",      # Required
    "target_version": "2",               # Optional, uses latest Archived if not specified
    "reason": "PERFORMANCE_DEGRADATION", # Default: "MANUAL_REQUEST"
    "alarm_name": "high-drawdown-alarm", # Optional
    "dry_run": false                     # Default: false
}

Rollback Reasons: - MANUAL_REQUEST - Manual rollback request - PERFORMANCE_DEGRADATION - Performance metrics degraded - TEST_FAILURE - Post-deployment tests failed - DRIFT_SEVERE - Severe model drift detected

Output Schema

{
    "rolled_back": true,
    "model_name": "PascalStrategy",
    "from_version": "3",                 # Previous Production version
    "to_version": "2",                   # New Production version
    "reason": "PERFORMANCE_DEGRADATION",
    "alarm_name": "high-drawdown-alarm",
    "timestamp": "2024-01-01T12:00:00Z"
}

Environment Variables

Variable Required Default Description
MLFLOW_TRACKING_URI Yes - MLflow server URL
MLFLOW_TRACKING_USERNAME No - MLflow username
MLFLOW_TRACKING_PASSWORD No - MLflow password
ROLLBACK_STATE_TABLE No "tradai-rollback-state" DynamoDB state table
ROLLBACK_COOLDOWN_HOURS No 24 Min hours between rollbacks
ALERT_SNS_TOPIC_ARN Yes - SNS topic for notifications

Rollback Process

flowchart TD
    A[Rollback Request] --> B{Dry run?}
    B -->|Yes| C[Return preview]
    B -->|No| D{Cooldown check}
    D -->|In cooldown| E[Reject: Too recent]
    D -->|OK| F[Archive current Production]
    F --> G[Promote target to Production]
    G --> H[Record rollback state]
    H --> I[Send SNS notification]

Key Features

  • Enforces cooldown period to prevent thrashing
  • Dry-run mode for inspection without execution
  • Archives old Production version before promoting new one
  • Tracks rollback count and reason history
  • Sends detailed SNS notifications

CloudWatch Alarm Integration

{
  "AlarmName": "high-drawdown-alarm",
  "AlarmActions": [
    "arn:aws:lambda:...:model-rollback"
  ],
  "Dimensions": [
    {
      "Name": "ModelName",
      "Value": "PascalStrategy"
    }
  ]
}

SNS Notification Format

{
    "subject": "Model Rollback: PascalStrategy",
    "message": {
        "model_name": "PascalStrategy",
        "from_version": "3",
        "to_version": "2",
        "reason": "PERFORMANCE_DEGRADATION",
        "triggered_by": "high-drawdown-alarm",
        "timestamp": "2024-01-01T12:00:00Z"
    }
}