Skip to content

check-retraining-needed

Evaluates whether a model requires retraining based on drift detection, scheduled intervals, or manual triggers.

Overview

Property Value
Trigger Step Functions state machine
Runtime Python 3.11
Timeout 30 seconds
Memory 256 MB

Input Schema

{
    "model_name": "PascalStrategy",  # Required
    "force": false,                   # Optional, forces retraining
    "manual_trigger": false           # Optional, marks as manual request
}

Output Schema

{
    "model_name": "PascalStrategy",
    "decision": "needs_retraining",  # needs_retraining | recently_trained | no_retraining | invalid_model
    "trigger": "drift_detected",     # manual | drift_detected | scheduled
    "reason": "Significant drift detected (PSI=0.350)",
    "hours_since_retrain": 168.5,
    "days_since_retrain": 7,
    "drift_severity": "significant",
    "drift_psi": 0.35
}

Environment Variables

Variable Required Default Description
DYNAMODB_TABLE_NAME Yes - State repository table
DRIFT_STATE_TABLE No "tradai-drift-state" Drift state table
RETRAINING_STATE_TABLE No "tradai-retraining-state" Retraining state table
RETRAINING_INTERVAL_DAYS No 7 Days between scheduled retraining
MIN_HOURS_BETWEEN_RETRAINING No 24 Minimum hours between attempts
ALLOWED_MODELS No "" (all) Comma-separated allowlist of valid model names. When set, only listed models are accepted; others return invalid_model. Empty string permits any model that passes format validation.

Decision Logic

flowchart TD
    A[Start] --> V{model_name format valid?}
    V -->|No| X[INVALID_MODEL]
    V -->|Yes| AL{In ALLOWED_MODELS?}
    AL -->|No| X
    AL -->|Yes / allowlist empty| B{Force or manual_trigger?}
    B -->|Yes| C[NEEDS_RETRAINING: MANUAL]
    B -->|No| MH{Hours since last retrain?}
    MH -->|< min_hours| I[RECENTLY_TRAINED]
    MH -->|>= min_hours| D{Check drift state}
    D -->|Significant drift| E[NEEDS_RETRAINING: DRIFT_DETECTED]
    D -->|No drift| F{Days since last?}
    F -->|>= interval| H[NEEDS_RETRAINING: SCHEDULED]
    F -->|< interval| J[NO_RETRAINING]

Key Features

  • Input validation: rejects model_name not matching ^[A-Z][A-Za-z0-9_]{1,49}\Z (PascalCase identifier, max 50 chars, \Z blocks trailing newlines)
  • Allowlist gate: when ALLOWED_MODELS is set, rejects any model not in the list — prevents bogus strategy names from wasting Fargate compute
  • Checks drift severity against thresholds
  • Validates minimum interval between retraining attempts
  • Supports forced retraining and manual triggers
  • Returns decision rationale for audit trail

Step Functions Integration

The Lambda feeds the EvaluateRetrainingNeed Choice state, which branches on $.check_result.decision:

Decision Route
needs_retraining RunRetraining (ECS Fargate)
invalid_model HandleInvalidModelNotifyFailure
recently_trained / no_retraining SkipRetraining

See infra/compute/asl_templates/retraining_workflow.json.j2 for the full ASL definition.