Skip to content

retraining-scheduler

Schedules and triggers model retraining based on drift detection, scheduled intervals, or manual requests.

Overview

Property Value
Trigger EventBridge / Manual
Runtime Python 3.11
Timeout 60 seconds
Memory 256 MB
Settings class RetrainingSchedulerSettings

Input Schema

{
    "models": [                         # Optional, uses DEFAULT_MODELS if not provided
        {
            "name": "PascalStrategy",
            "strategy": "PascalFreqAIStrategy",
            "freqai_model": "LightGBMRegressor",
            "train_period_days": 30,
            "pairs": ["BTC/USDT:USDT", "ETH/USDT:USDT"],
            "timeframe": "1h"
        }
    ],
    "force": false,                     # Force retraining regardless of state
    "trigger": "manual"                 # Optional: override trigger type
}

Output Schema

{
    "success": true,
    "data": {
        "summary": {
            "models_evaluated": 2,
            "retraining_triggered": 1,
            "skipped": 1,
            "errors": 0
        },
        "results": [
            {
                "model_name": "PascalStrategy",
                "status": "triggered",
                "trigger": "drift_detected",
                "task_arn": "arn:aws:ecs:...:task/abc123",
                "retraining_triggered": true,
                "timestamp": "2024-01-01T12:00:00+00:00"
            },
            {
                "model_name": "RadStrategy",
                "status": "skipped",
                "reason": "recently_retrained",
                "hours_since_last": 12.5,
                "retraining_triggered": false,
                "timestamp": "2024-01-01T12:00:00+00:00"
            }
        ]
    },
    "environment": "dev"
}

Environment Variables

Variable Required Default Description
ECS_CLUSTER Yes - ECS cluster name/ARN
ECS_SUBNETS Yes - Comma-separated subnet IDs
ECS_SECURITY_GROUPS Yes - Comma-separated SG IDs
ECS_TASK_DEFINITION_PREFIX No strategy- Task definition prefix
ECS_CONTAINER_NAME No strategy Container name for overrides
USE_SPOT No false Use Fargate Spot instances
RETRAINING_STATE_TABLE Yes - DynamoDB state table for retraining records
DRIFT_STATE_TABLE Yes - Drift detection state table
MIN_HOURS_BETWEEN_RETRAINING No 24 Minimum cooldown between retraining
RETRAINING_INTERVAL_DAYS No 7 Scheduled retraining interval in days
MLFLOW_TRACKING_URI No - MLflow server URL (passed to ECS task as env var)
SNS_ALERTS_TOPIC_ARN No - SNS topic ARN for notifications

Trigger Types

Uses RetrainingTrigger enum from tradai.common.entities.retraining:

Trigger Description Priority
drift_detected Significant PSI drift detected in drift state table High
scheduled Periodic retraining interval reached (RETRAINING_INTERVAL_DAYS) Medium
manual Explicit user request (trigger: "manual" in event or force: true) Highest

Retraining Decision Flow

flowchart TD
    A[Evaluate Model] --> B{Force flag?}
    B -->|Yes| C[Trigger: manual]
    B -->|No| D{Recently retrained?}
    D -->|Yes, within MIN_HOURS| E[Skip: cooldown]
    D -->|No| F{Drift detected in DynamoDB?}
    F -->|Yes, is_drifted=true| G[Trigger: drift_detected]
    F -->|No| H{Scheduled interval due?}
    H -->|Yes, days >= RETRAINING_INTERVAL_DAYS| I[Trigger: scheduled]
    H -->|No| J[Skip: no trigger]
    C --> K[Launch ECS Task]
    G --> K
    I --> K
    K --> L[Update Retraining State in DynamoDB]
    L --> M{Drift triggered?}
    M -->|Yes| N[Reset drift state: is_drifted=false]
    M -->|No| O[Send Notification]
    N --> O

ECS Task Configuration

The Lambda launches Fargate tasks with these container environment overrides:

Env Var Value Description
TRADING_MODE train Matches EntrypointSettings
STRATEGY From model config Strategy class name
MODEL_NAME From model config Model identifier
FREQAI_MODEL From model config FreqAI model class (default: LightGBMRegressor)
TRAIN_PERIOD_DAYS From model config Training period (default: 30)
PAIRS Comma-joined list Trading pairs
TIMEFRAME From model config Candle timeframe (default: 1h)
ENVIRONMENT From settings Deployment environment
MLFLOW_TRACKING_URI From settings Only added if MLFLOW_TRACKING_URI is configured
  • Task definition: {ECS_TASK_DEFINITION_PREFIX}{strategy_name_lowercase}
  • Capacity provider: FARGATE_SPOT (weight=1, base=0) + FARGATE (weight=0) if USE_SPOT=true, otherwise launchType=FARGATE
  • No command override: uses container ENTRYPOINT from image

DynamoDB State Management

Retraining State Table

Records retraining job state with 30-day TTL:

{
    "model_name": "PascalStrategy",     # Partition key
    "last_retraining": "2024-01-01T12:00:00+00:00",
    "task_arn": "arn:aws:ecs:...",
    "trigger": "drift_detected",
    "status": "running",
    "expires_at": 1706745600            # Unix timestamp, 30 days TTL
}

Drift State Table

Checked for is_drifted flag. Reset after drift-triggered retraining:

# Reset operation
table.update_item(
    Key={"model_name": model_name},
    UpdateExpression="SET is_drifted = :val, reset_at = :ts",
    ...
)

SQS Message Format

When consumed by the sqs-consumer, retraining messages use this format:

{
    "model_name": "PascalStrategy",
    "trigger_type": "drift_detected",    # or "scheduled", "manual"
    "custom_params": { ... }             # Optional additional parameters
}

CloudWatch Metrics

Namespace suffix: ModelRetraining

Metric Dimensions Description
RetrainingTriggered Model, Environment 1.0 if triggered, 0.0 if skipped
RetrainingTrigger_{type} Model, Environment Per-trigger-type count

EventBridge Schedule

{
  "ScheduleExpression": "rate(6 hours)",
  "Targets": [{
    "Arn": "arn:aws:lambda:...:retraining-scheduler",
    "Input": "{\"models\": [{\"name\": \"PascalStrategy\", \"strategy\": \"PascalFreqAIStrategy\"}]}"
  }]
}

SNS Notification Format

TradAI Model Retraining Notification

Environment: prod
Model: PascalStrategy
Trigger: Drift Detected
Task ARN: arn:aws:ecs:...:task/abc123
Timestamp: 2024-01-01T12:00:00+00:00

A model retraining job has been triggered. You will receive another
notification when the training completes.

Reason: Significant drift was detected in model predictions.

See Also

Related Lambdas:

Architecture:

Services:

CLI: