Skip to content

retraining-scheduler

Schedules and triggers model retraining based on drift detection, scheduled intervals, or manual requests.

Overview

Property Value
Trigger EventBridge / Manual
Runtime Python 3.11
Timeout 300 seconds
Memory 512 MB

Input Schema

{
    "models": [                         # Optional, uses DEFAULT_MODELS if not provided
        {
            "name": "PascalStrategy",
            "strategy": "PascalFreqAIStrategy",
            "freqai_model": "LightGBMRegressor",
            "train_period_days": 30,
            "pairs": ["BTC/USDT:USDT", "ETH/USDT:USDT"],
            "timeframe": "1h"
        }
    ],
    "force": false,                     # Force retraining regardless of state
    "trigger": "manual"                 # Optional: override trigger type
}

Output Schema

{
    "success": true,
    "data": {
        "summary": {
            "models_evaluated": 2,
            "retraining_triggered": 1,
            "skipped": 1,
            "errors": 0
        },
        "results": [
            {
                "model_name": "PascalStrategy",
                "status": "triggered",
                "trigger": "drift_detected",
                "task_arn": "arn:aws:ecs:...:task/abc123",
                "retraining_triggered": true,
                "timestamp": "2024-01-01T12:00:00Z"
            },
            {
                "model_name": "RadStrategy",
                "status": "skipped",
                "reason": "recently_retrained",
                "hours_since_last": 12.5,
                "retraining_triggered": false,
                "timestamp": "2024-01-01T12:00:00Z"
            }
        ]
    }
}

Environment Variables

Variable Required Default Description
ECS_CLUSTER Yes - ECS cluster name/ARN
ECS_SUBNETS Yes - Comma-separated subnet IDs
ECS_SECURITY_GROUPS Yes - Comma-separated SG IDs
ECS_TASK_DEFINITION_PREFIX No "tradai-" Task definition prefix
ECS_CONTAINER_NAME No "strategy" Container name for overrides
USE_SPOT No false Use Fargate Spot instances
RETRAINING_STATE_TABLE Yes - DynamoDB state table
DRIFT_STATE_TABLE Yes - Drift detection state table
MIN_HOURS_BETWEEN_RETRAINING No 24 Minimum cooldown
RETRAINING_INTERVAL_DAYS No 7 Scheduled retraining interval
MLFLOW_TRACKING_URI No - MLflow server URL
ALERT_SNS_TOPIC_ARN Yes - SNS topic ARN

Trigger Types

Trigger Description Priority
drift_detected Significant PSI drift detected High
scheduled Periodic retraining interval reached Medium
manual Explicit user request Highest

Retraining Decision Flow

flowchart TD
    A[Evaluate Model] --> B{Force flag?}
    B -->|Yes| C[Trigger: manual]
    B -->|No| D{Recently retrained?}
    D -->|Yes| E[Skip: cooldown]
    D -->|No| F{Drift detected?}
    F -->|Yes| G[Trigger: drift_detected]
    F -->|No| H{Scheduled interval due?}
    H -->|Yes| I[Trigger: scheduled]
    H -->|No| J[Skip: no trigger]
    C --> K[Launch ECS Task]
    G --> K
    I --> K
    K --> L[Update State]
    L --> M[Send Notification]

CloudWatch Metrics

Metric Description
RetrainingTriggered Count of triggered retraining jobs
RetrainingTrigger_drift_detected Drift-triggered retraining count
RetrainingTrigger_scheduled Scheduled retraining count
RetrainingTrigger_manual Manual retraining count

ECS Task Configuration

The Lambda launches Fargate tasks with: - Environment variables: TRADING_MODE=train, STRATEGY, MODEL_NAME, etc. - Capacity provider: FARGATE_SPOT if enabled, otherwise FARGATE - Container command not overridden - uses ENTRYPOINT from image

EventBridge Schedule

{
  "ScheduleExpression": "rate(6 hours)",
  "Targets": [{
    "Arn": "arn:aws:lambda:...:retraining-scheduler",
    "Input": "{\"models\": [{\"name\": \"PascalStrategy\", \"strategy\": \"PascalFreqAIStrategy\"}]}"
  }]
}

SNS Notification Format

TradAI Model Retraining Notification

Environment: prod
Model: PascalStrategy
Trigger: Drift Detected
Task ARN: arn:aws:ecs:...:task/abc123
Timestamp: 2024-01-01T12:00:00Z

A model retraining job has been triggered. You will receive another
notification when the training completes.

Reason: Significant drift was detected in model predictions.

See Also

Related Lambdas:

Architecture:

Services:

CLI: