retraining-scheduler¶
Schedules and triggers model retraining based on drift detection, scheduled intervals, or manual requests.
Overview¶
| Property | Value |
|---|---|
| Trigger | EventBridge / Manual |
| Runtime | Python 3.11 |
| Timeout | 60 seconds |
| Memory | 256 MB |
| Settings class | RetrainingSchedulerSettings |
Input Schema¶
{
"models": [ # Optional, uses DEFAULT_MODELS if not provided
{
"name": "PascalStrategy",
"strategy": "PascalFreqAIStrategy",
"freqai_model": "LightGBMRegressor",
"train_period_days": 30,
"pairs": ["BTC/USDT:USDT", "ETH/USDT:USDT"],
"timeframe": "1h"
}
],
"force": false, # Force retraining regardless of state
"trigger": "manual" # Optional: override trigger type
}
Output Schema¶
{
"success": true,
"data": {
"summary": {
"models_evaluated": 2,
"retraining_triggered": 1,
"skipped": 1,
"errors": 0
},
"results": [
{
"model_name": "PascalStrategy",
"status": "triggered",
"trigger": "drift_detected",
"task_arn": "arn:aws:ecs:...:task/abc123",
"retraining_triggered": true,
"timestamp": "2024-01-01T12:00:00+00:00"
},
{
"model_name": "RadStrategy",
"status": "skipped",
"reason": "recently_retrained",
"hours_since_last": 12.5,
"retraining_triggered": false,
"timestamp": "2024-01-01T12:00:00+00:00"
}
]
},
"environment": "dev"
}
Environment Variables¶
| Variable | Required | Default | Description |
|---|---|---|---|
ECS_CLUSTER | Yes | - | ECS cluster name/ARN |
ECS_SUBNETS | Yes | - | Comma-separated subnet IDs |
ECS_SECURITY_GROUPS | Yes | - | Comma-separated SG IDs |
ECS_TASK_DEFINITION_PREFIX | No | strategy- | Task definition prefix |
ECS_CONTAINER_NAME | No | strategy | Container name for overrides |
USE_SPOT | No | false | Use Fargate Spot instances |
RETRAINING_STATE_TABLE | Yes | - | DynamoDB state table for retraining records |
DRIFT_STATE_TABLE | Yes | - | Drift detection state table |
MIN_HOURS_BETWEEN_RETRAINING | No | 24 | Minimum cooldown between retraining |
RETRAINING_INTERVAL_DAYS | No | 7 | Scheduled retraining interval in days |
MLFLOW_TRACKING_URI | No | - | MLflow server URL (passed to ECS task as env var) |
SNS_ALERTS_TOPIC_ARN | No | - | SNS topic ARN for notifications |
Trigger Types¶
Uses RetrainingTrigger enum from tradai.common.entities.retraining:
| Trigger | Description | Priority |
|---|---|---|
drift_detected | Significant PSI drift detected in drift state table | High |
scheduled | Periodic retraining interval reached (RETRAINING_INTERVAL_DAYS) | Medium |
manual | Explicit user request (trigger: "manual" in event or force: true) | Highest |
Retraining Decision Flow¶
flowchart TD
A[Evaluate Model] --> B{Force flag?}
B -->|Yes| C[Trigger: manual]
B -->|No| D{Recently retrained?}
D -->|Yes, within MIN_HOURS| E[Skip: cooldown]
D -->|No| F{Drift detected in DynamoDB?}
F -->|Yes, is_drifted=true| G[Trigger: drift_detected]
F -->|No| H{Scheduled interval due?}
H -->|Yes, days >= RETRAINING_INTERVAL_DAYS| I[Trigger: scheduled]
H -->|No| J[Skip: no trigger]
C --> K[Launch ECS Task]
G --> K
I --> K
K --> L[Update Retraining State in DynamoDB]
L --> M{Drift triggered?}
M -->|Yes| N[Reset drift state: is_drifted=false]
M -->|No| O[Send Notification]
N --> O ECS Task Configuration¶
The Lambda launches Fargate tasks with these container environment overrides:
| Env Var | Value | Description |
|---|---|---|
TRADING_MODE | train | Matches EntrypointSettings |
STRATEGY | From model config | Strategy class name |
MODEL_NAME | From model config | Model identifier |
FREQAI_MODEL | From model config | FreqAI model class (default: LightGBMRegressor) |
TRAIN_PERIOD_DAYS | From model config | Training period (default: 30) |
PAIRS | Comma-joined list | Trading pairs |
TIMEFRAME | From model config | Candle timeframe (default: 1h) |
ENVIRONMENT | From settings | Deployment environment |
MLFLOW_TRACKING_URI | From settings | Only added if MLFLOW_TRACKING_URI is configured |
- Task definition:
{ECS_TASK_DEFINITION_PREFIX}{strategy_name_lowercase} - Capacity provider:
FARGATE_SPOT(weight=1, base=0) +FARGATE(weight=0) ifUSE_SPOT=true, otherwiselaunchType=FARGATE - No command override: uses container ENTRYPOINT from image
DynamoDB State Management¶
Retraining State Table¶
Records retraining job state with 30-day TTL:
{
"model_name": "PascalStrategy", # Partition key
"last_retraining": "2024-01-01T12:00:00+00:00",
"task_arn": "arn:aws:ecs:...",
"trigger": "drift_detected",
"status": "running",
"expires_at": 1706745600 # Unix timestamp, 30 days TTL
}
Drift State Table¶
Checked for is_drifted flag. Reset after drift-triggered retraining:
# Reset operation
table.update_item(
Key={"model_name": model_name},
UpdateExpression="SET is_drifted = :val, reset_at = :ts",
...
)
SQS Message Format¶
When consumed by the sqs-consumer, retraining messages use this format:
{
"model_name": "PascalStrategy",
"trigger_type": "drift_detected", # or "scheduled", "manual"
"custom_params": { ... } # Optional additional parameters
}
CloudWatch Metrics¶
Namespace suffix: ModelRetraining
| Metric | Dimensions | Description |
|---|---|---|
RetrainingTriggered | Model, Environment | 1.0 if triggered, 0.0 if skipped |
RetrainingTrigger_{type} | Model, Environment | Per-trigger-type count |
EventBridge Schedule¶
{
"ScheduleExpression": "rate(6 hours)",
"Targets": [{
"Arn": "arn:aws:lambda:...:retraining-scheduler",
"Input": "{\"models\": [{\"name\": \"PascalStrategy\", \"strategy\": \"PascalFreqAIStrategy\"}]}"
}]
}
SNS Notification Format¶
TradAI Model Retraining Notification
Environment: prod
Model: PascalStrategy
Trigger: Drift Detected
Task ARN: arn:aws:ecs:...:task/abc123
Timestamp: 2024-01-01T12:00:00+00:00
A model retraining job has been triggered. You will receive another
notification when the training completes.
Reason: Significant drift was detected in model predictions.
See Also¶
Related Lambdas:
- drift-monitor - Detects model drift (triggers retraining)
- sqs-consumer - SQS-based retraining task launcher
- compare-models - Champion vs challenger comparison
- promote-model - Model promotion after training
Architecture:
- ML Lifecycle - Full ML training pipeline
- ML Lifecycle - Pipeline diagram
Services:
- Strategy Service - Model registry integration
CLI:
- CLI Reference - A/B testing commands