Retraining Workflow Runbook¶
End-to-end operational guide for the tradai-retraining-workflow-{ENV} Step Functions state machine: what it does, how to trigger it, how to reproduce each path in AWS dev, where to look when something goes wrong, and how to onboard a new strategy.
Scope: applies to the post-#91 workflow (13 states, with
UpdateRetrainingState,HandleInvalidModel, andINVALID_MODELrouting). See the Troubleshooting section for historical issues that were fixed here.
1. Overview¶
The retraining workflow trains a FreqAI model, registers it in MLflow Model Registry, compares it against the champion, optionally promotes it, persists last_retrained metadata, and sends a completion notification.
trigger → CheckRetrainingNeeded → EvaluateRetrainingNeed
↓ needs_retraining ↓ invalid_model ↓ no_retraining / recently_trained
RunRetraining (ECS Fargate) HandleInvalidModel (Pass) SkipRetraining
↓ ↓ normalizes $.error
CompareModels ↓
↓ NotifyFailure
DecidePromotion → PromoteModel / KeepCurrentModel
↓
UpdateRetrainingState (writes last_retrained to DynamoDB)
↓
NotifyCompletion (SNS retraining_success)
All failure-catching task states (CheckRetrainingNeeded, RunRetraining, CompareModels, PromoteModel) have Catch: States.ALL → NotifyFailure. The invalid_model choice branch does NOT go through a Catch (the decision is explicit, not an error), so HandleInvalidModel is an intermediate Pass state that synthesises $.error so NotifyFailure sees a uniform payload shape.
2. Architecture¶
State machine (13 states)¶
| State | Type | Purpose |
|---|---|---|
| NormalizeInput | Pass | Merge defaults (e.g. empty config_version_id) into input |
| CheckRetrainingNeeded | Task (Lambda) | Validate model_name; decide based on drift/schedule |
| EvaluateRetrainingNeed | Choice | Route on decision: needs_retraining, invalid_model, else |
| RunRetraining | Task (ECS runTask.sync) | Train via generic tradai-strategy-generic-{ENV} task def |
| HandleInvalidModel | Pass | Normalize Choice-state context into $.error before NotifyFailure (no Catch fires on the explicit invalid_model branch, so shape has to be synthesised) |
| SkipRetraining | Pass | Terminal success for no-op runs |
| CompareModels | Task (Lambda) | Champion vs challenger comparison |
| DecidePromotion | Choice | Route on decision: promote vs keep |
| PromoteModel | Task (Lambda) | Transition new version to Production |
| KeepCurrentModel | Pass | Continue without promotion |
| UpdateRetrainingState | Task (DynamoDB updateItem) | Write last_retrained = State.EnteredTime keyed by model_name |
| NotifyCompletion | Task (Lambda) | SNS retraining_success |
| NotifyFailure | Task (Lambda) | SNS retraining_failed with $.error payload |
Data flow¶
| Path | Producer | Consumer | Storage |
|---|---|---|---|
| Decision | check-retraining-needed Lambda | EvaluateRetrainingNeed Choice | DynamoDB retraining-state-{ENV}, drift-state-{ENV} |
| Model artifacts | ECS training task | MLflow Model Registry | S3 tradai-mlflow-{ENV}/artifacts/{experiment}/{run_id}/artifacts/model/ |
| Status | ECS training task | update-status Lambda | DynamoDB workflow-state-{ENV} |
| Completion | Last state | SNS topic tradai-alerts-{ENV} | SNS → subscribers |
3. Triggers¶
Manual — most common in dev¶
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
--name "manual-$(date -u +%Y%m%d-%H%M%S)" \
--input '{
"model_name": "E2ETestStrategy",
"manual_trigger": true,
"force": true
}' \
--region eu-central-1 \
--profile tradai
config_version_id is optional; defaulted to "" by NormalizeInput.
Scheduled — retraining-scheduler Lambda¶
Runs on EventBridge rule tradai-retraining-scheduler-schedule-{ENV}. For each model in tradai-retraining-state-{ENV}, kicks off an execution with manual_trigger=false, force=false. The workflow's own CheckRetrainingNeeded gates whether training actually happens.
Drift-based — drift-monitor Lambda¶
When drift-monitor writes severity=significant to tradai-drift-state-{ENV}, an EventBridge rule triggers a retraining execution.
4. Prerequisites¶
- AWS access with
tradaiprofile (SSO or IAM user) - Model exists in MLflow Model Registry or is being trained for the first time (onboarding — see §10)
- Strategy Docker image pushed to ECR at
${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com/tradai/{strategy}:latest - OHLCV data for the symbol/timeframe in ArcticDB (
s3://tradai-arcticdb-{ENV}/{exchange}/) - ECS task definition
tradai-strategy-generic-{ENV}deployed and active
5. Happy path — reproduction in AWS dev¶
# Resolve the account from your profile so the same snippet works in any environment.
export AWS_PROFILE=tradai
export ENV=dev
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
# 1. Trigger
ARN=$(aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
--name "doc-happy-$(date -u +%Y%m%d-%H%M%S)" \
--input '{"model_name":"E2ETestStrategy","manual_trigger":true,"force":true}' \
--region eu-central-1 --query executionArn --output text)
echo "$ARN"
# 2. Wait for completion (~3–5 min for E2ETestStrategy)
aws stepfunctions describe-execution --execution-arn "$ARN" --region eu-central-1 \
--query '[status,startDate,stopDate]' --output text
Expected outputs by state¶
| Check | Command | Expected |
|---|---|---|
| Step Functions status | describe-execution | status=SUCCEEDED |
| MLflow Registry | SSM curl http://mlflow.tradai-${ENV}.local:5000/mlflow/ajax-api/2.0/mlflow/registered-models/search | E2ETestStrategy present, new version incremented, status=READY |
| S3 artifacts | aws s3 ls --summarize s3://tradai-mlflow-${ENV}/artifacts/{experiment}/{run_id}/artifacts/model/ --recursive | ~75 files: sub-train-BTC_*, metadata.json, backtesting_predictions/ |
last_retrained persisted | aws dynamodb get-item --table-name tradai-retraining-state-${ENV} --key '{"model_name":{"S":"E2ETestStrategy"}}' | Row with last_retrained ≈ current time (ISO 8601) |
| Workflow state | aws dynamodb get-item --table-name tradai-workflow-state-${ENV} --key '{"run_id":{"S":"$ARN"}}' | status=completed, mlflow_run_id populated |
| SNS notification | SNS subscribers / CloudWatch metric tradai-dev/Notifications NotificationsSent | notification_type=retraining_success, severity=INFO |
6. Failure paths — reproduction¶
6.1 INVALID_MODEL — typo / injection in model_name¶
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
--name "doc-bad-$(date -u +%Y%m%d-%H%M%S)" \
--input '{"model_name":"bad name!","manual_trigger":true}' \
--region eu-central-1
Expected: - CheckRetrainingNeeded returns decision=invalid_model, reason cites the regex - EvaluateRetrainingNeed routes to NotifyFailure (not RunRetraining) - ECS task NEVER starts — confirmed by zero new entries in aws ecs list-tasks --family tradai-strategy-generic-${ENV} - SNS notification_type=retraining_failed, severity=ERROR - Execution status=SUCCEEDED (failure-notification is a success step by design)
Two layers of validation:
- Format: must match
^[A-Z][A-Za-z0-9_]{1,49}\Z(PascalCase, max 50 chars,\Zblocks trailing newlines). - Allowlist: when
ALLOWED_MODELSenv var is set (e.g.E2ETestStrategy), only listed models are accepted. A format-valid but unknown name (e.g.BogusStrategy) returnsINVALID_MODELwithout starting any ECS task.
6.2 recently_trained — skip path after a recent run¶
# Immediately after a successful happy-path run (§5), without force/manual:
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
--name "doc-skip-$(date -u +%Y%m%d-%H%M%S)" \
--input '{"model_name":"E2ETestStrategy","manual_trigger":false,"force":false}' \
--region eu-central-1
Expected: - CheckRetrainingNeeded reads tradai-retraining-state-${ENV}.last_retrained, returns decision=recently_trained - EvaluateRetrainingNeed routes to SkipRetraining - Execution status=SUCCEEDED, no ECS task started, no SNS notification sent - Pre-#91 this path was unreachable (the gate field was never written back — see §9)
6.3 ECS training task exits non-zero¶
Any uncaught exit from the container or a timeout hits RunRetraining's Catch: States.ALL → NotifyFailure. Dig in via §7.
7. Observability¶
Step Functions execution history¶
aws stepfunctions get-execution-history --execution-arn "$ARN" --region eu-central-1 \
--query 'events[?type==`TaskSucceeded` || type==`TaskFailed`].[timestamp,name,type]' \
--output table
The per-task Payload in taskSucceededEventDetails.output contains the Lambda return value; useful to see what CompareModels decided.
CloudWatch log groups¶
| Service | Log group |
|---|---|
| check-retraining-needed | /aws/lambda/tradai-check-retraining-needed-{ENV} |
| compare-models | /aws/lambda/tradai-compare-models-{ENV} |
| promote-model | /aws/lambda/tradai-promote-model-{ENV} |
| notify-completion | /aws/lambda/tradai-notify-completion-{ENV} |
| update-status | /aws/lambda/tradai-update-status-{ENV} |
| ECS training | /ecs/tradai-strategy-generic-{ENV} |
| Step Functions | /aws/states/tradai-backtest-workflow-{ENV} (shared) |
DynamoDB tables¶
| Table | Key | Rows per |
|---|---|---|
tradai-workflow-state-{ENV} | run_id | Per Step Functions execution — training status + mlflow_run_id |
tradai-retraining-state-{ENV} | model_name | Per model — last_retrained (single row per model, upserted) |
tradai-drift-state-{ENV} | model_name | Per model — PSI, is_drifted, severity |
MLflow UI¶
ALB path: https://<alb-dns>/mlflow/ (not reachable from outside VPC). For read-only queries over SSM:
aws ssm send-command --instance-ids <consolidated-instance-id> \
--document-name AWS-RunShellScript \
--parameters 'commands=["curl -s http://mlflow.tradai-dev.local:5000/mlflow/ajax-api/2.0/mlflow/registered-models/search?max_results=20"]' \
--region eu-central-1
aws ssm get-command-invocation --command-id <from-send-command> \
--instance-id <consolidated-instance-id>
S3 artifact layout¶
s3://tradai-mlflow-{ENV}/
└── artifacts/
└── {experiment_id}/ # e.g. 9
└── {run_id}/ # 32-char hex
└── artifacts/
├── model/
│ ├── sub-train-BTC_{ts}/ # ~7 directories per run
│ │ ├── cb_btc_{ts}_metadata.json
│ │ └── cb_btc_{ts}_model.joblib
│ └── backtesting_predictions/
└── reproducibility/
└── manifest.json
8. Components¶
| Component | Source | Deployed by |
|---|---|---|
| State machine definition | infra/compute/asl_templates/retraining_workflow.json.j2 | Pulumi (infra/compute) |
| check-retraining-needed | lambdas/check-retraining-needed/ | just lambda-push-all or deploy-lambdas.yml |
| compare-models | lambdas/compare-models/ | same |
| promote-model | lambdas/promote-model/ | same |
| notify-completion | lambdas/notify-completion/ | same |
| update-status | lambdas/update-status/ | same |
| Training ECS task def | infra/compute/modules/ecs.py | Pulumi |
| Strategy image | strategies/{name}/Dockerfile | just release <strategy> <ver> in strategies repo |
| IAM: StepFunctions role | infra/compute/modules/step_functions.py::_create_execution_role | Pulumi (has dynamodb:UpdateItem on tradai-*, needed for UpdateRetrainingState) |
9. Troubleshooting¶
Skip path never triggers (CheckRetrainingNeeded always decides "first training")¶
Pre-#91 symptom. tradai-retraining-state-{ENV}.last_retrained was never written back by any post-training step, so the gate in CheckRetrainingNeeded saw the row as absent on every invocation.
Fix (post-#91). The UpdateRetrainingState state writes last_retrained = State.EnteredTime to retraining-state-{ENV} keyed by $.model_name right after PromoteModel/KeepCurrentModel. Implemented as a DynamoDB SDK integration (no extra Lambda). Failures in this state route directly to NotifyCompletion with $.error — last_retrained is best-effort metadata, not a correctness invariant.
Bad model_name silently triggers 5-minute generic-image training failure¶
Pre-#91 symptom. model_name was forwarded to ECS without validation; a typo or shell-injection-shaped string produced an opaque freqtrade crash deep inside the pipeline.
Fix (post-#91). CheckRetrainingNeeded rejects names that don't match ^[A-Z][A-Za-z0-9_]{1,49}\Z up-front, returns RetrainingDecision.INVALID_MODEL, and EvaluateRetrainingNeed routes to NotifyFailure before any ECS task starts. Additionally, when ALLOWED_MODELS is set, format-valid but unknown names are also rejected — preventing bogus strategies from consuming Fargate compute.
MLflow SDK calls 404 behind the proxy¶
Pre-#91 symptom. Dev MLflow runs with --static-prefix /mlflow; the ALB only exposes /mlflow/ajax-api/*. The MLflow Python SDK hard-codes /api/2.0/mlflow/ and every SDK-mediated call returned 404 HTML.
Fix (post-#91). MLflowAdapter bypasses the SDK for Registry and artifact operations and calls raw HTTP against the discovered /ajax-api base (create_model_version, transition_model_version_stage, get_run, log_artifacts_direct). Artifact upload uses boto3 directly against run.info.artifact_uri, with a single reused S3 client for the whole walk.
ExternalServiceError masks transient network failures¶
Pre-#91 symptom. _request raised bare ExternalServiceError for every HTTP failure including 404. Registry's create-if-missing flow swallowed that, so a 500 / 503 / timeout on the existence probe silently triggered a duplicate create.
Fix (post-#91). MLflowClientMixin._request raises NotFoundError specifically on 404 and keeps ExternalServiceError for other 4xx/5xx and network errors. _create_model_version_via_http catches only NotFoundError before deciding to create the registered model. Transient failures now propagate.
10. Onboarding a new strategy¶
# 1. Generate strategy skeleton (in tradai-strategies repo)
cd ../tradai-strategies
just new MyStrategy
# 2. Implement the strategy, add tests
# (freqai-enabled; see strategies/e2e-test-strategy/ for a minimal example)
# 3. Build + push image
cd ../tradai
just build-libs # freshens tradai wheels in dist/
cd ../tradai-strategies
docker build --build-arg USE_LOCAL_WHEELS=true \
-t ${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com/tradai/mystrategy:latest \
-f strategies/my-strategy/Dockerfile .
aws ecr get-login-password --region eu-central-1 | \
docker login --username AWS --password-stdin ${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com
docker push ${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com/tradai/mystrategy:latest
# 4. Add the strategy to ALLOWED_MODELS (if the allowlist is active)
# In infra/compute/modules/lambda_funcs.py, add "MyStrategy" to the
# ALLOWED_MODELS value in env_var_mapping, then deploy:
# just infra-up-compute ${ENV}
# Or temporarily via CLI:
# aws lambda update-function-configuration \
# --function-name tradai-check-retraining-needed-${ENV} \
# --environment "Variables={...,ALLOWED_MODELS=E2ETestStrategy,MyStrategy}" \
# --region eu-central-1
# 5. First training — manual_trigger bypasses schedule/drift gates;
# CheckRetrainingNeeded validates the name + allowlist and proceeds.
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
--name "onboard-mystrategy-$(date -u +%Y%m%d-%H%M%S)" \
--input '{"model_name":"MyStrategy","manual_trigger":true,"force":true}' \
--region eu-central-1
# 6. Verify per §5.
After the first successful run, tradai-retraining-state-{ENV} has a row for MyStrategy with last_retrained, so the scheduled and drift-based triggers can gate correctly from then on.
References¶
- Step Functions state-machine ARN:
arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-{ENV} - ASL template source:
infra/compute/asl_templates/retraining_workflow.json.j2 - Training entrypoint:
libs/tradai-common/src/tradai/common/entrypoint/training/ - MLflow adapter:
libs/tradai-common/src/tradai/common/mlflow/ - Debug sibling runbook: Debug Workflows — generic Step Functions / DynamoDB tracing