Skip to content

Retraining Workflow Runbook

End-to-end operational guide for the tradai-retraining-workflow-{ENV} Step Functions state machine: what it does, how to trigger it, how to reproduce each path in AWS dev, where to look when something goes wrong, and how to onboard a new strategy.

Scope: applies to the post-#91 workflow (13 states, with UpdateRetrainingState, HandleInvalidModel, and INVALID_MODEL routing). See the Troubleshooting section for historical issues that were fixed here.


1. Overview

The retraining workflow trains a FreqAI model, registers it in MLflow Model Registry, compares it against the champion, optionally promotes it, persists last_retrained metadata, and sends a completion notification.

trigger → CheckRetrainingNeeded → EvaluateRetrainingNeed
              ↓ needs_retraining       ↓ invalid_model              ↓ no_retraining / recently_trained
         RunRetraining (ECS Fargate)  HandleInvalidModel (Pass)    SkipRetraining
              ↓                        ↓ normalizes $.error
         CompareModels                 ↓
              ↓                        NotifyFailure
         DecidePromotion → PromoteModel / KeepCurrentModel
         UpdateRetrainingState  (writes last_retrained to DynamoDB)
         NotifyCompletion  (SNS retraining_success)

All failure-catching task states (CheckRetrainingNeeded, RunRetraining, CompareModels, PromoteModel) have Catch: States.ALL → NotifyFailure. The invalid_model choice branch does NOT go through a Catch (the decision is explicit, not an error), so HandleInvalidModel is an intermediate Pass state that synthesises $.error so NotifyFailure sees a uniform payload shape.

2. Architecture

State machine (13 states)

State Type Purpose
NormalizeInput Pass Merge defaults (e.g. empty config_version_id) into input
CheckRetrainingNeeded Task (Lambda) Validate model_name; decide based on drift/schedule
EvaluateRetrainingNeed Choice Route on decision: needs_retraining, invalid_model, else
RunRetraining Task (ECS runTask.sync) Train via generic tradai-strategy-generic-{ENV} task def
HandleInvalidModel Pass Normalize Choice-state context into $.error before NotifyFailure (no Catch fires on the explicit invalid_model branch, so shape has to be synthesised)
SkipRetraining Pass Terminal success for no-op runs
CompareModels Task (Lambda) Champion vs challenger comparison
DecidePromotion Choice Route on decision: promote vs keep
PromoteModel Task (Lambda) Transition new version to Production
KeepCurrentModel Pass Continue without promotion
UpdateRetrainingState Task (DynamoDB updateItem) Write last_retrained = State.EnteredTime keyed by model_name
NotifyCompletion Task (Lambda) SNS retraining_success
NotifyFailure Task (Lambda) SNS retraining_failed with $.error payload

Data flow

Path Producer Consumer Storage
Decision check-retraining-needed Lambda EvaluateRetrainingNeed Choice DynamoDB retraining-state-{ENV}, drift-state-{ENV}
Model artifacts ECS training task MLflow Model Registry S3 tradai-mlflow-{ENV}/artifacts/{experiment}/{run_id}/artifacts/model/
Status ECS training task update-status Lambda DynamoDB workflow-state-{ENV}
Completion Last state SNS topic tradai-alerts-{ENV} SNS → subscribers

3. Triggers

Manual — most common in dev

aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
  --name "manual-$(date -u +%Y%m%d-%H%M%S)" \
  --input '{
    "model_name": "E2ETestStrategy",
    "manual_trigger": true,
    "force": true
  }' \
  --region eu-central-1 \
  --profile tradai

config_version_id is optional; defaulted to "" by NormalizeInput.

Scheduled — retraining-scheduler Lambda

Runs on EventBridge rule tradai-retraining-scheduler-schedule-{ENV}. For each model in tradai-retraining-state-{ENV}, kicks off an execution with manual_trigger=false, force=false. The workflow's own CheckRetrainingNeeded gates whether training actually happens.

Drift-based — drift-monitor Lambda

When drift-monitor writes severity=significant to tradai-drift-state-{ENV}, an EventBridge rule triggers a retraining execution.

4. Prerequisites

  • AWS access with tradai profile (SSO or IAM user)
  • Model exists in MLflow Model Registry or is being trained for the first time (onboarding — see §10)
  • Strategy Docker image pushed to ECR at ${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com/tradai/{strategy}:latest
  • OHLCV data for the symbol/timeframe in ArcticDB (s3://tradai-arcticdb-{ENV}/{exchange}/)
  • ECS task definition tradai-strategy-generic-{ENV} deployed and active

5. Happy path — reproduction in AWS dev

# Resolve the account from your profile so the same snippet works in any environment.
export AWS_PROFILE=tradai
export ENV=dev
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

# 1. Trigger
ARN=$(aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
  --name "doc-happy-$(date -u +%Y%m%d-%H%M%S)" \
  --input '{"model_name":"E2ETestStrategy","manual_trigger":true,"force":true}' \
  --region eu-central-1 --query executionArn --output text)
echo "$ARN"

# 2. Wait for completion (~3–5 min for E2ETestStrategy)
aws stepfunctions describe-execution --execution-arn "$ARN" --region eu-central-1 \
  --query '[status,startDate,stopDate]' --output text

Expected outputs by state

Check Command Expected
Step Functions status describe-execution status=SUCCEEDED
MLflow Registry SSM curl http://mlflow.tradai-${ENV}.local:5000/mlflow/ajax-api/2.0/mlflow/registered-models/search E2ETestStrategy present, new version incremented, status=READY
S3 artifacts aws s3 ls --summarize s3://tradai-mlflow-${ENV}/artifacts/{experiment}/{run_id}/artifacts/model/ --recursive ~75 files: sub-train-BTC_*, metadata.json, backtesting_predictions/
last_retrained persisted aws dynamodb get-item --table-name tradai-retraining-state-${ENV} --key '{"model_name":{"S":"E2ETestStrategy"}}' Row with last_retrained ≈ current time (ISO 8601)
Workflow state aws dynamodb get-item --table-name tradai-workflow-state-${ENV} --key '{"run_id":{"S":"$ARN"}}' status=completed, mlflow_run_id populated
SNS notification SNS subscribers / CloudWatch metric tradai-dev/Notifications NotificationsSent notification_type=retraining_success, severity=INFO

6. Failure paths — reproduction

6.1 INVALID_MODEL — typo / injection in model_name

aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
  --name "doc-bad-$(date -u +%Y%m%d-%H%M%S)" \
  --input '{"model_name":"bad name!","manual_trigger":true}' \
  --region eu-central-1

Expected: - CheckRetrainingNeeded returns decision=invalid_model, reason cites the regex - EvaluateRetrainingNeed routes to NotifyFailure (not RunRetraining) - ECS task NEVER starts — confirmed by zero new entries in aws ecs list-tasks --family tradai-strategy-generic-${ENV} - SNS notification_type=retraining_failed, severity=ERROR - Execution status=SUCCEEDED (failure-notification is a success step by design)

Two layers of validation:

  1. Format: must match ^[A-Z][A-Za-z0-9_]{1,49}\Z (PascalCase, max 50 chars, \Z blocks trailing newlines).
  2. Allowlist: when ALLOWED_MODELS env var is set (e.g. E2ETestStrategy), only listed models are accepted. A format-valid but unknown name (e.g. BogusStrategy) returns INVALID_MODEL without starting any ECS task.

6.2 recently_trained — skip path after a recent run

# Immediately after a successful happy-path run (§5), without force/manual:
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
  --name "doc-skip-$(date -u +%Y%m%d-%H%M%S)" \
  --input '{"model_name":"E2ETestStrategy","manual_trigger":false,"force":false}' \
  --region eu-central-1

Expected: - CheckRetrainingNeeded reads tradai-retraining-state-${ENV}.last_retrained, returns decision=recently_trained - EvaluateRetrainingNeed routes to SkipRetraining - Execution status=SUCCEEDED, no ECS task started, no SNS notification sent - Pre-#91 this path was unreachable (the gate field was never written back — see §9)

6.3 ECS training task exits non-zero

Any uncaught exit from the container or a timeout hits RunRetraining's Catch: States.ALL → NotifyFailure. Dig in via §7.

7. Observability

Step Functions execution history

aws stepfunctions get-execution-history --execution-arn "$ARN" --region eu-central-1 \
  --query 'events[?type==`TaskSucceeded` || type==`TaskFailed`].[timestamp,name,type]' \
  --output table

The per-task Payload in taskSucceededEventDetails.output contains the Lambda return value; useful to see what CompareModels decided.

CloudWatch log groups

Service Log group
check-retraining-needed /aws/lambda/tradai-check-retraining-needed-{ENV}
compare-models /aws/lambda/tradai-compare-models-{ENV}
promote-model /aws/lambda/tradai-promote-model-{ENV}
notify-completion /aws/lambda/tradai-notify-completion-{ENV}
update-status /aws/lambda/tradai-update-status-{ENV}
ECS training /ecs/tradai-strategy-generic-{ENV}
Step Functions /aws/states/tradai-backtest-workflow-{ENV} (shared)

DynamoDB tables

Table Key Rows per
tradai-workflow-state-{ENV} run_id Per Step Functions execution — training status + mlflow_run_id
tradai-retraining-state-{ENV} model_name Per model — last_retrained (single row per model, upserted)
tradai-drift-state-{ENV} model_name Per model — PSI, is_drifted, severity

MLflow UI

ALB path: https://<alb-dns>/mlflow/ (not reachable from outside VPC). For read-only queries over SSM:

aws ssm send-command --instance-ids <consolidated-instance-id> \
  --document-name AWS-RunShellScript \
  --parameters 'commands=["curl -s http://mlflow.tradai-dev.local:5000/mlflow/ajax-api/2.0/mlflow/registered-models/search?max_results=20"]' \
  --region eu-central-1
aws ssm get-command-invocation --command-id <from-send-command> \
  --instance-id <consolidated-instance-id>

S3 artifact layout

s3://tradai-mlflow-{ENV}/
└── artifacts/
    └── {experiment_id}/           # e.g. 9
        └── {run_id}/              # 32-char hex
            └── artifacts/
                ├── model/
                │   ├── sub-train-BTC_{ts}/    # ~7 directories per run
                │   │   ├── cb_btc_{ts}_metadata.json
                │   │   └── cb_btc_{ts}_model.joblib
                │   └── backtesting_predictions/
                └── reproducibility/
                    └── manifest.json

8. Components

Component Source Deployed by
State machine definition infra/compute/asl_templates/retraining_workflow.json.j2 Pulumi (infra/compute)
check-retraining-needed lambdas/check-retraining-needed/ just lambda-push-all or deploy-lambdas.yml
compare-models lambdas/compare-models/ same
promote-model lambdas/promote-model/ same
notify-completion lambdas/notify-completion/ same
update-status lambdas/update-status/ same
Training ECS task def infra/compute/modules/ecs.py Pulumi
Strategy image strategies/{name}/Dockerfile just release <strategy> <ver> in strategies repo
IAM: StepFunctions role infra/compute/modules/step_functions.py::_create_execution_role Pulumi (has dynamodb:UpdateItem on tradai-*, needed for UpdateRetrainingState)

9. Troubleshooting

Skip path never triggers (CheckRetrainingNeeded always decides "first training")

Pre-#91 symptom. tradai-retraining-state-{ENV}.last_retrained was never written back by any post-training step, so the gate in CheckRetrainingNeeded saw the row as absent on every invocation.

Fix (post-#91). The UpdateRetrainingState state writes last_retrained = State.EnteredTime to retraining-state-{ENV} keyed by $.model_name right after PromoteModel/KeepCurrentModel. Implemented as a DynamoDB SDK integration (no extra Lambda). Failures in this state route directly to NotifyCompletion with $.errorlast_retrained is best-effort metadata, not a correctness invariant.

Bad model_name silently triggers 5-minute generic-image training failure

Pre-#91 symptom. model_name was forwarded to ECS without validation; a typo or shell-injection-shaped string produced an opaque freqtrade crash deep inside the pipeline.

Fix (post-#91). CheckRetrainingNeeded rejects names that don't match ^[A-Z][A-Za-z0-9_]{1,49}\Z up-front, returns RetrainingDecision.INVALID_MODEL, and EvaluateRetrainingNeed routes to NotifyFailure before any ECS task starts. Additionally, when ALLOWED_MODELS is set, format-valid but unknown names are also rejected — preventing bogus strategies from consuming Fargate compute.

MLflow SDK calls 404 behind the proxy

Pre-#91 symptom. Dev MLflow runs with --static-prefix /mlflow; the ALB only exposes /mlflow/ajax-api/*. The MLflow Python SDK hard-codes /api/2.0/mlflow/ and every SDK-mediated call returned 404 HTML.

Fix (post-#91). MLflowAdapter bypasses the SDK for Registry and artifact operations and calls raw HTTP against the discovered /ajax-api base (create_model_version, transition_model_version_stage, get_run, log_artifacts_direct). Artifact upload uses boto3 directly against run.info.artifact_uri, with a single reused S3 client for the whole walk.

ExternalServiceError masks transient network failures

Pre-#91 symptom. _request raised bare ExternalServiceError for every HTTP failure including 404. Registry's create-if-missing flow swallowed that, so a 500 / 503 / timeout on the existence probe silently triggered a duplicate create.

Fix (post-#91). MLflowClientMixin._request raises NotFoundError specifically on 404 and keeps ExternalServiceError for other 4xx/5xx and network errors. _create_model_version_via_http catches only NotFoundError before deciding to create the registered model. Transient failures now propagate.

10. Onboarding a new strategy

# 1. Generate strategy skeleton (in tradai-strategies repo)
cd ../tradai-strategies
just new MyStrategy

# 2. Implement the strategy, add tests
#    (freqai-enabled; see strategies/e2e-test-strategy/ for a minimal example)

# 3. Build + push image
cd ../tradai
just build-libs   # freshens tradai wheels in dist/
cd ../tradai-strategies
docker build --build-arg USE_LOCAL_WHEELS=true \
  -t ${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com/tradai/mystrategy:latest \
  -f strategies/my-strategy/Dockerfile .
aws ecr get-login-password --region eu-central-1 | \
  docker login --username AWS --password-stdin ${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com
docker push ${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com/tradai/mystrategy:latest

# 4. Add the strategy to ALLOWED_MODELS (if the allowlist is active)
#    In infra/compute/modules/lambda_funcs.py, add "MyStrategy" to the
#    ALLOWED_MODELS value in env_var_mapping, then deploy:
#      just infra-up-compute ${ENV}
#    Or temporarily via CLI:
#      aws lambda update-function-configuration \
#        --function-name tradai-check-retraining-needed-${ENV} \
#        --environment "Variables={...,ALLOWED_MODELS=E2ETestStrategy,MyStrategy}" \
#        --region eu-central-1

# 5. First training — manual_trigger bypasses schedule/drift gates;
#    CheckRetrainingNeeded validates the name + allowlist and proceeds.
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
  --name "onboard-mystrategy-$(date -u +%Y%m%d-%H%M%S)" \
  --input '{"model_name":"MyStrategy","manual_trigger":true,"force":true}' \
  --region eu-central-1

# 6. Verify per §5.

After the first successful run, tradai-retraining-state-{ENV} has a row for MyStrategy with last_retrained, so the scheduled and drift-based triggers can gate correctly from then on.


References

  • Step Functions state-machine ARN: arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-{ENV}
  • ASL template source: infra/compute/asl_templates/retraining_workflow.json.j2
  • Training entrypoint: libs/tradai-common/src/tradai/common/entrypoint/training/
  • MLflow adapter: libs/tradai-common/src/tradai/common/mlflow/
  • Debug sibling runbook: Debug Workflows — generic Step Functions / DynamoDB tracing