Retraining Workflow Runbook¶

End-to-end operational guide for the tradai-retraining-workflow-{ENV} Step Functions state machine: what it does, how to trigger it, how to reproduce each path in AWS dev, where to look when something goes wrong, and how to onboard a new strategy.

Scope: applies to the post-#91 workflow (13 states, with UpdateRetrainingState, HandleInvalidModel, and INVALID_MODEL routing). See the Troubleshooting section for historical issues that were fixed here.

1. Overview¶

The retraining workflow trains a FreqAI model, registers it in MLflow Model Registry, compares it against the champion, optionally promotes it, persists last_retrained metadata, and sends a completion notification.

trigger → CheckRetrainingNeeded → EvaluateRetrainingNeed
              ↓ needs_retraining       ↓ invalid_model              ↓ no_retraining / recently_trained
         RunRetraining (ECS Fargate)  HandleInvalidModel (Pass)    SkipRetraining
              ↓                        ↓ normalizes $.error
         CompareModels                 ↓
              ↓                        NotifyFailure
         DecidePromotion → PromoteModel / KeepCurrentModel
              ↓
         UpdateRetrainingState  (writes last_retrained to DynamoDB)
              ↓
         NotifyCompletion  (SNS retraining_success)

All failure-catching task states (CheckRetrainingNeeded, RunRetraining, CompareModels, PromoteModel) have Catch: States.ALL → NotifyFailure. The invalid_model choice branch does NOT go through a Catch (the decision is explicit, not an error), so HandleInvalidModel is an intermediate Pass state that synthesises $.error so NotifyFailure sees a uniform payload shape.

2. Architecture¶

State machine (13 states)¶

State	Type	Purpose
NormalizeInput	Pass	Merge defaults (e.g. empty `config_version_id`) into input
CheckRetrainingNeeded	Task (Lambda)	Validate model_name; decide based on drift/schedule
EvaluateRetrainingNeed	Choice	Route on `decision`: `needs_retraining`, `invalid_model`, else
RunRetraining	Task (ECS runTask.sync)	Train via generic `tradai-strategy-generic-{ENV}` task def
HandleInvalidModel	Pass	Normalize Choice-state context into `$.error` before `NotifyFailure` (no Catch fires on the explicit `invalid_model` branch, so shape has to be synthesised)
SkipRetraining	Pass	Terminal success for no-op runs
CompareModels	Task (Lambda)	Champion vs challenger comparison
DecidePromotion	Choice	Route on `decision`: `promote` vs keep
PromoteModel	Task (Lambda)	Transition new version to Production
KeepCurrentModel	Pass	Continue without promotion
UpdateRetrainingState	Task (DynamoDB updateItem)	Write `last_retrained = State.EnteredTime` keyed by `model_name`
NotifyCompletion	Task (Lambda)	SNS `retraining_success`
NotifyFailure	Task (Lambda)	SNS `retraining_failed` with `$.error` payload

Data flow¶

Path	Producer	Consumer	Storage
Decision	`check-retraining-needed` Lambda	`EvaluateRetrainingNeed` Choice	DynamoDB `retraining-state-{ENV}`, `drift-state-{ENV}`
Model artifacts	ECS training task	MLflow Model Registry	S3 `tradai-mlflow-{ENV}/artifacts/{experiment}/{run_id}/artifacts/model/`
Status	ECS training task	`update-status` Lambda	DynamoDB `workflow-state-{ENV}`
Completion	Last state	SNS topic `tradai-alerts-{ENV}`	SNS → subscribers

3. Triggers¶

Manual — most common in dev¶

aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
  --name "manual-$(date -u +%Y%m%d-%H%M%S)" \
  --input '{
    "model_name": "E2ETestStrategy",
    "manual_trigger": true,
    "force": true
  }' \
  --region eu-central-1 \
  --profile tradai

config_version_id is optional; defaulted to "" by NormalizeInput.

Scheduled — `retraining-scheduler` Lambda¶

Runs on EventBridge rule tradai-retraining-scheduler-schedule-{ENV}. For each model in tradai-retraining-state-{ENV}, kicks off an execution with manual_trigger=false, force=false. The workflow's own CheckRetrainingNeeded gates whether training actually happens.

Drift-based — `drift-monitor` Lambda¶

When drift-monitor writes severity=significant to tradai-drift-state-{ENV}, an EventBridge rule triggers a retraining execution.

4. Prerequisites¶

AWS access with tradai profile (SSO or IAM user)
Model exists in MLflow Model Registry or is being trained for the first time (onboarding — see §10)
Strategy Docker image pushed to ECR at ${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com/tradai/{strategy}:latest
OHLCV data for the symbol/timeframe in ArcticDB (s3://tradai-arcticdb-{ENV}/{exchange}/)
ECS task definition tradai-strategy-generic-{ENV} deployed and active

5. Happy path — reproduction in AWS dev¶

# Resolve the account from your profile so the same snippet works in any environment.
export AWS_PROFILE=tradai
export ENV=dev
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

# 1. Trigger
ARN=$(aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
  --name "doc-happy-$(date -u +%Y%m%d-%H%M%S)" \
  --input '{"model_name":"E2ETestStrategy","manual_trigger":true,"force":true}' \
  --region eu-central-1 --query executionArn --output text)
echo "$ARN"

# 2. Wait for completion (~3–5 min for E2ETestStrategy)
aws stepfunctions describe-execution --execution-arn "$ARN" --region eu-central-1 \
  --query '[status,startDate,stopDate]' --output text

Expected outputs by state¶

Check	Command	Expected
Step Functions status	`describe-execution`	`status=SUCCEEDED`
MLflow Registry	SSM `curl http://mlflow.tradai-${ENV}.local:5000/mlflow/ajax-api/2.0/mlflow/registered-models/search`	`E2ETestStrategy` present, new version incremented, `status=READY`
S3 artifacts	`aws s3 ls --summarize s3://tradai-mlflow-${ENV}/artifacts/{experiment}/{run_id}/artifacts/model/ --recursive`	~75 files: sub-train-BTC_*, metadata.json, backtesting_predictions/
`last_retrained` persisted	`aws dynamodb get-item --table-name tradai-retraining-state-${ENV} --key '{"model_name":{"S":"E2ETestStrategy"}}'`	Row with `last_retrained` ≈ current time (ISO 8601)
Workflow state	`aws dynamodb get-item --table-name tradai-workflow-state-${ENV} --key '{"run_id":{"S":"$ARN"}}'`	`status=completed`, `mlflow_run_id` populated
SNS notification	SNS subscribers / CloudWatch metric `tradai-dev/Notifications NotificationsSent`	`notification_type=retraining_success`, `severity=INFO`

6. Failure paths — reproduction¶

6.1 `INVALID_MODEL` — typo / injection in model_name¶

aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
  --name "doc-bad-$(date -u +%Y%m%d-%H%M%S)" \
  --input '{"model_name":"bad name!","manual_trigger":true}' \
  --region eu-central-1

Expected: - CheckRetrainingNeeded returns decision=invalid_model, reason cites the regex - EvaluateRetrainingNeed routes to NotifyFailure (not RunRetraining) - ECS task NEVER starts — confirmed by zero new entries in aws ecs list-tasks --family tradai-strategy-generic-${ENV} - SNS notification_type=retraining_failed, severity=ERROR - Execution status=SUCCEEDED (failure-notification is a success step by design)

Two layers of validation:

Format: must match ^[A-Z][A-Za-z0-9_]{1,49}\Z (PascalCase, max 50 chars, \Z blocks trailing newlines).
Allowlist: when ALLOWED_MODELS env var is set (e.g. E2ETestStrategy), only listed models are accepted. A format-valid but unknown name (e.g. BogusStrategy) returns INVALID_MODEL without starting any ECS task.

6.2 `recently_trained` — skip path after a recent run¶

# Immediately after a successful happy-path run (§5), without force/manual:
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
  --name "doc-skip-$(date -u +%Y%m%d-%H%M%S)" \
  --input '{"model_name":"E2ETestStrategy","manual_trigger":false,"force":false}' \
  --region eu-central-1

Expected: - CheckRetrainingNeeded reads tradai-retraining-state-${ENV}.last_retrained, returns decision=recently_trained - EvaluateRetrainingNeed routes to SkipRetraining - Execution status=SUCCEEDED, no ECS task started, no SNS notification sent - Pre-#91 this path was unreachable (the gate field was never written back — see §9)

6.3 ECS training task exits non-zero¶

Any uncaught exit from the container or a timeout hits RunRetraining's Catch: States.ALL → NotifyFailure. Dig in via §7.

7. Observability¶

Step Functions execution history¶

aws stepfunctions get-execution-history --execution-arn "$ARN" --region eu-central-1 \
  --query 'events[?type==`TaskSucceeded` || type==`TaskFailed`].[timestamp,name,type]' \
  --output table

The per-task Payload in taskSucceededEventDetails.output contains the Lambda return value; useful to see what CompareModels decided.

CloudWatch log groups¶

Service	Log group
check-retraining-needed	`/aws/lambda/tradai-check-retraining-needed-{ENV}`
compare-models	`/aws/lambda/tradai-compare-models-{ENV}`
promote-model	`/aws/lambda/tradai-promote-model-{ENV}`
notify-completion	`/aws/lambda/tradai-notify-completion-{ENV}`
update-status	`/aws/lambda/tradai-update-status-{ENV}`
ECS training	`/ecs/tradai-strategy-generic-{ENV}`
Step Functions	`/aws/states/tradai-backtest-workflow-{ENV}` (shared)

DynamoDB tables¶

Table	Key	Rows per
`tradai-workflow-state-{ENV}`	`run_id`	Per Step Functions execution — training status + `mlflow_run_id`
`tradai-retraining-state-{ENV}`	`model_name`	Per model — `last_retrained` (single row per model, upserted)
`tradai-drift-state-{ENV}`	`model_name`	Per model — PSI, `is_drifted`, `severity`

MLflow UI¶

ALB path: https://<alb-dns>/mlflow/ (not reachable from outside VPC). For read-only queries over SSM:

aws ssm send-command --instance-ids <consolidated-instance-id> \
  --document-name AWS-RunShellScript \
  --parameters 'commands=["curl -s http://mlflow.tradai-dev.local:5000/mlflow/ajax-api/2.0/mlflow/registered-models/search?max_results=20"]' \
  --region eu-central-1
aws ssm get-command-invocation --command-id <from-send-command> \
  --instance-id <consolidated-instance-id>

S3 artifact layout¶

s3://tradai-mlflow-{ENV}/
└── artifacts/
    └── {experiment_id}/           # e.g. 9
        └── {run_id}/              # 32-char hex
            └── artifacts/
                ├── model/
                │   ├── sub-train-BTC_{ts}/    # ~7 directories per run
                │   │   ├── cb_btc_{ts}_metadata.json
                │   │   └── cb_btc_{ts}_model.joblib
                │   └── backtesting_predictions/
                └── reproducibility/
                    └── manifest.json

8. Components¶

Component	Source	Deployed by
State machine definition	`infra/compute/asl_templates/retraining_workflow.json.j2`	Pulumi (`infra/compute`)
check-retraining-needed	`lambdas/check-retraining-needed/`	`just lambda-push-all` or `deploy-lambdas.yml`
compare-models	`lambdas/compare-models/`	same
promote-model	`lambdas/promote-model/`	same
notify-completion	`lambdas/notify-completion/`	same
update-status	`lambdas/update-status/`	same
Training ECS task def	`infra/compute/modules/ecs.py`	Pulumi
Strategy image	`strategies/{name}/Dockerfile`	`just release <strategy> <ver>` in strategies repo
IAM: StepFunctions role	`infra/compute/modules/step_functions.py::_create_execution_role`	Pulumi (has `dynamodb:UpdateItem` on `tradai-*`, needed for `UpdateRetrainingState`)

9. Troubleshooting¶

Skip path never triggers (`CheckRetrainingNeeded` always decides "first training")¶

Pre-#91 symptom. tradai-retraining-state-{ENV}.last_retrained was never written back by any post-training step, so the gate in CheckRetrainingNeeded saw the row as absent on every invocation.

Fix (post-#91). The UpdateRetrainingState state writes last_retrained = State.EnteredTime to retraining-state-{ENV} keyed by $.model_name right after PromoteModel/KeepCurrentModel. Implemented as a DynamoDB SDK integration (no extra Lambda). Failures in this state route directly to NotifyCompletion with $.error — last_retrained is best-effort metadata, not a correctness invariant.

Bad `model_name` silently triggers 5-minute generic-image training failure¶

Pre-#91 symptom. model_name was forwarded to ECS without validation; a typo or shell-injection-shaped string produced an opaque freqtrade crash deep inside the pipeline.

Fix (post-#91). CheckRetrainingNeeded rejects names that don't match ^[A-Z][A-Za-z0-9_]{1,49}\Z up-front, returns RetrainingDecision.INVALID_MODEL, and EvaluateRetrainingNeed routes to NotifyFailure before any ECS task starts. Additionally, when ALLOWED_MODELS is set, format-valid but unknown names are also rejected — preventing bogus strategies from consuming Fargate compute.

MLflow SDK calls 404 behind the proxy¶

Pre-#91 symptom. Dev MLflow runs with --static-prefix /mlflow; the ALB only exposes /mlflow/ajax-api/*. The MLflow Python SDK hard-codes /api/2.0/mlflow/ and every SDK-mediated call returned 404 HTML.

Fix (post-#91). MLflowAdapter bypasses the SDK for Registry and artifact operations and calls raw HTTP against the discovered /ajax-api base (create_model_version, transition_model_version_stage, get_run, log_artifacts_direct). Artifact upload uses boto3 directly against run.info.artifact_uri, with a single reused S3 client for the whole walk.

`ExternalServiceError` masks transient network failures¶

Pre-#91 symptom. _request raised bare ExternalServiceError for every HTTP failure including 404. Registry's create-if-missing flow swallowed that, so a 500 / 503 / timeout on the existence probe silently triggered a duplicate create.

Fix (post-#91). MLflowClientMixin._request raises NotFoundError specifically on 404 and keeps ExternalServiceError for other 4xx/5xx and network errors. _create_model_version_via_http catches only NotFoundError before deciding to create the registered model. Transient failures now propagate.

10. Onboarding a new strategy¶

# 1. Generate strategy skeleton (in tradai-strategies repo)
cd ../tradai-strategies
just new MyStrategy

# 2. Implement the strategy, add tests
#    (freqai-enabled; see strategies/e2e-test-strategy/ for a minimal example)

# 3. Build + push image
cd ../tradai
just build-libs   # freshens tradai wheels in dist/
cd ../tradai-strategies
docker build --build-arg USE_LOCAL_WHEELS=true \
  -t ${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com/tradai/mystrategy:latest \
  -f strategies/my-strategy/Dockerfile .
aws ecr get-login-password --region eu-central-1 | \
  docker login --username AWS --password-stdin ${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com
docker push ${ACCOUNT}.dkr.ecr.eu-central-1.amazonaws.com/tradai/mystrategy:latest

# 4. Add the strategy to ALLOWED_MODELS (if the allowlist is active)
#    In infra/compute/modules/lambda_funcs.py, add "MyStrategy" to the
#    ALLOWED_MODELS value in env_var_mapping, then deploy:
#      just infra-up-compute ${ENV}
#    Or temporarily via CLI:
#      aws lambda update-function-configuration \
#        --function-name tradai-check-retraining-needed-${ENV} \
#        --environment "Variables={...,ALLOWED_MODELS=E2ETestStrategy,MyStrategy}" \
#        --region eu-central-1

# 5. First training — manual_trigger bypasses schedule/drift gates;
#    CheckRetrainingNeeded validates the name + allowlist and proceeds.
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-${ENV} \
  --name "onboard-mystrategy-$(date -u +%Y%m%d-%H%M%S)" \
  --input '{"model_name":"MyStrategy","manual_trigger":true,"force":true}' \
  --region eu-central-1

# 6. Verify per §5.

After the first successful run, tradai-retraining-state-{ENV} has a row for MyStrategy with last_retrained, so the scheduled and drift-based triggers can gate correctly from then on.

References¶

Step Functions state-machine ARN: arn:aws:states:eu-central-1:${ACCOUNT}:stateMachine:tradai-retraining-workflow-{ENV}
ASL template source: infra/compute/asl_templates/retraining_workflow.json.j2
Training entrypoint: libs/tradai-common/src/tradai/common/entrypoint/training/
MLflow adapter: libs/tradai-common/src/tradai/common/mlflow/
Debug sibling runbook: Debug Workflows — generic Step Functions / DynamoDB tracing