Skip to content

Issue #91 — E2E Verification

End-to-end verification of every acceptance criterion from issue #91 against the deployed AWS dev environment. Total runtime: ~10 minutes (one ~5-minute training job plus short-path scenarios).

Every check is a pipeable one-liner. Each prints a single line of output you compare against the Expected block.

Prerequisites

  • AWS CLI authenticated with SSO profile tradai (account 600802701449, region eu-central-1).
  • python3 (or python — see below) on PATH.
  • A POSIX shell — native on macOS / Linux; Git Bash or WSL on Windows. On Windows where the default python3 is the Microsoft Store app-execution alias, either disable the alias in Settings → Apps → App execution aliases, or replace every python3 in the commands below with python.
  • SSM SendCommand permission on the consolidated EC2 instance (used for MLflow REST queries, since the MLflow endpoint is VPC-internal).

Setup (run once per session)

aws sso login --profile tradai
export AWS_PROFILE=tradai
export AWS_REGION=eu-central-1
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export SM_ARN="arn:aws:states:${AWS_REGION}:${ACCOUNT}:stateMachine:tradai-retraining-workflow-dev"
export MLFLOW_INSTANCE=$(aws ec2 describe-instances --filters 'Name=tag:Name,Values=tradai-consolidated-dev' 'Name=instance-state-name,Values=running' --region "$AWS_REGION" --query 'Reservations[0].Instances[0].InstanceId' --output text)
export ECR_E2E="tradai/e2eteststrategy"
export MODEL=E2ETestStrategy

Define the MLflow helper (one-liner; handles both GET and POST via SSM). Paste it verbatim:

mlflow_call() { local M="$1" P="$2" B="$3" S; if [ "$M" = "GET" ]; then S="curl -s 'http://mlflow.tradai-dev.local:5000/mlflow${P}'"; else local X=$(printf '%s' "$B" | base64 -w0); S="echo $X | base64 -d | curl -s -X POST 'http://mlflow.tradai-dev.local:5000/mlflow${P}' -H 'Content-Type: application/json' --data-binary @-"; fi; local J=$(SCRIPT="$S" python3 -c "import json,os; print(json.dumps({'commands':[os.environ['SCRIPT']]}))"); local C=$(aws ssm send-command --instance-ids "$MLFLOW_INSTANCE" --document-name AWS-RunShellScript --parameters "$J" --region "$AWS_REGION" --query Command.CommandId --output text); local s; until s=$(aws ssm list-command-invocations --command-id "$C" --region "$AWS_REGION" --query 'CommandInvocations[0].Status' --output text); [ "$s" = "Success" ] || [ "$s" = "Failed" ]; do sleep 2; done; aws ssm get-command-invocation --command-id "$C" --instance-id "$MLFLOW_INSTANCE" --region "$AWS_REGION" --query StandardOutputContent --output text; }

Section A — Static infrastructure invariants

A.1 State machine has 13 states including HandleInvalidModel and UpdateRetrainingState

aws stepfunctions describe-state-machine --state-machine-arn "$SM_ARN" --region "$AWS_REGION" --query definition --output text | python3 -c "import json,sys; s=json.loads(sys.stdin.read())['States']; print(len(s), '|', ','.join(s))"

Expected: 13 | NormalizeInput,CheckRetrainingNeeded,EvaluateRetrainingNeed,HandleInvalidModel,SkipRetraining,RunRetraining,CompareModels,DecidePromotion,KeepCurrentModel,PromoteModel,UpdateRetrainingState,NotifyCompletion,NotifyFailure

A.2 Strategy ECS task definition active, pointing at the E2ETestStrategy image

aws ecs describe-task-definition --task-definition tradai-strategy-generic-dev --region "$AWS_REGION" --query 'taskDefinition.[family,status,containerDefinitions[0].image]' --output text

Expected: tradai-strategy-generic-dev ACTIVE 600802701449.dkr.ecr.eu-central-1.amazonaws.com/tradai/e2eteststrategy:latest

A.3 E2ETestStrategy image present in ECR

aws ecr describe-images --repository-name "$ECR_E2E" --image-ids imageTag=latest --region "$AWS_REGION" --query 'imageDetails[0].[imageDigest,imagePushedAt]' --output text

Expected: a sha256:... digest and a timestamp. Empty or ImageNotFoundException = fail.

A.4 Required DynamoDB tables exist and are active

for t in tradai-retraining-state-dev tradai-drift-state-dev tradai-workflow-state-dev; do aws dynamodb describe-table --table-name "$t" --region "$AWS_REGION" --query 'Table.[TableName,TableStatus]' --output text; done

Expected: three lines, each ending in ACTIVE.

Section B — Scenario A: INVALID_MODEL routing (~5 s)

Malformed model_name. Must route through HandleInvalidModel → NotifyFailure without touching ECS.

B.1 Start execution and capture its ARN

export ARN_A=$(aws stepfunctions start-execution --state-machine-arn "$SM_ARN" --name "verify-invalid-$(date -u +%Y%m%d-%H%M%S)" --input '{"model_name":"bad name!","manual_trigger":true,"force":false}' --region "$AWS_REGION" --query executionArn --output text) && until [ "$(aws stepfunctions describe-execution --execution-arn "$ARN_A" --region "$AWS_REGION" --query status --output text)" != "RUNNING" ]; do sleep 2; done && echo "$ARN_A"

Expected: a single ARN line arn:aws:states:...:verify-invalid-....

B.2 State path ends at NotifyFailure

aws stepfunctions get-execution-history --execution-arn "$ARN_A" --region "$AWS_REGION" --query 'events[?stateEnteredEventDetails].stateEnteredEventDetails.name' --output text

Expected: NormalizeInput CheckRetrainingNeeded EvaluateRetrainingNeed HandleInvalidModel NotifyFailure

B.3 Execution status SUCCEEDED

aws stepfunctions describe-execution --execution-arn "$ARN_A" --region "$AWS_REGION" --query status --output text

Expected: SUCCEEDED (the workflow successfully emits the failure notification; the failure is a successful step).

B.4 NotifyFailure Lambda reported retraining_failed

aws stepfunctions get-execution-history --execution-arn "$ARN_A" --region "$AWS_REGION" --query 'events[?type==`TaskSucceeded`].taskSucceededEventDetails.output' --output json | python3 -c "import json,sys; outs=json.load(sys.stdin); [print(json.loads(json.loads(o)['Payload']['body'])['notification_type']) for o in outs if isinstance(json.loads(o).get('Payload',{}),dict) and 'body' in json.loads(o).get('Payload',{})]"

Expected: retraining_failed

B.5 HandleInvalidModel synthesised $.error with the regex-violation cause

aws stepfunctions get-execution-history --execution-arn "$ARN_A" --region "$AWS_REGION" --query 'events[?stateExitedEventDetails.name==`HandleInvalidModel`].stateExitedEventDetails.output' --output text | python3 -c "import json,sys; d=json.loads(sys.stdin.read()); c=d.get('error',{}).get('Cause','') or d.get('error',{}).get('cause',''); print('error_present=', 'error' in d, '| cause_startswith_invalid=', c.startswith(\"Invalid model_name 'bad name!'\"))"

Expected: error_present= True | cause_startswith_invalid= True

B.6 No RunRetraining state was entered (⇒ zero ECS tasks)

aws stepfunctions get-execution-history --execution-arn "$ARN_A" --region "$AWS_REGION" --query 'events[?stateEnteredEventDetails.name==`RunRetraining`]|length(@)' --output text

Expected: 0

B.7 ALLOWED_MODELS rejects a format-valid but unknown model

A PascalCase name that passes regex validation but is NOT in the ALLOWED_MODELS env var must also route to HandleInvalidModel → NotifyFailure.

export ARN_AL=$(aws stepfunctions start-execution --state-machine-arn "$SM_ARN" --name "verify-allowlist-$(date -u +%Y%m%d-%H%M%S)" --input '{"model_name":"BogusStrategy","manual_trigger":true,"force":true}' --region "$AWS_REGION" --query executionArn --output text) && until [ "$(aws stepfunctions describe-execution --execution-arn "$ARN_AL" --region "$AWS_REGION" --query status --output text)" != "RUNNING" ]; do sleep 2; done && echo "$ARN_AL"

Expected: a single ARN line.

B.8 Allowlist-rejected model routes to HandleInvalidModel → NotifyFailure

aws stepfunctions get-execution-history --execution-arn "$ARN_AL" --region "$AWS_REGION" --query 'events[?stateEnteredEventDetails].stateEnteredEventDetails.name' --output text

Expected: NormalizeInput CheckRetrainingNeeded EvaluateRetrainingNeed HandleInvalidModel NotifyFailure

B.9 Allowlist rejection reason mentions ALLOWED_MODELS

aws stepfunctions get-execution-history --execution-arn "$ARN_AL" --region "$AWS_REGION" --query 'events[?stateExitedEventDetails.name==`HandleInvalidModel`].stateExitedEventDetails.output' --output text | python3 -c "import json,sys; d=json.loads(sys.stdin.read()); c=d.get('error',{}).get('Cause','') or d.get('error',{}).get('cause',''); print('has_allowed_models_ref=', 'ALLOWED_MODELS' in c)"

Expected: has_allowed_models_ref= True

Section C — Scenario B: happy path (~5 min)

Valid E2ETestStrategy, force=true. Runs the full training pipeline and must persist last_retrained so Scenario C can skip.

C.1 Start execution and wait for completion

export ARN_B=$(aws stepfunctions start-execution --state-machine-arn "$SM_ARN" --name "verify-happy-$(date -u +%Y%m%d-%H%M%S)" --input '{"model_name":"E2ETestStrategy","manual_trigger":true,"force":true}' --region "$AWS_REGION" --query executionArn --output text) && echo "$ARN_B" && until [ "$(aws stepfunctions describe-execution --execution-arn "$ARN_B" --region "$AWS_REGION" --query status --output text)" != "RUNNING" ]; do sleep 30; done && aws stepfunctions describe-execution --execution-arn "$ARN_B" --region "$AWS_REGION" --query status --output text

Expected: ARN line, then (after ~5 minutes) SUCCEEDED.

C.2 State path runs through all training stages

aws stepfunctions get-execution-history --execution-arn "$ARN_B" --region "$AWS_REGION" --query 'events[?stateEnteredEventDetails].stateEnteredEventDetails.name' --output text

Expected (one branch, both valid): NormalizeInput CheckRetrainingNeeded EvaluateRetrainingNeed RunRetraining CompareModels DecidePromotion KeepCurrentModel UpdateRetrainingState NotifyCompletionPromoteModel substitutes for KeepCurrentModel if the challenger was promoted.

C.3 NormalizeInput defaulted config_version_id to ""

aws stepfunctions get-execution-history --execution-arn "$ARN_B" --region "$AWS_REGION" --query 'events[?stateExitedEventDetails.name==`NormalizeInput`].stateExitedEventDetails.output' --output text | python3 -c "import json,sys; d=json.loads(sys.stdin.read()); print('config_version_id=' + repr(d.get('config_version_id')))"

Expected: config_version_id=''

C.4 RunRetraining finished within 5 minutes

aws stepfunctions get-execution-history --execution-arn "$ARN_B" --region "$AWS_REGION" --query 'events[?contains([`RunRetraining`], stateEnteredEventDetails.name) || contains([`RunRetraining`], stateExitedEventDetails.name)].timestamp' --output text | python3 -c "import sys,datetime; t=[datetime.datetime.fromisoformat(x.replace('+0000','+00:00')) for x in sys.stdin.read().split()]; print('duration_seconds=', int((t[-1]-t[0]).total_seconds()))"

Expected: duration_seconds= <N> where N < 900. Issue says "~5 minutes" — observed range is 339–654s across runs (Fargate Spot scheduling + cold-start adds 60–90s overhead). 900s threshold validates lightweight design while tolerating infra jitter.

C.5 ECS training log has no error-class lines

export TASK_ID=$(aws stepfunctions get-execution-history --execution-arn "$ARN_B" --region "$AWS_REGION" --output json | python3 -c "import json,sys; h=json.load(sys.stdin); t=[json.loads(e['taskSucceededEventDetails']['output']).get('TaskArn','') for e in h['events'] if e['type']=='TaskSucceeded' and 'TaskArn' in e.get('taskSucceededEventDetails',{}).get('output','')]; print(t[0].rsplit('/',1)[1] if t else '')") && MSYS_NO_PATHCONV=1 aws logs filter-log-events --log-group-name /ecs/tradai/dev --log-stream-name-prefix "strategy/strategy/${TASK_ID}" --filter-pattern '?ERROR ?Traceback ?CRITICAL' --region "$AWS_REGION" --no-paginate --query 'length(events)' --output text

Expected: 0

C.6 CompareModels returned both decision and confidence

aws stepfunctions get-execution-history --execution-arn "$ARN_B" --region "$AWS_REGION" --query 'events[?type==`TaskSucceeded`].taskSucceededEventDetails.output' --output json | python3 -c "import json,sys; outs=json.load(sys.stdin); [print('decision=' + str(p.get('decision')) + ' confidence=' + str(p.get('confidence'))) for o in outs for p in [json.loads(o).get('Payload') if isinstance(json.loads(o).get('Payload'),dict) else {}] if 'confidence' in p]"

Expected: one line of the form decision=<value> confidence=<numeric> — both keys must be present. Example: decision=needs_more_data confidence=0.0.

C.7 DecidePromotion routed to exactly one of the terminal-branch states

aws stepfunctions get-execution-history --execution-arn "$ARN_B" --region "$AWS_REGION" --query 'events[?stateEnteredEventDetails.name==`PromoteModel` || stateEnteredEventDetails.name==`KeepCurrentModel`].stateEnteredEventDetails.name' --output text

Expected: PromoteModel or KeepCurrentModel (exactly one token).

C.8 NotifyCompletion Lambda reported retraining_success

aws stepfunctions get-execution-history --execution-arn "$ARN_B" --region "$AWS_REGION" --query 'events[?type==`TaskSucceeded`].taskSucceededEventDetails.output' --output json | python3 -c "import json,sys; outs=json.load(sys.stdin); [print(json.loads(json.loads(o)['Payload']['body'])['notification_type']) for o in outs if isinstance(json.loads(o).get('Payload',{}),dict) and 'body' in json.loads(o).get('Payload',{})]"

Expected: retraining_success

C.9 UpdateRetrainingState wrote last_retrained

aws dynamodb get-item --table-name tradai-retraining-state-dev --key "{\"model_name\":{\"S\":\"${MODEL}\"}}" --region "$AWS_REGION" --query 'Item.last_retrained.S' --output text

Expected: an ISO-8601 UTC timestamp within the last few minutes.

C.9b tradai-workflow-state-dev row created with status=completed

DW#27 requires the workflow state table to track job status.

MSYS_NO_PATHCONV=1 aws dynamodb get-item --table-name tradai-workflow-state-dev --key "{\"run_id\":{\"S\":\"${ARN_B}\"}}" --region "$AWS_REGION" --query 'Item.status.S' --output text

Expected: completed

C.10 MLflow experiment and run exist

Retraining creates a per-strategy experiment {MODEL}_training (e.g. E2ETestStrategy_training) and tags each run with job_id=<execution arn>.

# Discover experiment ID by name
export EXP_ID=$(mlflow_call GET "/ajax-api/2.0/mlflow/experiments/get-by-name?experiment_name=${MODEL}_training" | python3 -c "import json,sys; print(json.load(sys.stdin)['experiment']['experiment_id'])")
echo "experiment_id=${EXP_ID}"

Expected: experiment_id=<integer> (non-empty).

mlflow_call POST "/ajax-api/2.0/mlflow/runs/search" "$(printf "{\"experiment_ids\":[\"%s\"],\"filter\":\"tags.job_id = '%s'\",\"max_results\":1}" "$EXP_ID" "$ARN_B")" | python3 -c "import json,sys; d=json.load(sys.stdin); r=(d.get('runs') or [None])[0]; print('found=', bool(r), '| run_id=', (r or {}).get('info',{}).get('run_id','') , '| status=', (r or {}).get('info',{}).get('status',''))"

Expected: found= True | run_id= <32-char hex> | status= FINISHED.

C.11 MLflow run carries expected tags and training metrics

mlflow_call POST "/ajax-api/2.0/mlflow/runs/search" "$(printf "{\"experiment_ids\":[\"%s\"],\"filter\":\"tags.job_id = '%s'\",\"max_results\":1}" "$EXP_ID" "$ARN_B")" | python3 -c "import json,sys; r=json.load(sys.stdin)['runs'][0]; t={x['key']:x['value'] for x in r['data'].get('tags',[])}; m=[x['key'] for x in r['data'].get('metrics',[])]; print('strategy=' + t.get('strategy','') + ' | freqai_model=' + t.get('freqai_model','') + ' | metrics=' + ','.join(sorted(m)))"

Expected: strategy=E2ETestStrategy | freqai_model=LightGBMRegressor | metrics=training_profit_pct,training_sharpe_ratio,training_total_trades

C.12 S3 artefacts present at s3://tradai-mlflow-dev/artifacts/{EXP_ID}/<run_id>/artifacts/

export RUN_ID=$(mlflow_call POST "/ajax-api/2.0/mlflow/runs/search" "$(printf "{\"experiment_ids\":[\"%s\"],\"filter\":\"tags.job_id = '%s'\",\"max_results\":1}" "$EXP_ID" "$ARN_B")" | python3 -c "import json,sys; print(json.load(sys.stdin)['runs'][0]['info']['run_id'])") && aws s3 ls "s3://tradai-mlflow-dev/artifacts/${EXP_ID}/${RUN_ID}/artifacts/" --recursive --region "$AWS_REGION" --summarize --human-readable | tail -2

Expected: two summary lines (Total Objects: <N≥50> and Total Size: <non-zero>).

C.13 Model version registered in MLflow Registry with source pointing at this run

mlflow_call GET "/ajax-api/2.0/mlflow/registered-models/search?max_results=200" | python3 -c "import json,sys,os; run=os.environ['RUN_ID']; d=json.load(sys.stdin); hits=[v for m in d.get('registered_models',[]) for v in m.get('latest_versions',[]) if v.get('run_id')==run]; print('registered=', bool(hits), '| version=', (hits or [{}])[0].get('version',''), '| stage=', (hits or [{}])[0].get('current_stage',''))"

Expected: registered= True | version= <N> | stage= <None|Staging|Production>.

Section D — Scenario C: skip path (~5 s, run immediately after Scenario B)

Same model, force=false, manual_trigger=false. The fresh last_retrained from C.9 must short-circuit into SkipRetraining.

D.1 Start, wait, and read the state path

export ARN_C=$(aws stepfunctions start-execution --state-machine-arn "$SM_ARN" --name "verify-skip-$(date -u +%Y%m%d-%H%M%S)" --input "{\"model_name\":\"${MODEL}\",\"manual_trigger\":false,\"force\":false}" --region "$AWS_REGION" --query executionArn --output text) && until [ "$(aws stepfunctions describe-execution --execution-arn "$ARN_C" --region "$AWS_REGION" --query status --output text)" != "RUNNING" ]; do sleep 2; done && aws stepfunctions get-execution-history --execution-arn "$ARN_C" --region "$AWS_REGION" --query 'events[?stateEnteredEventDetails].stateEnteredEventDetails.name' --output text

Expected: NormalizeInput CheckRetrainingNeeded EvaluateRetrainingNeed SkipRetraining

D.2 Execution status SUCCEEDED

aws stepfunctions describe-execution --execution-arn "$ARN_C" --region "$AWS_REGION" --query status --output text

Expected: SUCCEEDED

D.3 No RunRetraining state entered

aws stepfunctions get-execution-history --execution-arn "$ARN_C" --region "$AWS_REGION" --query 'events[?stateEnteredEventDetails.name==`RunRetraining`]|length(@)' --output text

Expected: 0

Section E — Scenario D: config_version_id passthrough (~5 s)

Start with an explicit config_version_id. Because we set manual_trigger=false, force=false and C.9's last_retrained is still fresh, this skips through SkipRetraining — no training needed to prove passthrough.

E.1 Start with explicit config_version_id

export CVID="verify-91-$(date -u +%Y%m%d%H%M%S)" && export ARN_D=$(aws stepfunctions start-execution --state-machine-arn "$SM_ARN" --name "verify-cv-${CVID}" --input "{\"model_name\":\"${MODEL}\",\"manual_trigger\":false,\"force\":false,\"config_version_id\":\"${CVID}\"}" --region "$AWS_REGION" --query executionArn --output text) && until [ "$(aws stepfunctions describe-execution --execution-arn "$ARN_D" --region "$AWS_REGION" --query status --output text)" != "RUNNING" ]; do sleep 2; done && aws stepfunctions describe-execution --execution-arn "$ARN_D" --region "$AWS_REGION" --query status --output text

Expected: SUCCEEDED

E.2 NormalizeInput preserved the supplied config_version_id

aws stepfunctions get-execution-history --execution-arn "$ARN_D" --region "$AWS_REGION" --query 'events[?stateExitedEventDetails.name==`NormalizeInput`].stateExitedEventDetails.output' --output text | python3 -c "import json,sys,os; d=json.loads(sys.stdin.read()); print('passed_through=', d.get('config_version_id')==os.environ['CVID'], '| value=', d.get('config_version_id'))"

Expected: passed_through= True | value= verify-91-<timestamp>

E.3 CheckRetrainingNeeded input carried the same config_version_id

aws stepfunctions get-execution-history --execution-arn "$ARN_D" --region "$AWS_REGION" --query 'events[?stateEnteredEventDetails.name==`CheckRetrainingNeeded`].stateEnteredEventDetails.input' --output text | python3 -c "import json,sys,os; d=json.loads(sys.stdin.read()); print('downstream=', d.get('config_version_id')==os.environ['CVID'])"

Expected: downstream= True

Section F — Mapping to #91 Done When

Every checkbox in the original issue maps to one or more checks above. Three criteria have caveats (#9, #27 — documented against the implementation-as-shipped) or live outside the deployed environment (#2, #3, #5 — verified by CI on PR #11 in the strategies repo).

# Acceptance criterion Check(s) Status
1 Strategy created in tradai-strategies/strategies/e2e-test-strategy/ A.3 + full Scenario B Automated
2 Unit tests pass gh pr checks 11 --repo tradai-bot/strategies Manual — CI on PR #11
3 Lint + typecheck pass same as #2 Manual — CI on PR #11
4 Docker image built + pushed to ECR A.3 Automated
5 Smoke backtest completes locally PR #11 CI Manual — CI on PR #11
6 Workflow completes successfully C.1 + C.2 Automated
7 ECS training task runs without errors C.5 Automated
8 Training completes in ~5 minutes C.4 Automated
9 MLflow experiment created for E2ETestStrategy C.10 + C.10a + C.11 Automated — per-strategy experiment E2ETestStrategy_training is created automatically
10 Model metrics + params logged in MLflow C.11 Automated (metrics present; params dict is empty in the current training path — metrics carry the learning signal)
11 Model artefacts stored in S3 C.12 Automated
12 Feature importance stored in MLflow run C.12 (part of artifacts/model/ dump) Automated
13 Model registered in MLflow Model Registry C.13 Automated
14 CompareModels returns decision + confidence C.6 Automated
15 DecidePromotion routes correctly C.7 Automated
16 NotifyCompletion fires retraining_success C.8 Automated
17 Skip when force=false + fresh model D.1–D.3 Automated
18 Bad input triggers NotifyFailure B.2 + B.7–B.9 Automated (regex violation B.2, ALLOWED_MODELS rejection B.7–B.9)
19 NotifyFailure fires retraining_failed + error details B.4 + B.5 Automated
20 No orphaned ECS tasks after failure B.6 Automated
21 config_version_id defaults to "" when absent C.3 Automated
22 config_version_id passes through when supplied E.2 + E.3 Automated
23 Step Functions visual workflow correct C.2 / B.2 / D.1 (the Console visual is a rendering of get-execution-history) Automated
24 CloudWatch logs clean C.5 Automated
25 MLflow UI shows experiment/run/metrics/model C.10 + C.11 + C.13 (REST API queries return the same data the UI renders) Automated
26 S3 bucket contains model artefacts at expected path C.12 Automated
27 DynamoDB tradai-models-{env} updated with job status A.4 + C.9 + C.9b Caveattradai-models-dev does not exist; job status is tracked in tradai-workflow-state-dev (C.9b, keyed by execution ARN, status=completed), and per-model last_retrained gate is in tradai-retraining-state-dev (C.9). Issue text references a table name that was renamed during implementation.
28 Execution ARN of a successful run $ARN_B from C.1 Automated
29 MLflow experiment screenshot C.10 + C.11 — REST output replaces a UI screenshot Automated (no literal screenshot)
30 Model Registry screenshot C.13 Automated (no literal screenshot)
31 Document issues + fixes #366 filed; fix in PR #367 Done
32 Update runbook / docs This document; docs/runbooks/retraining-workflow.md references #366 fix location Done

Pass criteria

Issue #91 is verified when every command above prints the exact Expected text. Alternatively, run docs/verification/issue91-verify.sh which automates all checks and prints a PASS/FAIL summary.

Caveat rows (F.9, F.27) document where the implementation-as-shipped legitimately diverged from the original ticket wording — accept those caveats or revise the ticket before closure. Manual rows (F.2, F.3, F.5) are verified by CI on the strategies-repo PR #11 and are not re-runnable against deployed dev.

Section G — Gap regression checks (from independent verification 2026-04-23)

Added after the independent verification in issue91-independent-verification-20260423.md surfaced three code gaps that the A–E checks did not catch: hyperparameters never logged as MLflow params (DW#10), feature importance absent from artefacts (DW#12), and SNS failure notifications lacking error details (DW#19). Each sub-check below maps 1:1 to the fix in PR fix/91-verification-gaps.

The checks below assume ARN_B (happy-path execution) and RUN_ID are already exported from §C, and ARN_A (failure-path) from §B. Re-export them if you restart the shell:

export ARN_B=...  # from §C.1
export ARN_A=...  # from §B.1
export RUN_ID=... # from §C.12

G.1 (DW#10) MLflow run has non-zero params — hyperparameters logged

mlflow_call GET "/ajax-api/2.0/mlflow/runs/get?run_id=${RUN_ID}" | python3 -c "import json,sys; r=json.load(sys.stdin)['run']; params={p['key']:p['value'] for p in r['data'].get('params',[])}; print('params_count=' + str(len(params)) + ' | has_freqai_n_estimators=' + str(any(k.endswith('n_estimators') for k in params)))"

Expected: params_count= <N ≥ 5> | has_freqai_n_estimators= True — the base TrainingConfig params (timeframe, periods, pairs, freqai_model) plus the freqai model_training_parameters must be present.

G.2 (DW#12) feature_importance.json is present at the expected S3 path

aws s3 ls "s3://tradai-mlflow-dev/artifacts/${EXP_ID}/${RUN_ID}/artifacts/model/feature_importance.json" --region "$AWS_REGION" --human-readable | awk '{print $3, $4}'

Expected: a single line with a byte size (e.g. 512 Bytes) and the literal filename feature_importance.json. ImageNotFoundException or empty output = fail.

G.3 (DW#12) feature_importance.json contents reference rsi or sma-ratio

The artefact is always written; when booster scores are absent, it carries a training_features_list populated from FreqAI per-sub-train metadata.

aws s3 cp "s3://tradai-mlflow-dev/artifacts/${EXP_ID}/${RUN_ID}/artifacts/model/feature_importance.json" - --region "$AWS_REGION" | python3 -c "import json,sys; d=json.load(sys.stdin); feats=list(d.get('feature_importance', {}).keys()) + d.get('training_features_list', []); flat=' '.join(feats).lower(); print('has_rsi=' + str('rsi' in flat) + ' | has_sma_ratio=' + str('sma-ratio' in flat or 'sma_ratio' in flat))"

Expected: has_rsi= True | has_sma_ratio= True

G.4 (DW#19) Failure-path SNS body includes error Cause

_send_sns_notification sends the body via AlertPublisher.publish; SNS message content is NOT in CloudWatch. The lambda emits a dedicated SNS failure body includes error details: … INFO marker ONLY when _extract_error_lines produced at least one line. That marker is what we grep for.

MSYS_NO_PATHCONV=1 aws logs filter-log-events --log-group-name /aws/lambda/tradai-notify-completion-dev --region "$AWS_REGION" --start-time $(python3 -c "import time; print(int((time.time()-3600)*1000))") --filter-pattern 'SNS failure body includes error details' --no-paginate --query 'length(events)' --output text

Expected: a count ≥ 1 — at least one failure invocation in the last hour rendered an error block. 0 means either no failure ran in the window, or the lambda is running pre-fix code (no marker emitted).