Issue #91 — Independent Verification Report (2026-04-23)¶

A fresh, from-scratch verification of issue #91 against origin/main @ 9b6064f. Written by a verifier with no use of docs/verification/issue91.md or issue91-verify.sh (authored by the same contributor who implemented the #91 stack). Assertions were derived solely from the 32 Done When checkboxes in the issue body.

Purpose: provide an audit trail for whether #91 is ready to close and, where the implementation diverges from the ticket, to explicitly flag the divergence rather than silently adopt the implementation as the spec.

Method¶

Pulled issue #91 body via gh issue view 91 --repo tradai-bot/tradai. Extracted the 32 Done When items.
Authenticated to AWS dev (account 600802701449, eu-central-1).
Started four fresh Step Functions executions on tradai-retraining-workflow-dev with execution names prefixed independent-* so they are distinguishable from the author's verify-* runs:

Scenario	Execution name
Failure (invalid `model_name`)	`independent-fail-20260423T074145Z`
Happy (full training + registration)	`independent-happy-20260423T074145Z`
Skip (after happy)	`independent-skip-20260423T075330Z`
`config_version_id` passthrough	`independent-cvid-20260423T075330Z`

For each Done-When item, issued a query against AWS / MLflow REST / S3 that maps directly to the ticket text. No ticket-text was reinterpreted silently.

Happy-path evidence (primary)¶

Field	Value
Execution ARN	`arn:aws:states:eu-central-1:600802701449:execution:tradai-retraining-workflow-dev:independent-happy-20260423T074145Z`
Status	`SUCCEEDED`
State path	`NormalizeInput → CheckRetrainingNeeded → EvaluateRetrainingNeed → RunRetraining → CompareModels → DecidePromotion → KeepCurrentModel → UpdateRetrainingState → NotifyCompletion`
`RunRetraining` duration	339 s (5.65 min)
ECS task ID	`c9f67b6517a647259b3b9a5e37dcb1c8` (launchType=`FARGATE`, capacityProvider=`FARGATE` — `FARGATE_SPOT` weight=1 preferred by cluster default but Spot capacity unavailable in the placement window)
MLflow run	`fb8dc111d79549e088fed947e8c8eadf` in experiment `default_training` (id=9)
Model registry	`E2ETestStrategy v24`, `source=runs:/fb8dc111…/model`, stage `None`
S3 artefacts	76 objects, 152.2 KiB under `s3://tradai-mlflow-dev/artifacts/9/fb8dc111…/`
DynamoDB row	`tradai-workflow-state-dev` HASH `run_id=<ARN>`, `status=completed`, `created_at=2026-04-23T07:42:38Z`, `updated_at=2026-04-23T07:46:57Z`
`last_retrained` written	`tradai-retraining-state-dev[E2ETestStrategy].last_retrained = 2026-04-23T07:48:01.396Z`

Per-criterion status¶

Legend: ✅ PASS · 🔶 literal FAIL / spirit PASS · ⚠️ partial FAIL · ❌ FAIL · ∅ not independently verifiable · ✳ substitute/meta

E2ETestStrategy created (5 items)¶

#	Criterion	Status	Evidence
1	Strategy in `tradai-strategies/strategies/e2e-test-strategy/`	✅	7 files present in `tradai-bot/strategies#11` (OPEN, CLEAN, MERGEABLE)
2	Unit tests pass	✅	`Lint & Test` step `Run lint and tests` = success on PR #11
3	Lint + typecheck pass	✅	same step
4	Docker image pushed to ECR	✅	`tradai/e2eteststrategy:latest sha256:33c057d…` pushed 2026-04-17
5	Smoke backtest completes locally	∅	Not reproducible from this env; Smoke Backtest CI job blocked by #354 (Binance HTTP 451 to GitHub runners)

Happy path (11 items)¶

#	Criterion	Status	Evidence
6	Workflow succeeds, all states green	✅	state path above
7	ECS training task runs without errors	✅	`aws logs filter-log-events /ecs/tradai/dev` with pattern `?ERROR ?Traceback ?CRITICAL ?Exception` → 0 events
8	Training completes in ~5 minutes	✅	339 s
9	MLflow experiment created for E2ETestStrategy	🔶	Experiment `strategies/e2eteststrategy` does not exist. The run lives in the shared `default_training` experiment (id=9) with tag `strategy=E2ETestStrategy`. Literal ticket text unsatisfied.
10	Model metrics + parameters logged correctly	⚠️	3 metrics present (`training_profit_pct`, `training_sharpe_ratio`, `training_total_trades`). 0 params — issue-body-required `n_estimators=50, learning_rate=0.1, max_depth=3` are NOT in `params` nor in tags. → fixed by this PR's commit 1.
11	Model artefacts in S3	✅	76 objects at expected path
12	Feature importance stored (rsi, sma-ratio)	❌	No `feature_importance.{json,csv}` in S3. No booster pickle. `training_features_list` in per-sub-train `_metadata.json` names the features but not importance scores. → fixed by this PR's commit 2.
13	Model registered in MLflow Model Registry	✅	`E2ETestStrategy v24`
14	CompareModels returns decision + confidence	✅	`decision=needs_more_data, confidence=0.0`
15	DecidePromotion routes correctly	✅	routed to `KeepCurrentModel` (expected for `needs_more_data`)
16	NotifyCompletion fires `retraining_success`	✅	`notification_type=retraining_success, sns=true, sent=true`

Skip path (1 item)¶

#	Criterion	Status	Evidence
17	Fresh model + `force=false` → `SkipRetraining`	✅	Execution after happy, path ends at `SkipRetraining`, `RunRetraining` not entered

Failure path (3 items)¶

#	Criterion	Status	Evidence
18	Bad input triggers `NotifyFailure`	✅	`model_name="not a valid model name"` → path `… → HandleInvalidModel → NotifyFailure`
19	`retraining_failed` fires and includes error details	⚠️	Type fires, but SNS body = `"Model <name> retraining failed"` — no error cause in the message. Lambda receives `$.error.Cause` via `details.error` but does not render it. → fixed by this PR's commit 3.
20	No orphaned ECS tasks after failure	✅	`RunRetraining` not entered; `aws ecs list-tasks --cluster tradai-dev` = empty

config_version_id (2 items)¶

#	Criterion	Status	Evidence
21	Defaults to `""` when absent	✅	`NormalizeInput.output.config_version_id=''`; live ECS env on my training task: `CONFIG_VERSION_ID=""`
22	Passes through when supplied	✅	`NormalizeInput` preserves; `CheckRetrainingNeeded` input carries; ASL wires `CONFIG_VERSION_ID = $.config_version_id` in `RunRetraining.Parameters.Overrides.ContainerOverrides[0].Environment`

Manual verification (5 items)¶

#	Criterion	Status	Evidence
23	Step Functions console correct	✅	13 states, all expected transitions, all 5 task states have `Catch: States.ALL` (CheckRetrainingNeeded/RunRetraining/CompareModels/PromoteModel → `NotifyFailure`; UpdateRetrainingState → `NotifyCompletion` on catch — intentional: a state-write failure should not silence a successful training)
24	CloudWatch logs clean	✅	duplicate of DW#7
25	MLflow UI shows experiment/run/metrics/model	✅	REST substitute (`/ajax-api/2.0/mlflow/runs/search`, `/runs/get`, `/model-versions/search`)
26	S3 model artefacts at expected path	✅	duplicate of DW#11
27	DynamoDB `tradai-models-{env}` updated with job status	🔶	Table `tradai-models-dev` does not exist. Closest analog `tradai-workflow-state-dev` IS updated by the happy-path training job (row `run_id=<ARN>`, `status=completed`). But the failure path writes nothing — my ARN_FAIL has 0 rows in any DynamoDB table. Literal ticket text (table name) unsatisfied; functional analog partial.

Deliverables (5 items)¶

#	Criterion	Status	Evidence
28	Execution ARN of a successful run	✅	`independent-happy-20260423T074145Z`
29	MLflow experiment screenshot	✳	REST-substitute: tags/metrics/status captured in this report
30	Model Registry screenshot	✳	REST-substitute: v24 captured
31	Document issues found	✳	This report + PR description
32	Update runbook	✅	`docs/runbooks/retraining-workflow.md` 311 lines, landed via #363

Tally¶

✅ PASS: 21
🔶 Literal FAIL / spirit PASS: 2 (#9, #27)
⚠️ Partial FAIL: 2 (#10, #19) — both fixed by this PR
❌ FAIL: 1 (#12) — fixed by this PR
∅ Not indep verifiable: 1 (#5)
✳ Substitute/meta: 5 (#29, #30, #31, #32, #24 is duplicate)

Side findings (outside the DW list, surfaced during investigation)¶

All of these are legitimate production issues I observed but not tracked by #91 text. They are flagged here so maintainers can triage; fixes are NOT in this PR.

IAM: tradai-notify-completion-dev lambda receives AccessDeniedException on tradai-notifications-dev for dynamodb:GetItem (throttle check) and dynamodb:PutItem (audit record). SNS primary path still works; deduplication + audit silently broken.
Reproducibility manifest: artifacts/.../reproducibility/*.json shows git_commit="unknown" and feature_schema.feature_count=0 despite 4 real features (present in sub-train metadata). Manifest is shipped but not functional.
tradai-workflow-state-dev schema gap: strategy_name column in the row written by the training job is empty. Column is declared in the table schema but the training handler never populates it.
Heartbeat divergence: runbook table documents heartbeat_seconds=300; ASL actual = HeartbeatSeconds: 900. Safe today (5.65-min training < 15 min), but documentation-implementation drift.
Fargate placement: cluster default strategy prefers FARGATE_SPOT (weight=1) over FARGATE (weight=0, base=0). My run landed on FARGATE because Spot capacity was temporarily unavailable in eu-central-1 for the chosen task size. Fallback works; expected.

Recommendation (at the time of this report)¶

Do not close #91 at 9b6064f. Three criteria with unambiguous ticket-text violations (#10, #12, #19) need code, not reinterpretation. Two with divergent ticket text (#9, #27) need an explicit product-owner decision — either code to match the ticket, or amend the ticket.

This PR (fix/91-verification-gaps) addresses #10, #12, #19 with code + tests. #9 and #27 are called out in the PR description for product-owner judgement. Side findings (1–5 above) remain as follow-up issues.

Resolution (post-report)¶

All three code gaps (#10, #12, #19) were fixed in PR #370. An additional gap discovered during re-verification — format-valid but unknown model names bypassing validation when force=true — was fixed in PR #371 (ALLOWED_MODELS allowlist). Both PRs merged to main. Final verification with 4 live Step Functions executions confirmed all scenarios pass. Caveats #9 and #27 accepted as implementation decisions.

How to reproduce this report¶

Run docs/verification/issue91-verify.sh from the repo root with AWS_PROFILE=tradai and a valid SSO session. The script executes all checks from sections A–G of docs/verification/issue91.md and prints PASS/FAIL per check. See that file for manual check-by-check reproduction.