A fresh, from-scratch verification of issue #91 against origin/main @ 9b6064f. Written by a verifier with no use of docs/verification/issue91.md or issue91-verify.sh (authored by the same contributor who implemented the #91 stack). Assertions were derived solely from the 32 Done When checkboxes in the issue body.
Purpose: provide an audit trail for whether #91 is ready to close and, where the implementation diverges from the ticket, to explicitly flag the divergence rather than silently adopt the implementation as the spec.
Pulled issue #91 body via gh issue view 91 --repo tradai-bot/tradai. Extracted the 32 Done When items.
Authenticated to AWS dev (account 600802701449, eu-central-1).
Started four fresh Step Functions executions on tradai-retraining-workflow-dev with execution names prefixed independent-* so they are distinguishable from the author's verify-* runs:
Scenario
Execution name
Failure (invalid model_name)
independent-fail-20260423T074145Z
Happy (full training + registration)
independent-happy-20260423T074145Z
Skip (after happy)
independent-skip-20260423T075330Z
config_version_id passthrough
independent-cvid-20260423T075330Z
For each Done-When item, issued a query against AWS / MLflow REST / S3 that maps directly to the ticket text. No ticket-text was reinterpreted silently.
c9f67b6517a647259b3b9a5e37dcb1c8 (launchType=FARGATE, capacityProvider=FARGATE — FARGATE_SPOT weight=1 preferred by cluster default but Spot capacity unavailable in the placement window)
MLflow run
fb8dc111d79549e088fed947e8c8eadf in experiment default_training (id=9)
Experiment strategies/e2eteststrategy does not exist. The run lives in the shared default_training experiment (id=9) with tag strategy=E2ETestStrategy. Literal ticket text unsatisfied.
10
Model metrics + parameters logged correctly
⚠️
3 metrics present (training_profit_pct, training_sharpe_ratio, training_total_trades). 0 params — issue-body-required n_estimators=50, learning_rate=0.1, max_depth=3 are NOT in params nor in tags. → fixed by this PR's commit 1.
11
Model artefacts in S3
✅
76 objects at expected path
12
Feature importance stored (rsi, sma-ratio)
❌
No feature_importance.{json,csv} in S3. No booster pickle. training_features_list in per-sub-train _metadata.json names the features but not importance scores. → fixed by this PR's commit 2.
13
Model registered in MLflow Model Registry
✅
E2ETestStrategy v24
14
CompareModels returns decision + confidence
✅
decision=needs_more_data, confidence=0.0
15
DecidePromotion routes correctly
✅
routed to KeepCurrentModel (expected for needs_more_data)
model_name="not a valid model name" → path … → HandleInvalidModel → NotifyFailure
19
retraining_failed fires and includes error details
⚠️
Type fires, but SNS body = "Model <name> retraining failed" — no error cause in the message. Lambda receives $.error.Cause via details.error but does not render it. → fixed by this PR's commit 3.
20
No orphaned ECS tasks after failure
✅
RunRetraining not entered; aws ecs list-tasks --cluster tradai-dev = empty
13 states, all expected transitions, all 5 task states have Catch: States.ALL (CheckRetrainingNeeded/RunRetraining/CompareModels/PromoteModel → NotifyFailure; UpdateRetrainingState → NotifyCompletion on catch — intentional: a state-write failure should not silence a successful training)
DynamoDB tradai-models-{env} updated with job status
🔶
Table tradai-models-dev does not exist. Closest analog tradai-workflow-state-dev IS updated by the happy-path training job (row run_id=<ARN>, status=completed). But the failure path writes nothing — my ARN_FAIL has 0 rows in any DynamoDB table. Literal ticket text (table name) unsatisfied; functional analog partial.
⚠️ Partial FAIL: 2 (#10, #19) — both fixed by this PR
❌ FAIL: 1 (#12) — fixed by this PR
∅ Not indep verifiable: 1 (#5)
✳ Substitute/meta: 5 (#29, #30, #31, #32, #24 is duplicate)
Side findings (outside the DW list, surfaced during investigation)¶
All of these are legitimate production issues I observed but not tracked by #91 text. They are flagged here so maintainers can triage; fixes are NOT in this PR.
IAM: tradai-notify-completion-dev lambda receives AccessDeniedException on tradai-notifications-dev for dynamodb:GetItem (throttle check) and dynamodb:PutItem (audit record). SNS primary path still works; deduplication + audit silently broken.
Reproducibility manifest: artifacts/.../reproducibility/*.json shows git_commit="unknown" and feature_schema.feature_count=0 despite 4 real features (present in sub-train metadata). Manifest is shipped but not functional.
tradai-workflow-state-dev schema gap: strategy_name column in the row written by the training job is empty. Column is declared in the table schema but the training handler never populates it.
Heartbeat divergence: runbook table documents heartbeat_seconds=300; ASL actual = HeartbeatSeconds: 900. Safe today (5.65-min training < 15 min), but documentation-implementation drift.
Fargate placement: cluster default strategy prefers FARGATE_SPOT (weight=1) over FARGATE (weight=0, base=0). My run landed on FARGATE because Spot capacity was temporarily unavailable in eu-central-1 for the chosen task size. Fallback works; expected.
Do not close #91 at 9b6064f. Three criteria with unambiguous ticket-text violations (#10, #12, #19) need code, not reinterpretation. Two with divergent ticket text (#9, #27) need an explicit product-owner decision — either code to match the ticket, or amend the ticket.
This PR (fix/91-verification-gaps) addresses #10, #12, #19 with code + tests. #9 and #27 are called out in the PR description for product-owner judgement. Side findings (1–5 above) remain as follow-up issues.
All three code gaps (#10, #12, #19) were fixed in PR #370. An additional gap discovered during re-verification — format-valid but unknown model names bypassing validation when force=true — was fixed in PR #371 (ALLOWED_MODELS allowlist). Both PRs merged to main. Final verification with 4 live Step Functions executions confirmed all scenarios pass. Caveats #9 and #27 accepted as implementation decisions.
Run docs/verification/issue91-verify.sh from the repo root with AWS_PROFILE=tradai and a valid SSO session. The script executes all checks from sections A–G of docs/verification/issue91.md and prints PASS/FAIL per check. See that file for manual check-by-check reproduction.