TradAI AWS Dev Running Demo — Guide¶
Audience: anyone running the end-to-end TradAI running demo against the AWS dev environment, or reviewing its results afterwards. Scope: dry-run trading only. Live exchange trading is intentionally out of scope for this flow.
The full demo executes 14 architecture steps end-to-end (auth → catalog → data sync → backtest → progress → results → leaderboard → stage → dry-run launch → live PnL → promote → observability → rollback → cleanup). The running demo is verified against demo-architecture.md.
AWS dev environment: account 600802701449, region eu-central-1, public entrypoint https://api-dev.tradai-system.com.
1. Prerequisites¶
1.1 Tools¶
You need these on $PATH:
bash # 4+; on Windows use Git Bash
curl
jq
aws # AWS CLI v2
docker # only if you also want to rebuild/redeploy
Quick check:
for t in bash curl jq aws; do command -v "$t" >/dev/null && echo "✓ $t" || echo "✗ $t MISSING"; done
1.2 AWS profile¶
The demo uses an SSO profile named tradai. Configure once with aws configure sso against the SSO start URL https://d-99675533ed.awsapps.com/start, region eu-central-1, role AdministratorAccess (or whichever role you were granted).
Result (~/.aws/config):
[profile tradai]
sso_start_url = https://d-99675533ed.awsapps.com/start
sso_region = eu-central-1
sso_account_id = 600802701449
sso_role_name = AdministratorAccess
region = eu-central-1
output = json
Sign in:
run-all.sh runs an SSO preflight on start: if the session is missing or expired it auto-invokes aws sso login --profile $AWS_PROFILE for you. Set DEMO_AUTO_SSO_LOGIN=0 to fail-fast instead of launching the browser (useful for CI).
1.3 Clone the repo and check out main¶
The demo scripts live in scripts/demo/. They are POSIX bash, verified on Linux, macOS, and Git Bash on Windows.
1.4 Environment variables (optional)¶
The scripts work out of the box. Override these only if you want to point at a different environment or change the demo target:
| Variable | Default | What it does |
|---|---|---|
AWS_PROFILE | tradai | AWS SSO profile to use |
AWS_REGION | eu-central-1 | AWS region |
ENVIRONMENT | dev | Environment name used in resource naming |
API_BASE | https://api-dev.tradai-system.com | Public ALB entrypoint |
DEMO_STRATEGY | E2ETestStrategy | Strategy name used by stage/run/promote/rollback |
DEMO_MODEL_NAME | $DEMO_STRATEGY | MLflow registered-model name |
DEMO_LEADERBOARD_MODEL | PascalStrategy | Strategy bootstrapped to Production if the leaderboard is empty |
DEMO_SYMBOL | BTC/USDT:USDT | Symbol used for data sync + backtest |
DEMO_TIMEFRAME | 1h | Timeframe |
DEMO_EXCHANGE | binance_futures | Exchange |
DEMO_START_DATE / DEMO_END_DATE | 2025-01-01 / 2025-02-01 | Backtest date range |
DEMO_BACKTEST_TIMEOUT_SECONDS | 900 | Max wait for backtest to complete |
DEMO_DRY_RUN_TIMEOUT_SECONDS | 300 | Max wait for ECS task to reach RUNNING + emit PnL |
DEMO_POLL_INTERVAL_SECONDS | 15 | Polling interval |
DEMO_ROLLBACK_TARGET_VERSION | unset | Explicit rollback target (otherwise auto-captured from step 11) |
DEMO_AUTO_SSO_LOGIN | 1 | Auto-run aws sso login on expired session (set to 0 for fail-fast) |
2. Run the full demo¶
That executes the preflight + steps 00-14 in order. Each step prints [PASS] / [FAIL] lines; the last line on success is:
If a step fails, the cleanup_on_error trap automatically runs 14-cleanup.sh so we never leave the dry-run ECS task running.
Typical wall-clock duration: 6-10 minutes on a warm environment (dominated by the backtest in step 04 and the dry-run polling in steps 09 and 10).
Running a single step¶
Every script is standalone. State (token, job_id, model version, instance id) is persisted under ./.demo-state/ so consecutive scripts can pick up where the previous left off. Examples:
bash scripts/demo/01-login-auth.sh # acquire a fresh Cognito M2M token
bash scripts/demo/07-leaderboard.sh # just the leaderboard check
bash scripts/demo/14-cleanup.sh # scale strategy service back to 0
3. Step-by-step walkthrough¶
Every step corresponds to a section in demo-architecture.md. Reading the architecture doc alongside this guide is highly recommended for the first time.
Step 00 — Pre-demo setup¶
Script: 00-pre-demo-setup.sh
Verifies AWS identity, S3 buckets, ALB HTTPS + redirect, WAF association, Cognito user pool and M2M client, confirmed SNS subscription, ECS deployment circuit breakers on all three demo strategy services, the demo model is registered, and backend health. If the default Production leaderboard is empty it bootstraps DEMO_LEADERBOARD_MODEL to Production.
Step 01 — Login / Auth¶
Script: 01-login-auth.sh
Anonymous protected route returns 401. Cognito M2M token is acquired from the configured user pool. Authenticated request returns 200. Token is saved to .demo-state/token.
Step 02 — Catalog¶
Script: 02-catalog.sh
Listing returns an array. Detail endpoint for DEMO_LEADERBOARD_MODEL matches its name. DEMO_MODEL_NAME has at least one registered MLflow version.
Step 03 — Data sync and coverage¶
Script: 03-data-sync.sh
Reads freshness for DEMO_SYMBOL, posts a sync for the demo date range, verifies the architecture-level coverage endpoint returns the requested range.
Step 04 — Submit backtest¶
Script: 04-submit-backtest.sh
POST /api/v1/backtests returns 201 with a job_id. Saved to .demo-state/job_id. The backtest runs in Step Functions / ECS in the background.
Step 05 — Track progress¶
Script: 05-track-progress.sh
Polls GET /api/v1/backtests/{job_id} until status is completed. Default timeout 15 min.
Step 06 — Results and KPIs¶
Script: 06-results-kpis.sh
Verifies metric/KPI payload, equity endpoint, report-data endpoint, backtest traceability (non-null trace_id or mlflow_run_id, real 40-character git_commit).
Step 07 — Leaderboard¶
Script: 07-leaderboard.sh
Default Production leaderboard is non-empty. Configured stage leaderboard has at least one scored entry.
Step 08 — Stage model¶
Script: 08-stage-model.sh
POST /api/v1/strategies/{name}/stage returns 200, MLflow staging alias is set. State .demo-state/staged_version is updated.
Step 09 — Start dry-run trading¶
Script: 09-dry-run-start.sh
POST /api/v1/strategies/{name}/run returns 201. Backend registers a new ECS task definition revision with TRADING_MODE=dry-run, STRATEGY_ID, TRADING_STATE_TABLE, PAIRS, and CONFIG_OVERRIDES, and updates the strategy service to desired=1. Polls until ECS reports the task as RUNNING. instance_id saved to .demo-state/instance_id.
Step 10 — Status, PnL, logs¶
Script: 10-status-pnl-logs.sh
Logs are read either through the backend logs endpoint or, as a documented fallback, directly from CloudWatch. Trading status returns summary and instances arrays. PnL endpoint returns at least one strategy snapshot for the running task.
Step 11 — Promote to Production (alias only, no live trading)¶
Script: 11-promote-production.sh
Captures the current Production version (for the step 13 rollback). POST /api/v1/strategies/{name}/promote sets the MLflow champion alias and archives the previous champion. Verifies the new version is persisted as Production and the previous one is Archived.
Step 12 — Observability¶
Script: 12-observability.sh
Confirmed SNS email subscription, ECS deployment circuit breakers, CloudWatch dashboard tradai-dev exists, log group /ecs/tradai/dev is readable.
Step 13 — Rollback¶
Script: 13-rollback.sh
Rolls back the model to DEMO_ROLLBACK_TARGET_VERSION if set, otherwise to the Production version captured in step 11. Verifies the target version is persisted as Production after rollback.
Step 14 — Cleanup¶
Script: 14-cleanup.sh
Stops the dry-run task, scales the strategy service back to zero, waits for rollout COMPLETED. Always idempotent.
4. Where to find each artifact / report¶
After (or during) a run, you can inspect everything the demo produced through the API, the AWS Console, and the local .demo-state/ directory.
4.1 Local state files¶
./.demo-state/ (created in your working directory by lib.sh):
| File | Content |
|---|---|
token | Cognito M2M access token from step 01 |
job_id | Backtest job UUID from step 04 |
backtest.json | Full backtest result (trades, KPIs, equity); refreshed by step 05/06 |
model_version | MLflow version targeted by the demo (steps 08/11/13) |
staged_version | Version staged in step 08 |
production_version | Version promoted in step 11 |
previous_production_version | Production version captured before step 11 (target for step 13) |
instance_id | ECS task ID launched in step 09 |
alb_arn | Discovered dev ALB ARN |
Tail backtest.json:
jq '{job_id, status, result: {trades: .result.total_trades, sharpe: .result.metrics.sharpe, drawdown_pct: .result.max_drawdown_pct}}' .demo-state/backtest.json
4.2 Backtest results via API¶
Replace <job_id> with $(cat .demo-state/job_id):
| Endpoint | Returns |
|---|---|
GET https://api-dev.tradai-system.com/api/v1/backtests/<job_id> | Job + full result (metrics, trades, traceability) |
GET https://api-dev.tradai-system.com/api/v1/backtests/<job_id>/equity | Equity curve (time series) |
GET https://api-dev.tradai-system.com/api/v1/backtests/<job_id>/report-data | Detailed trade list + per-trade PnL |
You need a Bearer token; reuse .demo-state/token or run 01-login-auth.sh again.
Example:
TOKEN=$(cat .demo-state/token)
JOB=$(cat .demo-state/job_id)
curl -sS -H "Authorization: Bearer $TOKEN" \
"https://api-dev.tradai-system.com/api/v1/backtests/$JOB" \
| jq '.result.metrics'
4.3 MLflow runs and registry¶
MLflow is internal — there is no public web URL. Access through the backend proxy or the platform team's port-forward:
| Endpoint | Returns |
|---|---|
GET /api/v1/models/{name}/versions?include_archived=true | Every version with alias-driven current_stage |
GET /api/v1/catalog/strategies/{name} | Strategy summary (latest version, stage, tags, KPIs) |
GET /api/v1/catalog/leaderboard | Ranked Production strategies |
Backtest result .result.mlflow_run_id | MLflow run ID for that backtest |
If you have the platform team's MLflow access: https://mlflow-internal.tradai-system.com/#/experiments/<exp_id>/runs/<run_id> (ask in #tradai-platform for the exact internal URL).
4.4 Live trading state (CloudWatch + DynamoDB)¶
For the dry-run instance launched in step 09:
CloudWatch logs (replace <instance_id> with $(cat .demo-state/instance_id)):
- Console: Log group
/ecs/tradai/dev - Stream name:
strategy/strategy/<instance_id> - CLI: (Git Bash users: prefix with
aws logs get-log-events --profile tradai --region eu-central-1 \ --log-group-name /ecs/tradai/dev \ --log-stream-name "strategy/strategy/$(cat .demo-state/instance_id)" \ --limit 50MSYS_NO_PATHCONV=1so the/in the group name is preserved.)
DynamoDB state (heartbeat, status, PnL snapshot, trades):
- Console: Table
tradai-trading-state-dev - CLI:
ECS service: strategy-e2eteststrategy
4.5 Step Functions execution (backtest workflow)¶
- Console: State machines in eu-central-1 — the backtest workflow shows every step (NormalizeInput → ValidateStrategy → EnsureData → RunBacktest → RegisterModel → UpdateStatus) and links to the per-step CloudWatch logs.
4.6 Observability¶
- CloudWatch dashboard:
tradai-dev— latency, error rate, ECS task counts. - SNS alerts topic:
tradai-alerts-dev— at least one confirmed email subscription. - CloudWatch Alarms: filter
tradai-.
4.7 S3 artifacts¶
| Bucket | Holds |
|---|---|
tradai-arcticdb-dev | ArcticDB market data |
tradai-mlflow-dev | MLflow run artifacts (model files, configs) |
tradai-results-dev | Backtest result JSONs, organised by job_id/trace_id |
tradai-configs-dev | Strategy configs |
4.8 Strategy / backend container logs¶
Backend and strategy-service log to /ecs/tradai/dev too, on streams prefixed with their container name. CloudWatch Logs Insights:
5. Troubleshooting¶
5.1 aws sso login keeps asking on every run¶
You probably ran the preflight check after an interactive login but inside a different shell. Verify the cache is shared:
If get-caller-identity works with AWS_PROFILE= env-var but not --profile flag, that's a known Windows quirk — use the env-var form (run-all.sh already does).
5.2 Step 04 / 05 times out¶
Backtests can be slow on first invocation when warm-up data is missing. Bump the timeout:
DEMO_BACKTEST_TIMEOUT_SECONDS=1800 bash scripts/demo/04-submit-backtest.sh
bash scripts/demo/05-track-progress.sh
Check the Step Functions execution in the AWS Console for the actual failure if it's still running but not completing.
5.3 Step 09 dry-run never reaches RUNNING¶
Look at the most recent ECS task:
aws ecs list-tasks --profile tradai --region eu-central-1 \
--cluster tradai-dev --service-name strategy-e2eteststrategy \
--desired-status STOPPED | jq '.taskArns'
Then aws ecs describe-tasks ... on a stopped task — stoppedReason usually identifies the issue (image pull failure, missing env, etc.).
5.4 Step 10 PnL endpoint returns strategies: []¶
This is fixed in main. If you see it on a deployed environment, that deployment is behind main and is missing one of:
- PR #434 (
HealthReporterinitial zero-PnL snapshot +ensure_existson the state row) - PR #435 (backend injects
TRADING_STATE_TABLEinto the strategy runtime env so state management is not skipped)
Re-deploy backend + strategy-service from main.
5.5 Step 13 fails with No previous Production version captured¶
The demo cannot capture a previous Production version when the just-staged version (step 08) was itself the previous Production. This happens on a "clean" registry where the latest non-archived version is the one that was just rolled back to in a prior run.
Workaround (until #437 A lands) — register a fresh model version pointing at the most recent backtest's MLflow run, then re-run:
TOKEN=$(cat .demo-state/token)
RUN_ID=$(jq -r '.result.mlflow_run_id' .demo-state/backtest.json)
curl -sS -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "{\"backtest_run_id\":\"$RUN_ID\",\"docker_image_uri\":\"600802701449.dkr.ecr.eu-central-1.amazonaws.com/tradai/e2eteststrategy:latest\",\"skip_validation\":true,\"strategy_version\":\"0.1.bump\",\"description\":\"manual bootstrap\"}" \
"https://api-dev.tradai-system.com/api/v1/strategies/E2ETestStrategy/register"
bash scripts/demo/run-all.sh
The real fix is for the backtest workflow to register a new model version on every successful run (tracked in #437).
5.6 MSYS_NO_PATHCONV=1 everywhere on Windows¶
Git Bash on Windows converts /ecs/tradai/dev to C:/Program Files/Git/ecs/... which breaks AWS CLI. The demo scripts that touch log groups already prefix MSYS_NO_PATHCONV=1. If you call AWS CLI by hand from Git Bash, do the same.
6. Known limitations¶
The full set of known gaps and follow-ups is tracked in #437. The most relevant for someone running the demo today:
- The demo is not idempotent without manual model-version registration. After a complete promote+rollback cycle the registry's only non-archived version is the previous champion; the next run's stage step clears that champion alias and step 13 has nothing to roll back to. Fix is gap A in #437.
- Live trading is intentionally out of scope for
run-all.sh. Live mode coverage is gated on #418/#419/#421. - X-Ray service map is not exercised (deferred per #415).
For the demo flow that is in scope (dry-run paper trading, promote/rollback as alias-only operations, full observability checks): the running demo passes 14/14 against AWS dev when the registry has a fresh non-archived version to stage. Both PRs that made this end-to-end flow possible (#434, #435, #436) are merged into main.
7. Sources¶
- Demo architecture:
demo-architecture.md - Script reference:
aws-dev-demo-runbook.md - Gap tracking: tradai-bot/tradai#437
- Plan / status history:
demo-week-plan.md