Skip to content

TradAI AWS Dev Running Demo — Guide

Audience: anyone running the end-to-end TradAI running demo against the AWS dev environment, or reviewing its results afterwards. Scope: dry-run trading only. Live exchange trading is intentionally out of scope for this flow.

The full demo executes 14 architecture steps end-to-end (auth → catalog → data sync → backtest → progress → results → leaderboard → stage → dry-run launch → live PnL → promote → observability → rollback → cleanup). The running demo is verified against demo-architecture.md.

AWS dev environment: account 600802701449, region eu-central-1, public entrypoint https://api-dev.tradai-system.com.


1. Prerequisites

1.1 Tools

You need these on $PATH:

bash   # 4+; on Windows use Git Bash
curl
jq
aws    # AWS CLI v2
docker # only if you also want to rebuild/redeploy

Quick check:

for t in bash curl jq aws; do command -v "$t" >/dev/null && echo "✓ $t" || echo "✗ $t MISSING"; done

1.2 AWS profile

The demo uses an SSO profile named tradai. Configure once with aws configure sso against the SSO start URL https://d-99675533ed.awsapps.com/start, region eu-central-1, role AdministratorAccess (or whichever role you were granted).

Result (~/.aws/config):

[profile tradai]
sso_start_url = https://d-99675533ed.awsapps.com/start
sso_region    = eu-central-1
sso_account_id = 600802701449
sso_role_name = AdministratorAccess
region        = eu-central-1
output        = json

Sign in:

aws sso login --profile tradai

run-all.sh runs an SSO preflight on start: if the session is missing or expired it auto-invokes aws sso login --profile $AWS_PROFILE for you. Set DEMO_AUTO_SSO_LOGIN=0 to fail-fast instead of launching the browser (useful for CI).

1.3 Clone the repo and check out main

git clone git@github.com:tradai-bot/tradai.git
cd tradai
git checkout main && git pull

The demo scripts live in scripts/demo/. They are POSIX bash, verified on Linux, macOS, and Git Bash on Windows.

1.4 Environment variables (optional)

The scripts work out of the box. Override these only if you want to point at a different environment or change the demo target:

Variable Default What it does
AWS_PROFILE tradai AWS SSO profile to use
AWS_REGION eu-central-1 AWS region
ENVIRONMENT dev Environment name used in resource naming
API_BASE https://api-dev.tradai-system.com Public ALB entrypoint
DEMO_STRATEGY E2ETestStrategy Strategy name used by stage/run/promote/rollback
DEMO_MODEL_NAME $DEMO_STRATEGY MLflow registered-model name
DEMO_LEADERBOARD_MODEL PascalStrategy Strategy bootstrapped to Production if the leaderboard is empty
DEMO_SYMBOL BTC/USDT:USDT Symbol used for data sync + backtest
DEMO_TIMEFRAME 1h Timeframe
DEMO_EXCHANGE binance_futures Exchange
DEMO_START_DATE / DEMO_END_DATE 2025-01-01 / 2025-02-01 Backtest date range
DEMO_BACKTEST_TIMEOUT_SECONDS 900 Max wait for backtest to complete
DEMO_DRY_RUN_TIMEOUT_SECONDS 300 Max wait for ECS task to reach RUNNING + emit PnL
DEMO_POLL_INTERVAL_SECONDS 15 Polling interval
DEMO_ROLLBACK_TARGET_VERSION unset Explicit rollback target (otherwise auto-captured from step 11)
DEMO_AUTO_SSO_LOGIN 1 Auto-run aws sso login on expired session (set to 0 for fail-fast)

2. Run the full demo

AWS_PROFILE=tradai bash scripts/demo/run-all.sh

That executes the preflight + steps 00-14 in order. Each step prints [PASS] / [FAIL] lines; the last line on success is:

[PASS] Full running demo flow completed.

If a step fails, the cleanup_on_error trap automatically runs 14-cleanup.sh so we never leave the dry-run ECS task running.

Typical wall-clock duration: 6-10 minutes on a warm environment (dominated by the backtest in step 04 and the dry-run polling in steps 09 and 10).

Running a single step

Every script is standalone. State (token, job_id, model version, instance id) is persisted under ./.demo-state/ so consecutive scripts can pick up where the previous left off. Examples:

bash scripts/demo/01-login-auth.sh   # acquire a fresh Cognito M2M token
bash scripts/demo/07-leaderboard.sh  # just the leaderboard check
bash scripts/demo/14-cleanup.sh      # scale strategy service back to 0

3. Step-by-step walkthrough

Every step corresponds to a section in demo-architecture.md. Reading the architecture doc alongside this guide is highly recommended for the first time.

Step 00 — Pre-demo setup

Script: 00-pre-demo-setup.sh

Verifies AWS identity, S3 buckets, ALB HTTPS + redirect, WAF association, Cognito user pool and M2M client, confirmed SNS subscription, ECS deployment circuit breakers on all three demo strategy services, the demo model is registered, and backend health. If the default Production leaderboard is empty it bootstraps DEMO_LEADERBOARD_MODEL to Production.

Step 01 — Login / Auth

Script: 01-login-auth.sh

Anonymous protected route returns 401. Cognito M2M token is acquired from the configured user pool. Authenticated request returns 200. Token is saved to .demo-state/token.

Step 02 — Catalog

Script: 02-catalog.sh

Listing returns an array. Detail endpoint for DEMO_LEADERBOARD_MODEL matches its name. DEMO_MODEL_NAME has at least one registered MLflow version.

Step 03 — Data sync and coverage

Script: 03-data-sync.sh

Reads freshness for DEMO_SYMBOL, posts a sync for the demo date range, verifies the architecture-level coverage endpoint returns the requested range.

Step 04 — Submit backtest

Script: 04-submit-backtest.sh

POST /api/v1/backtests returns 201 with a job_id. Saved to .demo-state/job_id. The backtest runs in Step Functions / ECS in the background.

Step 05 — Track progress

Script: 05-track-progress.sh

Polls GET /api/v1/backtests/{job_id} until status is completed. Default timeout 15 min.

Step 06 — Results and KPIs

Script: 06-results-kpis.sh

Verifies metric/KPI payload, equity endpoint, report-data endpoint, backtest traceability (non-null trace_id or mlflow_run_id, real 40-character git_commit).

Step 07 — Leaderboard

Script: 07-leaderboard.sh

Default Production leaderboard is non-empty. Configured stage leaderboard has at least one scored entry.

Step 08 — Stage model

Script: 08-stage-model.sh

POST /api/v1/strategies/{name}/stage returns 200, MLflow staging alias is set. State .demo-state/staged_version is updated.

Step 09 — Start dry-run trading

Script: 09-dry-run-start.sh

POST /api/v1/strategies/{name}/run returns 201. Backend registers a new ECS task definition revision with TRADING_MODE=dry-run, STRATEGY_ID, TRADING_STATE_TABLE, PAIRS, and CONFIG_OVERRIDES, and updates the strategy service to desired=1. Polls until ECS reports the task as RUNNING. instance_id saved to .demo-state/instance_id.

Step 10 — Status, PnL, logs

Script: 10-status-pnl-logs.sh

Logs are read either through the backend logs endpoint or, as a documented fallback, directly from CloudWatch. Trading status returns summary and instances arrays. PnL endpoint returns at least one strategy snapshot for the running task.

Step 11 — Promote to Production (alias only, no live trading)

Script: 11-promote-production.sh

Captures the current Production version (for the step 13 rollback). POST /api/v1/strategies/{name}/promote sets the MLflow champion alias and archives the previous champion. Verifies the new version is persisted as Production and the previous one is Archived.

Step 12 — Observability

Script: 12-observability.sh

Confirmed SNS email subscription, ECS deployment circuit breakers, CloudWatch dashboard tradai-dev exists, log group /ecs/tradai/dev is readable.

Step 13 — Rollback

Script: 13-rollback.sh

Rolls back the model to DEMO_ROLLBACK_TARGET_VERSION if set, otherwise to the Production version captured in step 11. Verifies the target version is persisted as Production after rollback.

Step 14 — Cleanup

Script: 14-cleanup.sh

Stops the dry-run task, scales the strategy service back to zero, waits for rollout COMPLETED. Always idempotent.


4. Where to find each artifact / report

After (or during) a run, you can inspect everything the demo produced through the API, the AWS Console, and the local .demo-state/ directory.

4.1 Local state files

./.demo-state/ (created in your working directory by lib.sh):

File Content
token Cognito M2M access token from step 01
job_id Backtest job UUID from step 04
backtest.json Full backtest result (trades, KPIs, equity); refreshed by step 05/06
model_version MLflow version targeted by the demo (steps 08/11/13)
staged_version Version staged in step 08
production_version Version promoted in step 11
previous_production_version Production version captured before step 11 (target for step 13)
instance_id ECS task ID launched in step 09
alb_arn Discovered dev ALB ARN

Tail backtest.json:

jq '{job_id, status, result: {trades: .result.total_trades, sharpe: .result.metrics.sharpe, drawdown_pct: .result.max_drawdown_pct}}' .demo-state/backtest.json

4.2 Backtest results via API

Replace <job_id> with $(cat .demo-state/job_id):

Endpoint Returns
GET https://api-dev.tradai-system.com/api/v1/backtests/<job_id> Job + full result (metrics, trades, traceability)
GET https://api-dev.tradai-system.com/api/v1/backtests/<job_id>/equity Equity curve (time series)
GET https://api-dev.tradai-system.com/api/v1/backtests/<job_id>/report-data Detailed trade list + per-trade PnL

You need a Bearer token; reuse .demo-state/token or run 01-login-auth.sh again.

Example:

TOKEN=$(cat .demo-state/token)
JOB=$(cat .demo-state/job_id)
curl -sS -H "Authorization: Bearer $TOKEN" \
  "https://api-dev.tradai-system.com/api/v1/backtests/$JOB" \
  | jq '.result.metrics'

4.3 MLflow runs and registry

MLflow is internal — there is no public web URL. Access through the backend proxy or the platform team's port-forward:

Endpoint Returns
GET /api/v1/models/{name}/versions?include_archived=true Every version with alias-driven current_stage
GET /api/v1/catalog/strategies/{name} Strategy summary (latest version, stage, tags, KPIs)
GET /api/v1/catalog/leaderboard Ranked Production strategies
Backtest result .result.mlflow_run_id MLflow run ID for that backtest

If you have the platform team's MLflow access: https://mlflow-internal.tradai-system.com/#/experiments/<exp_id>/runs/<run_id> (ask in #tradai-platform for the exact internal URL).

4.4 Live trading state (CloudWatch + DynamoDB)

For the dry-run instance launched in step 09:

CloudWatch logs (replace <instance_id> with $(cat .demo-state/instance_id)):

  • Console: Log group /ecs/tradai/dev
  • Stream name: strategy/strategy/<instance_id>
  • CLI:
    aws logs get-log-events --profile tradai --region eu-central-1 \
      --log-group-name /ecs/tradai/dev \
      --log-stream-name "strategy/strategy/$(cat .demo-state/instance_id)" \
      --limit 50
    
    (Git Bash users: prefix with MSYS_NO_PATHCONV=1 so the / in the group name is preserved.)

DynamoDB state (heartbeat, status, PnL snapshot, trades):

  • Console: Table tradai-trading-state-dev
  • CLI:
    aws dynamodb get-item --profile tradai --region eu-central-1 \
      --table-name tradai-trading-state-dev \
      --key '{"strategy_id":{"S":"E2ETestStrategy"}}'
    

ECS service: strategy-e2eteststrategy

4.5 Step Functions execution (backtest workflow)

  • Console: State machines in eu-central-1 — the backtest workflow shows every step (NormalizeInput → ValidateStrategy → EnsureData → RunBacktest → RegisterModel → UpdateStatus) and links to the per-step CloudWatch logs.

4.6 Observability

4.7 S3 artifacts

Bucket Holds
tradai-arcticdb-dev ArcticDB market data
tradai-mlflow-dev MLflow run artifacts (model files, configs)
tradai-results-dev Backtest result JSONs, organised by job_id/trace_id
tradai-configs-dev Strategy configs

4.8 Strategy / backend container logs

Backend and strategy-service log to /ecs/tradai/dev too, on streams prefixed with their container name. CloudWatch Logs Insights:

fields @timestamp, @message
| filter @logStream like /strategy/
| sort @timestamp desc
| limit 200

5. Troubleshooting

5.1 aws sso login keeps asking on every run

You probably ran the preflight check after an interactive login but inside a different shell. Verify the cache is shared:

ls ~/.aws/sso/cache
AWS_PROFILE=tradai aws sts get-caller-identity

If get-caller-identity works with AWS_PROFILE= env-var but not --profile flag, that's a known Windows quirk — use the env-var form (run-all.sh already does).

5.2 Step 04 / 05 times out

Backtests can be slow on first invocation when warm-up data is missing. Bump the timeout:

DEMO_BACKTEST_TIMEOUT_SECONDS=1800 bash scripts/demo/04-submit-backtest.sh
bash scripts/demo/05-track-progress.sh

Check the Step Functions execution in the AWS Console for the actual failure if it's still running but not completing.

5.3 Step 09 dry-run never reaches RUNNING

Look at the most recent ECS task:

aws ecs list-tasks --profile tradai --region eu-central-1 \
  --cluster tradai-dev --service-name strategy-e2eteststrategy \
  --desired-status STOPPED | jq '.taskArns'

Then aws ecs describe-tasks ... on a stopped task — stoppedReason usually identifies the issue (image pull failure, missing env, etc.).

5.4 Step 10 PnL endpoint returns strategies: []

This is fixed in main. If you see it on a deployed environment, that deployment is behind main and is missing one of:

  • PR #434 (HealthReporter initial zero-PnL snapshot + ensure_exists on the state row)
  • PR #435 (backend injects TRADING_STATE_TABLE into the strategy runtime env so state management is not skipped)

Re-deploy backend + strategy-service from main.

5.5 Step 13 fails with No previous Production version captured

The demo cannot capture a previous Production version when the just-staged version (step 08) was itself the previous Production. This happens on a "clean" registry where the latest non-archived version is the one that was just rolled back to in a prior run.

Workaround (until #437 A lands) — register a fresh model version pointing at the most recent backtest's MLflow run, then re-run:

TOKEN=$(cat .demo-state/token)
RUN_ID=$(jq -r '.result.mlflow_run_id' .demo-state/backtest.json)
curl -sS -X POST -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d "{\"backtest_run_id\":\"$RUN_ID\",\"docker_image_uri\":\"600802701449.dkr.ecr.eu-central-1.amazonaws.com/tradai/e2eteststrategy:latest\",\"skip_validation\":true,\"strategy_version\":\"0.1.bump\",\"description\":\"manual bootstrap\"}" \
     "https://api-dev.tradai-system.com/api/v1/strategies/E2ETestStrategy/register"
bash scripts/demo/run-all.sh

The real fix is for the backtest workflow to register a new model version on every successful run (tracked in #437).

5.6 MSYS_NO_PATHCONV=1 everywhere on Windows

Git Bash on Windows converts /ecs/tradai/dev to C:/Program Files/Git/ecs/... which breaks AWS CLI. The demo scripts that touch log groups already prefix MSYS_NO_PATHCONV=1. If you call AWS CLI by hand from Git Bash, do the same.


6. Known limitations

The full set of known gaps and follow-ups is tracked in #437. The most relevant for someone running the demo today:

  • The demo is not idempotent without manual model-version registration. After a complete promote+rollback cycle the registry's only non-archived version is the previous champion; the next run's stage step clears that champion alias and step 13 has nothing to roll back to. Fix is gap A in #437.
  • Live trading is intentionally out of scope for run-all.sh. Live mode coverage is gated on #418/#419/#421.
  • X-Ray service map is not exercised (deferred per #415).

For the demo flow that is in scope (dry-run paper trading, promote/rollback as alias-only operations, full observability checks): the running demo passes 14/14 against AWS dev when the registry has a fresh non-archived version to stage. Both PRs that made this end-to-end flow possible (#434, #435, #436) are merged into main.


7. Sources