TradAI Live Trading Architecture¶

Version: 9.1 (Live Trading Integration) Date: 2025-12-09 Status: READY FOR IMPLEMENTATION Depends On: Architecture v9.2.1 (Backtesting Foundation)

Status: Not Yet Implemented

This document describes the target architecture. Implementation is tracked in reports/implementation-tasks/.

Executive Summary¶

This document extends the TradAI architecture v8.1 to support live trading capabilities. The design follows key principles:

Unified Strategy Container - Single container per strategy handles all modes (backtest, hyperopt, dry-run, live)
MLflow-Centric Configuration - Strategy config loaded from MLflow tags + S3, NOT hardcoded env vars
MLflow Model Lifecycle - FreqAI trains during backtest, MLflow serves models for live trading
Direct ArcticDB Access - Live trading reads S3 directly, decoupled from data-collection service
Self-Managed Sessions - Strategy containers manage their own DynamoDB state
ECS Services for Live - Always-on services (NOT Spot) for production trading

Cost Impact¶

Component	Monthly Cost
Base Platform (v8.1)	$78-99
Per Live Strategy	+$46
Total (3 strategies)	$216-237

System Architecture v9.0¶

┌─────────────────────────────────────────────────────────────┐
│                      AWS Cloud                               │
│  ┌─────────────────────────────────────────────────────────┐│
│  │                    Public Subnet                        ││
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ││
│  │  │   API GW     │  │     ALB      │  │ NAT Instance │  ││
│  │  │  + Cognito   │  │              │  │              │  ││
│  │  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  ││
│  └─────────┼─────────────────┼─────────────────┼──────────┘│
│            │                 │                 │           │
│  ┌─────────┼─────────────────┼─────────────────┼──────────┐│
│  │         │        Private Subnet             │          ││
│  │         ▼                 ▼                 ▼          ││
│  │  ┌─────────────────────────────────────────────────┐   ││
│  │  │              ECS Fargate Cluster                │   ││
│  │  │  ┌─────────────────────────────────────────┐    │   ││
│  │  │  │         Backend Service (Always-on)     │    │   ││
│  │  │  │  • REST API endpoints                   │    │   ││
│  │  │  │  • Strategy lifecycle management        │    │   ││
│  │  │  │  • Session control (start/stop/pause)   │    │   ││
│  │  │  └─────────────────────────────────────────┘    │   ││
│  │  │                                                  │   ││
│  │  │  ┌─────────────────────────────────────────┐    │   ││
│  │  │  │       Strategy Service (On-demand)      │    │   ││
│  │  │  │  • Backtest orchestration               │    │   ││
│  │  │  │  • Config management                    │    │   ││
│  │  │  │  • MLflow logging                       │    │   ││
│  │  │  └─────────────────────────────────────────┘    │   ││
│  │  │                                                  │   ││
│  │  │  ┌─────────────────────────────────────────┐    │   ││
│  │  │  │  Strategy Containers (Per Strategy)     │    │   ││
│  │  │  │  ┌───────────────┐ ┌───────────────┐    │    │   ││
│  │  │  │  │ pascal:2.0.0  │ │ momentum:1.0  │    │    │   ││
│  │  │  │  │ LIVE MODE     │ │ DRY-RUN MODE  │    │    │   ││
│  │  │  │  │ ECS Service   │ │ ECS Service   │    │    │   ││
│  │  │  │  └───────────────┘ └───────────────┘    │    │   ││
│  │  │  │  ┌───────────────┐ ┌───────────────┐    │    │   ││
│  │  │  │  │ ml-trend:3.1  │ │ Backtest Task │    │    │   ││
│  │  │  │  │ LIVE MODE     │ │ BACKTEST MODE │    │    │   ││
│  │  │  │  │ ECS Service   │ │ ECS Task/Spot │    │    │   ││
│  │  │  │  └───────────────┘ └───────────────┘    │    │   ││
│  │  │  └─────────────────────────────────────────┘    │   ││
│  │  │                                                  │   ││
│  │  │  ┌─────────────────────────────────────────┐    │   ││
│  │  │  │    Data Collection (Scheduled Tasks)    │    │   ││
│  │  │  │  • Historical data fetch                │    │   ││
│  │  │  │  • ArcticDB write                       │    │   ││
│  │  │  └─────────────────────────────────────────┘    │   ││
│  │  │                                                  │   ││
│  │  │  ┌─────────────────────────────────────────┐    │   ││
│  │  │  │         MLflow (Always-on)              │    │   ││
│  │  │  │  • Strategy versioning                  │    │   ││
│  │  │  │  • ML model registry                    │    │   ││
│  │  │  │  • Experiment tracking                  │    │   ││
│  │  │  └─────────────────────────────────────────┘    │   ││
│  │  └────────────────────────────────────────────────┘    ││
│  └────────────────────────────────────────────────────────┘│
│                                                            │
│  ┌────────────────────────────────────────────────────────┐│
│  │                    Data Layer                          ││
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐  ││
│  │  │ ArcticDB │ │ DynamoDB │ │    S3    │ │   RDS    │  ││
│  │  │ (S3)     │ │ State    │ │ Configs  │ │ Postgres │  ││
│  │  │          │ │ + Live   │ │ Results  │ │          │  ││
│  │  └──────────┘ └──────────┘ └──────────┘ └──────────┘  ││
│  └────────────────────────────────────────────────────────┘│
│                                                            │
│  ┌────────────────────────────────────────────────────────┐│
│  │              Monitoring & Events                       ││
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐  ││
│  │  │EventBrdge│ │ Lambda   │ │CloudWatch│ │   SNS    │  ││
│  │  │ Rules    │ │ Health   │ │ Alarms   │ │ Alerts   │  ││
│  │  └──────────┘ └──────────┘ └──────────┘ └──────────┘  ││
│  └────────────────────────────────────────────────────────┘│
└────────────────────────────────────────────────────────────┘

External:
┌──────────────┐  ┌──────────────┐
│   Binance    │  │    CCXT      │
│   Exchange   │  │   (Futures)  │
└──────────────┘  └──────────────┘

Unified Strategy Container¶

Container Structure¶

Each strategy is packaged as a single Docker image that supports all trading modes:

tradai-{strategy-name}:{version}
├── Freqtrade runtime
├── Strategy code (IStrategy implementation)
├── FreqAI models (for ML strategies)
├── ArcticDB warmup loader
├── DynamoDB state manager
├── Health reporter
└── Mode configuration

Trading Modes¶

Mode	Description	Launch Type	Spot OK?
`backtest`	Historical simulation	ECS Task	Yes
`hyperopt`	Parameter optimization	ECS Task	Yes
`dry-run`	Paper trading (no real orders)	ECS Service	No
`live`	Production trading	ECS Service	No

Environment Variables (Minimal)¶

The container uses minimal environment variables - just enough to locate the strategy in MLflow. All other configuration is loaded from MLflow tags and S3.

# REQUIRED: Identify strategy and trading state
# Issue 10 Fix: Changed from STRATEGY_NAME/STRATEGY_STAGE to match actual code
MLFLOW_TRACKING_URI: "http://mlflow.internal:5000"
STRATEGY: "PascalStrategy"  # Strategy class name (FREQTRADE_STRATEGY as fallback)
STRATEGY_ID: "pascal-btc"   # Unique strategy identifier for trading state

# REQUIRED: Trading mode
TRADING_MODE: "live"  # backtest | hyperopt | dry-run | live | train

# REQUIRED: Secrets (must stay in Secrets Manager)
EXCHANGE_SECRET_NAME: "tradai/exchange/binance"  # AWS Secrets Manager secret

# INFRASTRUCTURE (rarely changes)
DYNAMODB_TABLE: "tradai-workflow-state"
ARCTICDB_S3_URI: "s3://tradai-arcticdb-prod"

Why minimal env vars? - Single source of truth: Config lives in MLflow, not scattered across deployments - Version tracking: Config changes are tied to strategy versions - Audit trail: MLflow tracks who changed what when - Environment parity: Same container image, different config via MLflow stage

MLflow-Centric Configuration¶

Design Principle¶

Strategy configuration flows from MLflow, NOT from environment variables:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Config Loading Architecture                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Container starts with minimal env vars:                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  STRATEGY="PascalStrategy"                                           │    │
│  │  STRATEGY_ID="pascal-btc"                                            │    │
│  │  MLFLOW_TRACKING_URI="http://mlflow.internal:5000"                   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                              │                                               │
│                              ▼                                               │
│  Step 1: Query MLflow for model version                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  MLflowAdapter.get_model_version(name, stage)                        │    │
│  │  → Returns ModelVersion with tags:                                   │    │
│  │    • strategy_version: "2.0.0"                                       │    │
│  │    • timeframe: "1h"                                                 │    │
│  │    • configuration_file: "strategies/pascal/config.json"             │    │
│  │    • warmup_days: "30"                                               │    │
│  │    • pairs: "BTC/USDT:USDT,ETH/USDT:USDT"                           │    │
│  │    • category: "mean_reversion"                                      │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                              │                                               │
│                              ▼                                               │
│  Step 2: Load full config from S3 (via tag reference)                        │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  ConfigMergeService.load_config(config_s3_path)                      │    │
│  │  → Returns full Freqtrade config with all parameters                 │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                              │                                               │
│                              ▼                                               │
│  Step 3: Apply runtime overrides (optional)                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  ConfigMergeService.apply_overrides(base, overrides)                 │    │
│  │  → OmegaConf deep merge for any session-specific changes             │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                              │                                               │
│                              ▼                                               │
│  Final: Validated strategy config ready for Freqtrade                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Integration with Existing Patterns¶

This design leverages existing TradAI components:

Component	File	Purpose
`StrategyMetadata`	`libs/tradai-strategy/src/tradai/strategy/metadata.py`	Defines metadata schema with `to_mlflow_tags()`
`MLflowAdapter`	`libs/tradai-common/src/tradai/common/mlflow_adapter.py`	REST API wrapper for MLflow
`ConfigMergeService`	`libs/tradai-common/src/tradai/common/config/merge.py`	S3 config loading + OmegaConf merge
`S3ConfigRepository`	`libs/tradai-common/src/tradai/common/aws/s3_config_repository.py`	S3 config storage

Config Loading Implementation¶

from tradai.common.mlflow_adapter import MLflowAdapter
from tradai.common.config_service import ConfigMergeService
from tradai.common.aws.s3_config_repository import S3ConfigRepository

class StrategyConfigLoader:
    """Loads strategy configuration from MLflow and S3."""

    def __init__(
        self,
        mlflow_adapter: MLflowAdapter,
        config_service: ConfigMergeService,
    ):
        self.mlflow = mlflow_adapter
        self.config_service = config_service

    def load_config(
        self,
        strategy_name: str,
        stage: str = "Production",
        runtime_overrides: dict | None = None,
    ) -> dict:
        """Load complete strategy config from MLflow + S3.

        Args:
            strategy_name: Name of strategy in MLflow registry
            stage: MLflow stage (Production, Staging, None)
            runtime_overrides: Optional overrides for this session

        Returns:
            Complete validated strategy configuration
        """
        # 1. Get model version from MLflow (includes all tags)
        model_version = self.mlflow.get_model_version(
            name=strategy_name,
            stage=stage,
        )

        # 2. Extract config metadata from tags
        tags = {t["key"]: t["value"] for t in model_version.tags}
        config_s3_path = tags["configuration_file"]

        # 3. Build strategy metadata from tags
        strategy_config = {
            "strategy_name": strategy_name,
            "strategy_version": tags["strategy_version"],
            "timeframe": tags["timeframe"],
            "warmup_days": int(tags.get("warmup_days", "30")),
            "pairs": tags["pairs"].split(","),
            "can_short": tags.get("can_short", "false").lower() == "true",
        }

        # 4. Load full Freqtrade config from S3
        freqtrade_config = self.config_service.load_config(config_s3_path)

        # 5. Merge strategy metadata into config
        full_config = {
            **freqtrade_config,
            "tradai": strategy_config,
        }

        # 6. Apply runtime overrides if provided
        if runtime_overrides:
            full_config = self.config_service.apply_overrides(
                base=full_config,
                overrides=runtime_overrides,
            )

        return full_config

MLflow Tag Schema (StrategyMetadata.to_mlflow_tags())¶

When a strategy is registered, StrategyMetadata.to_mlflow_tags() stores:

# Core identity (from StrategyMetadata)
strategy_name: "PascalStrategy"
strategy_version: "2.0.0"
timeframe: "1h"
category: "mean_reversion"
author: "TradAI Team"
status: "production"  # testing | staging | production

# Trading config
can_short: "true"
pairs: "BTC/USDT:USDT,ETH/USDT:USDT"
stake_amount: "100"
max_open_trades: "3"

# Data requirements
warmup_days: "30"
data_format: "ohlcv"

# Config file reference (S3 path)
configuration_file: "strategies/pascal/v2.0.0/config.json"

# ML model reference (for FreqAI strategies)
freqai_model: "models:/PascalStrategy-FreqAI/Production"

# Docker image reference
ecr_url: "123456789.dkr.ecr.us-east-1.amazonaws.com/tradai-pascal:2.0.0"

Container Initialization Flow¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                     Container Startup Sequence (MLflow-First)                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. Bootstrap (from minimal env vars)                                        │
│     ├── Read STRATEGY (or FREQTRADE_STRATEGY), STRATEGY_ID from environment  │
│     ├── Read TRADING_MODE from environment                                   │
│     └── Initialize MLflowAdapter with MLFLOW_TRACKING_URI                    │
│                                                                              │
│  2. Load Config from MLflow + S3                                             │
│     ├── MLflowAdapter.get_model_version(name, stage)                         │
│     │   └── Returns ModelVersion with all metadata tags                      │
│     ├── Extract config_s3_path from tags["configuration_file"]               │
│     ├── ConfigMergeService.load_config(config_s3_path)                       │
│     │   └── Returns full Freqtrade config from S3                            │
│     └── Apply runtime overrides (from DynamoDB session if resuming)          │
│                                                                              │
│  3. Initialize State (DynamoDB)                                              │
│     ├── Create session record with config snapshot                           │
│     ├── Set status = "initializing"                                          │
│     └── Record strategy_version, config_hash, start_timestamp                │
│                                                                              │
│  4. Load ML Models (if FreqAI strategy)                                      │
│     ├── Extract freqai_model URI from tags                                   │
│     ├── MLflowAdapter.get_model_version(freqai_model)                        │
│     └── Load model artifacts for inference (NO retraining)                   │
│                                                                              │
│  5. Warmup Data (ArcticDB)                                                   │
│     ├── Extract warmup_days, pairs, timeframe from config                    │
│     ├── Connect to S3-backed ArcticDB                                        │
│     └── Load historical data → populate Freqtrade dataframes                 │
│                                                                              │
│  6. Start Freqtrade                                                          │
│     ├── Initialize exchange connection (credentials from Secrets Manager)    │
│     ├── Start strategy loop                                                  │
│     └── Update DynamoDB status = "running"                                   │
│                                                                              │
│  7. Health Reporting Loop (async)                                            │
│     ├── Every 60 seconds: update DynamoDB last_heartbeat                     │
│     ├── Report metrics to CloudWatch (PnL, trades, latency)                  │
│     └── Check exchange connectivity                                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Startup Code Example¶

import os
from tradai.common.mlflow_adapter import MLflowAdapter
from tradai.common.config_service import ConfigMergeService

def main():
    # 1. Bootstrap from minimal env vars
    strategy_name = os.environ.get("STRATEGY") or os.environ["FREQTRADE_STRATEGY"]
    strategy_id = os.environ.get("STRATEGY_ID", "")
    trading_mode = os.environ["TRADING_MODE"]
    mlflow_uri = os.environ["MLFLOW_TRACKING_URI"]

    # 2. Initialize adapters
    mlflow = MLflowAdapter(base_url=mlflow_uri)
    config_service = ConfigMergeService(bucket="tradai-configs")
    config_loader = StrategyConfigLoader(mlflow, config_service)

    # 3. Load complete config from MLflow + S3
    config = config_loader.load_config(
        strategy_name=strategy_name,
    )

    # 4. Extract runtime parameters (all from config, NOT env vars)
    warmup_days = config["tradai"]["warmup_days"]
    pairs = config["tradai"]["pairs"]
    timeframe = config["tradai"]["timeframe"]

    # 5. Initialize state, warmup data, start trading...
    logger.info(f"Starting {strategy_name} v{config['tradai']['strategy_version']}")
    logger.info(f"Mode: {trading_mode}, Pairs: {pairs}, Timeframe: {timeframe}")

MLflow Model Lifecycle¶

Training Phase (Backtest Mode)¶

During backtesting with FreqAI strategies:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         FreqAI Training Flow                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Backtest Run                                                                │
│       │                                                                      │
│       ▼                                                                      │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐                    │
│  │  FreqAI     │────▶│   Train     │────▶│   Save to   │                    │
│  │  Strategy   │     │   Models    │     │   MLflow    │                    │
│  └─────────────┘     └─────────────┘     └──────┬──────┘                    │
│                                                  │                           │
│                                                  ▼                           │
│                                          ┌─────────────┐                     │
│                                          │   MLflow    │                     │
│                                          │  Registry   │                     │
│                                          │             │                     │
│                                          │ • Staging   │                     │
│                                          │ • Prod      │                     │
│                                          └─────────────┘                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Inference Phase (Live Mode)¶

During live trading:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         MLflow Inference Flow                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Container Start                                                             │
│       │                                                                      │
│       ▼                                                                      │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐                    │
│  │   Load      │────▶│   MLflow    │────▶│  Cached     │                    │
│  │   Model     │     │   Client    │     │  Model      │                    │
│  └─────────────┘     └─────────────┘     └──────┬──────┘                    │
│                                                  │                           │
│                                                  ▼                           │
│                                          ┌─────────────┐                     │
│                                          │  FreqAI     │                     │
│                                          │  Predict    │                     │
│                                          │  (No Train) │                     │
│                                          └─────────────┘                     │
│                                                                              │
│  Key: FreqAI does NOT retrain during live trading.                          │
│       Models are loaded once from MLflow and used for inference only.       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Model Registry Schema¶

MLflow Model Registry
├── PascalStrategy
│   ├── Version 1 (Staging) - trained 2025-12-01
│   ├── Version 2 (Production) - trained 2025-12-05
│   └── Version 3 (None) - training 2025-12-08
├── MomentumStrategy
│   └── Version 1 (Production) - trained 2025-12-03
└── MLTrendStrategy
    ├── Version 1 (Archived) - trained 2025-11-15
    └── Version 2 (Production) - trained 2025-12-06

Unified DynamoDB State¶

Table Schema¶

Single table tradai-workflow-state handles both backtest workflows AND live trading sessions:

Table: tradai-workflow-state
  Partition Key: workflow_id (String)
  Sort Key: None

  # Backtest Workflow Record
  {
    "workflow_id": "backtest-pascal-20251208-abc123",
    "workflow_type": "backtest",
    "strategy_name": "PascalStrategy",
    "strategy_version": "2.0.0",
    "status": "completed",
    "started_at": "2025-12-08T10:00:00Z",
    "completed_at": "2025-12-08T10:15:00Z",
    "config_s3_path": "s3://tradai-configs/backtest/...",
    "result_s3_path": "s3://tradai-results/backtest/...",
    "ttl": 1736294400  # 30 days
  }

  # Live Trading Session Record
  {
    "workflow_id": "live-pascal-20251208-001",
    "workflow_type": "live",
    "strategy_name": "PascalStrategy",
    "strategy_version": "2.0.0",
    "status": "running",
    "trading_mode": "live",
    "started_at": "2025-12-08T00:00:00Z",
    "last_heartbeat": "2025-12-08T14:30:00Z",
    "exchange": "binance",
    "pairs": ["BTC/USDT:USDT", "ETH/USDT:USDT"],
    "metrics": {
      "total_trades": 45,
      "win_rate": 0.62,
      "pnl_today": 125.50,
      "pnl_total": 1250.00
    },
    "health": {
      "cpu_percent": 35,
      "memory_mb": 512,
      "exchange_connected": true,
      "last_trade_at": "2025-12-08T14:25:00Z"
    }
    # No TTL for live sessions - persist until explicitly deleted
  }

GSI for Status Queries¶

GSI: status-index
  Partition Key: status (String)
  Sort Key: started_at (String)

  # Query: Get all running live sessions
  status = "running" AND workflow_type = "live"

ArcticDB Direct Access¶

Architecture Decision¶

Live trading containers access ArcticDB directly via S3, NOT through the data-collection service:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Data Access Patterns                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Data Collection Service (Scheduled)                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  • Runs on schedule (hourly/daily)                                   │    │
│  │  • Fetches from CCXT/Binance                                         │    │
│  │  • WRITES to ArcticDB (S3)                                           │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                              │                                               │
│                              ▼                                               │
│                       ┌─────────────┐                                        │
│                       │  ArcticDB   │                                        │
│                       │    (S3)     │                                        │
│                       └──────┬──────┘                                        │
│                              │                                               │
│                              ▼                                               │
│  Strategy Containers (Live Trading)                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  • Always running                                                    │    │
│  │  • READS from ArcticDB (S3) directly                                 │    │
│  │  • No dependency on data-collection service                          │    │
│  │  • Can also fetch real-time data from exchange                       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  Benefits:                                                                   │
│  • Decoupled - live trading unaffected by data-collection failures          │
│  • Low latency - direct S3 access, no service hop                           │
│  • Scalable - each container has independent read path                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Warmup Loader Implementation¶

class ArcticDBWarmupLoader:
    """Loads historical data from ArcticDB for strategy warmup."""

    def __init__(self, s3_uri: str, warmup_days: int = 30):
        self.arctic = Arctic(s3_uri)
        self.warmup_days = warmup_days
        self.library = self.arctic.get_library("ohlcv")

    def load_warmup_data(
        self,
        symbols: list[str],
        timeframe: str,
    ) -> dict[str, pd.DataFrame]:
        """Load warmup data for all symbols."""
        end = datetime.utcnow()
        start = end - timedelta(days=self.warmup_days)

        result = {}
        for symbol in symbols:
            key = f"{symbol.replace('/', '_')}_{timeframe}"
            df = self.library.read(
                key,
                date_range=(start, end),
            ).data
            result[symbol] = df

        return result

Health Monitoring¶

Health Check Flow (C3/H1)¶

The Health Reporter now collects real-time metrics from Freqtrade and evaluates risk controls on every heartbeat cycle:

Freqtrade Process (localhost:8080)
    │  GET /profit, /status, /balance
    ▼
MetricsCollector ──► StrategyPnL + [LiveTrade]
    │
    ▼
HealthReporter._heartbeat_loop() (every 30s default)
    ├── _collect_metrics()     → poll Freqtrade via MetricsCollector
    ├── _send_heartbeat()      → persist to DynamoDB (pnl_snapshot, open_trades_snapshot)
    ├── _check_risk()          → RiskMonitor evaluates limits
    │   └── On breach: pause/stop Freqtrade, update DynamoDB, send CRITICAL alert
    └── _check_pause_resume()  → C9 pause/resume from backend API

Risk controls (C3): RiskMonitor evaluates drawdown, open trades, leverage, and fail-closed monitoring failures. RiskLimits entity stores configurable limits. Pre-flight validation via RiskLimits.validate_deployment_bounds().

Real-time monitoring (H1): PnL and trade snapshots are persisted to DynamoDB and served via GET /strategies/pnl and GET /strategies/{id}/trades.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Health Monitoring Flow                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Strategy Container                                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Health Reporter (every 30s)                                         │    │
│  │  ├── Collect metrics from Freqtrade API (H1)                         │    │
│  │  ├── Update DynamoDB heartbeat + metric snapshots                    │    │
│  │  ├── Evaluate risk limits (C3: drawdown, trades, leverage)           │    │
│  │  └── On breach: pause/stop + CRITICAL alert                          │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                              │                                               │
│                              ▼                                               │
│                       ┌─────────────┐                                        │
│                       │  DynamoDB   │                                        │
│                       │  + CW       │                                        │
│                       └──────┬──────┘                                        │
│                              │                                               │
│                              ▼                                               │
│  EventBridge Rule (every 5 min)                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Trigger: rate(5 minutes)                                            │    │
│  │  Target: Lambda health-check-live                                    │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                              │                                               │
│                              ▼                                               │
│  Lambda: health-check-live                                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  • Query DynamoDB for status="running" AND workflow_type="live"      │    │
│  │  • Check last_heartbeat < 3 minutes ago                              │    │
│  │  • If stale: publish to SNS alert topic                              │    │
│  │  • Update CloudWatch custom metrics                                  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                              │                                               │
│                              ▼ (if unhealthy)                                │
│                       ┌─────────────┐                                        │
│                       │    SNS      │──────▶ PagerDuty / Slack / Email      │
│                       │   Alert     │                                        │
│                       └─────────────┘                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

CloudWatch Alarms¶

Live Trading Alarms:
  - Name: live-trading-heartbeat-stale
    Metric: LiveTradingHeartbeatAge
    Threshold: > 180 seconds
    Period: 60 seconds
    EvaluationPeriods: 3
    Action: SNS Alert

  - Name: live-trading-exchange-disconnect
    Metric: LiveTradingExchangeConnected
    Threshold: < 1
    Period: 60 seconds
    EvaluationPeriods: 2
    Action: SNS Alert

  - Name: live-trading-high-loss
    Metric: LiveTradingPnLDaily
    Threshold: < -500  # USD
    Period: 300 seconds
    EvaluationPeriods: 1
    Action: SNS Alert + Auto-pause

ECS Service Configuration¶

Live Trading Service Definition¶

# ECS Service for Live Trading (NOT Spot)
Resource: AWS::ECS::Service
  ServiceName: tradai-live-pascal
  Cluster: tradai-cluster
  LaunchType: FARGATE  # NOT Spot for live trading
  DesiredCount: 1
  DeploymentConfiguration:
    MaximumPercent: 100
    MinimumHealthyPercent: 0  # Allow full replacement
  TaskDefinition: tradai-strategy-pascal
  NetworkConfiguration:
    AwsvpcConfiguration:
      Subnets:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroups:
        - !Ref StrategySecurityGroup
  EnableExecuteCommand: true  # For debugging

Task Definition¶

Resource: AWS::ECS::TaskDefinition
  Family: tradai-strategy-pascal
  Cpu: 512
  Memory: 1024
  NetworkMode: awsvpc
  RequiresCompatibilities:
    - FARGATE
  ExecutionRoleArn: !Ref ECSExecutionRole
  TaskRoleArn: !Ref StrategyTaskRole
  ContainerDefinitions:
    - Name: strategy
      Image: !Sub ${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/tradai-pascal:2.0.0
      Essential: true
      Environment:
        # MINIMAL env vars - config loaded from MLflow
        - Name: STRATEGY_NAME
          Value: PascalStrategy
        - Name: STRATEGY_STAGE
          Value: Production  # or use STRATEGY_VERSION for pinned version
        - Name: TRADING_MODE
          Value: live
        # Infrastructure (rarely changes)
        - Name: MLFLOW_TRACKING_URI
          Value: http://mlflow.internal:5000
        - Name: DYNAMODB_TABLE
          Value: tradai-workflow-state
        - Name: ARCTICDB_S3_URI
          Value: !Sub s3://${ArcticDBBucket}
        - Name: S3_CONFIG_BUCKET
          Value: !Ref ConfigsBucket
      Secrets:
        # Secrets stay in Secrets Manager (not MLflow)
        - Name: EXCHANGE_API_KEY
          ValueFrom: !Sub arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:tradai/exchange/api-key
        - Name: EXCHANGE_API_SECRET
          ValueFrom: !Sub arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:tradai/exchange/api-secret
      LogConfiguration:
        LogDriver: awslogs
        Options:
          awslogs-group: /ecs/tradai-live-pascal
          awslogs-region: !Ref AWS::Region
          awslogs-stream-prefix: strategy
      HealthCheck:
        Command:
          - CMD-SHELL
          - curl -f http://localhost:8080/health || exit 1
        Interval: 30
        Timeout: 5
        Retries: 3
        StartPeriod: 60

# Note: All strategy-specific config (pairs, timeframe, warmup_days, etc.)
# is loaded at runtime from MLflow tags + S3, NOT from env vars.
# To change config: update MLflow model version, NOT task definition.

Auto Scaling (Optional)¶

For strategies that benefit from horizontal scaling:

Resource: AWS::ApplicationAutoScaling::ScalableTarget
  ServiceNamespace: ecs
  ResourceId: !Sub service/tradai-cluster/tradai-live-pascal
  ScalableDimension: ecs:service:DesiredCount
  MinCapacity: 1
  MaxCapacity: 5

Resource: AWS::ApplicationAutoScaling::ScalingPolicy
  PolicyName: cpu-scaling
  PolicyType: TargetTrackingScaling
  ScalingTargetId: !Ref ScalableTarget
  TargetTrackingScalingPolicyConfiguration:
    PredefinedMetricSpecification:
      PredefinedMetricType: ECSServiceAverageCPUUtilization
    TargetValue: 70
    ScaleInCooldown: 300
    ScaleOutCooldown: 60

Cost Analysis¶

Per-Strategy Costs (Live Trading)¶

Component	Specification	Monthly Cost
ECS Fargate	0.5 vCPU, 1GB, 24/7	$15.33
CloudWatch Logs	~1GB/month	$0.50
DynamoDB	On-demand, ~1M reads	$0.25
Secrets Manager	2 secrets	$0.80
EventBridge	8,640 invocations	~$0.01
Lambda (health)	8,640 x 128MB x 1s	$0.11
SNS	~100 alerts	$0.01
Subtotal per strategy		$17.01
Reserve (exchange fees, data)		+$29
Total per strategy		~$46

Platform Total (3 Strategies)¶

Component	Monthly Cost
Base Platform (v8.1)	$78-99
Pascal Strategy (live)	$46
Momentum Strategy (dry-run)	$46
ML Trend Strategy (live)	$46
Platform Total	$216-237

Security Controls¶

Exchange Credentials¶

# Secrets Manager Configuration
Secret: tradai/exchange/api-key
  Description: Binance API key for live trading
  Rotation: Manual (exchange limitation)
  Access: StrategyTaskRole only

Secret: tradai/exchange/api-secret
  Description: Binance API secret
  Rotation: Manual
  Access: StrategyTaskRole only

# IAM Policy for Strategy Task Role
Policy:
  Effect: Allow
  Action:
    - secretsmanager:GetSecretValue
  Resource:
    - arn:aws:secretsmanager:*:*:secret:tradai/exchange/*
  Condition:
    StringEquals:
      aws:RequestTag/Environment: !Ref Environment

Network Security¶

# Strategy Security Group
Inbound:
  - Port 8080 (health check) from ALB only
  - No direct internet access

Outbound:
  - Port 443 to exchange APIs (Binance)
  - Port 443 to S3 (VPC endpoint)
  - Port 443 to DynamoDB (VPC endpoint)
  - Port 5000 to MLflow (internal)

Audit Logging¶

CloudTrail Events:
  - ECS RunTask / StopTask
  - Secrets Manager GetSecretValue
  - DynamoDB PutItem / UpdateItem

CloudWatch Logs:
  - /ecs/tradai-live-{strategy}
  - Retention: 30 days
  - Export to S3: After 7 days

Implementation Checklist (Phase 6)¶

Week 17-18: Live Trading Foundation¶

[ ] Update Strategy Container Dockerfile for multi-mode support
[ ] Implement ArcticDB warmup loader
[ ] Implement DynamoDB state manager for live sessions
[ ] Implement health reporter component
[ ] Create ECS Service definitions (non-Spot)
[ ] Configure Secrets Manager for exchange credentials

Week 19-20: MLflow Integration¶

[ ] Update MLflow adapter for model serving
[ ] Implement model loading in FreqAI strategies
[ ] Create model promotion workflow (Staging → Production)
[ ] Test model versioning with live trading

Week 21-22: Monitoring & Operations¶

[ ] Deploy EventBridge + Lambda health monitoring
[ ] Configure CloudWatch alarms
[ ] Set up SNS alerting
[ ] Create operational runbooks
[ ] Perform dry-run validation
[ ] Go-live with first strategy

Document Changelog¶

Version	Date	Changes
9.1	2025-12-09	MLflow-centric configuration: minimal env vars, config from MLflow tags + S3, leverages existing `StrategyMetadata`, `MLflowAdapter`, `ConfigMergeService` patterns
9.0	2025-12-08	Initial live trading architecture document

12-ML-LIFECYCLE.md - Complete ML strategy lifecycle including training pipeline, hyperparameter optimization, retraining scheduling, drift detection, and model rollback. Read this for the full MLOps story.

Next Action: Begin Phase 6 implementation after completing Phases 1-5 of the backtesting foundation.