Skip to content

TradAI Live Trading ArchitectureΒΆ

Version: 9.1.2 (Live Trading Integration) Date: 2026-03-28 Status: ASPIRATIONAL (Partially Implemented) Depends On: Architecture v9.2.1 (Backtesting Foundation)

TL;DR: Live trading architecture using ECS Fargate (1 vCPU, 2GB, non-Spot). TradingHandler orchestrates Freqtrade with health monitoring (MetricsCollector, RiskMonitor, HealthReporter). Status: partially implemented β€” core components ready, E2E not yet validated.

Status: Aspirational (Partially Implemented)

This document describes the target architecture. Core infrastructure is implemented: ECS service definitions (live-trading, dry-run-trading in config.py), the trading entrypoint (tradai.common.entrypoint.trading.TradingHandler), health reporting (HealthReporter, RiskMonitor, MetricsCollector), and the trading-heartbeat-check Lambda. Remaining work (MLflow model serving in live mode, full DynamoDB session management, exchange credential rotation) is tracked in reports/implementation-tasks/.

Implemented Components

The following building blocks are implemented and tested: StrategyConfigLoader (MLflow tags + S3), ArcticDBWarmupLoader (historical data warmup), TradingStateRepository (DynamoDB session state), HealthReporter + RiskMonitor + MetricsCollector (container health and risk controls), FreqAIConfigBuilder (inference config), and the trading-heartbeat-check Lambda. ECS service definitions for live-trading and dry-run-trading are in config.py.

PrerequisitesΒΆ

Live trading (Phase 6) requires ALL of the following before starting:

  1. Phases 1-5 complete with a stable backtesting platform
  2. At least one successful end-to-end backtest via Step Functions
  3. MLflow model registry operational with at least one registered model
  4. Health monitoring and CloudWatch alarms verified in production

1. Executive SummaryΒΆ

This document extends the TradAI architecture v8.1 to support live trading capabilities. The design follows key principles:

  1. Unified Strategy Container - Single container per strategy handles all modes (backtest, hyperopt, dry-run, live)
  2. MLflow-Centric Configuration - Strategy config loaded from MLflow tags + S3, NOT hardcoded env vars
  3. MLflow Model Lifecycle - FreqAI trains during backtest, MLflow serves models for live trading
  4. Direct ArcticDB Access - Live trading reads S3 directly, decoupled from data-collection service
  5. Self-Managed Sessions - Strategy containers manage their own DynamoDB state
  6. ECS Services for Live - Always-on services (NOT Spot) for production trading

Cost ImpactΒΆ

Component Monthly Cost
Base Platform (v8.1) $122-128
Per Live Strategy +$61
Total (3 strategies) $305-311

2. System Architecture v9.0ΒΆ

graph TB
    subgraph Public["Public Subnet"]
        APIGW["API Gateway<br/>+ Cognito"]
        ALB["ALB"]
        NAT["NAT Instance"]
    end

    subgraph Private["Private Subnet - ECS Fargate Cluster"]
        Backend["Backend Service<br/>(Always-on)"]
        Strategy["Strategy Service<br/>(On-demand)"]
        Containers["Strategy Containers<br/>pascal:2.0.0 LIVE<br/>momentum:1.0 DRY-RUN"]
        DataColl["Data Collection<br/>(Scheduled)"]
        MLflow["MLflow<br/>(Always-on)"]
    end

    subgraph Data["Data Layer"]
        ArcticDB["ArcticDB (S3)"]
        DynamoDB["DynamoDB State"]
        S3["S3 Configs/Results"]
        RDS["RDS Postgres"]
    end

    subgraph Monitoring["Monitoring & Events"]
        EB["EventBridge"]
        Lambda["Lambda Health"]
        CW["CloudWatch Alarms"]
        SNS["SNS Alerts"]
    end

    subgraph External["External"]
        Binance["Binance Exchange"]
        CCXT["CCXT (Futures)"]
    end

    APIGW --> ALB
    ALB --> Backend
    ALB --> Strategy
    ALB --> MLflow
    Backend --> Containers
    Containers --> ArcticDB
    Containers --> DynamoDB
    Containers --> MLflow
    Containers --> Binance
    DataColl --> ArcticDB
    EB --> Lambda
    Lambda --> DynamoDB
    Lambda --> SNS
    CW --> SNS

See source code for detailed ASCII reference.


3. Unified Strategy ContainerΒΆ

Container StructureΒΆ

Each strategy is packaged as a single Docker image that supports all trading modes:

tradai-{strategy-name}:{version}
β”œβ”€β”€ Freqtrade runtime
β”œβ”€β”€ Strategy code (IStrategy implementation)
β”œβ”€β”€ FreqAI models (for ML strategies)
β”œβ”€β”€ ArcticDB warmup loader
β”œβ”€β”€ DynamoDB state manager
β”œβ”€β”€ Health reporter
└── Mode configuration

Trading ModesΒΆ

Mode Description Launch Type Spot OK?
backtest Historical simulation ECS Task Yes
hyperopt Parameter optimization ECS Task Yes
dry-run Paper trading (no real orders) ECS Service Yes
live Production trading ECS Service No

Environment Variables (Minimal)ΒΆ

The container uses minimal environment variables - just enough to locate the strategy in MLflow. All other configuration is loaded from MLflow tags and S3.

# REQUIRED: Identify strategy and trading state
# Issue 10 Fix: Changed from STRATEGY_NAME/STRATEGY_STAGE to match actual code
MLFLOW_TRACKING_URI: "http://mlflow.tradai-{env}.local:5000/mlflow"
STRATEGY: "PascalStrategy"  # Strategy class name (FREQTRADE_STRATEGY as fallback)
STRATEGY_ID: "pascal-btc"   # Unique strategy identifier for trading state

# REQUIRED: Trading mode
TRADING_MODE: "live"  # backtest | hyperopt | dry-run | live | train

# REQUIRED: Secrets (must stay in Secrets Manager)
EXCHANGE_SECRET_NAME: "tradai/exchange/binance"  # AWS Secrets Manager secret

# INFRASTRUCTURE (rarely changes)
DYNAMODB_TABLE: "tradai-workflow-state"
ARCTICDB_S3_URI: "s3://tradai-arcticdb-prod"

Why minimal env vars? - Single source of truth: Config lives in MLflow, not scattered across deployments - Version tracking: Config changes are tied to strategy versions - Audit trail: MLflow tracks who changed what when - Environment parity: Same container image, different config via MLflow stage


4. MLflow-Centric ConfigurationΒΆ

Design PrincipleΒΆ

Strategy configuration flows from MLflow, NOT from environment variables:

flowchart TD
    A["Minimal env vars"]
    B["Query MLflow tags"]
    C["Load S3 config"]
    D["Apply runtime overrides"]
    E["Validated config"]

    A --> B --> C --> D --> E
Step Action Detail
Minimal env vars Bootstrap lookup Only enough data to locate the strategy in MLflow, typically STRATEGY, STRATEGY_ID, and MLFLOW_TRACKING_URI
Query MLflow tags Resolve model metadata MLflowAdapter.get_model_version(...) returns tags such as strategy_version, timeframe, configuration_file, warmup_days, pairs, and category
Load S3 config Fetch base config ConfigMergeService.load_config(...) loads the full Freqtrade configuration referenced by the MLflow metadata
Apply runtime overrides Merge session-specific values ConfigMergeService.apply_overrides(...) performs the final OmegaConf deep merge
Validated config Start trading runtime The container launches Freqtrade with the merged, validated strategy configuration

Integration with Existing PatternsΒΆ

This design leverages existing TradAI components:

Component File Purpose
StrategyMetadata libs/tradai-strategy/src/tradai/strategy/metadata.py Defines metadata schema with to_mlflow_tags()
MLflowAdapter libs/tradai-common/src/tradai/common/mlflow/adapter.py REST API wrapper for MLflow
ConfigMergeService libs/tradai-common/src/tradai/common/config/merge.py S3 config loading + OmegaConf merge
S3ConfigRepository libs/tradai-common/src/tradai/common/aws/s3_config_repository.py S3 config storage

Config Loading ImplementationΒΆ

from tradai.common.mlflow.adapter import MLflowAdapter
from tradai.common.config.merge import ConfigMergeService
from tradai.common.aws.s3_config_repository import S3ConfigRepository

class StrategyConfigLoader:
    """Loads strategy configuration from MLflow and S3."""

    def __init__(
        self,
        mlflow_adapter: MLflowAdapter,
        config_service: ConfigMergeService,
    ):
        self.mlflow = mlflow_adapter
        self.config_service = config_service

    def load_config(
        self,
        strategy_name: str,
        stage: str = "Production",
        runtime_overrides: dict | None = None,
    ) -> dict:
        """Load complete strategy config from MLflow + S3.

        Args:
            strategy_name: Name of strategy in MLflow registry
            stage: MLflow stage (Production, Staging, None)
            runtime_overrides: Optional overrides for this session

        Returns:
            Complete validated strategy configuration
        """
        # 1. Get model version from MLflow (includes all tags)
        model_version = self.mlflow.get_model_version(
            name=strategy_name,
            stage=stage,
        )

        # 2. Extract config metadata from tags
        tags = {t["key"]: t["value"] for t in model_version.tags}
        config_s3_path = tags["configuration_file"]

        # 3. Build strategy metadata from tags
        strategy_config = {
            "strategy_name": strategy_name,
            "strategy_version": tags["strategy_version"],
            "timeframe": tags["timeframe"],
            "warmup_days": int(tags.get("warmup_days", "30")),
            "pairs": tags["pairs"].split(","),
            "can_short": tags.get("can_short", "false").lower() == "true",
        }

        # 4. Load full Freqtrade config from S3
        freqtrade_config = self.config_service.load_config(config_s3_path)

        # 5. Merge strategy metadata into config
        full_config = {
            **freqtrade_config,
            "tradai": strategy_config,
        }

        # 6. Apply runtime overrides if provided
        if runtime_overrides:
            full_config = self.config_service.apply_overrides(
                base=full_config,
                overrides=runtime_overrides,
            )

        return full_config

MLflow Tag Schema (StrategyMetadata.to_mlflow_tags())ΒΆ

When a strategy is registered, StrategyMetadata.to_mlflow_tags() stores:

# Core identity (from StrategyMetadata)
strategy_name: "PascalStrategy"
strategy_version: "2.0.0"
timeframe: "1h"
category: "mean_reversion"
author: "TradAI Team"
status: "production"  # testing | staging | production

# Trading config
can_short: "true"
pairs: "BTC/USDT:USDT,ETH/USDT:USDT"
stake_amount: "100"
max_open_trades: "3"

# Data requirements
warmup_days: "30"
data_format: "ohlcv"

# Config file reference (S3 path)
configuration_file: "strategies/pascal/v2.0.0/config.json"

# ML model reference (for FreqAI strategies)
freqai_model: "models:/PascalStrategy-FreqAI/Production"

# Docker image reference
ecr_url: "123456789.dkr.ecr.eu-central-1.amazonaws.com/tradai/pascal:2.0.0"

Container Initialization FlowΒΆ

sequenceDiagram
    participant Env as Environment
    participant MLflow as MLflow Registry
    participant S3 as S3 Config
    participant DDB as DynamoDB
    participant Arctic as ArcticDB
    participant FT as Freqtrade
    participant CW as CloudWatch

    Note over Env: 1. Bootstrap
    Env->>MLflow: get_model_version(STRATEGY, stage)
    MLflow-->>Env: ModelVersion + tags

    Note over Env: 2. Load Config
    Env->>S3: load_config(config_s3_path)
    S3-->>Env: Full Freqtrade config
    Note over Env: Apply runtime overrides

    Note over Env: 3. Initialize State
    Env->>DDB: Create session (status=INITIALIZING)

    Note over Env: 4. Load ML Models (if FreqAI)
    Env->>MLflow: get_model_version(freqai_model)
    MLflow-->>Env: Model artifacts

    Note over Env: 5. Warmup Data
    Env->>Arctic: Load historical OHLCV data

    Note over Env: 6. Start Trading
    Env->>FT: Start Freqtrade subprocess
    Env->>DDB: Update status=RUNNING

    loop Every 30 seconds
        Note over Env: 7. Health Reporting
        FT-->>Env: /profit, /status
        Env->>DDB: Update last_heartbeat
        Env->>CW: Publish metrics
    end

Startup Code ExampleΒΆ

import os
from tradai.common.mlflow.adapter import MLflowAdapter
from tradai.common.config.merge import ConfigMergeService

def main():
    # 1. Bootstrap from minimal env vars
    strategy_name = os.environ.get("STRATEGY") or os.environ["FREQTRADE_STRATEGY"]
    strategy_id = os.environ.get("STRATEGY_ID", "")
    trading_mode = os.environ["TRADING_MODE"]
    mlflow_uri = os.environ["MLFLOW_TRACKING_URI"]

    # 2. Initialize adapters
    mlflow = MLflowAdapter(base_url=mlflow_uri)
    config_service = ConfigMergeService(bucket="tradai-configs")
    config_loader = StrategyConfigLoader(mlflow, config_service)

    # 3. Load complete config from MLflow + S3
    config = config_loader.load_config(
        strategy_name=strategy_name,
    )

    # 4. Extract runtime parameters (all from config, NOT env vars)
    warmup_days = config["tradai"]["warmup_days"]
    pairs = config["tradai"]["pairs"]
    timeframe = config["tradai"]["timeframe"]

    # 5. Initialize state, warmup data, start trading...
    logger.info(f"Starting {strategy_name} v{config['tradai']['strategy_version']}")
    logger.info(f"Mode: {trading_mode}, Pairs: {pairs}, Timeframe: {timeframe}")

5. MLflow Model LifecycleΒΆ

Training Phase (Backtest Mode)ΒΆ

During backtesting with FreqAI strategies:

flowchart LR
    subgraph Training["Training Phase (Backtest Mode)"]
        A[Backtest Run] --> B[FreqAI Strategy]
        B --> C[Train Models]
        C --> D[Save to MLflow]
        D --> E["MLflow Registry\n(Staging / Prod)"]
    end

    subgraph Inference["Inference Phase (Live Mode)"]
        F[Container Start] --> G[Load Model]
        G --> H[MLflow Client]
        H --> I[Cached Model]
        I --> J["FreqAI Predict\n(No Train)"]
    end

    E -.-> G

Note: FreqAI does NOT retrain during live trading. Models are loaded once from MLflow and used for inference only.

Model Registry SchemaΒΆ

MLflow Model Registry
β”œβ”€β”€ PascalStrategy
β”‚   β”œβ”€β”€ Version 1 (Staging) - trained 2025-12-01
β”‚   β”œβ”€β”€ Version 2 (Production) - trained 2025-12-05
β”‚   └── Version 3 (None) - training 2025-12-08
β”œβ”€β”€ MomentumStrategy
β”‚   └── Version 1 (Production) - trained 2025-12-03
└── MLTrendStrategy
    β”œβ”€β”€ Version 1 (Archived) - trained 2025-11-15
    └── Version 2 (Production) - trained 2025-12-06

6. Unified DynamoDB StateΒΆ

Table SchemaΒΆ

Single table tradai-workflow-state handles both backtest workflows AND live trading sessions:

Table: tradai-workflow-state-{env}
  Partition Key: run_id (String)
  Sort Key: None

  # Backtest Workflow Record
  {
    "run_id": "backtest-pascal-20251208-abc123",
    "workflow_type": "backtest",
    "strategy_name": "PascalStrategy",
    "strategy_version": "2.0.0",
    "status": "completed",
    "created_at": "2025-12-08T10:00:00Z",
    "started_at": "2025-12-08T10:00:00Z",
    "completed_at": "2025-12-08T10:15:00Z",
    "config_s3_path": "s3://tradai-configs/backtest/...",
    "result_s3_path": "s3://tradai-results/backtest/...",
    "ttl": 1736294400  # 7 days
  }

  # Live Trading Session Record
  {
    "run_id": "live-pascal-20251208-001",
    "workflow_type": "live",
    "strategy_name": "PascalStrategy",
    "strategy_version": "2.0.0",
    "status": "running",
    "trading_mode": "live",
    "created_at": "2025-12-08T00:00:00Z",
    "started_at": "2025-12-08T00:00:00Z",
    "last_heartbeat": "2025-12-08T14:30:00Z",
    "exchange": "binance",
    "pairs": ["BTC/USDT:USDT", "ETH/USDT:USDT"],
    "metrics": {
      "total_trades": 45,
      "win_rate": 0.62,
      "pnl_today": 125.50,
      "pnl_total": 1250.00
    },
    "health": {
      "cpu_percent": 35,
      "memory_mb": 512,
      "exchange_connected": true,
      "last_trade_at": "2025-12-08T14:25:00Z"
    }
    # No TTL for live sessions - persist until explicitly deleted
  }

DynamoDB item size limit for long-running sessions

Long-running trading sessions accumulate metrics snapshots in the same DynamoDB item. DynamoDB has a 400KB item size limit. For strategies running 6+ months, consider archiving historical metrics to S3 periodically.

GSI for Status QueriesΒΆ

GSI: status-created_at-index
  Partition Key: status (String)
  Sort Key: created_at (String)

  # Query: Get all running live sessions
  status = "running" (then filter workflow_type = "live")

GSI: trace_id-index
  Partition Key: trace_id (String)
  Projection: INCLUDE (job_id, status, created_at, execution_arn)

7. ArcticDB Direct AccessΒΆ

Architecture DecisionΒΆ

Live trading containers access ArcticDB directly via S3, NOT through the data-collection service:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Data Access Patterns                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  Data Collection Service (Scheduled)                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  β€’ Runs on schedule (hourly/daily)                                   β”‚    β”‚
β”‚  β”‚  β€’ Fetches from CCXT/Binance                                         β”‚    β”‚
β”‚  β”‚  β€’ WRITES to ArcticDB (S3)                                           β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                              β”‚                                               β”‚
β”‚                              β–Ό                                               β”‚
β”‚                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                        β”‚
β”‚                       β”‚  ArcticDB   β”‚                                        β”‚
β”‚                       β”‚    (S3)     β”‚                                        β”‚
β”‚                       β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                                        β”‚
β”‚                              β”‚                                               β”‚
β”‚                              β–Ό                                               β”‚
β”‚  Strategy Containers (Live Trading)                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  β€’ Always running                                                    β”‚    β”‚
β”‚  β”‚  β€’ READS from ArcticDB (S3) directly                                 β”‚    β”‚
β”‚  β”‚  β€’ No dependency on data-collection service                          β”‚    β”‚
β”‚  β”‚  β€’ Can also fetch real-time data from exchange                       β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                              β”‚
β”‚  Benefits:                                                                   β”‚
β”‚  β€’ Decoupled - live trading unaffected by data-collection failures          β”‚
β”‚  β€’ Low latency - direct S3 access, no service hop                           β”‚
β”‚  β€’ Scalable - each container has independent read path                       β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Warmup Loader ImplementationΒΆ

class ArcticDBWarmupLoader:
    """Loads historical data from ArcticDB for strategy warmup."""

    def __init__(self, s3_uri: str, warmup_days: int = 30):
        self.arctic = Arctic(s3_uri)
        self.warmup_days = warmup_days
        self.library = self.arctic.get_library("ohlcv")

    def load_warmup_data(
        self,
        symbols: list[str],
        timeframe: str,
    ) -> dict[str, pd.DataFrame]:
        """Load warmup data for all symbols."""
        end = datetime.utcnow()
        start = end - timedelta(days=self.warmup_days)

        result = {}
        for symbol in symbols:
            key = f"{symbol.replace('/', '_')}_{timeframe}"
            df = self.library.read(
                key,
                date_range=(start, end),
            ).data
            result[symbol] = df

        return result

8. Health MonitoringΒΆ

Health Check Flow (C3/H1)ΒΆ

The Health Reporter now collects real-time metrics from Freqtrade and evaluates risk controls on every heartbeat cycle:

Freqtrade Process (localhost:8080)
    β”‚  GET /profit, /status, /balance
    β–Ό
MetricsCollector ──► StrategyPnL + [LiveTrade]
    β”‚
    β–Ό
HealthReporter._heartbeat_loop() (every 30s default)
    β”œβ”€β”€ _collect_metrics()     β†’ poll Freqtrade via MetricsCollector
    β”œβ”€β”€ _send_heartbeat()      β†’ persist to DynamoDB (pnl_snapshot, open_trades_snapshot)
    β”œβ”€β”€ _check_risk()          β†’ RiskMonitor evaluates limits
    β”‚   └── On breach: pause/stop Freqtrade, update DynamoDB, send CRITICAL alert
    └── _check_pause_resume()  β†’ C9 pause/resume from backend API

Risk controls (C3): RiskMonitor evaluates drawdown, open trades, leverage, and fail-closed monitoring failures. RiskLimits entity stores configurable limits. Pre-flight validation via RiskLimits.validate_deployment_bounds().

Real-time monitoring (H1): PnL and trade snapshots are persisted to DynamoDB and served via GET /strategies/pnl and GET /strategies/{id}/trades.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Health Monitoring Flow                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  Strategy Container                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  Health Reporter (every 30s)                                         β”‚    β”‚
β”‚  β”‚  β”œβ”€β”€ Collect metrics from Freqtrade API (H1)                         β”‚    β”‚
β”‚  β”‚  β”œβ”€β”€ Update DynamoDB heartbeat + metric snapshots                    β”‚    β”‚
β”‚  β”‚  β”œβ”€β”€ Evaluate risk limits (C3: drawdown, trades, leverage)           β”‚    β”‚
β”‚  β”‚  └── On breach: pause/stop + CRITICAL alert                          β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                              β”‚                                               β”‚
β”‚                              β–Ό                                               β”‚
β”‚                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                        β”‚
β”‚                       β”‚  DynamoDB   β”‚                                        β”‚
β”‚                       β”‚  + CW       β”‚                                        β”‚
β”‚                       β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                                        β”‚
β”‚                              β”‚                                               β”‚
β”‚                              β–Ό                                               β”‚
β”‚  EventBridge Rule (every 5 min)                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  Trigger: rate(5 minutes)                                            β”‚    β”‚
β”‚  β”‚  Target: Lambda trading-heartbeat-check                                    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                              β”‚                                               β”‚
β”‚                              β–Ό                                               β”‚
β”‚  Lambda: trading-heartbeat-check                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  β€’ Query DynamoDB for status="running" AND workflow_type="live"      β”‚    β”‚
β”‚  β”‚  β€’ Check last_heartbeat < 3 minutes ago                              β”‚    β”‚
β”‚  β”‚  β€’ If stale: publish to SNS alert topic                              β”‚    β”‚
β”‚  β”‚  β€’ Update CloudWatch custom metrics                                  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                              β”‚                                               β”‚
β”‚                              β–Ό (if unhealthy)                                β”‚
β”‚                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                        β”‚
β”‚                       β”‚    SNS      │──────▢ PagerDuty / Slack / Email      β”‚
β”‚                       β”‚   Alert     β”‚                                        β”‚
β”‚                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                        β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

CloudWatch AlarmsΒΆ

Live Trading Alarms:
  - Name: live-trading-heartbeat-stale
    Metric: LiveTradingHeartbeatAge
    Threshold: > 180 seconds
    Period: 60 seconds
    EvaluationPeriods: 3
    Action: SNS Alert

  - Name: live-trading-exchange-disconnect
    Metric: LiveTradingExchangeConnected
    Threshold: < 1
    Period: 60 seconds
    EvaluationPeriods: 2
    Action: SNS Alert

  - Name: live-trading-high-loss
    Metric: LiveTradingPnLDaily
    Threshold: < -500  # USD
    Period: 300 seconds
    EvaluationPeriods: 1
    Action: SNS Alert + Auto-pause

9. ECS Service ConfigurationΒΆ

Live Trading Service DefinitionΒΆ

# ECS Service for Live Trading (NOT Spot)
Resource: AWS::ECS::Service
  ServiceName: tradai-live-pascal
  Cluster: tradai-cluster
  LaunchType: FARGATE  # NOT Spot for live trading
  DesiredCount: 1
  DeploymentConfiguration:
    MaximumPercent: 100
    MinimumHealthyPercent: 0  # Allow full replacement (see warning below)
  TaskDefinition: tradai-strategy-pascal
  NetworkConfiguration:
    AwsvpcConfiguration:
      Subnets:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroups:
        - !Ref StrategySecurityGroup
  EnableExecuteCommand: true  # For debugging

Zero-container window during deployment

With MinimumHealthyPercent: 0, there will be a window with zero running trading containers during deployment. For strategies with open positions, coordinate deployments during market off-hours or implement a graceful handoff mechanism.

Task DefinitionΒΆ

Resource: AWS::ECS::TaskDefinition
  Family: tradai-strategy-pascal
  Cpu: 1024
  Memory: 2048
  NetworkMode: awsvpc
  RequiresCompatibilities:
    - FARGATE
  ExecutionRoleArn: !Ref ECSExecutionRole
  TaskRoleArn: !Ref StrategyTaskRole
  ContainerDefinitions:
    - Name: strategy
      Image: !Sub ${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/tradai-pascal:2.0.0
      Essential: true
      Environment:
        # MINIMAL env vars - config loaded from MLflow
        - Name: STRATEGY
          Value: PascalStrategy
        - Name: STRATEGY_ID
          Value: pascal-btc  # Unique strategy identifier for trading state
        - Name: TRADING_MODE
          Value: live
        # Infrastructure (rarely changes)
        - Name: MLFLOW_TRACKING_URI
          Value: http://mlflow.tradai-{env}.local:5000/mlflow
        - Name: DYNAMODB_TABLE
          Value: tradai-workflow-state
        - Name: ARCTICDB_S3_URI
          Value: !Sub s3://${ArcticDBBucket}
        - Name: S3_CONFIG_BUCKET
          Value: !Ref ConfigsBucket
      Secrets:
        # Secrets stay in Secrets Manager (not MLflow)
        - Name: EXCHANGE_API_KEY
          ValueFrom: !Sub arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:tradai/exchange/api-key
        - Name: EXCHANGE_API_SECRET
          ValueFrom: !Sub arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:tradai/exchange/api-secret
      LogConfiguration:
        LogDriver: awslogs
        Options:
          awslogs-group: /ecs/tradai-live-pascal
          awslogs-region: !Ref AWS::Region
          awslogs-stream-prefix: strategy
      HealthCheck:
        Command:
          - CMD-SHELL
          - curl -f http://localhost:8004/api/v1/health || exit 1
        Interval: 30
        Timeout: 5
        Retries: 3
        StartPeriod: 120

# Note: All strategy-specific config (pairs, timeframe, warmup_days, etc.)
# is loaded at runtime from MLflow tags + S3, NOT from env vars.
# To change config: update MLflow model version, NOT task definition.

Auto Scaling (Optional)ΒΆ

Do NOT auto-scale live trading containers

CPU-based auto-scaling for live trading strategies will create DUPLICATE trading instances placing conflicting orders. Auto-scaling should only apply to the backtest/strategy-service containers, NOT to live trading containers. Live trading uses desired_count: 1 (or 0) per strategy -- never auto-scaled.

For strategies that benefit from horizontal scaling (backtest/strategy-service only):

Resource: AWS::ApplicationAutoScaling::ScalableTarget
  ServiceNamespace: ecs
  ResourceId: !Sub service/tradai-cluster/tradai-live-pascal
  ScalableDimension: ecs:service:DesiredCount
  MinCapacity: 1
  MaxCapacity: 5

Resource: AWS::ApplicationAutoScaling::ScalingPolicy
  PolicyName: cpu-scaling
  PolicyType: TargetTrackingScaling
  ScalingTargetId: !Ref ScalableTarget
  TargetTrackingScalingPolicyConfiguration:
    PredefinedMetricSpecification:
      PredefinedMetricType: ECSServiceAverageCPUUtilization
    TargetValue: 70
    ScaleInCooldown: 300
    ScaleOutCooldown: 60

10. Cost AnalysisΒΆ

Per-Strategy Costs (Live Trading)ΒΆ

Component Specification Monthly Cost
ECS Fargate 1 vCPU, 2GB, 24/7 $30.66
CloudWatch Logs ~1GB/month $0.50
DynamoDB On-demand, ~1M reads $0.25
Secrets Manager 2 secrets $0.80
EventBridge 8,640 invocations ~$0.01
Lambda (health) 8,640 x 256MB x 1s $0.22
SNS ~100 alerts $0.01
Subtotal per strategy $32.45
Reserve (exchange fees, data) +$29
Total per strategy ~$61

Platform Total (3 Strategies)ΒΆ

Component Monthly Cost
Base Platform (v8.1) $122-128
Pascal Strategy (live) $61
Momentum Strategy (dry-run) $61
ML Trend Strategy (live) $61
Platform Total $305-311

11. Security ControlsΒΆ

Exchange CredentialsΒΆ

# Secrets Manager Configuration
Secret: tradai/exchange/api-key
  Description: Binance API key for live trading
  Rotation: Manual (exchange limitation)
  Access: StrategyTaskRole only

Secret: tradai/exchange/api-secret
  Description: Binance API secret
  Rotation: Manual
  Access: StrategyTaskRole only

# IAM Policy for Strategy Task Role
Policy:
  Effect: Allow
  Action:
    - secretsmanager:GetSecretValue
  Resource:
    - arn:aws:secretsmanager:*:*:secret:tradai/exchange/*
  Condition:
    StringEquals:
      aws:RequestTag/Environment: !Ref Environment

Network SecurityΒΆ

# Strategy Security Group
Inbound:
  - Port 8004 (health check) from ALB only
  - No direct internet access

Outbound:
  - Port 443 to exchange APIs (Binance)
  - Port 443 to S3 (VPC endpoint)
  - Port 443 to DynamoDB (VPC endpoint)
  - Port 5000 to MLflow (internal)

Audit LoggingΒΆ

CloudTrail Events:
  - ECS RunTask / StopTask
  - Secrets Manager GetSecretValue
  - DynamoDB PutItem / UpdateItem

CloudWatch Logs:
  - /ecs/tradai-live-{strategy}
  - Retention: 30 days
  - Export to S3: After 7 days

12. Implementation Checklist (Phase 6)ΒΆ

Week 17-18: Live Trading FoundationΒΆ

  • Update Strategy Container Dockerfile for multi-mode support
  • Implement ArcticDB warmup loader
  • Implement DynamoDB state manager for live sessions
  • Implement health reporter component
  • Create ECS Service definitions (non-Spot)
  • Configure Secrets Manager for exchange credentials

Week 19-20: MLflow IntegrationΒΆ

  • Update MLflow adapter for model serving
  • Implement model loading in FreqAI strategies
  • Create model promotion workflow (Staging β†’ Production)
  • Test model versioning with live trading

Week 21-22: Monitoring & OperationsΒΆ

  • Deploy EventBridge + Lambda health monitoring
  • Configure CloudWatch alarms
  • Set up SNS alerting
  • Create operational runbooks
  • Perform dry-run validation
  • Go-live with first strategy

13. Document ChangelogΒΆ

Version Date Changes
9.1.1 2026-03-28 Added section numbers, Mermaid architecture diagram, status admonitions
9.1 2025-12-09 MLflow-centric configuration: minimal env vars, config from MLflow tags + S3, leverages existing StrategyMetadata, MLflowAdapter, ConfigMergeService patterns
9.0 2025-12-08 Initial live trading architecture document

DependenciesΒΆ

If This Changes Update This Doc
infra/shared/tradai_infra_shared/config.py (live-trading service) Section 9 (ECS Service Configuration)
libs/tradai-common/src/tradai/common/entrypoint/trading.py Section 4 (Container Initialization Flow), Section 8 (Health Monitoring)
lambdas/trading-heartbeat-check/handler.py Section 8 (Health Monitoring)
libs/tradai-common/src/tradai/common/mlflow/adapter.py Section 5 (MLflow Model Lifecycle)
DynamoDB schema changes Section 6 (Unified DynamoDB State)

  • 12-ML-LIFECYCLE.md - Complete ML strategy lifecycle including training pipeline, hyperparameter optimization, retraining scheduling, drift detection, and model rollback. Read this for the full MLOps story.

Next Action: Begin Phase 6 implementation after completing Phases 1-5 of the backtesting foundation.