TradAI Live Trading Architecture¶
Version: 9.1 (Live Trading Integration) Date: 2025-12-09 Status: READY FOR IMPLEMENTATION Depends On: Architecture v9.2.1 (Backtesting Foundation)
Executive Summary¶
This document extends the TradAI architecture v8.1 to support live trading capabilities. The design follows key principles:
- Unified Strategy Container - Single container per strategy handles all modes (backtest, hyperopt, dry-run, live)
- MLflow-Centric Configuration - Strategy config loaded from MLflow tags + S3, NOT hardcoded env vars
- MLflow Model Lifecycle - FreqAI trains during backtest, MLflow serves models for live trading
- Direct ArcticDB Access - Live trading reads S3 directly, decoupled from data-collection service
- Self-Managed Sessions - Strategy containers manage their own DynamoDB state
- ECS Services for Live - Always-on services (NOT Spot) for production trading
Cost Impact¶
| Component | Monthly Cost |
|---|---|
| Base Platform (v8.1) | $78-99 |
| Per Live Strategy | +$46 |
| Total (3 strategies) | $216-237 |
System Architecture v9.0¶
┌─────────────────────────────────────────────────────────────┐
│ AWS Cloud │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Public Subnet ││
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││
│ │ │ API GW │ │ ALB │ │ NAT Instance │ ││
│ │ │ + Cognito │ │ │ │ │ ││
│ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ ││
│ └─────────┼─────────────────┼─────────────────┼──────────┘│
│ │ │ │ │
│ ┌─────────┼─────────────────┼─────────────────┼──────────┐│
│ │ │ Private Subnet │ ││
│ │ ▼ ▼ ▼ ││
│ │ ┌─────────────────────────────────────────────────┐ ││
│ │ │ ECS Fargate Cluster │ ││
│ │ │ ┌─────────────────────────────────────────┐ │ ││
│ │ │ │ Backend Service (Always-on) │ │ ││
│ │ │ │ • REST API endpoints │ │ ││
│ │ │ │ • Strategy lifecycle management │ │ ││
│ │ │ │ • Session control (start/stop/pause) │ │ ││
│ │ │ └─────────────────────────────────────────┘ │ ││
│ │ │ │ ││
│ │ │ ┌─────────────────────────────────────────┐ │ ││
│ │ │ │ Strategy Service (On-demand) │ │ ││
│ │ │ │ • Backtest orchestration │ │ ││
│ │ │ │ • Config management │ │ ││
│ │ │ │ • MLflow logging │ │ ││
│ │ │ └─────────────────────────────────────────┘ │ ││
│ │ │ │ ││
│ │ │ ┌─────────────────────────────────────────┐ │ ││
│ │ │ │ Strategy Containers (Per Strategy) │ │ ││
│ │ │ │ ┌───────────────┐ ┌───────────────┐ │ │ ││
│ │ │ │ │ pascal:2.0.0 │ │ momentum:1.0 │ │ │ ││
│ │ │ │ │ LIVE MODE │ │ DRY-RUN MODE │ │ │ ││
│ │ │ │ │ ECS Service │ │ ECS Service │ │ │ ││
│ │ │ │ └───────────────┘ └───────────────┘ │ │ ││
│ │ │ │ ┌───────────────┐ ┌───────────────┐ │ │ ││
│ │ │ │ │ ml-trend:3.1 │ │ Backtest Task │ │ │ ││
│ │ │ │ │ LIVE MODE │ │ BACKTEST MODE │ │ │ ││
│ │ │ │ │ ECS Service │ │ ECS Task/Spot │ │ │ ││
│ │ │ │ └───────────────┘ └───────────────┘ │ │ ││
│ │ │ └─────────────────────────────────────────┘ │ ││
│ │ │ │ ││
│ │ │ ┌─────────────────────────────────────────┐ │ ││
│ │ │ │ Data Collection (Scheduled Tasks) │ │ ││
│ │ │ │ • Historical data fetch │ │ ││
│ │ │ │ • ArcticDB write │ │ ││
│ │ │ └─────────────────────────────────────────┘ │ ││
│ │ │ │ ││
│ │ │ ┌─────────────────────────────────────────┐ │ ││
│ │ │ │ MLflow (Always-on) │ │ ││
│ │ │ │ • Strategy versioning │ │ ││
│ │ │ │ • ML model registry │ │ ││
│ │ │ │ • Experiment tracking │ │ ││
│ │ │ └─────────────────────────────────────────┘ │ ││
│ │ └────────────────────────────────────────────────┘ ││
│ └────────────────────────────────────────────────────────┘│
│ │
│ ┌────────────────────────────────────────────────────────┐│
│ │ Data Layer ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││
│ │ │ ArcticDB │ │ DynamoDB │ │ S3 │ │ RDS │ ││
│ │ │ (S3) │ │ State │ │ Configs │ │ Postgres │ ││
│ │ │ │ │ + Live │ │ Results │ │ │ ││
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││
│ └────────────────────────────────────────────────────────┘│
│ │
│ ┌────────────────────────────────────────────────────────┐│
│ │ Monitoring & Events ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││
│ │ │EventBrdge│ │ Lambda │ │CloudWatch│ │ SNS │ ││
│ │ │ Rules │ │ Health │ │ Alarms │ │ Alerts │ ││
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││
│ └────────────────────────────────────────────────────────┘│
└────────────────────────────────────────────────────────────┘
External:
┌──────────────┐ ┌──────────────┐
│ Binance │ │ CCXT │
│ Exchange │ │ (Futures) │
└──────────────┘ └──────────────┘
Unified Strategy Container¶
Container Structure¶
Each strategy is packaged as a single Docker image that supports all trading modes:
tradai-{strategy-name}:{version}
├── Freqtrade runtime
├── Strategy code (IStrategy implementation)
├── FreqAI models (for ML strategies)
├── ArcticDB warmup loader
├── DynamoDB state manager
├── Health reporter
└── Mode configuration
Trading Modes¶
| Mode | Description | Launch Type | Spot OK? |
|---|---|---|---|
backtest | Historical simulation | ECS Task | Yes |
hyperopt | Parameter optimization | ECS Task | Yes |
dry-run | Paper trading (no real orders) | ECS Service | No |
live | Production trading | ECS Service | No |
Environment Variables (Minimal)¶
The container uses minimal environment variables - just enough to locate the strategy in MLflow. All other configuration is loaded from MLflow tags and S3.
# REQUIRED: Identify strategy and trading state
# Issue 10 Fix: Changed from STRATEGY_NAME/STRATEGY_STAGE to match actual code
MLFLOW_TRACKING_URI: "http://mlflow.internal:5000"
STRATEGY: "PascalStrategy" # Strategy class name (FREQTRADE_STRATEGY as fallback)
STRATEGY_ID: "pascal-btc" # Unique strategy identifier for trading state
# REQUIRED: Trading mode
TRADING_MODE: "live" # backtest | hyperopt | dry-run | live | train
# REQUIRED: Secrets (must stay in Secrets Manager)
EXCHANGE_SECRET_NAME: "tradai/exchange/binance" # AWS Secrets Manager secret
# INFRASTRUCTURE (rarely changes)
DYNAMODB_TABLE: "tradai-workflow-state"
ARCTICDB_S3_URI: "s3://tradai-arcticdb-prod"
Why minimal env vars? - Single source of truth: Config lives in MLflow, not scattered across deployments - Version tracking: Config changes are tied to strategy versions - Audit trail: MLflow tracks who changed what when - Environment parity: Same container image, different config via MLflow stage
MLflow-Centric Configuration¶
Design Principle¶
Strategy configuration flows from MLflow, NOT from environment variables:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Config Loading Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Container starts with minimal env vars: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ STRATEGY_NAME="PascalStrategy" │ │
│ │ STRATEGY_STAGE="Production" │ │
│ │ MLFLOW_TRACKING_URI="http://mlflow.internal:5000" │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Step 1: Query MLflow for model version │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ MLflowAdapter.get_model_version(name, stage) │ │
│ │ → Returns ModelVersion with tags: │ │
│ │ • strategy_version: "2.0.0" │ │
│ │ • timeframe: "1h" │ │
│ │ • configuration_file: "strategies/pascal/config.json" │ │
│ │ • warmup_days: "30" │ │
│ │ • pairs: "BTC/USDT:USDT,ETH/USDT:USDT" │ │
│ │ • category: "mean_reversion" │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Step 2: Load full config from S3 (via tag reference) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ ConfigMergeService.load_config(config_s3_path) │ │
│ │ → Returns full Freqtrade config with all parameters │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Step 3: Apply runtime overrides (optional) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ ConfigMergeService.apply_overrides(base, overrides) │ │
│ │ → OmegaConf deep merge for any session-specific changes │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Final: Validated strategy config ready for Freqtrade │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Integration with Existing Patterns¶
This design leverages existing TradAI components:
| Component | File | Purpose |
|---|---|---|
StrategyMetadata | libs/tradai-strategy/src/tradai/strategy/metadata.py | Defines metadata schema with to_mlflow_tags() |
MLflowAdapter | libs/tradai-common/src/tradai/common/mlflow_adapter.py | REST API wrapper for MLflow |
ConfigMergeService | libs/tradai-common/src/tradai/common/config_service.py | S3 config loading + OmegaConf merge |
S3ConfigRepository | libs/tradai-common/src/tradai/common/aws/s3_config_repository.py | S3 config storage |
Config Loading Implementation¶
from tradai.common.mlflow_adapter import MLflowAdapter
from tradai.common.config_service import ConfigMergeService
from tradai.common.aws.s3_config_repository import S3ConfigRepository
class StrategyConfigLoader:
"""Loads strategy configuration from MLflow and S3."""
def __init__(
self,
mlflow_adapter: MLflowAdapter,
config_service: ConfigMergeService,
):
self.mlflow = mlflow_adapter
self.config_service = config_service
def load_config(
self,
strategy_name: str,
stage: str = "Production",
runtime_overrides: dict | None = None,
) -> dict:
"""Load complete strategy config from MLflow + S3.
Args:
strategy_name: Name of strategy in MLflow registry
stage: MLflow stage (Production, Staging, None)
runtime_overrides: Optional overrides for this session
Returns:
Complete validated strategy configuration
"""
# 1. Get model version from MLflow (includes all tags)
model_version = self.mlflow.get_model_version(
name=strategy_name,
stage=stage,
)
# 2. Extract config metadata from tags
tags = {t["key"]: t["value"] for t in model_version.tags}
config_s3_path = tags["configuration_file"]
# 3. Build strategy metadata from tags
strategy_config = {
"strategy_name": strategy_name,
"strategy_version": tags["strategy_version"],
"timeframe": tags["timeframe"],
"warmup_days": int(tags.get("warmup_days", "30")),
"pairs": tags["pairs"].split(","),
"can_short": tags.get("can_short", "false").lower() == "true",
}
# 4. Load full Freqtrade config from S3
freqtrade_config = self.config_service.load_config(config_s3_path)
# 5. Merge strategy metadata into config
full_config = {
**freqtrade_config,
"tradai": strategy_config,
}
# 6. Apply runtime overrides if provided
if runtime_overrides:
full_config = self.config_service.apply_overrides(
base=full_config,
overrides=runtime_overrides,
)
return full_config
MLflow Tag Schema (StrategyMetadata.to_mlflow_tags())¶
When a strategy is registered, StrategyMetadata.to_mlflow_tags() stores:
# Core identity (from StrategyMetadata)
strategy_name: "PascalStrategy"
strategy_version: "2.0.0"
timeframe: "1h"
category: "mean_reversion"
author: "TradAI Team"
status: "production" # testing | staging | production
# Trading config
can_short: "true"
pairs: "BTC/USDT:USDT,ETH/USDT:USDT"
stake_amount: "100"
max_open_trades: "3"
# Data requirements
warmup_days: "30"
data_format: "ohlcv"
# Config file reference (S3 path)
configuration_file: "strategies/pascal/v2.0.0/config.json"
# ML model reference (for FreqAI strategies)
freqai_model: "models:/PascalStrategy-FreqAI/Production"
# Docker image reference
ecr_url: "123456789.dkr.ecr.us-east-1.amazonaws.com/tradai-pascal:2.0.0"
Container Initialization Flow¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ Container Startup Sequence (MLflow-First) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Bootstrap (from minimal env vars) │
│ ├── Read STRATEGY_NAME, STRATEGY_STAGE from environment │
│ ├── Read TRADING_MODE from environment │
│ └── Initialize MLflowAdapter with MLFLOW_TRACKING_URI │
│ │
│ 2. Load Config from MLflow + S3 │
│ ├── MLflowAdapter.get_model_version(name, stage) │
│ │ └── Returns ModelVersion with all metadata tags │
│ ├── Extract config_s3_path from tags["configuration_file"] │
│ ├── ConfigMergeService.load_config(config_s3_path) │
│ │ └── Returns full Freqtrade config from S3 │
│ └── Apply runtime overrides (from DynamoDB session if resuming) │
│ │
│ 3. Initialize State (DynamoDB) │
│ ├── Create session record with config snapshot │
│ ├── Set status = "initializing" │
│ └── Record strategy_version, config_hash, start_timestamp │
│ │
│ 4. Load ML Models (if FreqAI strategy) │
│ ├── Extract freqai_model URI from tags │
│ ├── MLflowAdapter.get_model_version(freqai_model) │
│ └── Load model artifacts for inference (NO retraining) │
│ │
│ 5. Warmup Data (ArcticDB) │
│ ├── Extract warmup_days, pairs, timeframe from config │
│ ├── Connect to S3-backed ArcticDB │
│ └── Load historical data → populate Freqtrade dataframes │
│ │
│ 6. Start Freqtrade │
│ ├── Initialize exchange connection (credentials from Secrets Manager) │
│ ├── Start strategy loop │
│ └── Update DynamoDB status = "running" │
│ │
│ 7. Health Reporting Loop (async) │
│ ├── Every 60 seconds: update DynamoDB last_heartbeat │
│ ├── Report metrics to CloudWatch (PnL, trades, latency) │
│ └── Check exchange connectivity │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Startup Code Example¶
import os
from tradai.common.mlflow_adapter import MLflowAdapter
from tradai.common.config_service import ConfigMergeService
def main():
# 1. Bootstrap from minimal env vars
strategy_name = os.environ["STRATEGY_NAME"]
strategy_stage = os.environ.get("STRATEGY_STAGE", "Production")
trading_mode = os.environ["TRADING_MODE"]
mlflow_uri = os.environ["MLFLOW_TRACKING_URI"]
# 2. Initialize adapters
mlflow = MLflowAdapter(base_url=mlflow_uri)
config_service = ConfigMergeService(bucket="tradai-configs")
config_loader = StrategyConfigLoader(mlflow, config_service)
# 3. Load complete config from MLflow + S3
config = config_loader.load_config(
strategy_name=strategy_name,
stage=strategy_stage,
)
# 4. Extract runtime parameters (all from config, NOT env vars)
warmup_days = config["tradai"]["warmup_days"]
pairs = config["tradai"]["pairs"]
timeframe = config["tradai"]["timeframe"]
# 5. Initialize state, warmup data, start trading...
logger.info(f"Starting {strategy_name} v{config['tradai']['strategy_version']}")
logger.info(f"Mode: {trading_mode}, Pairs: {pairs}, Timeframe: {timeframe}")
MLflow Model Lifecycle¶
Training Phase (Backtest Mode)¶
During backtesting with FreqAI strategies:
┌─────────────────────────────────────────────────────────────────────────────┐
│ FreqAI Training Flow │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Backtest Run │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ FreqAI │────▶│ Train │────▶│ Save to │ │
│ │ Strategy │ │ Models │ │ MLflow │ │
│ └─────────────┘ └─────────────┘ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ MLflow │ │
│ │ Registry │ │
│ │ │ │
│ │ • Staging │ │
│ │ • Prod │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Inference Phase (Live Mode)¶
During live trading:
┌─────────────────────────────────────────────────────────────────────────────┐
│ MLflow Inference Flow │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Container Start │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Load │────▶│ MLflow │────▶│ Cached │ │
│ │ Model │ │ Client │ │ Model │ │
│ └─────────────┘ └─────────────┘ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ FreqAI │ │
│ │ Predict │ │
│ │ (No Train) │ │
│ └─────────────┘ │
│ │
│ Key: FreqAI does NOT retrain during live trading. │
│ Models are loaded once from MLflow and used for inference only. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Model Registry Schema¶
MLflow Model Registry
├── PascalStrategy
│ ├── Version 1 (Staging) - trained 2025-12-01
│ ├── Version 2 (Production) - trained 2025-12-05
│ └── Version 3 (None) - training 2025-12-08
├── MomentumStrategy
│ └── Version 1 (Production) - trained 2025-12-03
└── MLTrendStrategy
├── Version 1 (Archived) - trained 2025-11-15
└── Version 2 (Production) - trained 2025-12-06
Unified DynamoDB State¶
Table Schema¶
Single table tradai-workflow-state handles both backtest workflows AND live trading sessions:
Table: tradai-workflow-state
Partition Key: workflow_id (String)
Sort Key: None
# Backtest Workflow Record
{
"workflow_id": "backtest-pascal-20251208-abc123",
"workflow_type": "backtest",
"strategy_name": "PascalStrategy",
"strategy_version": "2.0.0",
"status": "completed",
"started_at": "2025-12-08T10:00:00Z",
"completed_at": "2025-12-08T10:15:00Z",
"config_s3_path": "s3://tradai-configs/backtest/...",
"result_s3_path": "s3://tradai-results/backtest/...",
"ttl": 1736294400 # 30 days
}
# Live Trading Session Record
{
"workflow_id": "live-pascal-20251208-001",
"workflow_type": "live",
"strategy_name": "PascalStrategy",
"strategy_version": "2.0.0",
"status": "running",
"trading_mode": "live",
"started_at": "2025-12-08T00:00:00Z",
"last_heartbeat": "2025-12-08T14:30:00Z",
"exchange": "binance",
"pairs": ["BTC/USDT:USDT", "ETH/USDT:USDT"],
"metrics": {
"total_trades": 45,
"win_rate": 0.62,
"pnl_today": 125.50,
"pnl_total": 1250.00
},
"health": {
"cpu_percent": 35,
"memory_mb": 512,
"exchange_connected": true,
"last_trade_at": "2025-12-08T14:25:00Z"
}
# No TTL for live sessions - persist until explicitly deleted
}
GSI for Status Queries¶
GSI: status-index
Partition Key: status (String)
Sort Key: started_at (String)
# Query: Get all running live sessions
status = "running" AND workflow_type = "live"
ArcticDB Direct Access¶
Architecture Decision¶
Live trading containers access ArcticDB directly via S3, NOT through the data-collection service:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Data Access Patterns │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Data Collection Service (Scheduled) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ • Runs on schedule (hourly/daily) │ │
│ │ • Fetches from CCXT/Binance │ │
│ │ • WRITES to ArcticDB (S3) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ ArcticDB │ │
│ │ (S3) │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ Strategy Containers (Live Trading) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ • Always running │ │
│ │ • READS from ArcticDB (S3) directly │ │
│ │ • No dependency on data-collection service │ │
│ │ • Can also fetch real-time data from exchange │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Benefits: │
│ • Decoupled - live trading unaffected by data-collection failures │
│ • Low latency - direct S3 access, no service hop │
│ • Scalable - each container has independent read path │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Warmup Loader Implementation¶
class ArcticDBWarmupLoader:
"""Loads historical data from ArcticDB for strategy warmup."""
def __init__(self, s3_uri: str, warmup_days: int = 30):
self.arctic = Arctic(s3_uri)
self.warmup_days = warmup_days
self.library = self.arctic.get_library("ohlcv")
def load_warmup_data(
self,
symbols: list[str],
timeframe: str,
) -> dict[str, pd.DataFrame]:
"""Load warmup data for all symbols."""
end = datetime.utcnow()
start = end - timedelta(days=self.warmup_days)
result = {}
for symbol in symbols:
key = f"{symbol.replace('/', '_')}_{timeframe}"
df = self.library.read(
key,
date_range=(start, end),
).data
result[symbol] = df
return result
Health Monitoring¶
Health Check Flow¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ Health Monitoring Flow │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Strategy Container │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Health Reporter (every 60s) │ │
│ │ ├── Update DynamoDB heartbeat │ │
│ │ ├── Publish CloudWatch metrics │ │
│ │ └── Report exchange connectivity │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ DynamoDB │ │
│ │ + CW │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ EventBridge Rule (every 5 min) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Trigger: rate(5 minutes) │ │
│ │ Target: Lambda health-check-live │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Lambda: health-check-live │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ • Query DynamoDB for status="running" AND workflow_type="live" │ │
│ │ • Check last_heartbeat < 3 minutes ago │ │
│ │ • If stale: publish to SNS alert topic │ │
│ │ • Update CloudWatch custom metrics │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ (if unhealthy) │
│ ┌─────────────┐ │
│ │ SNS │──────▶ PagerDuty / Slack / Email │
│ │ Alert │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
CloudWatch Alarms¶
Live Trading Alarms:
- Name: live-trading-heartbeat-stale
Metric: LiveTradingHeartbeatAge
Threshold: > 180 seconds
Period: 60 seconds
EvaluationPeriods: 3
Action: SNS Alert
- Name: live-trading-exchange-disconnect
Metric: LiveTradingExchangeConnected
Threshold: < 1
Period: 60 seconds
EvaluationPeriods: 2
Action: SNS Alert
- Name: live-trading-high-loss
Metric: LiveTradingPnLDaily
Threshold: < -500 # USD
Period: 300 seconds
EvaluationPeriods: 1
Action: SNS Alert + Auto-pause
ECS Service Configuration¶
Live Trading Service Definition¶
# ECS Service for Live Trading (NOT Spot)
Resource: AWS::ECS::Service
ServiceName: tradai-live-pascal
Cluster: tradai-cluster
LaunchType: FARGATE # NOT Spot for live trading
DesiredCount: 1
DeploymentConfiguration:
MaximumPercent: 100
MinimumHealthyPercent: 0 # Allow full replacement
TaskDefinition: tradai-strategy-pascal
NetworkConfiguration:
AwsvpcConfiguration:
Subnets:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
SecurityGroups:
- !Ref StrategySecurityGroup
EnableExecuteCommand: true # For debugging
Task Definition¶
Resource: AWS::ECS::TaskDefinition
Family: tradai-strategy-pascal
Cpu: 512
Memory: 1024
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
ExecutionRoleArn: !Ref ECSExecutionRole
TaskRoleArn: !Ref StrategyTaskRole
ContainerDefinitions:
- Name: strategy
Image: !Sub ${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/tradai-pascal:2.0.0
Essential: true
Environment:
# MINIMAL env vars - config loaded from MLflow
- Name: STRATEGY_NAME
Value: PascalStrategy
- Name: STRATEGY_STAGE
Value: Production # or use STRATEGY_VERSION for pinned version
- Name: TRADING_MODE
Value: live
# Infrastructure (rarely changes)
- Name: MLFLOW_TRACKING_URI
Value: http://mlflow.internal:5000
- Name: DYNAMODB_TABLE
Value: tradai-workflow-state
- Name: ARCTICDB_S3_URI
Value: !Sub s3://${ArcticDBBucket}
- Name: S3_CONFIG_BUCKET
Value: !Ref ConfigsBucket
Secrets:
# Secrets stay in Secrets Manager (not MLflow)
- Name: EXCHANGE_API_KEY
ValueFrom: !Sub arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:tradai/exchange/api-key
- Name: EXCHANGE_API_SECRET
ValueFrom: !Sub arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:tradai/exchange/api-secret
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: /ecs/tradai-live-pascal
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: strategy
HealthCheck:
Command:
- CMD-SHELL
- curl -f http://localhost:8080/health || exit 1
Interval: 30
Timeout: 5
Retries: 3
StartPeriod: 60
# Note: All strategy-specific config (pairs, timeframe, warmup_days, etc.)
# is loaded at runtime from MLflow tags + S3, NOT from env vars.
# To change config: update MLflow model version, NOT task definition.
Auto Scaling (Optional)¶
For strategies that benefit from horizontal scaling:
Resource: AWS::ApplicationAutoScaling::ScalableTarget
ServiceNamespace: ecs
ResourceId: !Sub service/tradai-cluster/tradai-live-pascal
ScalableDimension: ecs:service:DesiredCount
MinCapacity: 1
MaxCapacity: 5
Resource: AWS::ApplicationAutoScaling::ScalingPolicy
PolicyName: cpu-scaling
PolicyType: TargetTrackingScaling
ScalingTargetId: !Ref ScalableTarget
TargetTrackingScalingPolicyConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageCPUUtilization
TargetValue: 70
ScaleInCooldown: 300
ScaleOutCooldown: 60
Cost Analysis¶
Per-Strategy Costs (Live Trading)¶
| Component | Specification | Monthly Cost |
|---|---|---|
| ECS Fargate | 0.5 vCPU, 1GB, 24/7 | $15.33 |
| CloudWatch Logs | ~1GB/month | $0.50 |
| DynamoDB | On-demand, ~1M reads | $0.25 |
| Secrets Manager | 2 secrets | $0.80 |
| EventBridge | 8,640 invocations | ~$0.01 |
| Lambda (health) | 8,640 x 128MB x 1s | $0.11 |
| SNS | ~100 alerts | $0.01 |
| Subtotal per strategy | $17.01 | |
| Reserve (exchange fees, data) | +$29 | |
| Total per strategy | ~$46 |
Platform Total (3 Strategies)¶
| Component | Monthly Cost |
|---|---|
| Base Platform (v8.1) | $78-99 |
| Pascal Strategy (live) | $46 |
| Momentum Strategy (dry-run) | $46 |
| ML Trend Strategy (live) | $46 |
| Platform Total | $216-237 |
Security Controls¶
Exchange Credentials¶
# Secrets Manager Configuration
Secret: tradai/exchange/api-key
Description: Binance API key for live trading
Rotation: Manual (exchange limitation)
Access: StrategyTaskRole only
Secret: tradai/exchange/api-secret
Description: Binance API secret
Rotation: Manual
Access: StrategyTaskRole only
# IAM Policy for Strategy Task Role
Policy:
Effect: Allow
Action:
- secretsmanager:GetSecretValue
Resource:
- arn:aws:secretsmanager:*:*:secret:tradai/exchange/*
Condition:
StringEquals:
aws:RequestTag/Environment: !Ref Environment
Network Security¶
# Strategy Security Group
Inbound:
- Port 8080 (health check) from ALB only
- No direct internet access
Outbound:
- Port 443 to exchange APIs (Binance)
- Port 443 to S3 (VPC endpoint)
- Port 443 to DynamoDB (VPC endpoint)
- Port 5000 to MLflow (internal)
Audit Logging¶
CloudTrail Events:
- ECS RunTask / StopTask
- Secrets Manager GetSecretValue
- DynamoDB PutItem / UpdateItem
CloudWatch Logs:
- /ecs/tradai-live-{strategy}
- Retention: 30 days
- Export to S3: After 7 days
Implementation Checklist (Phase 6)¶
Week 17-18: Live Trading Foundation¶
- [ ] Update Strategy Container Dockerfile for multi-mode support
- [ ] Implement ArcticDB warmup loader
- [ ] Implement DynamoDB state manager for live sessions
- [ ] Implement health reporter component
- [ ] Create ECS Service definitions (non-Spot)
- [ ] Configure Secrets Manager for exchange credentials
Week 19-20: MLflow Integration¶
- [ ] Update MLflow adapter for model serving
- [ ] Implement model loading in FreqAI strategies
- [ ] Create model promotion workflow (Staging → Production)
- [ ] Test model versioning with live trading
Week 21-22: Monitoring & Operations¶
- [ ] Deploy EventBridge + Lambda health monitoring
- [ ] Configure CloudWatch alarms
- [ ] Set up SNS alerting
- [ ] Create operational runbooks
- [ ] Perform dry-run validation
- [ ] Go-live with first strategy
Document Changelog¶
| Version | Date | Changes |
|---|---|---|
| 9.1 | 2025-12-09 | MLflow-centric configuration: minimal env vars, config from MLflow tags + S3, leverages existing StrategyMetadata, MLflowAdapter, ConfigMergeService patterns |
| 9.0 | 2025-12-08 | Initial live trading architecture document |
Related Documents¶
- 12-ML-LIFECYCLE.md - Complete ML strategy lifecycle including training pipeline, hyperparameter optimization, retraining scheduling, drift detection, and model rollback. Read this for the full MLOps story.
Next Action: Begin Phase 6 implementation after completing Phases 1-5 of the backtesting foundation.