TradAI Live Trading ArchitectureΒΆ
Version: 9.1.2 (Live Trading Integration) Date: 2026-03-28 Status: ASPIRATIONAL (Partially Implemented) Depends On: Architecture v9.2.1 (Backtesting Foundation)
TL;DR: Live trading architecture using ECS Fargate (1 vCPU, 2GB, non-Spot). TradingHandler orchestrates Freqtrade with health monitoring (MetricsCollector, RiskMonitor, HealthReporter). Status: partially implemented β core components ready, E2E not yet validated.
Status: Aspirational (Partially Implemented)
This document describes the target architecture. Core infrastructure is implemented: ECS service definitions (live-trading, dry-run-trading in config.py), the trading entrypoint (tradai.common.entrypoint.trading.TradingHandler), health reporting (HealthReporter, RiskMonitor, MetricsCollector), and the trading-heartbeat-check Lambda. Remaining work (MLflow model serving in live mode, full DynamoDB session management, exchange credential rotation) is tracked in reports/implementation-tasks/.
Implemented Components
The following building blocks are implemented and tested: StrategyConfigLoader (MLflow tags + S3), ArcticDBWarmupLoader (historical data warmup), TradingStateRepository (DynamoDB session state), HealthReporter + RiskMonitor + MetricsCollector (container health and risk controls), FreqAIConfigBuilder (inference config), and the trading-heartbeat-check Lambda. ECS service definitions for live-trading and dry-run-trading are in config.py.
PrerequisitesΒΆ
Live trading (Phase 6) requires ALL of the following before starting:
- Phases 1-5 complete with a stable backtesting platform
- At least one successful end-to-end backtest via Step Functions
- MLflow model registry operational with at least one registered model
- Health monitoring and CloudWatch alarms verified in production
1. Executive SummaryΒΆ
This document extends the TradAI architecture v8.1 to support live trading capabilities. The design follows key principles:
- Unified Strategy Container - Single container per strategy handles all modes (backtest, hyperopt, dry-run, live)
- MLflow-Centric Configuration - Strategy config loaded from MLflow tags + S3, NOT hardcoded env vars
- MLflow Model Lifecycle - FreqAI trains during backtest, MLflow serves models for live trading
- Direct ArcticDB Access - Live trading reads S3 directly, decoupled from data-collection service
- Self-Managed Sessions - Strategy containers manage their own DynamoDB state
- ECS Services for Live - Always-on services (NOT Spot) for production trading
Cost ImpactΒΆ
| Component | Monthly Cost |
|---|---|
| Base Platform (v8.1) | $122-128 |
| Per Live Strategy | +$61 |
| Total (3 strategies) | $305-311 |
2. System Architecture v9.0ΒΆ
graph TB
subgraph Public["Public Subnet"]
APIGW["API Gateway<br/>+ Cognito"]
ALB["ALB"]
NAT["NAT Instance"]
end
subgraph Private["Private Subnet - ECS Fargate Cluster"]
Backend["Backend Service<br/>(Always-on)"]
Strategy["Strategy Service<br/>(On-demand)"]
Containers["Strategy Containers<br/>pascal:2.0.0 LIVE<br/>momentum:1.0 DRY-RUN"]
DataColl["Data Collection<br/>(Scheduled)"]
MLflow["MLflow<br/>(Always-on)"]
end
subgraph Data["Data Layer"]
ArcticDB["ArcticDB (S3)"]
DynamoDB["DynamoDB State"]
S3["S3 Configs/Results"]
RDS["RDS Postgres"]
end
subgraph Monitoring["Monitoring & Events"]
EB["EventBridge"]
Lambda["Lambda Health"]
CW["CloudWatch Alarms"]
SNS["SNS Alerts"]
end
subgraph External["External"]
Binance["Binance Exchange"]
CCXT["CCXT (Futures)"]
end
APIGW --> ALB
ALB --> Backend
ALB --> Strategy
ALB --> MLflow
Backend --> Containers
Containers --> ArcticDB
Containers --> DynamoDB
Containers --> MLflow
Containers --> Binance
DataColl --> ArcticDB
EB --> Lambda
Lambda --> DynamoDB
Lambda --> SNS
CW --> SNS See source code for detailed ASCII reference.
3. Unified Strategy ContainerΒΆ
Container StructureΒΆ
Each strategy is packaged as a single Docker image that supports all trading modes:
tradai-{strategy-name}:{version}
βββ Freqtrade runtime
βββ Strategy code (IStrategy implementation)
βββ FreqAI models (for ML strategies)
βββ ArcticDB warmup loader
βββ DynamoDB state manager
βββ Health reporter
βββ Mode configuration
Trading ModesΒΆ
| Mode | Description | Launch Type | Spot OK? |
|---|---|---|---|
backtest | Historical simulation | ECS Task | Yes |
hyperopt | Parameter optimization | ECS Task | Yes |
dry-run | Paper trading (no real orders) | ECS Service | Yes |
live | Production trading | ECS Service | No |
Environment Variables (Minimal)ΒΆ
The container uses minimal environment variables - just enough to locate the strategy in MLflow. All other configuration is loaded from MLflow tags and S3.
# REQUIRED: Identify strategy and trading state
# Issue 10 Fix: Changed from STRATEGY_NAME/STRATEGY_STAGE to match actual code
MLFLOW_TRACKING_URI: "http://mlflow.tradai-{env}.local:5000/mlflow"
STRATEGY: "PascalStrategy" # Strategy class name (FREQTRADE_STRATEGY as fallback)
STRATEGY_ID: "pascal-btc" # Unique strategy identifier for trading state
# REQUIRED: Trading mode
TRADING_MODE: "live" # backtest | hyperopt | dry-run | live | train
# REQUIRED: Secrets (must stay in Secrets Manager)
EXCHANGE_SECRET_NAME: "tradai/exchange/binance" # AWS Secrets Manager secret
# INFRASTRUCTURE (rarely changes)
DYNAMODB_TABLE: "tradai-workflow-state"
ARCTICDB_S3_URI: "s3://tradai-arcticdb-prod"
Why minimal env vars? - Single source of truth: Config lives in MLflow, not scattered across deployments - Version tracking: Config changes are tied to strategy versions - Audit trail: MLflow tracks who changed what when - Environment parity: Same container image, different config via MLflow stage
4. MLflow-Centric ConfigurationΒΆ
Design PrincipleΒΆ
Strategy configuration flows from MLflow, NOT from environment variables:
flowchart TD
A["Minimal env vars"]
B["Query MLflow tags"]
C["Load S3 config"]
D["Apply runtime overrides"]
E["Validated config"]
A --> B --> C --> D --> E | Step | Action | Detail |
|---|---|---|
Minimal env vars | Bootstrap lookup | Only enough data to locate the strategy in MLflow, typically STRATEGY, STRATEGY_ID, and MLFLOW_TRACKING_URI |
Query MLflow tags | Resolve model metadata | MLflowAdapter.get_model_version(...) returns tags such as strategy_version, timeframe, configuration_file, warmup_days, pairs, and category |
Load S3 config | Fetch base config | ConfigMergeService.load_config(...) loads the full Freqtrade configuration referenced by the MLflow metadata |
Apply runtime overrides | Merge session-specific values | ConfigMergeService.apply_overrides(...) performs the final OmegaConf deep merge |
Validated config | Start trading runtime | The container launches Freqtrade with the merged, validated strategy configuration |
Integration with Existing PatternsΒΆ
This design leverages existing TradAI components:
| Component | File | Purpose |
|---|---|---|
StrategyMetadata | libs/tradai-strategy/src/tradai/strategy/metadata.py | Defines metadata schema with to_mlflow_tags() |
MLflowAdapter | libs/tradai-common/src/tradai/common/mlflow/adapter.py | REST API wrapper for MLflow |
ConfigMergeService | libs/tradai-common/src/tradai/common/config/merge.py | S3 config loading + OmegaConf merge |
S3ConfigRepository | libs/tradai-common/src/tradai/common/aws/s3_config_repository.py | S3 config storage |
Config Loading ImplementationΒΆ
from tradai.common.mlflow.adapter import MLflowAdapter
from tradai.common.config.merge import ConfigMergeService
from tradai.common.aws.s3_config_repository import S3ConfigRepository
class StrategyConfigLoader:
"""Loads strategy configuration from MLflow and S3."""
def __init__(
self,
mlflow_adapter: MLflowAdapter,
config_service: ConfigMergeService,
):
self.mlflow = mlflow_adapter
self.config_service = config_service
def load_config(
self,
strategy_name: str,
stage: str = "Production",
runtime_overrides: dict | None = None,
) -> dict:
"""Load complete strategy config from MLflow + S3.
Args:
strategy_name: Name of strategy in MLflow registry
stage: MLflow stage (Production, Staging, None)
runtime_overrides: Optional overrides for this session
Returns:
Complete validated strategy configuration
"""
# 1. Get model version from MLflow (includes all tags)
model_version = self.mlflow.get_model_version(
name=strategy_name,
stage=stage,
)
# 2. Extract config metadata from tags
tags = {t["key"]: t["value"] for t in model_version.tags}
config_s3_path = tags["configuration_file"]
# 3. Build strategy metadata from tags
strategy_config = {
"strategy_name": strategy_name,
"strategy_version": tags["strategy_version"],
"timeframe": tags["timeframe"],
"warmup_days": int(tags.get("warmup_days", "30")),
"pairs": tags["pairs"].split(","),
"can_short": tags.get("can_short", "false").lower() == "true",
}
# 4. Load full Freqtrade config from S3
freqtrade_config = self.config_service.load_config(config_s3_path)
# 5. Merge strategy metadata into config
full_config = {
**freqtrade_config,
"tradai": strategy_config,
}
# 6. Apply runtime overrides if provided
if runtime_overrides:
full_config = self.config_service.apply_overrides(
base=full_config,
overrides=runtime_overrides,
)
return full_config
MLflow Tag Schema (StrategyMetadata.to_mlflow_tags())ΒΆ
When a strategy is registered, StrategyMetadata.to_mlflow_tags() stores:
# Core identity (from StrategyMetadata)
strategy_name: "PascalStrategy"
strategy_version: "2.0.0"
timeframe: "1h"
category: "mean_reversion"
author: "TradAI Team"
status: "production" # testing | staging | production
# Trading config
can_short: "true"
pairs: "BTC/USDT:USDT,ETH/USDT:USDT"
stake_amount: "100"
max_open_trades: "3"
# Data requirements
warmup_days: "30"
data_format: "ohlcv"
# Config file reference (S3 path)
configuration_file: "strategies/pascal/v2.0.0/config.json"
# ML model reference (for FreqAI strategies)
freqai_model: "models:/PascalStrategy-FreqAI/Production"
# Docker image reference
ecr_url: "123456789.dkr.ecr.eu-central-1.amazonaws.com/tradai/pascal:2.0.0"
Container Initialization FlowΒΆ
sequenceDiagram
participant Env as Environment
participant MLflow as MLflow Registry
participant S3 as S3 Config
participant DDB as DynamoDB
participant Arctic as ArcticDB
participant FT as Freqtrade
participant CW as CloudWatch
Note over Env: 1. Bootstrap
Env->>MLflow: get_model_version(STRATEGY, stage)
MLflow-->>Env: ModelVersion + tags
Note over Env: 2. Load Config
Env->>S3: load_config(config_s3_path)
S3-->>Env: Full Freqtrade config
Note over Env: Apply runtime overrides
Note over Env: 3. Initialize State
Env->>DDB: Create session (status=INITIALIZING)
Note over Env: 4. Load ML Models (if FreqAI)
Env->>MLflow: get_model_version(freqai_model)
MLflow-->>Env: Model artifacts
Note over Env: 5. Warmup Data
Env->>Arctic: Load historical OHLCV data
Note over Env: 6. Start Trading
Env->>FT: Start Freqtrade subprocess
Env->>DDB: Update status=RUNNING
loop Every 30 seconds
Note over Env: 7. Health Reporting
FT-->>Env: /profit, /status
Env->>DDB: Update last_heartbeat
Env->>CW: Publish metrics
end Startup Code ExampleΒΆ
import os
from tradai.common.mlflow.adapter import MLflowAdapter
from tradai.common.config.merge import ConfigMergeService
def main():
# 1. Bootstrap from minimal env vars
strategy_name = os.environ.get("STRATEGY") or os.environ["FREQTRADE_STRATEGY"]
strategy_id = os.environ.get("STRATEGY_ID", "")
trading_mode = os.environ["TRADING_MODE"]
mlflow_uri = os.environ["MLFLOW_TRACKING_URI"]
# 2. Initialize adapters
mlflow = MLflowAdapter(base_url=mlflow_uri)
config_service = ConfigMergeService(bucket="tradai-configs")
config_loader = StrategyConfigLoader(mlflow, config_service)
# 3. Load complete config from MLflow + S3
config = config_loader.load_config(
strategy_name=strategy_name,
)
# 4. Extract runtime parameters (all from config, NOT env vars)
warmup_days = config["tradai"]["warmup_days"]
pairs = config["tradai"]["pairs"]
timeframe = config["tradai"]["timeframe"]
# 5. Initialize state, warmup data, start trading...
logger.info(f"Starting {strategy_name} v{config['tradai']['strategy_version']}")
logger.info(f"Mode: {trading_mode}, Pairs: {pairs}, Timeframe: {timeframe}")
5. MLflow Model LifecycleΒΆ
Training Phase (Backtest Mode)ΒΆ
During backtesting with FreqAI strategies:
flowchart LR
subgraph Training["Training Phase (Backtest Mode)"]
A[Backtest Run] --> B[FreqAI Strategy]
B --> C[Train Models]
C --> D[Save to MLflow]
D --> E["MLflow Registry\n(Staging / Prod)"]
end
subgraph Inference["Inference Phase (Live Mode)"]
F[Container Start] --> G[Load Model]
G --> H[MLflow Client]
H --> I[Cached Model]
I --> J["FreqAI Predict\n(No Train)"]
end
E -.-> G Note: FreqAI does NOT retrain during live trading. Models are loaded once from MLflow and used for inference only.
Model Registry SchemaΒΆ
MLflow Model Registry
βββ PascalStrategy
β βββ Version 1 (Staging) - trained 2025-12-01
β βββ Version 2 (Production) - trained 2025-12-05
β βββ Version 3 (None) - training 2025-12-08
βββ MomentumStrategy
β βββ Version 1 (Production) - trained 2025-12-03
βββ MLTrendStrategy
βββ Version 1 (Archived) - trained 2025-11-15
βββ Version 2 (Production) - trained 2025-12-06
6. Unified DynamoDB StateΒΆ
Table SchemaΒΆ
Single table tradai-workflow-state handles both backtest workflows AND live trading sessions:
Table: tradai-workflow-state-{env}
Partition Key: run_id (String)
Sort Key: None
# Backtest Workflow Record
{
"run_id": "backtest-pascal-20251208-abc123",
"workflow_type": "backtest",
"strategy_name": "PascalStrategy",
"strategy_version": "2.0.0",
"status": "completed",
"created_at": "2025-12-08T10:00:00Z",
"started_at": "2025-12-08T10:00:00Z",
"completed_at": "2025-12-08T10:15:00Z",
"config_s3_path": "s3://tradai-configs/backtest/...",
"result_s3_path": "s3://tradai-results/backtest/...",
"ttl": 1736294400 # 7 days
}
# Live Trading Session Record
{
"run_id": "live-pascal-20251208-001",
"workflow_type": "live",
"strategy_name": "PascalStrategy",
"strategy_version": "2.0.0",
"status": "running",
"trading_mode": "live",
"created_at": "2025-12-08T00:00:00Z",
"started_at": "2025-12-08T00:00:00Z",
"last_heartbeat": "2025-12-08T14:30:00Z",
"exchange": "binance",
"pairs": ["BTC/USDT:USDT", "ETH/USDT:USDT"],
"metrics": {
"total_trades": 45,
"win_rate": 0.62,
"pnl_today": 125.50,
"pnl_total": 1250.00
},
"health": {
"cpu_percent": 35,
"memory_mb": 512,
"exchange_connected": true,
"last_trade_at": "2025-12-08T14:25:00Z"
}
# No TTL for live sessions - persist until explicitly deleted
}
DynamoDB item size limit for long-running sessions
Long-running trading sessions accumulate metrics snapshots in the same DynamoDB item. DynamoDB has a 400KB item size limit. For strategies running 6+ months, consider archiving historical metrics to S3 periodically.
GSI for Status QueriesΒΆ
GSI: status-created_at-index
Partition Key: status (String)
Sort Key: created_at (String)
# Query: Get all running live sessions
status = "running" (then filter workflow_type = "live")
GSI: trace_id-index
Partition Key: trace_id (String)
Projection: INCLUDE (job_id, status, created_at, execution_arn)
7. ArcticDB Direct AccessΒΆ
Architecture DecisionΒΆ
Live trading containers access ArcticDB directly via S3, NOT through the data-collection service:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Access Patterns β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Collection Service (Scheduled) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Runs on schedule (hourly/daily) β β
β β β’ Fetches from CCXT/Binance β β
β β β’ WRITES to ArcticDB (S3) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββ β
β β ArcticDB β β
β β (S3) β β
β ββββββββ¬βββββββ β
β β β
β βΌ β
β Strategy Containers (Live Trading) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Always running β β
β β β’ READS from ArcticDB (S3) directly β β
β β β’ No dependency on data-collection service β β
β β β’ Can also fetch real-time data from exchange β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Benefits: β
β β’ Decoupled - live trading unaffected by data-collection failures β
β β’ Low latency - direct S3 access, no service hop β
β β’ Scalable - each container has independent read path β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Warmup Loader ImplementationΒΆ
class ArcticDBWarmupLoader:
"""Loads historical data from ArcticDB for strategy warmup."""
def __init__(self, s3_uri: str, warmup_days: int = 30):
self.arctic = Arctic(s3_uri)
self.warmup_days = warmup_days
self.library = self.arctic.get_library("ohlcv")
def load_warmup_data(
self,
symbols: list[str],
timeframe: str,
) -> dict[str, pd.DataFrame]:
"""Load warmup data for all symbols."""
end = datetime.utcnow()
start = end - timedelta(days=self.warmup_days)
result = {}
for symbol in symbols:
key = f"{symbol.replace('/', '_')}_{timeframe}"
df = self.library.read(
key,
date_range=(start, end),
).data
result[symbol] = df
return result
8. Health MonitoringΒΆ
Health Check Flow (C3/H1)ΒΆ
The Health Reporter now collects real-time metrics from Freqtrade and evaluates risk controls on every heartbeat cycle:
Freqtrade Process (localhost:8080)
β GET /profit, /status, /balance
βΌ
MetricsCollector βββΊ StrategyPnL + [LiveTrade]
β
βΌ
HealthReporter._heartbeat_loop() (every 30s default)
βββ _collect_metrics() β poll Freqtrade via MetricsCollector
βββ _send_heartbeat() β persist to DynamoDB (pnl_snapshot, open_trades_snapshot)
βββ _check_risk() β RiskMonitor evaluates limits
β βββ On breach: pause/stop Freqtrade, update DynamoDB, send CRITICAL alert
βββ _check_pause_resume() β C9 pause/resume from backend API
Risk controls (C3): RiskMonitor evaluates drawdown, open trades, leverage, and fail-closed monitoring failures. RiskLimits entity stores configurable limits. Pre-flight validation via RiskLimits.validate_deployment_bounds().
Real-time monitoring (H1): PnL and trade snapshots are persisted to DynamoDB and served via GET /strategies/pnl and GET /strategies/{id}/trades.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Health Monitoring Flow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Strategy Container β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Health Reporter (every 30s) β β
β β βββ Collect metrics from Freqtrade API (H1) β β
β β βββ Update DynamoDB heartbeat + metric snapshots β β
β β βββ Evaluate risk limits (C3: drawdown, trades, leverage) β β
β β βββ On breach: pause/stop + CRITICAL alert β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββ β
β β DynamoDB β β
β β + CW β β
β ββββββββ¬βββββββ β
β β β
β βΌ β
β EventBridge Rule (every 5 min) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Trigger: rate(5 minutes) β β
β β Target: Lambda trading-heartbeat-check β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Lambda: trading-heartbeat-check β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Query DynamoDB for status="running" AND workflow_type="live" β β
β β β’ Check last_heartbeat < 3 minutes ago β β
β β β’ If stale: publish to SNS alert topic β β
β β β’ Update CloudWatch custom metrics β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ (if unhealthy) β
β βββββββββββββββ β
β β SNS ββββββββΆ PagerDuty / Slack / Email β
β β Alert β β
β βββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CloudWatch AlarmsΒΆ
Live Trading Alarms:
- Name: live-trading-heartbeat-stale
Metric: LiveTradingHeartbeatAge
Threshold: > 180 seconds
Period: 60 seconds
EvaluationPeriods: 3
Action: SNS Alert
- Name: live-trading-exchange-disconnect
Metric: LiveTradingExchangeConnected
Threshold: < 1
Period: 60 seconds
EvaluationPeriods: 2
Action: SNS Alert
- Name: live-trading-high-loss
Metric: LiveTradingPnLDaily
Threshold: < -500 # USD
Period: 300 seconds
EvaluationPeriods: 1
Action: SNS Alert + Auto-pause
9. ECS Service ConfigurationΒΆ
Live Trading Service DefinitionΒΆ
# ECS Service for Live Trading (NOT Spot)
Resource: AWS::ECS::Service
ServiceName: tradai-live-pascal
Cluster: tradai-cluster
LaunchType: FARGATE # NOT Spot for live trading
DesiredCount: 1
DeploymentConfiguration:
MaximumPercent: 100
MinimumHealthyPercent: 0 # Allow full replacement (see warning below)
TaskDefinition: tradai-strategy-pascal
NetworkConfiguration:
AwsvpcConfiguration:
Subnets:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
SecurityGroups:
- !Ref StrategySecurityGroup
EnableExecuteCommand: true # For debugging
Zero-container window during deployment
With MinimumHealthyPercent: 0, there will be a window with zero running trading containers during deployment. For strategies with open positions, coordinate deployments during market off-hours or implement a graceful handoff mechanism.
Task DefinitionΒΆ
Resource: AWS::ECS::TaskDefinition
Family: tradai-strategy-pascal
Cpu: 1024
Memory: 2048
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
ExecutionRoleArn: !Ref ECSExecutionRole
TaskRoleArn: !Ref StrategyTaskRole
ContainerDefinitions:
- Name: strategy
Image: !Sub ${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/tradai-pascal:2.0.0
Essential: true
Environment:
# MINIMAL env vars - config loaded from MLflow
- Name: STRATEGY
Value: PascalStrategy
- Name: STRATEGY_ID
Value: pascal-btc # Unique strategy identifier for trading state
- Name: TRADING_MODE
Value: live
# Infrastructure (rarely changes)
- Name: MLFLOW_TRACKING_URI
Value: http://mlflow.tradai-{env}.local:5000/mlflow
- Name: DYNAMODB_TABLE
Value: tradai-workflow-state
- Name: ARCTICDB_S3_URI
Value: !Sub s3://${ArcticDBBucket}
- Name: S3_CONFIG_BUCKET
Value: !Ref ConfigsBucket
Secrets:
# Secrets stay in Secrets Manager (not MLflow)
- Name: EXCHANGE_API_KEY
ValueFrom: !Sub arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:tradai/exchange/api-key
- Name: EXCHANGE_API_SECRET
ValueFrom: !Sub arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:tradai/exchange/api-secret
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: /ecs/tradai-live-pascal
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: strategy
HealthCheck:
Command:
- CMD-SHELL
- curl -f http://localhost:8004/api/v1/health || exit 1
Interval: 30
Timeout: 5
Retries: 3
StartPeriod: 120
# Note: All strategy-specific config (pairs, timeframe, warmup_days, etc.)
# is loaded at runtime from MLflow tags + S3, NOT from env vars.
# To change config: update MLflow model version, NOT task definition.
Auto Scaling (Optional)ΒΆ
Do NOT auto-scale live trading containers
CPU-based auto-scaling for live trading strategies will create DUPLICATE trading instances placing conflicting orders. Auto-scaling should only apply to the backtest/strategy-service containers, NOT to live trading containers. Live trading uses desired_count: 1 (or 0) per strategy -- never auto-scaled.
For strategies that benefit from horizontal scaling (backtest/strategy-service only):
Resource: AWS::ApplicationAutoScaling::ScalableTarget
ServiceNamespace: ecs
ResourceId: !Sub service/tradai-cluster/tradai-live-pascal
ScalableDimension: ecs:service:DesiredCount
MinCapacity: 1
MaxCapacity: 5
Resource: AWS::ApplicationAutoScaling::ScalingPolicy
PolicyName: cpu-scaling
PolicyType: TargetTrackingScaling
ScalingTargetId: !Ref ScalableTarget
TargetTrackingScalingPolicyConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageCPUUtilization
TargetValue: 70
ScaleInCooldown: 300
ScaleOutCooldown: 60
10. Cost AnalysisΒΆ
Per-Strategy Costs (Live Trading)ΒΆ
| Component | Specification | Monthly Cost |
|---|---|---|
| ECS Fargate | 1 vCPU, 2GB, 24/7 | $30.66 |
| CloudWatch Logs | ~1GB/month | $0.50 |
| DynamoDB | On-demand, ~1M reads | $0.25 |
| Secrets Manager | 2 secrets | $0.80 |
| EventBridge | 8,640 invocations | ~$0.01 |
| Lambda (health) | 8,640 x 256MB x 1s | $0.22 |
| SNS | ~100 alerts | $0.01 |
| Subtotal per strategy | $32.45 | |
| Reserve (exchange fees, data) | +$29 | |
| Total per strategy | ~$61 |
Platform Total (3 Strategies)ΒΆ
| Component | Monthly Cost |
|---|---|
| Base Platform (v8.1) | $122-128 |
| Pascal Strategy (live) | $61 |
| Momentum Strategy (dry-run) | $61 |
| ML Trend Strategy (live) | $61 |
| Platform Total | $305-311 |
11. Security ControlsΒΆ
Exchange CredentialsΒΆ
# Secrets Manager Configuration
Secret: tradai/exchange/api-key
Description: Binance API key for live trading
Rotation: Manual (exchange limitation)
Access: StrategyTaskRole only
Secret: tradai/exchange/api-secret
Description: Binance API secret
Rotation: Manual
Access: StrategyTaskRole only
# IAM Policy for Strategy Task Role
Policy:
Effect: Allow
Action:
- secretsmanager:GetSecretValue
Resource:
- arn:aws:secretsmanager:*:*:secret:tradai/exchange/*
Condition:
StringEquals:
aws:RequestTag/Environment: !Ref Environment
Network SecurityΒΆ
# Strategy Security Group
Inbound:
- Port 8004 (health check) from ALB only
- No direct internet access
Outbound:
- Port 443 to exchange APIs (Binance)
- Port 443 to S3 (VPC endpoint)
- Port 443 to DynamoDB (VPC endpoint)
- Port 5000 to MLflow (internal)
Audit LoggingΒΆ
CloudTrail Events:
- ECS RunTask / StopTask
- Secrets Manager GetSecretValue
- DynamoDB PutItem / UpdateItem
CloudWatch Logs:
- /ecs/tradai-live-{strategy}
- Retention: 30 days
- Export to S3: After 7 days
12. Implementation Checklist (Phase 6)ΒΆ
Week 17-18: Live Trading FoundationΒΆ
- Update Strategy Container Dockerfile for multi-mode support
- Implement ArcticDB warmup loader
- Implement DynamoDB state manager for live sessions
- Implement health reporter component
- Create ECS Service definitions (non-Spot)
- Configure Secrets Manager for exchange credentials
Week 19-20: MLflow IntegrationΒΆ
- Update MLflow adapter for model serving
- Implement model loading in FreqAI strategies
- Create model promotion workflow (Staging β Production)
- Test model versioning with live trading
Week 21-22: Monitoring & OperationsΒΆ
- Deploy EventBridge + Lambda health monitoring
- Configure CloudWatch alarms
- Set up SNS alerting
- Create operational runbooks
- Perform dry-run validation
- Go-live with first strategy
13. Document ChangelogΒΆ
| Version | Date | Changes |
|---|---|---|
| 9.1.1 | 2026-03-28 | Added section numbers, Mermaid architecture diagram, status admonitions |
| 9.1 | 2025-12-09 | MLflow-centric configuration: minimal env vars, config from MLflow tags + S3, leverages existing StrategyMetadata, MLflowAdapter, ConfigMergeService patterns |
| 9.0 | 2025-12-08 | Initial live trading architecture document |
DependenciesΒΆ
| If This Changes | Update This Doc |
|---|---|
infra/shared/tradai_infra_shared/config.py (live-trading service) | Section 9 (ECS Service Configuration) |
libs/tradai-common/src/tradai/common/entrypoint/trading.py | Section 4 (Container Initialization Flow), Section 8 (Health Monitoring) |
lambdas/trading-heartbeat-check/handler.py | Section 8 (Health Monitoring) |
libs/tradai-common/src/tradai/common/mlflow/adapter.py | Section 5 (MLflow Model Lifecycle) |
| DynamoDB schema changes | Section 6 (Unified DynamoDB State) |
14. Related DocumentsΒΆ
- 12-ML-LIFECYCLE.md - Complete ML strategy lifecycle including training pipeline, hyperparameter optimization, retraining scheduling, drift detection, and model rollback. Read this for the full MLOps story.
Next Action: Begin Phase 6 implementation after completing Phases 1-5 of the backtesting foundation.