TradAI Observability Architecture¶
Version: 1.0.0 | Date: 2026-03-28 | Status: CURRENT
1. TL;DR¶
TradAI observability is built on four pillars: structured logging via LoggerMixin, metrics collection from Freqtrade into DynamoDB and CloudWatch, health monitoring through a protocol-based checker framework with async concurrent execution, and risk monitoring via a pure-evaluator RiskMonitor that triggers circuit-breaker actions. End-to-end traceability is achieved through a trace_id correlation field that flows from job submission through Step Functions, ECS tasks, DynamoDB records, and MLflow experiment tags. CloudWatch provides the unified alerting and dashboarding layer with custom metric namespaces, composite alarms, and a pre-built operations dashboard.
2. Logging Architecture¶
LoggerMixin Pattern¶
All TradAI classes use LoggerMixin for structured logging via cooperative multiple inheritance. The mixin auto-initializes a logger named after the fully qualified class (module.ClassName).
Source: libs/tradai-common/src/tradai/common/logger_mixin.py
Key behaviors:
- Logger is auto-initialized in
__init__with fully qualified class name - Lazy re-initialization on first
self.loggeraccess (handles unpickling) - Pickle-safe: logger is excluded from
__getstate__and re-created in__setstate__ - Hydra integration via
get_hydra_logfile_dir_and_path()for file-based logging
Log Level Convention
- DEBUG: Internal diagnostics (heartbeat counts, metric parse details)
- INFO: Key operations (status updates, MLflow logging, strategy start/stop)
- WARNING: Recoverable failures (heartbeat fails, metric collection fails)
- ERROR: Unrecoverable failures (health check exceptions, critical breaches)
- Never log secrets or PII.
CloudWatch Log Groups¶
All log groups use the canonical retention policy from infra/shared/tradai_infra_shared/config.py:
| Log Group Pattern | Source | Retention |
|---|---|---|
/ecs/tradai/{env} | ECS cluster (shared) | 30d / 90d prod |
/ecs/tradai/{env}/services | ECS service-level logs | 30d / 90d prod |
/aws/lambda/tradai-{func}-{env} | Lambda functions (18 handlers) | 30d / 90d prod |
/aws/states/tradai-backtest-workflow-{env} | Step Functions execution logs | 30d / 90d prod |
aws-waf-logs-tradai-{env} | WAF request logs | 30d / 90d prod |
/aws/cloudtrail/tradai | CloudTrail audit logs | 30d / 90d prod |
/tradai/consolidated/user-data | EC2 consolidated bootstrap | 7d |
/tradai/consolidated/containers | EC2 Docker container logs | 7d |
3. Metrics Collection¶
The metrics pipeline collects real-time PnL and trade data from Freqtrade during live trading, persists snapshots to DynamoDB, and publishes derived metrics to CloudWatch.
Source files:
libs/tradai-common/src/tradai/common/health/metrics_collector.pylibs/tradai-common/src/tradai/common/health/freqtrade_client.pylibs/tradai-common/src/tradai/common/health/reporter.py
Collection Flow¶
sequenceDiagram
participant HR as HealthReporter (30s)
participant MC as MetricsCollector
participant FT as Freqtrade API
participant DB as DynamoDB
HR->>MC: collect_all()
MC->>FT: GET /api/v1/profit
FT-->>MC: profit_data (ratios)
MC->>FT: GET /api/v1/status
FT-->>MC: open_trades[]
MC-->>HR: (StrategyPnL, list[LiveTrade])
HR->>DB: update_metrics(strategy_id, pnl, trades)
HR->>HR: _check_risk(pnl, trades) MetricsCollector¶
MetricsCollector transforms Freqtrade API responses into typed entities:
GET /api/v1/profitproducesStrategyPnLwith:realized_pnl_pct,drawdown_pct,open_trades,closed_trades,timestampGET /api/v1/statusproduceslist[LiveTrade]with:pair,is_open,is_short,open_rate,stake_amount,profit_pct,leverage- Unrealized PnL is computed as stake-weighted average of open trade profits
Ratio to Percentage Conversion
Freqtrade returns ratios (0.0-1.0). MetricsCollector multiplies by 100 for all *_pct fields. This conversion happens exactly once at the collection boundary.
FreqtradeAPIClient¶
The FreqtradeAPIClient wraps the Freqtrade REST API (localhost:8080) with JWT auth. Provides get_profit(), get_open_trades(), get_balance(), pause(), and resume(). All calls use 5-second timeouts and are best-effort (failures logged, never raised).
4. Health Monitoring¶
The health monitoring framework uses a protocol-based architecture with async concurrent execution and three-state health status (HEALTHY, DEGRADED, UNHEALTHY).
Module Files¶
| File | Purpose |
|---|---|
health/__init__.py | Module entry point with lazy loading |
health/protocols.py | HealthChecker protocol (@runtime_checkable) |
health/base.py | BaseHealthChecker with latency measurement and error handling |
health/aws_checkers.py | AWSHealthChecker base with lazy boto3 client management |
health/checkers.py | Pre-built checkers: Redis, HTTP, Database, DynamoDB, SQS, S3, MLflow, SNS |
health/service.py | HealthService aggregator with async concurrent checks |
health/reporter.py | HealthReporter background heartbeat thread |
health/metrics_collector.py | MetricsCollector for Freqtrade PnL/trade data |
health/risk_monitor.py | RiskMonitor pure evaluator for risk limits |
health/freqtrade_client.py | FreqtradeAPIClient for Freqtrade REST API |
Architecture Flowchart¶
flowchart TB
subgraph "HTTP Health Endpoint"
HS[HealthService<br/>async concurrent]
HS --> RC[RedisHealthChecker]
HS --> HC[HTTPHealthChecker]
HS --> DC[DatabaseHealthChecker]
HS --> DDB[DynamoDBHealthChecker]
HS --> SQC[SQSHealthChecker]
HS --> S3C[S3HealthChecker]
HS --> MLC[MLflowHealthChecker<br/>critical=False]
HS --> SNSC[SNSHealthChecker]
end
subgraph "Background Heartbeat"
HR[HealthReporter<br/>30s interval thread]
HR -->|collect_all| MC[MetricsCollector]
HR -->|record_heartbeat| DY[(DynamoDB<br/>trading-state)]
HR -->|evaluate| RM[RiskMonitor]
HR -->|pause/resume| FT[FreqtradeAPIClient]
HR -->|send_critical| AS[AlertService<br/>SNS]
end
subgraph "Aggregated Result"
HS -->|HealthResult| HR2{Status}
HR2 -->|all critical OK| HEALTHY[HEALTHY]
HR2 -->|non-critical failing| DEGRADED[DEGRADED]
HR2 -->|critical failing| UNHEALTHY[UNHEALTHY]
end HealthService¶
HealthService runs all registered HealthChecker instances via asyncio.gather (timeout: 5s). Status: HEALTHY (all OK), DEGRADED (non-critical failing), UNHEALTHY (critical failing).
HealthReporter¶
Background non-daemon thread running a 30-second heartbeat loop: (1) collect metrics from Freqtrade, (2) send heartbeat + PnL snapshots to DynamoDB, (3) check risk via RiskMonitor, (4) check pause/resume state from DynamoDB for operator control.
Alerting on Heartbeat Failures
After 3 consecutive failures (configurable), an SNS alert fires via AlertService. One alert per failure run to avoid storms. Resets on next successful heartbeat.
Pre-built Health Checkers¶
| Checker | Check Method | Critical |
|---|---|---|
RedisHealthChecker | PING command | Yes |
HTTPHealthChecker | GET + status code validation | Yes |
DatabaseHealthChecker | Async ping callable | Yes |
DynamoDBHealthChecker | DescribeTable (ACTIVE) | Yes |
SQSHealthChecker | GetQueueAttributes | Yes |
S3HealthChecker | HeadBucket | Yes |
SNSHealthChecker | GetTopicAttributes | Yes |
MLflowHealthChecker | is_available() via thread pool | No |
AWS checkers inherit AWSHealthChecker which provides lazy boto3 client caching, LocalStack endpoint compatibility, and short timeouts (3s connect, 3s read, no retries).
5. Risk Monitoring¶
The RiskMonitor is a pure evaluator with no I/O side effects. It receives pre-collected metrics and returns a RiskCheckResult indicating whether any limits are breached.
Source files:
libs/tradai-common/src/tradai/common/health/risk_monitor.pylibs/tradai-common/src/tradai/common/entities/risk_limits.py
RiskLimits Configuration¶
class RiskLimits(BaseModel):
max_drawdown_pct: float = 20.0 # 1.0 - 100.0 (deployment: 1.0 - 50.0)
max_open_trades: int = 5 # 1 - 50 (deployment: 1 - 20)
max_leverage: float = 3.0 # 1.0 - 125.0 (deployment: 1.0 - 10.0)
action_on_breach: str = "pause" # "pause" or "stop"
Deployment-Safe Bounds
RiskLimits enforces stricter deployment bounds (via validate_deployment_bounds()) than the entity validation bounds. This guards against misconfiguration before any container runs.
Evaluation Order¶
Checks in order: (1) drawdown vs max_drawdown_pct, (2) open trade count vs max_open_trades, (3) max trade leverage vs max_leverage, (4) fail-closed after 5 consecutive metric failures (triggers monitoring_failure breach to stop blind trading).
Circuit Breaker Flow¶
flowchart LR
RM[RiskMonitor.evaluate] -->|breached=True| BA{action_on_breach}
BA -->|pause| FTP[FreqtradeAPIClient.pause]
BA -->|stop| SC[shutdown_callback]
SC -->|fallback on error| FTP
FTP --> DDB[DynamoDB: status=PAUSED]
DDB --> SNS[AlertService.send_critical]
RM -->|breached=False| REC{Was previously breached?}
REC -->|Yes| RECA[AlertService.send_info<br/>Recovery notification]
REC -->|No| NOP[No action] Breach actions execute once per breach run: "stop" calls shutdown_callback (falls back to pause), "pause" calls FreqtradeAPIClient.pause(). DynamoDB status is set to PAUSED with breach details, and a critical SNS alert fires. On recovery, an info-level notification is sent.
6. Tracing & Correlation¶
Every backtest and training execution carries a unified trace context for end-to-end debugging.
Trace Context Fields¶
| Field | Set By | Stored In |
|---|---|---|
trace_id (E2E correlation UUID) | API Gateway / CLI | DynamoDB, Step Functions, ECS env, MLflow tags |
job_id (DynamoDB PK) | Backend API | DynamoDB (run_id), S3, MLflow tags |
mlflow_run_id | Strategy container | DynamoDB, BacktestResult |
git_commit (7 or 40 char SHA) | get_git_commit() | BacktestResult, MLflow tags |
Trace ID Flow¶
flowchart LR
API[API Gateway<br/>POST /backtests] -->|trace_id in SQS message| SQS[SQS FIFO Queue]
SQS -->|BacktestJobMessage| SFN[Step Functions<br/>trace_id in input]
SFN -->|TRACE_ID env var| ECS[ECS Task<br/>Strategy Container]
ECS -->|trace_id tag| MLF[MLflow Run]
ECS -->|trace_id field| DDB[(DynamoDB<br/>BacktestJobStatus)]
DDB -->|trace_id-index GSI| QUERY[Correlation Queries] Propagation chain: BacktestJobMessage.trace_id (SQS) -> Step Functions input ("trace_id") -> ECS env var (TRACE_ID) -> MLflow tag (trace_id) -> DynamoDB field (trace_id).
DynamoDB GSI Queries¶
The workflow-state table has two GSIs:
| GSI Name | Hash Key | Range Key | Use Case |
|---|---|---|---|
status-created_at-index | status | created_at | List jobs by status |
trace_id-index | trace_id | -- | E2E correlation lookup |
7. MLflow Experiment Tracking¶
MLflow provides experiment tracking for backtest results, model versioning, and artifact storage.
Source files:
libs/tradai-common/src/tradai/common/mlflow/adapter.pylibs/tradai-common/src/tradai/common/mlflow/backtest_logger.py
Tracking URI¶
MLflow is accessed via Service Discovery:
The MLflowAdapter uses REST API with Basic Authentication and supports lazy initialization with double-checked locking for thread safety.
What Gets Logged¶
Metrics: total_trades, winning_trades, losing_trades, total_profit, total_profit_pct (required). Optional: sharpe_ratio, sortino_ratio, calmar_ratio, profit_factor, win_rate, max_drawdown, sqn, cagr, expectancy.
Tags: strategy, timeframe, exchange, start_date, end_date, symbols, trading_mode, job_id, trace_id, git_commit, logged_at.
Parameters: strategy, timeframe, start_date, end_date, symbols, exchange, stake_amount, stoploss (logged via log_parameters()).
Artifacts: configs/config.json -- full strategy configuration for reproducibility.
MLflow Health Check¶
MLflowHealthChecker is non-critical (critical=False). If unavailable, the service enters DEGRADED status but continues operating.
8. CloudWatch Integration¶
CloudWatch serves as the centralized monitoring layer with custom metric namespaces, configurable alarms, and an operations dashboard.
Source files:
infra/edge/modules/cloudwatch_alarms.pyinfra/edge/modules/cloudwatch_dashboard.pyinfra/ami/files/cloudwatch-agent.json
Custom Metric Namespaces¶
| Namespace | Metrics | Published By |
|---|---|---|
tradai/ServiceHealth | ServiceHealthy, HealthCheckLatency, ConsecutiveFailures | Lambda health-check (every 2 min) |
tradai/TradingHealth | ActiveStrategies, StaleHeartbeats, HealthyHeartbeats | Lambda trading-heartbeat-check (every 5 min) |
tradai/InfraDrift | DriftDetected, ResourcesToCreate/Update/Delete, CheckSuccess | Lambda pulumi-drift-detector (every 6 hrs) |
TradAI/Consolidated | disk_used_percent, mem_used_percent | CloudWatch Agent on EC2 (every 5 min) |
CloudWatch Alarms¶
All alarms notify via SNS topic (tradai-alerts-{env}).
| Alarm | Metric / Namespace | Condition | Period | Missing Data |
|---|---|---|---|---|
| Service Health (per service) | ServiceHealthy / ServiceHealth | < 1 for 2 periods | 120s | breaching (always-on) or notBreaching (on-demand) |
| Service Latency (per service) | HealthCheckLatency / ServiceHealth | > 5000ms for 3 periods | 120s | notBreaching |
| Stale Heartbeats | StaleHeartbeats / TradingHealth | >= 1 for 2 periods | 300s | notBreaching |
| Active Strategies | ActiveStrategies / TradingHealth | < configured min | 300s | notBreaching |
| Lambda Errors (per function) | Errors / AWS/Lambda | >= 1 | 300s | notBreaching |
| DLQ Messages | ApproximateNumberOfMessagesVisible / AWS/SQS | > 0 | 300s | notBreaching |
| Infrastructure Drift | DriftDetected / InfraDrift | >= 1 | 21600s | notBreaching |
| RDS CPU | CPUUtilization / AWS/RDS | > 80% for 2 periods | 300s | notBreaching |
| RDS Storage | FreeStorageSpace / AWS/RDS | < 2GB | 300s | notBreaching |
| API 5xx Rate | Math: 5xx / Count * 100 / AWS/ApiGateway | > 1% for 2 periods | 300s | notBreaching |
| Step Function Failures | ExecutionsFailed / AWS/States | > 0 | 300s | notBreaching |
On-Demand Service Handling
Services with desired_count=0 (live-trading, dry-run-trading, strategy-container) use treat_missing_data="notBreaching" so they do not trigger false alarms when not running. Note: strategy-service (desired_count=1) is always-on and uses treat_missing_data="breaching" like other always-on services.
Composite Alarm¶
A composite alarm (tradai-system-health-{env}) triggers when ANY of these fire:
- Any critical service health alarm (backend-api, data-collection, mlflow, strategy-service, live-trading, dry-run-trading)
- Stale heartbeats alarm
- RDS CPU or storage alarms
- API Gateway 5xx rate alarm
- Step Functions failure alarms
Operations Dashboard¶
The CloudWatch dashboard (tradai-{env}) auto-refreshes every 60 seconds across four rows:
| Row | Widgets | Source |
|---|---|---|
| 1 | ECS CPU, Memory, Running Task Count | AWS/ECS, ECS/ContainerInsights |
| 2 | Health Check Latency (p50/p95/p99), Service Health | tradai/ServiceHealth |
| 3 | Active Strategies, Stale/Healthy Heartbeats | tradai/TradingHealth |
| 4 | Alarm Status overview, Recent Logs (ERROR/WARNING) | Alarms, Logs Insights |
Monitored Lambda Functions¶
| Function | Schedule | Purpose |
|---|---|---|
orphan-scanner | 5 min | Detect orphaned ECS tasks |
health-check | 2 min | Publish service health metrics |
trading-heartbeat-check | 5 min | Publish trading heartbeat metrics |
drift-monitor | 12 hrs | ML model/data drift detection |
retraining-scheduler | 6 hrs | Schedule model retraining |
pulumi-drift-detector | 6 hrs | Infrastructure drift detection |
Changelog¶
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-03-28 | Initial creation from health module analysis |
Dependencies¶
| If This Changes | Update This Doc |
|---|---|
libs/tradai-common/src/tradai/common/health/ | Health monitoring (Section 4) |
libs/tradai-common/src/tradai/common/logger_mixin.py | Logging architecture (Section 2) |
infra/edge/modules/cloudwatch_alarms.py | CloudWatch integration (Section 8) |
| New CloudWatch namespace or alarm added | Metrics and alarms tables |