trading-heartbeat-check¶
Monitors live trading containers by checking DynamoDB trading-state table for stale heartbeats. Alerts via SNS after consecutive failures.
Overview¶
| Property | Value |
|---|---|
| Trigger | EventBridge scheduled event |
| Runtime | Python 3.11 |
| Timeout | 60 seconds |
| Memory | 256 MB |
| Settings class | HeartbeatCheckSettings (extends DynamoDBSettings) |
Input Schema¶
No specific input required (scheduled event).
Output Schema¶
{
"success": true,
"data": {
"summary": {
"active_strategies": 5,
"stale": 1,
"healthy": 4,
"alerts_sent": 1,
"latency_ms": 245.32
}
},
"environment": "dev"
}
Environment Variables¶
| Variable | Required | Default | Description |
|---|---|---|---|
TRADING_STATE_TABLE | Yes | - | Trading state DynamoDB table (read via validation_alias) |
DYNAMODB_TABLE_NAME | Yes | - | Health state table for failure tracking |
HEARTBEAT_THRESHOLD_MINUTES | No | 2 | Stale threshold in minutes |
CONSECUTIVE_FAILURES_THRESHOLD | No | 2 | Consecutive failures before alerting |
DYNAMODB_TTL_SECONDS | No | 86400 | State record TTL (24 hours default) |
SNS_ALERTS_TOPIC_ARN | No | - | SNS topic for alerts |
Active Strategy Filtering¶
The handler queries the TRADING_STATE_TABLE for strategies with only these statuses:
warmup- Container starting uprunning- Actively trading
Other statuses (stopped, failed, terminated) are excluded via DynamoDB FilterExpression. Uses paginated scan with the low-level DynamoDB client (not boto3 resource) to handle large tables.
ISO 8601 Timestamp Handling¶
Heartbeat timestamps are parsed using _parse_iso_timestamp() which:
- Handles both
Zsuffix and+00:00offset formats (normalizesZto+00:00) - Uses
datetime.fromisoformat()for parsing - Returns
Nonefor unparseable timestamps (treated as stale) - Logs a warning for invalid timestamp formats
Heartbeat Detection Flow¶
flowchart TD
A[EventBridge Trigger] --> B[Query Active Strategies]
B -->|Filter: status IN warmup, running| C{Strategies found?}
C -->|No| D[Return empty summary]
C -->|Yes| E[Check each heartbeat]
E --> F{last_heartbeat present?}
F -->|No| G[Mark as stale]
F -->|Yes| H{Parse ISO timestamp}
H -->|Parse failed| G
H -->|Success| I{Older than threshold?}
I -->|Yes| G
I -->|No| J[Mark as healthy]
G --> K[Increment failure count]
J --> L[Reset failure count to 0]
K --> M{Failures == threshold?}
M -->|Yes, exactly 2| N[Send SNS Alert]
M -->|No| O[Continue]
L --> O
N --> O
O --> P{More strategies?}
P -->|Yes| E
P -->|No| Q[Publish metrics]
Q --> R[Return summary] Staleness Criteria¶
A strategy heartbeat is considered stale if: 1. last_heartbeat timestamp is older than HEARTBEAT_THRESHOLD_MINUTES (default: 2 minutes) 2. last_heartbeat field is missing (never recorded) 3. last_heartbeat timestamp is unparseable (invalid format)
Consecutive Failure Tracking¶
Uses DynamoDBStateRepository with HeartbeatState entity (from tradai.common.lambda_):
{
"container_id": "trading-heartbeat:PascalStrategy", # Namespaced key
"strategy_name": "PascalStrategy",
"last_heartbeat": "2024-01-01T12:00:00+00:00",
"status": "stale", # or "healthy"
"consecutive_missed": 2
}
Key behaviors: - Alert sent exactly when consecutive_missed == CONSECUTIVE_FAILURES_THRESHOLD (default: 2), preventing alert spam - Healthy strategies get their failure count reset to 0 with status "healthy" - ConsecutiveHeartbeatFailures metric published per-strategy on each failure increment - Key format trading-heartbeat:{strategy_id} namespaces heartbeat state in the shared DynamoDB table
CloudWatch Metrics¶
Namespace suffix: TradingHealth
| Metric | Dimensions | Description |
|---|---|---|
ActiveStrategies | Environment | Total strategies being monitored |
StaleHeartbeats | Environment | Count of stale heartbeats detected |
HealthyHeartbeats | Environment | Count of healthy heartbeats |
ConsecutiveHeartbeatFailures | StrategyId, Environment | Per-strategy failure count |
EventBridge Schedule¶
{
"ScheduleExpression": "rate(5 minutes)",
"Targets": [{
"Arn": "arn:aws:lambda:...:trading-heartbeat-check"
}]
}
SNS Alert Format¶
TradAI Trading Heartbeat Alert
Environment: prod
Strategy ID: PascalStrategy
Instance ID: task-abc123
Status: STALE HEARTBEAT
Last Heartbeat: 2024-01-01T11:55:00Z
Staleness: 5.2 minutes
Threshold: 2 minutes
This alert was triggered after 2 consecutive
stale heartbeat detections. The trading container may have crashed or become
unresponsive.
Recommended Actions:
1. Check ECS task status for PascalStrategy
2. Review CloudWatch logs for errors
3. If container is unhealthy, consider manual restart
4. Check DynamoDB trading-state table for current status
Subject: [{ENV}] Trading Alert: {strategy_id} heartbeat stale Message attributes: strategy_id, environment, severity=HIGH, alert_type=stale_heartbeat
Trading State Table Schema¶
Expected schema in TRADING_STATE_TABLE:
| Attribute | Type | Description |
|---|---|---|
strategy_id | S | Partition key |
instance_id | S | ECS task instance |
status | S | warmup, running, stopped, failed, terminated |
last_heartbeat | S | ISO 8601 timestamp |
started_at | S | Container start time |
error | S | Error message if failed |
Related¶
- health-check - Service-level health checks
- orphan-scanner - Orphaned task detection