trading-heartbeat-check¶
Monitors live trading containers by checking DynamoDB trading-state table for stale heartbeats.
Overview¶
| Property | Value |
|---|---|
| Trigger | EventBridge scheduled event |
| Runtime | Python 3.11 |
| Timeout | 60 seconds |
| Memory | 256 MB |
Input Schema¶
No specific input required (scheduled event).
Output Schema¶
{
"success": true,
"data": {
"summary": {
"active_strategies": 5,
"stale": 1,
"healthy": 4,
"alerts_sent": 1,
"latency_ms": 245.32
}
}
}
Environment Variables¶
| Variable | Required | Default | Description |
|---|---|---|---|
TRADING_STATE_TABLE | Yes | "tradai-trading-state" | Trading state DynamoDB table |
DYNAMODB_TABLE_NAME | Yes | - | Health state table for failure tracking |
HEARTBEAT_THRESHOLD_MINUTES | No | 2 | Stale threshold in minutes |
CONSECUTIVE_FAILURES_THRESHOLD | No | 2 | Failures before alerting |
DYNAMODB_TTL_SECONDS | No | 604800 | State record TTL (7 days) |
ALERT_SNS_TOPIC_ARN | Yes | - | SNS topic for alerts |
Heartbeat Detection Flow¶
flowchart TD
A[EventBridge Trigger] --> B[Query Active Strategies]
B --> C{Strategies found?}
C -->|No| D[Return empty summary]
C -->|Yes| E[Check each heartbeat]
E --> F{Heartbeat fresh?}
F -->|Yes| G[Reset failure count]
F -->|No| H[Increment failure count]
H --> I{Threshold reached?}
I -->|Yes| J[Send SNS Alert]
I -->|No| K[Continue]
G --> K
J --> K
K --> L{More strategies?}
L -->|Yes| E
L -->|No| M[Publish metrics]
M --> N[Return summary] Active Strategy Detection¶
Queries trading-state table for strategies with status: - warmup - Container starting up - running - Actively trading
Uses paginated DynamoDB scan with filter expression.
Staleness Criteria¶
A strategy heartbeat is considered stale if: 1. last_heartbeat timestamp is older than HEARTBEAT_THRESHOLD_MINUTES 2. last_heartbeat field is missing 3. last_heartbeat timestamp is unparseable
Consecutive Failure Tracking¶
Uses DynamoDBStateRepository with HeartbeatState entity:
{
"container_id": "trading-heartbeat:PascalStrategy",
"strategy_name": "PascalStrategy",
"last_heartbeat": "2024-01-01T12:00:00Z",
"status": "stale",
"consecutive_missed": 2
}
Alerts only sent when consecutive_missed == CONSECUTIVE_FAILURES_THRESHOLD (exactly, to avoid spam).
CloudWatch Metrics¶
| Metric | Description |
|---|---|
ActiveStrategies | Total strategies being monitored |
StaleHeartbeats | Count of stale heartbeats detected |
HealthyHeartbeats | Count of healthy heartbeats |
ConsecutiveHeartbeatFailures | Per-strategy failure count |
EventBridge Schedule¶
{
"ScheduleExpression": "rate(1 minute)",
"Targets": [{
"Arn": "arn:aws:lambda:...:trading-heartbeat-check"
}]
}
SNS Alert Format¶
TradAI Trading Heartbeat Alert
Environment: prod
Strategy ID: PascalStrategy
Instance ID: task-abc123
Status: STALE HEARTBEAT
Last Heartbeat: 2024-01-01T11:55:00Z
Staleness: 5.2 minutes
Threshold: 2 minutes
This alert was triggered after 2 consecutive
stale heartbeat detections. The trading container may have crashed or become
unresponsive.
Recommended Actions:
1. Check ECS task status for PascalStrategy
2. Review CloudWatch logs for errors
3. If container is unhealthy, consider manual restart
4. Check DynamoDB trading-state table for current status
Trading State Table Schema¶
Expected schema in trading-state table:
| Attribute | Type | Description |
|---|---|---|
strategy_id | S | Partition key |
instance_id | S | ECS task instance |
status | S | warmup, running, stopped, failed |
last_heartbeat | S | ISO 8601 timestamp |
started_at | S | Container start time |
error | S | Error message if failed |
Related¶
- health-check - Service-level health checks
- orphan-scanner - Orphaned task detection