Skip to content

trading-heartbeat-check

Monitors live trading containers by checking DynamoDB trading-state table for stale heartbeats. Alerts via SNS after consecutive failures.

Overview

Property Value
Trigger EventBridge scheduled event
Runtime Python 3.11
Timeout 60 seconds
Memory 256 MB
Settings class HeartbeatCheckSettings (extends DynamoDBSettings)

Input Schema

No specific input required (scheduled event).

Output Schema

{
    "success": true,
    "data": {
        "summary": {
            "active_strategies": 5,
            "stale": 1,
            "healthy": 4,
            "alerts_sent": 1,
            "latency_ms": 245.32
        }
    },
    "environment": "dev"
}

Environment Variables

Variable Required Default Description
TRADING_STATE_TABLE Yes - Trading state DynamoDB table (read via validation_alias)
DYNAMODB_TABLE_NAME Yes - Health state table for failure tracking
HEARTBEAT_THRESHOLD_MINUTES No 2 Stale threshold in minutes
CONSECUTIVE_FAILURES_THRESHOLD No 2 Consecutive failures before alerting
DYNAMODB_TTL_SECONDS No 86400 State record TTL (24 hours default)
SNS_ALERTS_TOPIC_ARN No - SNS topic for alerts

Active Strategy Filtering

The handler queries the TRADING_STATE_TABLE for strategies with only these statuses:

  • warmup - Container starting up
  • running - Actively trading

Other statuses (stopped, failed, terminated) are excluded via DynamoDB FilterExpression. Uses paginated scan with the low-level DynamoDB client (not boto3 resource) to handle large tables.

ISO 8601 Timestamp Handling

Heartbeat timestamps are parsed using _parse_iso_timestamp() which:

  • Handles both Z suffix and +00:00 offset formats (normalizes Z to +00:00)
  • Uses datetime.fromisoformat() for parsing
  • Returns None for unparseable timestamps (treated as stale)
  • Logs a warning for invalid timestamp formats

Heartbeat Detection Flow

flowchart TD
    A[EventBridge Trigger] --> B[Query Active Strategies]
    B -->|Filter: status IN warmup, running| C{Strategies found?}
    C -->|No| D[Return empty summary]
    C -->|Yes| E[Check each heartbeat]
    E --> F{last_heartbeat present?}
    F -->|No| G[Mark as stale]
    F -->|Yes| H{Parse ISO timestamp}
    H -->|Parse failed| G
    H -->|Success| I{Older than threshold?}
    I -->|Yes| G
    I -->|No| J[Mark as healthy]
    G --> K[Increment failure count]
    J --> L[Reset failure count to 0]
    K --> M{Failures == threshold?}
    M -->|Yes, exactly 2| N[Send SNS Alert]
    M -->|No| O[Continue]
    L --> O
    N --> O
    O --> P{More strategies?}
    P -->|Yes| E
    P -->|No| Q[Publish metrics]
    Q --> R[Return summary]

Staleness Criteria

A strategy heartbeat is considered stale if: 1. last_heartbeat timestamp is older than HEARTBEAT_THRESHOLD_MINUTES (default: 2 minutes) 2. last_heartbeat field is missing (never recorded) 3. last_heartbeat timestamp is unparseable (invalid format)

Consecutive Failure Tracking

Uses DynamoDBStateRepository with HeartbeatState entity (from tradai.common.lambda_):

{
    "container_id": "trading-heartbeat:PascalStrategy",  # Namespaced key
    "strategy_name": "PascalStrategy",
    "last_heartbeat": "2024-01-01T12:00:00+00:00",
    "status": "stale",                                    # or "healthy"
    "consecutive_missed": 2
}

Key behaviors: - Alert sent exactly when consecutive_missed == CONSECUTIVE_FAILURES_THRESHOLD (default: 2), preventing alert spam - Healthy strategies get their failure count reset to 0 with status "healthy" - ConsecutiveHeartbeatFailures metric published per-strategy on each failure increment - Key format trading-heartbeat:{strategy_id} namespaces heartbeat state in the shared DynamoDB table

CloudWatch Metrics

Namespace suffix: TradingHealth

Metric Dimensions Description
ActiveStrategies Environment Total strategies being monitored
StaleHeartbeats Environment Count of stale heartbeats detected
HealthyHeartbeats Environment Count of healthy heartbeats
ConsecutiveHeartbeatFailures StrategyId, Environment Per-strategy failure count

EventBridge Schedule

{
  "ScheduleExpression": "rate(5 minutes)",
  "Targets": [{
    "Arn": "arn:aws:lambda:...:trading-heartbeat-check"
  }]
}

SNS Alert Format

TradAI Trading Heartbeat Alert

Environment: prod
Strategy ID: PascalStrategy
Instance ID: task-abc123
Status: STALE HEARTBEAT
Last Heartbeat: 2024-01-01T11:55:00Z
Staleness: 5.2 minutes
Threshold: 2 minutes

This alert was triggered after 2 consecutive
stale heartbeat detections. The trading container may have crashed or become
unresponsive.

Recommended Actions:
1. Check ECS task status for PascalStrategy
2. Review CloudWatch logs for errors
3. If container is unhealthy, consider manual restart
4. Check DynamoDB trading-state table for current status

Subject: [{ENV}] Trading Alert: {strategy_id} heartbeat stale Message attributes: strategy_id, environment, severity=HIGH, alert_type=stale_heartbeat

Trading State Table Schema

Expected schema in TRADING_STATE_TABLE:

Attribute Type Description
strategy_id S Partition key
instance_id S ECS task instance
status S warmup, running, stopped, failed, terminated
last_heartbeat S ISO 8601 timestamp
started_at S Container start time
error S Error message if failed