Skip to content

trading-heartbeat-check

Monitors live trading containers by checking DynamoDB trading-state table for stale heartbeats.

Overview

Property Value
Trigger EventBridge scheduled event
Runtime Python 3.11
Timeout 60 seconds
Memory 256 MB

Input Schema

No specific input required (scheduled event).

Output Schema

{
    "success": true,
    "data": {
        "summary": {
            "active_strategies": 5,
            "stale": 1,
            "healthy": 4,
            "alerts_sent": 1,
            "latency_ms": 245.32
        }
    }
}

Environment Variables

Variable Required Default Description
TRADING_STATE_TABLE Yes "tradai-trading-state" Trading state DynamoDB table
DYNAMODB_TABLE_NAME Yes - Health state table for failure tracking
HEARTBEAT_THRESHOLD_MINUTES No 2 Stale threshold in minutes
CONSECUTIVE_FAILURES_THRESHOLD No 2 Failures before alerting
DYNAMODB_TTL_SECONDS No 604800 State record TTL (7 days)
ALERT_SNS_TOPIC_ARN Yes - SNS topic for alerts

Heartbeat Detection Flow

flowchart TD
    A[EventBridge Trigger] --> B[Query Active Strategies]
    B --> C{Strategies found?}
    C -->|No| D[Return empty summary]
    C -->|Yes| E[Check each heartbeat]
    E --> F{Heartbeat fresh?}
    F -->|Yes| G[Reset failure count]
    F -->|No| H[Increment failure count]
    H --> I{Threshold reached?}
    I -->|Yes| J[Send SNS Alert]
    I -->|No| K[Continue]
    G --> K
    J --> K
    K --> L{More strategies?}
    L -->|Yes| E
    L -->|No| M[Publish metrics]
    M --> N[Return summary]

Active Strategy Detection

Queries trading-state table for strategies with status: - warmup - Container starting up - running - Actively trading

Uses paginated DynamoDB scan with filter expression.

Staleness Criteria

A strategy heartbeat is considered stale if: 1. last_heartbeat timestamp is older than HEARTBEAT_THRESHOLD_MINUTES 2. last_heartbeat field is missing 3. last_heartbeat timestamp is unparseable

Consecutive Failure Tracking

Uses DynamoDBStateRepository with HeartbeatState entity:

{
    "container_id": "trading-heartbeat:PascalStrategy",
    "strategy_name": "PascalStrategy",
    "last_heartbeat": "2024-01-01T12:00:00Z",
    "status": "stale",
    "consecutive_missed": 2
}

Alerts only sent when consecutive_missed == CONSECUTIVE_FAILURES_THRESHOLD (exactly, to avoid spam).

CloudWatch Metrics

Metric Description
ActiveStrategies Total strategies being monitored
StaleHeartbeats Count of stale heartbeats detected
HealthyHeartbeats Count of healthy heartbeats
ConsecutiveHeartbeatFailures Per-strategy failure count

EventBridge Schedule

{
  "ScheduleExpression": "rate(1 minute)",
  "Targets": [{
    "Arn": "arn:aws:lambda:...:trading-heartbeat-check"
  }]
}

SNS Alert Format

TradAI Trading Heartbeat Alert

Environment: prod
Strategy ID: PascalStrategy
Instance ID: task-abc123
Status: STALE HEARTBEAT
Last Heartbeat: 2024-01-01T11:55:00Z
Staleness: 5.2 minutes
Threshold: 2 minutes

This alert was triggered after 2 consecutive
stale heartbeat detections. The trading container may have crashed or become
unresponsive.

Recommended Actions:
1. Check ECS task status for PascalStrategy
2. Review CloudWatch logs for errors
3. If container is unhealthy, consider manual restart
4. Check DynamoDB trading-state table for current status

Trading State Table Schema

Expected schema in trading-state table:

Attribute Type Description
strategy_id S Partition key
instance_id S ECS task instance
status S warmup, running, stopped, failed
last_heartbeat S ISO 8601 timestamp
started_at S Container start time
error S Error message if failed