Skip to content

health-check¶

Performs periodic health checks on all ECS services via Service Discovery.

Overview¶

Property	Value
Trigger	EventBridge scheduled event
Runtime	Python 3.11
Timeout	60 seconds
Memory	256 MB

Input Schema¶

{
    "services": [
        {
            "name": "backend-api",
            "port": 8000,
            "path": "/api/v1/health"
        },
        {
            "name": "data-collection",
            "port": 8002,
            "path": "/api/v1/health"
        },
        {
            "name": "strategy-service",
            "port": 8003,
            "path": "/api/v1/health"
        },
        {
            "name": "mlflow",
            "port": 5000,
            "path": "/mlflow/health"
        }
    ]
}

Output Schema¶

{
    "summary": {
        "healthy": 3,
        "total": 4
    },
    "results": [
        {
            "service": "backend-api",
            "healthy": true,
            "status_code": 200,
            "latency_ms": 45.2,
            "timestamp": "2024-01-01T12:00:00Z"
        },
        {
            "service": "mlflow",
            "healthy": false,
            "status_code": 503,
            "latency_ms": 1500.0,
            "error": "Service unavailable",
            "timestamp": "2024-01-01T12:00:00Z"
        }
    ]
}

Environment Variables¶

Variable	Required	Default	Description
`SERVICE_DISCOVERY_NAMESPACE`	No	"tradai.local"	Cloud Map namespace
`HEALTH_CHECK_TIMEOUT`	No	10	HTTP timeout (seconds)
`DYNAMODB_TABLE_NAME`	Yes	-	State repository table
`SNS_ALERTS_TOPIC_ARN`	No	-	SNS topic for alerts
`CONSECUTIVE_FAILURES_THRESHOLD`	No	2	Failures before alert
`DYNAMODB_TTL_SECONDS`	No	86400	State record TTL (24 hours)

Service Discovery DNS¶

Services are resolved via Cloud Map DNS:

http://{service-name}.{namespace}:{port}{path}

Example: http://backend-api.tradai.local:8000/api/v1/health

CloudWatch Metrics¶

Metric	Description
`ServiceHealthy`	1 if healthy, 0 if not (per service)
`HealthCheckLatency`	Response time in ms
`ConsecutiveFailures`	Count of consecutive failures

Key Features¶

Uses Service Discovery DNS resolution
Tracks consecutive failures to avoid alert spam
Resets failure count on successful health check
Distinguishes between connection errors and HTTP status codes

EventBridge Schedule¶

{
  "ScheduleExpression": "rate(2 minutes)",
  "Targets": [{
    "Arn": "arn:aws:lambda:...:health-check"
  }]
}

Alert Format¶

When consecutive failures exceed threshold:

Subject: [DEV] Service Alert: backend-api is unhealthy

TradAI Service Health Alert

Environment: dev
Service: backend-api
Status: UNHEALTHY
Consecutive Failures: 3
Timestamp: 2024-01-01T12:00:00+00:00

Error Details:
Connection timeout

This alert was triggered after 3 consecutive health check failures.
Please investigate the service immediately.

---
TradAI Health Check System

See Also¶

Related Lambdas:

trading-heartbeat-check - Trading container heartbeat monitoring
orphan-scanner - Task health and cleanup

Architecture:

Services - ECS service definitions
Services - Lambda infrastructure diagram

Services:

Backend Service - Backend health endpoint
Data Collection - Data service health endpoint
Strategy Service - Strategy service health endpoint

CLI:

CLI Reference - tradai monitor health command