health-check¶
Performs periodic health checks on all ECS services via Service Discovery.
Overview¶
| Property | Value |
|---|---|
| Trigger | EventBridge scheduled event |
| Runtime | Python 3.11 |
| Timeout | 60 seconds |
| Memory | 256 MB |
Input Schema¶
{
"services": [
{
"name": "backend-api",
"port": 8000,
"path": "/api/v1/health"
},
{
"name": "data-collection",
"port": 8002,
"path": "/api/v1/health"
},
{
"name": "strategy-service",
"port": 8003,
"path": "/api/v1/health"
},
{
"name": "mlflow",
"port": 5000,
"path": "/mlflow/health"
}
]
}
Output Schema¶
{
"summary": {
"healthy": 3,
"total": 4
},
"results": [
{
"service": "backend-api",
"healthy": true,
"status_code": 200,
"latency_ms": 45.2,
"timestamp": "2024-01-01T12:00:00Z"
},
{
"service": "mlflow",
"healthy": false,
"status_code": 503,
"latency_ms": 1500.0,
"error": "Service unavailable",
"timestamp": "2024-01-01T12:00:00Z"
}
]
}
Environment Variables¶
| Variable | Required | Default | Description |
|---|---|---|---|
SERVICE_DISCOVERY_NAMESPACE | No | "tradai.local" | Cloud Map namespace |
HEALTH_CHECK_TIMEOUT | No | 10 | HTTP timeout (seconds) |
DYNAMODB_TABLE_NAME | Yes | - | State repository table |
SNS_ALERTS_TOPIC_ARN | No | - | SNS topic for alerts |
CONSECUTIVE_FAILURES_THRESHOLD | No | 2 | Failures before alert |
DYNAMODB_TTL_SECONDS | No | 86400 | State record TTL (24 hours) |
Service Discovery DNS¶
Services are resolved via Cloud Map DNS:
Example: http://backend-api.tradai.local:8000/api/v1/health
CloudWatch Metrics¶
| Metric | Description |
|---|---|
ServiceHealthy | 1 if healthy, 0 if not (per service) |
HealthCheckLatency | Response time in ms |
ConsecutiveFailures | Count of consecutive failures |
Key Features¶
- Uses Service Discovery DNS resolution
- Tracks consecutive failures to avoid alert spam
- Resets failure count on successful health check
- Distinguishes between connection errors and HTTP status codes
EventBridge Schedule¶
{
"ScheduleExpression": "rate(2 minutes)",
"Targets": [{
"Arn": "arn:aws:lambda:...:health-check"
}]
}
Alert Format¶
When consecutive failures exceed threshold:
Subject: [DEV] Service Alert: backend-api is unhealthy
TradAI Service Health Alert
Environment: dev
Service: backend-api
Status: UNHEALTHY
Consecutive Failures: 3
Timestamp: 2024-01-01T12:00:00+00:00
Error Details:
Connection timeout
This alert was triggered after 3 consecutive health check failures.
Please investigate the service immediately.
---
TradAI Health Check System
See Also¶
Related Lambdas:
- trading-heartbeat-check - Trading container heartbeat monitoring
- orphan-scanner - Task health and cleanup
Architecture:
Services:
- Backend Service - Backend health endpoint
- Data Collection - Data service health endpoint
- Strategy Service - Strategy service health endpoint
CLI:
- CLI Reference -
tradai monitor healthcommand