health-check¶
Performs periodic health checks on all ECS services via Service Discovery.
Overview¶
| Property | Value |
|---|---|
| Trigger | EventBridge scheduled event |
| Runtime | Python 3.11 |
| Timeout | 60 seconds |
| Memory | 256 MB |
Input Schema¶
{
"services": [
{
"name": "backend-api",
"port": 8000,
"path": "/api/v1/health"
},
{
"name": "data-collection",
"port": 8002,
"path": "/api/v1/health"
},
{
"name": "strategy-service",
"port": 8003,
"path": "/api/v1/health"
},
{
"name": "mlflow",
"port": 5000,
"path": "/mlflow/health"
}
]
}
Output Schema¶
{
"summary": {
"healthy": 3,
"total": 4
},
"results": [
{
"service": "backend-api",
"healthy": true,
"status_code": 200,
"latency_ms": 45.2,
"timestamp": "2024-01-01T12:00:00Z"
},
{
"service": "mlflow",
"healthy": false,
"status_code": 503,
"latency_ms": 1500.0,
"error": "Service unavailable",
"timestamp": "2024-01-01T12:00:00Z"
}
]
}
Environment Variables¶
| Variable | Required | Default | Description |
|---|---|---|---|
SERVICE_DISCOVERY_NAMESPACE | No | "tradai.local" | Cloud Map namespace |
HEALTH_CHECK_TIMEOUT | No | 30 | HTTP timeout (seconds) |
DYNAMODB_TABLE_NAME | Yes | - | State repository table |
ALERT_SNS_TOPIC_ARN | Yes | - | SNS topic for alerts |
CONSECUTIVE_FAILURES_THRESHOLD | No | 3 | Failures before alert |
DYNAMODB_TTL_SECONDS | No | 604800 | State record TTL (7 days) |
Service Discovery DNS¶
Services are resolved via Cloud Map DNS:
Example: http://backend-api.tradai.local.internal:8000/api/v1/health
CloudWatch Metrics¶
| Metric | Description |
|---|---|
ServiceHealthy | 1 if healthy, 0 if not (per service) |
HealthCheckLatency | Response time in ms |
ConsecutiveFailures | Count of consecutive failures |
Key Features¶
- Uses Service Discovery DNS resolution
- Tracks consecutive failures to avoid alert spam
- Resets failure count on successful health check
- Distinguishes between connection errors and HTTP status codes
EventBridge Schedule¶
{
"ScheduleExpression": "rate(5 minutes)",
"Targets": [{
"Arn": "arn:aws:lambda:...:health-check"
}]
}
Alert Format¶
When consecutive failures exceed threshold:
{
"subject": "Service Unhealthy: backend-api",
"message": {
"service": "backend-api",
"consecutive_failures": 3,
"last_error": "Connection timeout",
"timestamp": "2024-01-01T12:00:00Z"
}
}
See Also¶
Related Lambdas:
- trading-heartbeat-check - Trading container heartbeat monitoring
- orphan-scanner - Task health and cleanup
Architecture:
- Services - ECS service definitions
- Architecture Overview - Lambda infrastructure diagram
Services:
- Backend Service - Backend health endpoint
- Data Collection - Data service health endpoint
- Strategy Service - Strategy service health endpoint
CLI:
- CLI Reference -
tradai monitor healthcommand