Skip to content

health-check

Performs periodic health checks on all ECS services via Service Discovery.

Overview

Property Value
Trigger EventBridge scheduled event
Runtime Python 3.11
Timeout 60 seconds
Memory 256 MB

Input Schema

{
    "services": [
        {
            "name": "backend-api",
            "port": 8000,
            "path": "/api/v1/health"
        },
        {
            "name": "data-collection",
            "port": 8002,
            "path": "/api/v1/health"
        },
        {
            "name": "strategy-service",
            "port": 8003,
            "path": "/api/v1/health"
        },
        {
            "name": "mlflow",
            "port": 5000,
            "path": "/mlflow/health"
        }
    ]
}

Output Schema

{
    "summary": {
        "healthy": 3,
        "total": 4
    },
    "results": [
        {
            "service": "backend-api",
            "healthy": true,
            "status_code": 200,
            "latency_ms": 45.2,
            "timestamp": "2024-01-01T12:00:00Z"
        },
        {
            "service": "mlflow",
            "healthy": false,
            "status_code": 503,
            "latency_ms": 1500.0,
            "error": "Service unavailable",
            "timestamp": "2024-01-01T12:00:00Z"
        }
    ]
}

Environment Variables

Variable Required Default Description
SERVICE_DISCOVERY_NAMESPACE No "tradai.local" Cloud Map namespace
HEALTH_CHECK_TIMEOUT No 30 HTTP timeout (seconds)
DYNAMODB_TABLE_NAME Yes - State repository table
ALERT_SNS_TOPIC_ARN Yes - SNS topic for alerts
CONSECUTIVE_FAILURES_THRESHOLD No 3 Failures before alert
DYNAMODB_TTL_SECONDS No 604800 State record TTL (7 days)

Service Discovery DNS

Services are resolved via Cloud Map DNS:

http://{service-name}.{namespace}.internal:{port}{path}

Example: http://backend-api.tradai.local.internal:8000/api/v1/health

CloudWatch Metrics

Metric Description
ServiceHealthy 1 if healthy, 0 if not (per service)
HealthCheckLatency Response time in ms
ConsecutiveFailures Count of consecutive failures

Key Features

  • Uses Service Discovery DNS resolution
  • Tracks consecutive failures to avoid alert spam
  • Resets failure count on successful health check
  • Distinguishes between connection errors and HTTP status codes

EventBridge Schedule

{
  "ScheduleExpression": "rate(5 minutes)",
  "Targets": [{
    "Arn": "arn:aws:lambda:...:health-check"
  }]
}

Alert Format

When consecutive failures exceed threshold:

{
    "subject": "Service Unhealthy: backend-api",
    "message": {
        "service": "backend-api",
        "consecutive_failures": 3,
        "last_error": "Connection timeout",
        "timestamp": "2024-01-01T12:00:00Z"
    }
}

See Also

Related Lambdas:

Architecture:

Services:

CLI: