Skip to content

health-check

Performs periodic health checks on all ECS services via Service Discovery.

Overview

Property Value
Trigger EventBridge scheduled event
Runtime Python 3.11
Timeout 60 seconds
Memory 256 MB

Input Schema

{
    "services": [
        {
            "name": "backend-api",
            "port": 8000,
            "path": "/api/v1/health"
        },
        {
            "name": "data-collection",
            "port": 8002,
            "path": "/api/v1/health"
        },
        {
            "name": "strategy-service",
            "port": 8003,
            "path": "/api/v1/health"
        },
        {
            "name": "mlflow",
            "port": 5000,
            "path": "/mlflow/health"
        }
    ]
}

Output Schema

{
    "summary": {
        "healthy": 3,
        "total": 4
    },
    "results": [
        {
            "service": "backend-api",
            "healthy": true,
            "status_code": 200,
            "latency_ms": 45.2,
            "timestamp": "2024-01-01T12:00:00Z"
        },
        {
            "service": "mlflow",
            "healthy": false,
            "status_code": 503,
            "latency_ms": 1500.0,
            "error": "Service unavailable",
            "timestamp": "2024-01-01T12:00:00Z"
        }
    ]
}

Environment Variables

Variable Required Default Description
SERVICE_DISCOVERY_NAMESPACE No "tradai.local" Cloud Map namespace
HEALTH_CHECK_TIMEOUT No 10 HTTP timeout (seconds)
DYNAMODB_TABLE_NAME Yes - State repository table
SNS_ALERTS_TOPIC_ARN No - SNS topic for alerts
CONSECUTIVE_FAILURES_THRESHOLD No 2 Failures before alert
DYNAMODB_TTL_SECONDS No 86400 State record TTL (24 hours)

Service Discovery DNS

Services are resolved via Cloud Map DNS:

http://{service-name}.{namespace}:{port}{path}

Example: http://backend-api.tradai.local:8000/api/v1/health

CloudWatch Metrics

Metric Description
ServiceHealthy 1 if healthy, 0 if not (per service)
HealthCheckLatency Response time in ms
ConsecutiveFailures Count of consecutive failures

Key Features

  • Uses Service Discovery DNS resolution
  • Tracks consecutive failures to avoid alert spam
  • Resets failure count on successful health check
  • Distinguishes between connection errors and HTTP status codes

EventBridge Schedule

{
  "ScheduleExpression": "rate(2 minutes)",
  "Targets": [{
    "Arn": "arn:aws:lambda:...:health-check"
  }]
}

Alert Format

When consecutive failures exceed threshold:

Subject: [DEV] Service Alert: backend-api is unhealthy

TradAI Service Health Alert

Environment: dev
Service: backend-api
Status: UNHEALTHY
Consecutive Failures: 3
Timestamp: 2024-01-01T12:00:00+00:00

Error Details:
Connection timeout

This alert was triggered after 3 consecutive health check failures.
Please investigate the service immediately.

---
TradAI Health Check System

See Also

Related Lambdas:

Architecture:

Services:

CLI: