Skip to content

health-check¶

Performs periodic health checks on all ECS services via Service Discovery.

Overview¶

Property	Value
Trigger	EventBridge scheduled event
Runtime	Python 3.11
Timeout	60 seconds
Memory	256 MB

Input Schema¶

{
    "services": [
        {
            "name": "backend-api",
            "port": 8000,
            "path": "/api/v1/health"
        },
        {
            "name": "data-collection",
            "port": 8002,
            "path": "/api/v1/health"
        },
        {
            "name": "strategy-service",
            "port": 8003,
            "path": "/api/v1/health"
        },
        {
            "name": "mlflow",
            "port": 5000,
            "path": "/mlflow/health"
        }
    ]
}

Output Schema¶

{
    "summary": {
        "healthy": 3,
        "total": 4
    },
    "results": [
        {
            "service": "backend-api",
            "healthy": true,
            "status_code": 200,
            "latency_ms": 45.2,
            "timestamp": "2024-01-01T12:00:00Z"
        },
        {
            "service": "mlflow",
            "healthy": false,
            "status_code": 503,
            "latency_ms": 1500.0,
            "error": "Service unavailable",
            "timestamp": "2024-01-01T12:00:00Z"
        }
    ]
}

Environment Variables¶

Variable	Required	Default	Description
`SERVICE_DISCOVERY_NAMESPACE`	No	"tradai.local"	Cloud Map namespace
`HEALTH_CHECK_TIMEOUT`	No	30	HTTP timeout (seconds)
`DYNAMODB_TABLE_NAME`	Yes	-	State repository table
`ALERT_SNS_TOPIC_ARN`	Yes	-	SNS topic for alerts
`CONSECUTIVE_FAILURES_THRESHOLD`	No	3	Failures before alert
`DYNAMODB_TTL_SECONDS`	No	604800	State record TTL (7 days)

Service Discovery DNS¶

Services are resolved via Cloud Map DNS:

http://{service-name}.{namespace}.internal:{port}{path}

Example: http://backend-api.tradai.local.internal:8000/api/v1/health

CloudWatch Metrics¶

Metric	Description
`ServiceHealthy`	1 if healthy, 0 if not (per service)
`HealthCheckLatency`	Response time in ms
`ConsecutiveFailures`	Count of consecutive failures

Key Features¶

Uses Service Discovery DNS resolution
Tracks consecutive failures to avoid alert spam
Resets failure count on successful health check
Distinguishes between connection errors and HTTP status codes

EventBridge Schedule¶

{
  "ScheduleExpression": "rate(5 minutes)",
  "Targets": [{
    "Arn": "arn:aws:lambda:...:health-check"
  }]
}

Alert Format¶

When consecutive failures exceed threshold:

{
    "subject": "Service Unhealthy: backend-api",
    "message": {
        "service": "backend-api",
        "consecutive_failures": 3,
        "last_error": "Connection timeout",
        "timestamp": "2024-01-01T12:00:00Z"
    }
}

See Also¶

Related Lambdas:

trading-heartbeat-check - Trading container heartbeat monitoring
orphan-scanner - Task health and cleanup

Architecture:

Services - ECS service definitions
Architecture Overview - Lambda infrastructure diagram

Services:

Backend Service - Backend health endpoint
Data Collection - Data service health endpoint
Strategy Service - Strategy service health endpoint

CLI:

CLI Reference - tradai monitor health command