Dev Environment Health Audit Date: 2026-04-27 Account: 600802701449 | Region: eu-central-1 | Profile: tradai
Summary Category Status Details Infrastructure (S3, DynamoDB, ECR, VPC, RDS, SQS, SNS) OK All resources exist and accessible Compute (EC2 Consolidated, ALB) OK All 4 services running, ALB healthy API Gateway OK 28 routes, health returns 200 Scheduled Lambdas ISSUES health-check timeouts due to stale Service Discovery Step Functions OK Both workflows exist, recent executions SUCCEEDED Backtest Pipeline (SQS→Lambda→StepFunctions) NOT TESTED backtest-consumer Lambda never invoked Monitoring (CloudWatch, WAF) PARTIAL WAF not associated, 1 alarm intermittent Service Discovery DEGRADED Stale entries from terminated instances
1. PERSISTENT Stack — ALL OK S3 Buckets (5/5) Bucket Objects Size Status tradai-configs-dev 1 240 B OK tradai-results-dev 4 ~1 MB OK tradai-arcticdb-dev 101 ~3.5 MB OK tradai-mlflow-dev 2,973 ~6 MB OK tradai-logs-dev 44,812 ~555 MB OK
DynamoDB Tables (12/12) — All exist Table Has Data Notes tradai-workflow-state-dev Yes 3+ jobs (failed, completed) tradai-health-state-dev Yes 4 services tracked, but last_checked = None tradai-trading-state-dev Exists — tradai-deployments-dev Exists — tradai-drift-state-dev Exists — tradai-retraining-state-dev Exists — tradai-rollback-state-dev Exists — tradai-shadow-test-state-dev Exists — tradai-notifications-dev Exists — tradai-idempotency-dev Exists — tradai-infra-drift-state-dev Exists — tradai-config-versions-dev Exists —
ECR Repositories (26 total, 24 expected + 2 strategy) All service repos have images tagged v0.1.1 + latest (pushed 2026-04-25). All lambda repos have images tagged v0.1.1.
Cognito — OK User Pool: tradai-users-dev (eu-central-1_dHPbxry36) 2. FOUNDATION Stack — ALL OK VPC & Networking Resource ID/Value Status VPC vpc-0828db9a63c49b746 (10.0.0.0/16) OK Public 1a subnet-01731fb85a12f7121 (10.0.1.0/24) OK Public 1b subnet-0abd670a1d25213c3 (10.0.2.0/24) OK Private 1a subnet-028bafcb206fba662 (10.0.11.0/24) OK Private 1b subnet-00cea26bde780f164 (10.0.12.0/24) OK Database 1a subnet-078bb239781c956f7 (10.0.21.0/24) OK Database 1b subnet-05197068b40ebecb3 (10.0.22.0/24) OK NAT Instance i-04abe49eb07a829ad (t4g.nano, 3.77.81.37) Running
RDS — OK ID: tradai-mlflow-dev Status: available Engine: PostgreSQL (db.t4g.micro) Endpoint: tradai-mlflow-dev.crcc44o2kjg3.eu-central-1.rds.amazonaws.com SQS — OK tradai-backtest-queue-dev.fifo — exists, 0 messages tradai-backtest-dlq-dev.fifo — exists SNS — OK tradai-alerts-dev tradai-registration-dev 3. COMPUTE Stack EC2 Consolidated — OK Instance Type IP Status tradai-consolidated-dev (i-0d81ba92e56c19946) t3.small 10.0.11.129 Running, Healthy ASG: tradai-consolidated-asg-dev min=1, max=1, desired=1 — OK
Docker containers on consolidated EC2:
Container Status Port backend-api Up 13h (healthy) 0.0.0.0:8000 data-collection Up 13h (healthy) 0.0.0.0:8002 strategy-service Up 13h (healthy) 0.0.0.0:8003 mlflow Up 13h (healthy) 0.0.0.0:5000
No container errors in CloudWatch logs for the last hour.
ALB — OK Setting Value Name tradai-dev DNS tradai-dev-1942285475.eu-central-1.elb.amazonaws.com State active Type application
Target Group Health:
Target Group Port Health Path Target Status tradai-backend-api-d 8000 /api/v1/health i-0d81ba92e56c19946 healthy tradai-mlflow-d 5000 /health i-0d81ba92e56c19946 healthy tradai-live-trading-d 8004 /api/v1/health (none) no targets tradai-dry-run-trading-d 8005 /api/v1/health (none) no targets
Health endpoint responses:
GET /api/v1/health → 200 OK (0.13s)
Backend: healthy
DynamoDB: healthy (11ms)
SQS: healthy (8ms, 0 msgs)
data-collection: healthy (14ms)
strategy-service: healthy (11ms)
GET /mlflow/health → 200 OK (0.08s)
ECS Services Service Desired Running Status tradai-live-trading-dev 0 0 ACTIVE (by design) tradai-dry-run-trading-dev 0 0 ACTIVE (by design)
Note: Main services (backend, data-collection, strategy-service, mlflow) run on consolidated EC2, NOT as ECS services. Only live/dry-run-trading have ECS task definitions.
Lambda Functions (18 total) — ALL EXIST Scheduled Lambdas execution status:
Lambda Schedule Last Duration Status health-check rate(2 min) 20-60s (TIMEOUTS) DEGRADED orphan-scanner rate(5 min) 400-470ms OK trading-heartbeat-check rate(5 min) 130-150ms OK drift-monitor rate(12 hours) 4.5ms actual OK (no-op?) retraining-scheduler rate(6 hours) 1.4s OK pulumi-drift-detector rate(6 hours) 3.2ms actual OK (no-op?)
Step Functions Lambdas:
Lambda Log Bytes Status validate-strategy 9,746 Has been invoked data-collection-proxy 9,476 Has been invoked update-status 16,093 Has been invoked notify-completion 91,744 Has been invoked check-retraining-needed 56,382 Has been invoked compare-models 59,697 Has been invoked cleanup-resources 3,439 Has been invoked backtest-consumer 0 NEVER INVOKED promote-model 0 NEVER INVOKED model-rollback 0 NEVER INVOKED sqs-consumer 0 NEVER INVOKED
Step Functions — OK Workflow Recent Executions All Succeeded tradai-backtest-workflow-dev 5 (Apr 10-12) Yes tradai-retraining-workflow-dev 5 (Apr 23) Yes
Service Discovery — DEGRADED Namespace: tradai-dev.local (ns-upztkwfphvzqzn2j)
Stale entries detected. Only i-0d81ba92e56c19946 (10.0.11.129) is the current consolidated instance. Old entries from terminated instances remain:
Service Total Entries Current (10.0.11.129) Stale backend-api 4 1 3 (IPs: 10.0.11.100, 10.0.11.52, 10.0.11.201) data-collection 2 1 1 (IP: 10.0.11.201) strategy-service 3 1 2 (IPs: 10.0.11.162, 10.0.11.201) mlflow 4 1 3 (IPs: 10.0.11.100, 10.0.11.162, 10.0.11.201) live-trading 0 0 0 dry-run-trading 0 0 0
Impact: Lambda health-check resolves DNS and sometimes hits stale IPs, causing 10-30s timeouts per service. This explains the 20-60s total duration and frequent 60s timeouts.
Stale instance IDs: - i-0036ccac4827f82ca — terminated - i-09ff8e414fe16f7e7 — does not exist - i-0abb996a2c9885150 — does not exist - i-0aefcaeee239c441d — terminated
4. EDGE Stack API Gateway — OK Setting Value Name tradai-api-dev Endpoint https://z9uaqcerrd.execute-api.eu-central-1.amazonaws.com Protocol HTTP Routes 28 Auth JWT (Cognito) on all except GET /health
GET /api/v1/health via API Gateway → 200 OK
WAF — EXISTS, NOT ASSOCIATED WebACL: tradai-waf-dev (0bffab6d-0f5c-47b5-b9a3-6f96061034b2) Log group: aws-waf-logs-tradai-dev (0 bytes — never received traffic) Known bug: WAFv2 cannot parse $default in HTTP API stage ARN CloudWatch Alarms (28 total) State Count Alarms OK 27 All except mlflow latency ALARM 0-1 tradai-mlflow-high-latency-dev (intermittent, currently OK)
ISSUES TO FIX (Priority Order) P0 — Critical 1. Stale Service Discovery entries causing health-check timeouts Problem: 9 stale SD entries from terminated EC2 instances Impact: health-check Lambda times out ~50% of invocations (60s timeout), services sometimes unreachable via SD DNS Fix: Deregister stale instances from Service Discovery Stale instances to remove: i-0036ccac4827f82ca from backend-api, mlflow i-09ff8e414fe16f7e7 from backend-api i-0abb996a2c9885150 from strategy-service, mlflow i-0aefcaeee239c441d from all 4 services Root cause: ASG lifecycle hook or cloud-init cleanup not deregistering on termination P1 — Important 2. backtest-consumer Lambda never invoked Problem: SQS event source mapping is Enabled, but Lambda has 0 log bytes ever Impact: POST /api/v1/backtests via API Gateway → SQS → backtest-consumer → Step Functions pipeline untested Diagnosis needed: Either API Gateway SQS integration never receives requests, or the integration is misconfigured Action: Submit a test backtest through API Gateway POST /api/v1/backtests to verify the full pipeline 3. WAF not associated with API Gateway Problem: WebACL exists but not attached to the HTTP API $default stage Impact: No WAF protection on API Gateway Known issue: WAFv2 cannot parse $default stage ARN P2 — Monitor 4. drift-monitor and pulumi-drift-detector running as no-ops Problem: Both execute in 3-5ms (just init+return), not doing real work Impact: No actual drift detection happening Likely cause: Missing configuration or empty state — need to check if MLflow has models to monitor Problem: These Lambdas have 0 bytes in logs (never executed) Impact: Model promotion and rollback pipeline untested Note: Acceptable if no models have been promoted yet in dev 6. DynamoDB health_state missing timestamps Problem: last_checked field is None for all 4 services Impact: Health history not persisted correctly Likely cause: health-check Lambda may be using different attribute names, or timeouts prevent writing P3 — Cleanup 7. Container Insights log group empty /aws/ecs/containerinsights/tradai-dev/performance — 0 bytes (1 day retention) Consolidated EC2 mode may not emit Container Insights metrics 8. ECS log group has minimal data /aws/ecs/tradai-dev — 444 KB, /ecs/tradai/dev — 558 KB Expected: most logs go to /tradai/consolidated/containers (~97 MB) RESOURCE INVENTORY CHECKLIST Persistent Stack Foundation Stack Compute Stack Edge Stack