Skip to content

Dev Environment Health Audit

Date: 2026-04-27 Account: 600802701449 | Region: eu-central-1 | Profile: tradai

Summary

Category Status Details
Infrastructure (S3, DynamoDB, ECR, VPC, RDS, SQS, SNS) OK All resources exist and accessible
Compute (EC2 Consolidated, ALB) OK All 4 services running, ALB healthy
API Gateway OK 28 routes, health returns 200
Scheduled Lambdas ISSUES health-check timeouts due to stale Service Discovery
Step Functions OK Both workflows exist, recent executions SUCCEEDED
Backtest Pipeline (SQS→Lambda→StepFunctions) NOT TESTED backtest-consumer Lambda never invoked
Monitoring (CloudWatch, WAF) PARTIAL WAF not associated, 1 alarm intermittent
Service Discovery DEGRADED Stale entries from terminated instances

1. PERSISTENT Stack — ALL OK

S3 Buckets (5/5)

Bucket Objects Size Status
tradai-configs-dev 1 240 B OK
tradai-results-dev 4 ~1 MB OK
tradai-arcticdb-dev 101 ~3.5 MB OK
tradai-mlflow-dev 2,973 ~6 MB OK
tradai-logs-dev 44,812 ~555 MB OK

DynamoDB Tables (12/12) — All exist

Table Has Data Notes
tradai-workflow-state-dev Yes 3+ jobs (failed, completed)
tradai-health-state-dev Yes 4 services tracked, but last_checked = None
tradai-trading-state-dev Exists
tradai-deployments-dev Exists
tradai-drift-state-dev Exists
tradai-retraining-state-dev Exists
tradai-rollback-state-dev Exists
tradai-shadow-test-state-dev Exists
tradai-notifications-dev Exists
tradai-idempotency-dev Exists
tradai-infra-drift-state-dev Exists
tradai-config-versions-dev Exists

ECR Repositories (26 total, 24 expected + 2 strategy)

All service repos have images tagged v0.1.1 + latest (pushed 2026-04-25). All lambda repos have images tagged v0.1.1.

Cognito — OK

  • User Pool: tradai-users-dev (eu-central-1_dHPbxry36)

2. FOUNDATION Stack — ALL OK

VPC & Networking

Resource ID/Value Status
VPC vpc-0828db9a63c49b746 (10.0.0.0/16) OK
Public 1a subnet-01731fb85a12f7121 (10.0.1.0/24) OK
Public 1b subnet-0abd670a1d25213c3 (10.0.2.0/24) OK
Private 1a subnet-028bafcb206fba662 (10.0.11.0/24) OK
Private 1b subnet-00cea26bde780f164 (10.0.12.0/24) OK
Database 1a subnet-078bb239781c956f7 (10.0.21.0/24) OK
Database 1b subnet-05197068b40ebecb3 (10.0.22.0/24) OK
NAT Instance i-04abe49eb07a829ad (t4g.nano, 3.77.81.37) Running

RDS — OK

  • ID: tradai-mlflow-dev
  • Status: available
  • Engine: PostgreSQL (db.t4g.micro)
  • Endpoint: tradai-mlflow-dev.crcc44o2kjg3.eu-central-1.rds.amazonaws.com

SQS — OK

  • tradai-backtest-queue-dev.fifo — exists, 0 messages
  • tradai-backtest-dlq-dev.fifo — exists

SNS — OK

  • tradai-alerts-dev
  • tradai-registration-dev

3. COMPUTE Stack

EC2 Consolidated — OK

Instance Type IP Status
tradai-consolidated-dev (i-0d81ba92e56c19946) t3.small 10.0.11.129 Running, Healthy
ASG: tradai-consolidated-asg-dev min=1, max=1, desired=1 OK

Docker containers on consolidated EC2:

Container Status Port
backend-api Up 13h (healthy) 0.0.0.0:8000
data-collection Up 13h (healthy) 0.0.0.0:8002
strategy-service Up 13h (healthy) 0.0.0.0:8003
mlflow Up 13h (healthy) 0.0.0.0:5000

No container errors in CloudWatch logs for the last hour.

ALB — OK

Setting Value
Name tradai-dev
DNS tradai-dev-1942285475.eu-central-1.elb.amazonaws.com
State active
Type application

Target Group Health:

Target Group Port Health Path Target Status
tradai-backend-api-d 8000 /api/v1/health i-0d81ba92e56c19946 healthy
tradai-mlflow-d 5000 /health i-0d81ba92e56c19946 healthy
tradai-live-trading-d 8004 /api/v1/health (none) no targets
tradai-dry-run-trading-d 8005 /api/v1/health (none) no targets

Health endpoint responses:

GET /api/v1/health → 200 OK (0.13s)
  Backend: healthy
  DynamoDB: healthy (11ms)
  SQS: healthy (8ms, 0 msgs)
  data-collection: healthy (14ms)
  strategy-service: healthy (11ms)

GET /mlflow/health → 200 OK (0.08s)

ECS Services

Service Desired Running Status
tradai-live-trading-dev 0 0 ACTIVE (by design)
tradai-dry-run-trading-dev 0 0 ACTIVE (by design)

Note: Main services (backend, data-collection, strategy-service, mlflow) run on consolidated EC2, NOT as ECS services. Only live/dry-run-trading have ECS task definitions.

Lambda Functions (18 total) — ALL EXIST

Scheduled Lambdas execution status:

Lambda Schedule Last Duration Status
health-check rate(2 min) 20-60s (TIMEOUTS) DEGRADED
orphan-scanner rate(5 min) 400-470ms OK
trading-heartbeat-check rate(5 min) 130-150ms OK
drift-monitor rate(12 hours) 4.5ms actual OK (no-op?)
retraining-scheduler rate(6 hours) 1.4s OK
pulumi-drift-detector rate(6 hours) 3.2ms actual OK (no-op?)

Step Functions Lambdas:

Lambda Log Bytes Status
validate-strategy 9,746 Has been invoked
data-collection-proxy 9,476 Has been invoked
update-status 16,093 Has been invoked
notify-completion 91,744 Has been invoked
check-retraining-needed 56,382 Has been invoked
compare-models 59,697 Has been invoked
cleanup-resources 3,439 Has been invoked
backtest-consumer 0 NEVER INVOKED
promote-model 0 NEVER INVOKED
model-rollback 0 NEVER INVOKED
sqs-consumer 0 NEVER INVOKED

Step Functions — OK

Workflow Recent Executions All Succeeded
tradai-backtest-workflow-dev 5 (Apr 10-12) Yes
tradai-retraining-workflow-dev 5 (Apr 23) Yes

Service Discovery — DEGRADED

Namespace: tradai-dev.local (ns-upztkwfphvzqzn2j)

Stale entries detected. Only i-0d81ba92e56c19946 (10.0.11.129) is the current consolidated instance. Old entries from terminated instances remain:

Service Total Entries Current (10.0.11.129) Stale
backend-api 4 1 3 (IPs: 10.0.11.100, 10.0.11.52, 10.0.11.201)
data-collection 2 1 1 (IP: 10.0.11.201)
strategy-service 3 1 2 (IPs: 10.0.11.162, 10.0.11.201)
mlflow 4 1 3 (IPs: 10.0.11.100, 10.0.11.162, 10.0.11.201)
live-trading 0 0 0
dry-run-trading 0 0 0

Impact: Lambda health-check resolves DNS and sometimes hits stale IPs, causing 10-30s timeouts per service. This explains the 20-60s total duration and frequent 60s timeouts.

Stale instance IDs: - i-0036ccac4827f82ca — terminated - i-09ff8e414fe16f7e7 — does not exist - i-0abb996a2c9885150 — does not exist - i-0aefcaeee239c441d — terminated


4. EDGE Stack

API Gateway — OK

Setting Value
Name tradai-api-dev
Endpoint https://z9uaqcerrd.execute-api.eu-central-1.amazonaws.com
Protocol HTTP
Routes 28
Auth JWT (Cognito) on all except GET /health

GET /api/v1/health via API Gateway → 200 OK

WAF — EXISTS, NOT ASSOCIATED

  • WebACL: tradai-waf-dev (0bffab6d-0f5c-47b5-b9a3-6f96061034b2)
  • Log group: aws-waf-logs-tradai-dev (0 bytes — never received traffic)
  • Known bug: WAFv2 cannot parse $default in HTTP API stage ARN

CloudWatch Alarms (28 total)

State Count Alarms
OK 27 All except mlflow latency
ALARM 0-1 tradai-mlflow-high-latency-dev (intermittent, currently OK)

ISSUES TO FIX (Priority Order)

P0 — Critical

1. Stale Service Discovery entries causing health-check timeouts

  • Problem: 9 stale SD entries from terminated EC2 instances
  • Impact: health-check Lambda times out ~50% of invocations (60s timeout), services sometimes unreachable via SD DNS
  • Fix: Deregister stale instances from Service Discovery
  • Stale instances to remove:
  • i-0036ccac4827f82ca from backend-api, mlflow
  • i-09ff8e414fe16f7e7 from backend-api
  • i-0abb996a2c9885150 from strategy-service, mlflow
  • i-0aefcaeee239c441d from all 4 services
  • Root cause: ASG lifecycle hook or cloud-init cleanup not deregistering on termination

P1 — Important

2. backtest-consumer Lambda never invoked

  • Problem: SQS event source mapping is Enabled, but Lambda has 0 log bytes ever
  • Impact: POST /api/v1/backtests via API Gateway → SQS → backtest-consumer → Step Functions pipeline untested
  • Diagnosis needed: Either API Gateway SQS integration never receives requests, or the integration is misconfigured
  • Action: Submit a test backtest through API Gateway POST /api/v1/backtests to verify the full pipeline

3. WAF not associated with API Gateway

  • Problem: WebACL exists but not attached to the HTTP API $default stage
  • Impact: No WAF protection on API Gateway
  • Known issue: WAFv2 cannot parse $default stage ARN

P2 — Monitor

4. drift-monitor and pulumi-drift-detector running as no-ops

  • Problem: Both execute in 3-5ms (just init+return), not doing real work
  • Impact: No actual drift detection happening
  • Likely cause: Missing configuration or empty state — need to check if MLflow has models to monitor

5. promote-model, model-rollback, sqs-consumer never invoked

  • Problem: These Lambdas have 0 bytes in logs (never executed)
  • Impact: Model promotion and rollback pipeline untested
  • Note: Acceptable if no models have been promoted yet in dev

6. DynamoDB health_state missing timestamps

  • Problem: last_checked field is None for all 4 services
  • Impact: Health history not persisted correctly
  • Likely cause: health-check Lambda may be using different attribute names, or timeouts prevent writing

P3 — Cleanup

7. Container Insights log group empty

  • /aws/ecs/containerinsights/tradai-dev/performance — 0 bytes (1 day retention)
  • Consolidated EC2 mode may not emit Container Insights metrics

8. ECS log group has minimal data

  • /aws/ecs/tradai-dev — 444 KB, /ecs/tradai/dev — 558 KB
  • Expected: most logs go to /tradai/consolidated/containers (~97 MB)

RESOURCE INVENTORY CHECKLIST

Persistent Stack

  • S3: tradai-configs-dev
  • S3: tradai-results-dev
  • S3: tradai-arcticdb-dev
  • S3: tradai-mlflow-dev
  • S3: tradai-logs-dev
  • DynamoDB: 12/12 tables
  • ECR: 24+ repositories with images
  • Cognito: tradai-users-dev
  • CodeArtifact: (not checked individually)
  • CloudTrail: logs flowing (543 MB)

Foundation Stack

  • VPC: 10.0.0.0/16
  • Subnets: 6/6 (2 public, 2 private, 2 database)
  • NAT Instance: running (t4g.nano)
  • RDS: available (PostgreSQL 15)
  • SQS: 2 queues (backtest + DLQ)
  • SNS: 2 topics (alerts + registration)
  • Security Groups: (checked via VPC)
  • VPC Endpoints: (not checked individually)

Compute Stack

  • ECS Cluster: tradai-dev
  • EC2 Consolidated: running, 4 containers healthy
  • ALB: active, targets healthy
  • Lambda: 18/18 functions exist
  • EventBridge: 7 rules, all ENABLED
  • Step Functions: 2 workflows
  • IAM Roles: (implicit — services functioning)
  • Service Discovery: namespace exists, BUT stale entries

Edge Stack

  • API Gateway: 28 routes, health OK
  • WAF: exists (NOT associated)
  • CloudWatch Alarms: 28 alarms
  • CloudWatch Dashboard: (not checked)