Dev Environment Health Audit¶

Date: 2026-04-27 Account: 600802701449 | Region: eu-central-1 | Profile: tradai

Summary¶

Category	Status	Details
Infrastructure (S3, DynamoDB, ECR, VPC, RDS, SQS, SNS)	OK	All resources exist and accessible
Compute (EC2 Consolidated, ALB)	OK	All 4 services running, ALB healthy
API Gateway	OK	28 routes, health returns 200
Scheduled Lambdas	ISSUES	health-check timeouts due to stale Service Discovery
Step Functions	OK	Both workflows exist, recent executions SUCCEEDED
Backtest Pipeline (SQS→Lambda→StepFunctions)	NOT TESTED	backtest-consumer Lambda never invoked
Monitoring (CloudWatch, WAF)	PARTIAL	WAF not associated, 1 alarm intermittent
Service Discovery	DEGRADED	Stale entries from terminated instances

1. PERSISTENT Stack — ALL OK¶

S3 Buckets (5/5)¶

Bucket	Objects	Size	Status
tradai-configs-dev	1	240 B	OK
tradai-results-dev	4	~1 MB	OK
tradai-arcticdb-dev	101	~3.5 MB	OK
tradai-mlflow-dev	2,973	~6 MB	OK
tradai-logs-dev	44,812	~555 MB	OK

DynamoDB Tables (12/12) — All exist¶

Table	Has Data	Notes
tradai-workflow-state-dev	Yes	3+ jobs (failed, completed)
tradai-health-state-dev	Yes	4 services tracked, but `last_checked` = None
tradai-trading-state-dev	Exists	—
tradai-deployments-dev	Exists	—
tradai-drift-state-dev	Exists	—
tradai-retraining-state-dev	Exists	—
tradai-rollback-state-dev	Exists	—
tradai-shadow-test-state-dev	Exists	—
tradai-notifications-dev	Exists	—
tradai-idempotency-dev	Exists	—
tradai-infra-drift-state-dev	Exists	—
tradai-config-versions-dev	Exists	—

ECR Repositories (26 total, 24 expected + 2 strategy)¶

All service repos have images tagged v0.1.1 + latest (pushed 2026-04-25). All lambda repos have images tagged v0.1.1.

Cognito — OK¶

User Pool: tradai-users-dev (eu-central-1_dHPbxry36)

2. FOUNDATION Stack — ALL OK¶

VPC & Networking¶

Resource	ID/Value	Status
VPC	vpc-0828db9a63c49b746 (10.0.0.0/16)	OK
Public 1a	subnet-01731fb85a12f7121 (10.0.1.0/24)	OK
Public 1b	subnet-0abd670a1d25213c3 (10.0.2.0/24)	OK
Private 1a	subnet-028bafcb206fba662 (10.0.11.0/24)	OK
Private 1b	subnet-00cea26bde780f164 (10.0.12.0/24)	OK
Database 1a	subnet-078bb239781c956f7 (10.0.21.0/24)	OK
Database 1b	subnet-05197068b40ebecb3 (10.0.22.0/24)	OK
NAT Instance	i-04abe49eb07a829ad (t4g.nano, 3.77.81.37)	Running

RDS — OK¶

ID: tradai-mlflow-dev
Status: available
Engine: PostgreSQL (db.t4g.micro)
Endpoint: tradai-mlflow-dev.crcc44o2kjg3.eu-central-1.rds.amazonaws.com

SQS — OK¶

tradai-backtest-queue-dev.fifo — exists, 0 messages
tradai-backtest-dlq-dev.fifo — exists

tradai-alerts-dev
tradai-registration-dev

3. COMPUTE Stack¶

EC2 Consolidated — OK¶

Instance	Type	IP	Status
tradai-consolidated-dev (i-0d81ba92e56c19946)	t3.small	10.0.11.129	Running, Healthy
ASG: tradai-consolidated-asg-dev	min=1, max=1, desired=1	—	OK

Docker containers on consolidated EC2:

Container	Status	Port
backend-api	Up 13h (healthy)	0.0.0.0:8000
data-collection	Up 13h (healthy)	0.0.0.0:8002
strategy-service	Up 13h (healthy)	0.0.0.0:8003
mlflow	Up 13h (healthy)	0.0.0.0:5000

No container errors in CloudWatch logs for the last hour.

ALB — OK¶

Setting	Value
Name	tradai-dev
DNS	tradai-dev-1942285475.eu-central-1.elb.amazonaws.com
State	active
Type	application

Target Group Health:

Target Group	Port	Health Path	Target	Status
tradai-backend-api-d	8000	/api/v1/health	i-0d81ba92e56c19946	healthy
tradai-mlflow-d	5000	/health	i-0d81ba92e56c19946	healthy
tradai-live-trading-d	8004	/api/v1/health	(none)	no targets
tradai-dry-run-trading-d	8005	/api/v1/health	(none)	no targets

Health endpoint responses:

GET /api/v1/health → 200 OK (0.13s)
  Backend: healthy
  DynamoDB: healthy (11ms)
  SQS: healthy (8ms, 0 msgs)
  data-collection: healthy (14ms)
  strategy-service: healthy (11ms)

GET /mlflow/health → 200 OK (0.08s)

ECS Services¶

Service	Desired	Running	Status
tradai-live-trading-dev	0	0	ACTIVE (by design)
tradai-dry-run-trading-dev	0	0	ACTIVE (by design)

Note: Main services (backend, data-collection, strategy-service, mlflow) run on consolidated EC2, NOT as ECS services. Only live/dry-run-trading have ECS task definitions.

Lambda Functions (18 total) — ALL EXIST¶

Scheduled Lambdas execution status:

Lambda	Schedule	Last Duration	Status
health-check	rate(2 min)	20-60s (TIMEOUTS)	DEGRADED
orphan-scanner	rate(5 min)	400-470ms	OK
trading-heartbeat-check	rate(5 min)	130-150ms	OK
drift-monitor	rate(12 hours)	4.5ms actual	OK (no-op?)
retraining-scheduler	rate(6 hours)	1.4s	OK
pulumi-drift-detector	rate(6 hours)	3.2ms actual	OK (no-op?)

Step Functions Lambdas:

Lambda	Log Bytes	Status
validate-strategy	9,746	Has been invoked
data-collection-proxy	9,476	Has been invoked
update-status	16,093	Has been invoked
notify-completion	91,744	Has been invoked
check-retraining-needed	56,382	Has been invoked
compare-models	59,697	Has been invoked
cleanup-resources	3,439	Has been invoked
backtest-consumer	0	NEVER INVOKED
promote-model	0	NEVER INVOKED
model-rollback	0	NEVER INVOKED
sqs-consumer	0	NEVER INVOKED

Step Functions — OK¶

Workflow	Recent Executions	All Succeeded
tradai-backtest-workflow-dev	5 (Apr 10-12)	Yes
tradai-retraining-workflow-dev	5 (Apr 23)	Yes

Service Discovery — DEGRADED¶

Namespace: tradai-dev.local (ns-upztkwfphvzqzn2j)

Stale entries detected. Only i-0d81ba92e56c19946 (10.0.11.129) is the current consolidated instance. Old entries from terminated instances remain:

Service	Total Entries	Current (10.0.11.129)	Stale
backend-api	4	1	3 (IPs: 10.0.11.100, 10.0.11.52, 10.0.11.201)
data-collection	2	1	1 (IP: 10.0.11.201)
strategy-service	3	1	2 (IPs: 10.0.11.162, 10.0.11.201)
mlflow	4	1	3 (IPs: 10.0.11.100, 10.0.11.162, 10.0.11.201)
live-trading	0	0	0
dry-run-trading	0	0	0

Impact: Lambda health-check resolves DNS and sometimes hits stale IPs, causing 10-30s timeouts per service. This explains the 20-60s total duration and frequent 60s timeouts.

Stale instance IDs: - i-0036ccac4827f82ca — terminated - i-09ff8e414fe16f7e7 — does not exist - i-0abb996a2c9885150 — does not exist - i-0aefcaeee239c441d — terminated

4. EDGE Stack¶

API Gateway — OK¶

Setting	Value
Name	tradai-api-dev
Endpoint	https://z9uaqcerrd.execute-api.eu-central-1.amazonaws.com
Protocol	HTTP
Routes	28
Auth	JWT (Cognito) on all except GET /health

GET /api/v1/health via API Gateway → 200 OK

WAF — EXISTS, NOT ASSOCIATED¶

WebACL: tradai-waf-dev (0bffab6d-0f5c-47b5-b9a3-6f96061034b2)
Log group: aws-waf-logs-tradai-dev (0 bytes — never received traffic)
Known bug: WAFv2 cannot parse $default in HTTP API stage ARN

CloudWatch Alarms (28 total)¶

State	Count	Alarms
OK	27	All except mlflow latency
ALARM	0-1	tradai-mlflow-high-latency-dev (intermittent, currently OK)

ISSUES TO FIX (Priority Order)¶

P0 — Critical¶

1. Stale Service Discovery entries causing health-check timeouts¶

Problem: 9 stale SD entries from terminated EC2 instances
Impact: health-check Lambda times out ~50% of invocations (60s timeout), services sometimes unreachable via SD DNS
Fix: Deregister stale instances from Service Discovery
Stale instances to remove:
i-0036ccac4827f82ca from backend-api, mlflow
i-09ff8e414fe16f7e7 from backend-api
i-0abb996a2c9885150 from strategy-service, mlflow
i-0aefcaeee239c441d from all 4 services
Root cause: ASG lifecycle hook or cloud-init cleanup not deregistering on termination

P1 — Important¶

2. backtest-consumer Lambda never invoked¶

Problem: SQS event source mapping is Enabled, but Lambda has 0 log bytes ever
Impact: POST /api/v1/backtests via API Gateway → SQS → backtest-consumer → Step Functions pipeline untested
Diagnosis needed: Either API Gateway SQS integration never receives requests, or the integration is misconfigured
Action: Submit a test backtest through API Gateway POST /api/v1/backtests to verify the full pipeline

3. WAF not associated with API Gateway¶

Problem: WebACL exists but not attached to the HTTP API $default stage
Impact: No WAF protection on API Gateway
Known issue: WAFv2 cannot parse $default stage ARN

P2 — Monitor¶

4. drift-monitor and pulumi-drift-detector running as no-ops¶

Problem: Both execute in 3-5ms (just init+return), not doing real work
Impact: No actual drift detection happening
Likely cause: Missing configuration or empty state — need to check if MLflow has models to monitor

5. promote-model, model-rollback, sqs-consumer never invoked¶

Problem: These Lambdas have 0 bytes in logs (never executed)
Impact: Model promotion and rollback pipeline untested
Note: Acceptable if no models have been promoted yet in dev

6. DynamoDB health_state missing timestamps¶

Problem: last_checked field is None for all 4 services
Impact: Health history not persisted correctly
Likely cause: health-check Lambda may be using different attribute names, or timeouts prevent writing

P3 — Cleanup¶

7. Container Insights log group empty¶

/aws/ecs/containerinsights/tradai-dev/performance — 0 bytes (1 day retention)
Consolidated EC2 mode may not emit Container Insights metrics

8. ECS log group has minimal data¶

/aws/ecs/tradai-dev — 444 KB, /ecs/tradai/dev — 558 KB
Expected: most logs go to /tradai/consolidated/containers (~97 MB)

RESOURCE INVENTORY CHECKLIST¶

Persistent Stack¶

Foundation Stack¶

VPC: 10.0.0.0/16
Subnets: 6/6 (2 public, 2 private, 2 database)
NAT Instance: running (t4g.nano)
RDS: available (PostgreSQL 15)
SQS: 2 queues (backtest + DLQ)
SNS: 2 topics (alerts + registration)
Security Groups: (checked via VPC)
VPC Endpoints: (not checked individually)

Compute Stack¶

ECS Cluster: tradai-dev
EC2 Consolidated: running, 4 containers healthy
ALB: active, targets healthy
Lambda: 18/18 functions exist
EventBridge: 7 rules, all ENABLED
Step Functions: 2 workflows
IAM Roles: (implicit — services functioning)
Service Discovery: namespace exists, BUT stale entries

Edge Stack¶

API Gateway: 28 routes, health OK
WAF: exists (NOT associated)
CloudWatch Alarms: 28 alarms
CloudWatch Dashboard: (not checked)