TradAI Final Architecture - VPC & Networking¶
Version: 10.0.0 | Date: 2026-03-28 | Source: infra/foundation/modules/, infra/compute/modules/alb.py, infra/shared/tradai_infra_shared/config.py
TL;DR: Single VPC (
10.0.0.0/16) ineu-central-1with 6 subnets across 2 AZs (public, private, database). A cost-effective NAT instance (t4g.nano, ~$3/month) replaces NAT Gateway. 11 VPC endpoints (2 gateway + 9 interface) keep AWS API traffic off the public internet. ALB handles path-based routing to 4 ECS services.
1. VPC Layout¶
Region: eu-central-1 (default; overridable via Pulumi config or AWS_REGION env var)
| Property | Value |
|---|---|
| VPC CIDR | 10.0.0.0/16 |
| Availability Zones | eu-central-1a, eu-central-1b |
| DNS Hostnames | Enabled |
| DNS Support | Enabled |
| Subnet tiers | 3 (public, private, database) |
| Total subnets | 6 (2 per tier) |
Mermaid Diagram¶
graph TB
IGW[Internet Gateway]
subgraph VPC["VPC 10.0.0.0/16"]
subgraph AZ1["eu-central-1a"]
PUB1["Public 10.0.1.0/24<br/>ALB · NAT · IGW"]
PRIV1["Private 10.0.11.0/24<br/>ECS · Lambda"]
DB1["Database 10.0.21.0/24<br/>RDS PostgreSQL"]
end
subgraph AZ2["eu-central-1b"]
PUB2["Public 10.0.2.0/24<br/>ALB"]
PRIV2["Private 10.0.12.0/24<br/>ECS · Lambda"]
DB2["Database 10.0.22.0/24<br/>RDS Multi-AZ"]
end
end
IGW --> PUB1
IGW --> PUB2
PUB1 -->|NAT| PRIV1
PUB1 -->|NAT| PRIV2
PRIV1 --> DB1
PRIV2 --> DB2 ASCII Diagram (Reference)¶
eu-central-1a eu-central-1b
+--------------------+ +--------------------+
| public 10.0.1.0/24 | | public 10.0.2.0/24 |
| ALB, NAT, IGW | | ALB |
+--------------------+ +--------------------+
| private 10.0.11.0/24| | private 10.0.12.0/24|
| ECS, Lambda | | ECS, Lambda |
+--------------------+ +--------------------+
| database 10.0.21.0/24| | database 10.0.22.0/24|
| RDS PostgreSQL | | RDS (Multi-AZ) |
+--------------------+ +--------------------+
Source: infra/shared/tradai_infra_shared/config.py lines 47-55, infra/foundation/modules/vpc.py
2. Subnets¶
| Tier | AZ | CIDR | Public IP | Purpose |
|---|---|---|---|---|
| Public | eu-central-1a | 10.0.1.0/24 | Yes | ALB, NAT instance |
| Public | eu-central-1b | 10.0.2.0/24 | Yes | ALB |
| Private | eu-central-1a | 10.0.11.0/24 | No | ECS tasks, Lambda, VPC endpoints |
| Private | eu-central-1b | 10.0.12.0/24 | No | ECS tasks, Lambda, VPC endpoints |
| Database | eu-central-1a | 10.0.21.0/24 | No | RDS PostgreSQL |
| Database | eu-central-1b | 10.0.22.0/24 | No | RDS PostgreSQL (Multi-AZ standby) |
Source: infra/shared/tradai_infra_shared/config.py SUBNETS dict
3. Route Tables¶
There is one route table per tier (not per-AZ). Both AZ subnets within a tier share the same route table.
| Route Table | Default Route | Notes |
|---|---|---|
tradai-public-rt | 0.0.0.0/0 -> Internet Gateway | Attached to both public subnets |
tradai-private-rt | 0.0.0.0/0 -> NAT instance (added by Lambda) | Single RT shared by both private subnets |
tradai-database-rt | Local only (implicit 10.0.0.0/16) | No internet access; S3/DynamoDB via gateway endpoints |
The private route table's NAT route is not configured inline. It is added dynamically by the NAT route-updater Lambda when the NAT ASG launches an instance.
Source: infra/foundation/modules/vpc.py _create_route_tables()
4. Security Groups¶
All security groups are created in infra/foundation/modules/security_groups.py using SecurityRuleBuilder from infra/shared/tradai_infra_shared/core/security_rule_builder.py.
Port Definitions (CommonPorts)¶
| Name | Port(s) | Protocol |
|---|---|---|
| HTTP | 80 | TCP |
| HTTPS | 443 | TCP |
| POSTGRESQL | 5432 | TCP |
| BACKEND_API | 8000 | TCP |
| DATA_COLLECTION | 8002 | TCP |
| STRATEGY_SERVICE | 8003 | TCP |
| MLFLOW | 5000 | TCP |
| ALL_SERVICES | 8000-8003 | TCP |
| ALL_TCP | 0-65535 | TCP |
4.1 ALB Security Group (tradai-alb-sg)¶
| Direction | Port(s) | Source/Destination | Description |
|---|---|---|---|
| Ingress | 80 | 0.0.0.0/0 | HTTP from internet |
| Ingress | 443 | 0.0.0.0/0 | HTTPS from internet |
| Egress | 8000-8003 | ECS SG | Service ports to ECS |
| Egress | 5000 | ECS SG | MLflow to ECS |
| Egress | 8000-8003 | Consolidated SG* | Service ports to EC2 |
| Egress | 5000 | Consolidated SG* | MLflow to EC2 |
*Consolidated SG rules only created when is_consolidated_mode() is true (dev/staging).
4.2 ECS Security Group (tradai-ecs-sg)¶
| Direction | Port(s) | Source/Destination | Description |
|---|---|---|---|
| Ingress | 8000 | ALB SG | Backend API from ALB |
| Ingress | 8002 | ALB SG | Data Collection from ALB |
| Ingress | 8003 | ALB SG | Strategy Service from ALB |
| Ingress | 5000 | ALB SG | MLflow from ALB |
| Ingress | 8000-8003 | Lambda SG | Service ports from Lambda |
| Ingress | 5000 | Lambda SG | MLflow from Lambda |
| Egress | 443 | 0.0.0.0/0 | HTTPS to internet |
| Egress | 5432 | RDS SG | PostgreSQL to RDS |
| Egress | 8000-8003 | Consolidated SG* | Service ports to EC2 |
| Egress | 5000 | Consolidated SG* | MLflow to EC2 |
4.3 Lambda Security Group (tradai-lambda-sg)¶
| Direction | Port(s) | Source/Destination | Description |
|---|---|---|---|
| Egress | 443 | 0.0.0.0/0 | HTTPS to internet |
| Egress | 8000-8003 | ECS SG | Service ports to ECS |
| Egress | 5000 | ECS SG | MLflow to ECS |
| Egress | 8000-8003 | Consolidated SG* | Service ports to EC2 |
| Egress | 5000 | Consolidated SG* | MLflow to EC2 |
No ingress rules. Lambda is egress-only.
4.4 RDS Security Group (tradai-rds-sg)¶
| Direction | Port(s) | Source/Destination | Description |
|---|---|---|---|
| Ingress | 5432 | ECS SG | PostgreSQL from ECS |
| Ingress | 5432 | Lambda SG | PostgreSQL from Lambda |
| Ingress | 5432 | Consolidated SG* | PostgreSQL from EC2 |
No egress rules defined (uses default).
4.5 NAT Security Group (tradai-nat-sg)¶
NAT Instance accepts ALL TCP from private subnets
The NAT security group allows all TCP ports (0-65535) from both private subnets. This is intentional -- the NAT instance must forward arbitrary traffic from ECS tasks and Lambda functions to AWS APIs and the internet. The security boundary is enforced by the source security groups on the originating services, not on the NAT instance itself.
| Direction | Port(s) | Source/Destination | Description |
|---|---|---|---|
| Ingress | 0-65535 (ALL TCP) | 10.0.11.0/24 | All TCP from private subnet AZ-1 |
| Ingress | 0-65535 (ALL TCP) | 10.0.12.0/24 | All TCP from private subnet AZ-2 |
| Egress | 0-65535 (ALL TCP) | 0.0.0.0/0 | All TCP to internet |
4.6 VPC Endpoint Security Group (tradai-endpoint-sg)¶
| Direction | Port(s) | Source/Destination | Description |
|---|---|---|---|
| Ingress | 443 | 10.0.11.0/24 | HTTPS from private subnet AZ-1 |
| Ingress | 443 | 10.0.12.0/24 | HTTPS from private subnet AZ-2 |
| Ingress | 443 | ECS SG | HTTPS from ECS tasks |
| Ingress | 443 | Lambda SG | HTTPS from Lambda functions |
| Ingress | 443 | Consolidated SG* | HTTPS from EC2 (if consolidated mode) |
4.7 Consolidated Security Group (tradai-consolidated-sg) -- dev/staging only¶
Only created when is_consolidated_mode() returns True.
| Direction | Port(s) | Source/Destination | Description |
|---|---|---|---|
| Ingress | 8000 | ALB SG | Backend API from ALB |
| Ingress | 8002 | ALB SG | Data Collection from ALB |
| Ingress | 5000 | ALB SG | MLflow from ALB |
| Ingress | 8000-8003 | Lambda SG | Service ports from Lambda |
| Ingress | 5000 | Lambda SG | MLflow from Lambda |
| Ingress | 8000-8003 | ECS SG | Service ports from ECS Fargate (backtest tasks) |
| Ingress | 5000 | ECS SG | MLflow from ECS Fargate |
| Egress | 443 | 0.0.0.0/0 | HTTPS to AWS APIs |
| Egress | 5432 | RDS SG | PostgreSQL to RDS |
Source: infra/foundation/modules/security_groups.py
5. NAT Instance¶
A cost-effective NAT instance replaces NAT Gateway, saving ~$32/month.
| Property | Value |
|---|---|
| Instance type | t4g.nano (ARM64, ~$3/month) |
| AMI | Latest Amazon Linux 2023 ARM64 |
| Placement | First public subnet (eu-central-1a) |
| EIP | Dedicated Elastic IP for stable address |
| IMDSv2 | Required (http_tokens="required") |
| Monitoring | Detailed monitoring enabled |
High Availability¶
The NAT instance runs in an Auto Scaling Group (min=1, max=1, desired=1) for automatic replacement on failure.
| Component | Purpose |
|---|---|
| ASG | Single-instance HA; replaces failed instances automatically |
| Lifecycle Hook | EC2_INSTANCE_LAUNCHING with 300s heartbeat; delays traffic until ready |
| EventBridge Rule | Triggers on ASG launch events for the NAT ASG |
Lambda (update-nat-routes) | Updates private route table to point 0.0.0.0/0 at new instance |
| IAM Role | ec2:AssociateAddress, ec2:ModifyInstanceAttribute, ec2:DescribeInstances |
The Lambda and EventBridge rule are created before the ASG to prevent a race condition where the first instance launches before the event handling chain is ready.
User data is rendered from a Jinja2 template (infra/foundation/templates/nat-userdata.sh.j2) and configures IP forwarding, iptables NAT masquerading, EIP association, and source/destination check disabling.
Source: infra/foundation/modules/nat_instance.py
6. VPC Endpoints¶
6.1 Gateway Endpoints (free)¶
| Endpoint | Service | Route Tables | Policy |
|---|---|---|---|
| S3 | com.amazonaws.eu-central-1.s3 | Private, Database | TradAI buckets (tradai-*-{env}), ECR layer buckets (prod-*-starport-layer-bucket), system package buckets (amazon-ssm-*, al2023-repos-*, amazonlinux-*) |
| DynamoDB | com.amazonaws.eu-central-1.dynamodb | Private, Database | TradAI tables only (tradai-*), includes index access |
VPC Endpoint Costs
Interface endpoints cost ~$7.30/month per endpoint per AZ. With 9 interface endpoints across 2 AZs, the total is ~$65.70/month. Gateway endpoints (S3, DynamoDB) are free. Evaluate whether the NAT instance can handle the traffic instead -- see 07-COST-ANALYSIS.md Optimization 2.
6.2 Interface Endpoints (~$65.70/month total)¶
All interface endpoints are placed in private subnets, use the endpoint security group, and have private DNS enabled.
| Endpoint | Service | Purpose | Cost/month/AZ |
|---|---|---|---|
| ECR API | ecr.api | Container image pulls (API) | ~$7.30 |
| ECR DKR | ecr.dkr | Container image pulls (Docker) | ~$7.30 |
| STS | sts | Credential refresh (ECR auth) | ~$7.30 |
| Secrets Manager | secretsmanager | RDS credentials for MLflow | ~$7.30 |
| CloudWatch Logs | logs | Log delivery without NAT | ~$7.30 |
| SSM | ssm | Session Manager (debugging) | ~$7.30 |
| SSM Messages | ssmmessages | Session Manager (debugging) | ~$7.30 |
| EC2 Messages | ec2messages | Session Manager (debugging) | ~$7.30 |
| SQS | sqs | Backtest queue without NAT | ~$7.30 |
Interface endpoints are created when both private_subnet_ids and endpoint_security_group_id are provided to the VpcEndpoints constructor.
Source: infra/foundation/modules/vpc_endpoints.py
7. Network ACLs¶
NACLs provide stateless subnet-level filtering as defense-in-depth on top of security groups.
7.1 Public NACL¶
Inbound:
| Rule # | Protocol | Port(s) | Source | Description |
|---|---|---|---|---|
| 90 | TCP | 22 | 0.0.0.0/0 | SSH (Packer AMI builds, admin) |
| 100 | TCP | 443 | 0.0.0.0/0 | HTTPS |
| 110 | TCP | 80 | 0.0.0.0/0 | HTTP |
| 120 | TCP | 1024-65535 | 0.0.0.0/0 | Ephemeral (return traffic) |
Outbound:
| Rule # | Protocol | Port(s) | Destination | Description |
|---|---|---|---|---|
| 100 | TCP | 443 | 0.0.0.0/0 | HTTPS to internet |
| 110 | TCP | 80 | 0.0.0.0/0 | HTTP to internet |
| 120 | TCP | 8000-8003 | 10.0.0.0/16 | ECS service ports to VPC |
| 130 | TCP | 5000 | 10.0.0.0/16 | MLflow to VPC |
| 140 | TCP | 1024-65535 | 0.0.0.0/0 | Ephemeral (responses) |
7.2 Private NACL¶
Inbound:
| Rule # | Protocol | Port(s) | Source | Description |
|---|---|---|---|---|
| 100 | TCP | 8000-8003 | 10.0.1.0/24 | ECS ports from public AZ-1 (ALB) |
| 110 | TCP | 8000-8003 | 10.0.2.0/24 | ECS ports from public AZ-2 (ALB) |
| 115 | TCP | 8000-8003 | 10.0.11.0/24 | ECS ports from private AZ-1 (Lambda/intra-VPC) |
| 116 | TCP | 8000-8003 | 10.0.12.0/24 | ECS ports from private AZ-2 (Lambda/intra-VPC) |
| 120 | TCP | 5000 | 10.0.0.0/16 | MLflow from VPC |
| 125 | TCP | 443 | 10.0.0.0/16 | HTTPS from VPC (cross-subnet VPC endpoint traffic) |
| 130 | TCP | 1024-65535 | 0.0.0.0/0 | Ephemeral (return traffic from NAT/internet) |
| 140 | ICMP | All | 0.0.0.0/0 | ICMP (NAT instance, network diagnostics) |
Outbound:
| Rule # | Protocol | Port(s) | Destination | Description |
|---|---|---|---|---|
| 100 | TCP | 443 | 0.0.0.0/0 | HTTPS to internet (via NAT) |
| 110 | TCP | 5432 | 10.0.21.0/24 | PostgreSQL to database AZ-1 |
| 120 | TCP | 5432 | 10.0.22.0/24 | PostgreSQL to database AZ-2 |
| 130 | TCP | 1024-65535 | 0.0.0.0/0 | Ephemeral (responses) |
7.3 Database NACL¶
Inbound:
| Rule # | Protocol | Port(s) | Source | Description |
|---|---|---|---|---|
| 100 | TCP | 5432 | 10.0.11.0/24 | PostgreSQL from private AZ-1 |
| 110 | TCP | 5432 | 10.0.12.0/24 | PostgreSQL from private AZ-2 |
Outbound:
| Rule # | Protocol | Port(s) | Destination | Description |
|---|---|---|---|---|
| 100 | TCP | 1024-65535 | 10.0.11.0/24 | Ephemeral to private AZ-1 |
| 110 | TCP | 1024-65535 | 10.0.12.0/24 | Ephemeral to private AZ-2 |
This is the most restrictive NACL -- only PostgreSQL from private subnets, responses back.
Source: infra/foundation/modules/nacl.py
8. Application Load Balancer¶
| Property | Value |
|---|---|
| Type | Internet-facing, Application |
| Subnets | Both public subnets |
| Security group | ALB SG |
| Deletion protection | Enabled in prod only |
Listeners¶
| Listener | Port | Behavior |
|---|---|---|
| HTTPS (with cert) | 443 | Path-based routing, TLS policy ELBSecurityPolicy-TLS13-1-2-2021-06 |
| HTTP (with cert) | 80 | 301 redirect to HTTPS |
| HTTP (no cert, dev) | 80 | Path-based routing directly (no TLS) |
Default action for routing listeners: 404 fixed response. Path rules route to target groups.
Path Routing¶
| Service | Path Pattern(s) | Priority | Target Type |
|---|---|---|---|
| MLflow | /mlflow, /mlflow/* | 100+ (specific) | ip (Fargate) or instance (consolidated) |
| Live Trading | /api/v1/live, /api/v1/live/* | 100+ (specific) | ip |
| Dry-Run Trading | /api/v1/dry-run, /api/v1/dry-run/* | 100+ (specific) | ip |
| Backend API | /api/v1/* | 900+ (catch-all) | ip or instance |
| Data Collection | (none -- internal only) | -- | -- |
| Strategy Service | (none -- internal only) | -- | -- |
Health Checks¶
All target groups use: healthy threshold 2, unhealthy threshold 3, timeout 5s, interval 30s, matcher 200.
| Service | Health Path | Port |
|---|---|---|
| Backend API | /api/v1/health | 8000 |
| Data Collection | /api/v1/health | 8002 |
| Strategy Service | /api/v1/health | 8003 |
| MLflow | /mlflow/ | 5000 |
Source: infra/compute/modules/alb.py, infra/shared/tradai_infra_shared/config.py ALB_PATH_PATTERNS
9. DNS & Service Discovery¶
Internal service-to-service communication uses AWS Cloud Map (not Route 53 public DNS).
| Property | Value |
|---|---|
| Namespace | tradai-{env}.local |
| Type | Private DNS namespace |
| Example | mlflow.tradai-dev.local:5000 |
Services register via Cloud Map and are addressable by name within the VPC. The MLflow tracking URI is constructed as http://mlflow.{namespace}:5000/mlflow.
Source: infra/shared/tradai_infra_shared/config.py get_sd_namespace(), get_mlflow_tracking_uri()
10. VPC Flow Logs¶
| Property | Value |
|---|---|
| Traffic type | ALL (accept + reject) |
| Destination | CloudWatch Logs (/aws/vpc/tradai-flow-logs) |
| Retention | 7 days |
| IAM role | Dedicated role for vpc-flow-logs.amazonaws.com |
Flow logs capture source/destination IPs, ports, protocol, packet/byte counts, and accept/reject actions for all VPC traffic.
Source: infra/foundation/modules/vpc_flow_logs.py
Changelog¶
| Version | Date | Changes |
|---|---|---|
| 10.0.0 | 2026-03-28 | Full regeneration from infra code. Corrected SG rules, VPC endpoints, NAT config |
Dependencies¶
| If This Changes | Update This Doc |
|---|---|
infra/foundation/modules/security_groups.py | Security group rules |
infra/foundation/modules/vpc_endpoints.py | VPC endpoint list |
infra/foundation/modules/nat_instance.py | NAT configuration |
infra/shared/tradai_infra_shared/config.py VPC_CIDR/SUBNETS | Network layout |