Skip to content

TradAI Final Architecture - VPC & Networking

Version: 10.0.0 | Date: 2026-03-28 | Source: infra/foundation/modules/, infra/compute/modules/alb.py, infra/shared/tradai_infra_shared/config.py

TL;DR: Single VPC (10.0.0.0/16) in eu-central-1 with 6 subnets across 2 AZs (public, private, database). A cost-effective NAT instance (t4g.nano, ~$3/month) replaces NAT Gateway. 11 VPC endpoints (2 gateway + 9 interface) keep AWS API traffic off the public internet. ALB handles path-based routing to 4 ECS services.


1. VPC Layout

Region: eu-central-1 (default; overridable via Pulumi config or AWS_REGION env var)

Property Value
VPC CIDR 10.0.0.0/16
Availability Zones eu-central-1a, eu-central-1b
DNS Hostnames Enabled
DNS Support Enabled
Subnet tiers 3 (public, private, database)
Total subnets 6 (2 per tier)

Mermaid Diagram

graph TB
    IGW[Internet Gateway]

    subgraph VPC["VPC 10.0.0.0/16"]
        subgraph AZ1["eu-central-1a"]
            PUB1["Public 10.0.1.0/24<br/>ALB · NAT · IGW"]
            PRIV1["Private 10.0.11.0/24<br/>ECS · Lambda"]
            DB1["Database 10.0.21.0/24<br/>RDS PostgreSQL"]
        end

        subgraph AZ2["eu-central-1b"]
            PUB2["Public 10.0.2.0/24<br/>ALB"]
            PRIV2["Private 10.0.12.0/24<br/>ECS · Lambda"]
            DB2["Database 10.0.22.0/24<br/>RDS Multi-AZ"]
        end
    end

    IGW --> PUB1
    IGW --> PUB2
    PUB1 -->|NAT| PRIV1
    PUB1 -->|NAT| PRIV2
    PRIV1 --> DB1
    PRIV2 --> DB2

ASCII Diagram (Reference)

eu-central-1a                          eu-central-1b
+--------------------+                 +--------------------+
| public 10.0.1.0/24 |                 | public 10.0.2.0/24 |
|  ALB, NAT, IGW     |                 |  ALB               |
+--------------------+                 +--------------------+
| private 10.0.11.0/24|                | private 10.0.12.0/24|
|  ECS, Lambda        |                |  ECS, Lambda        |
+--------------------+                 +--------------------+
| database 10.0.21.0/24|               | database 10.0.22.0/24|
|  RDS PostgreSQL      |               |  RDS (Multi-AZ)      |
+--------------------+                 +--------------------+

Source: infra/shared/tradai_infra_shared/config.py lines 47-55, infra/foundation/modules/vpc.py


2. Subnets

Tier AZ CIDR Public IP Purpose
Public eu-central-1a 10.0.1.0/24 Yes ALB, NAT instance
Public eu-central-1b 10.0.2.0/24 Yes ALB
Private eu-central-1a 10.0.11.0/24 No ECS tasks, Lambda, VPC endpoints
Private eu-central-1b 10.0.12.0/24 No ECS tasks, Lambda, VPC endpoints
Database eu-central-1a 10.0.21.0/24 No RDS PostgreSQL
Database eu-central-1b 10.0.22.0/24 No RDS PostgreSQL (Multi-AZ standby)

Source: infra/shared/tradai_infra_shared/config.py SUBNETS dict


3. Route Tables

There is one route table per tier (not per-AZ). Both AZ subnets within a tier share the same route table.

Route Table Default Route Notes
tradai-public-rt 0.0.0.0/0 -> Internet Gateway Attached to both public subnets
tradai-private-rt 0.0.0.0/0 -> NAT instance (added by Lambda) Single RT shared by both private subnets
tradai-database-rt Local only (implicit 10.0.0.0/16) No internet access; S3/DynamoDB via gateway endpoints

The private route table's NAT route is not configured inline. It is added dynamically by the NAT route-updater Lambda when the NAT ASG launches an instance.

Source: infra/foundation/modules/vpc.py _create_route_tables()


4. Security Groups

All security groups are created in infra/foundation/modules/security_groups.py using SecurityRuleBuilder from infra/shared/tradai_infra_shared/core/security_rule_builder.py.

Port Definitions (CommonPorts)

Name Port(s) Protocol
HTTP 80 TCP
HTTPS 443 TCP
POSTGRESQL 5432 TCP
BACKEND_API 8000 TCP
DATA_COLLECTION 8002 TCP
STRATEGY_SERVICE 8003 TCP
MLFLOW 5000 TCP
ALL_SERVICES 8000-8003 TCP
ALL_TCP 0-65535 TCP

4.1 ALB Security Group (tradai-alb-sg)

Direction Port(s) Source/Destination Description
Ingress 80 0.0.0.0/0 HTTP from internet
Ingress 443 0.0.0.0/0 HTTPS from internet
Egress 8000-8003 ECS SG Service ports to ECS
Egress 5000 ECS SG MLflow to ECS
Egress 8000-8003 Consolidated SG* Service ports to EC2
Egress 5000 Consolidated SG* MLflow to EC2

*Consolidated SG rules only created when is_consolidated_mode() is true (dev/staging).

4.2 ECS Security Group (tradai-ecs-sg)

Direction Port(s) Source/Destination Description
Ingress 8000 ALB SG Backend API from ALB
Ingress 8002 ALB SG Data Collection from ALB
Ingress 8003 ALB SG Strategy Service from ALB
Ingress 5000 ALB SG MLflow from ALB
Ingress 8000-8003 Lambda SG Service ports from Lambda
Ingress 5000 Lambda SG MLflow from Lambda
Egress 443 0.0.0.0/0 HTTPS to internet
Egress 5432 RDS SG PostgreSQL to RDS
Egress 8000-8003 Consolidated SG* Service ports to EC2
Egress 5000 Consolidated SG* MLflow to EC2

4.3 Lambda Security Group (tradai-lambda-sg)

Direction Port(s) Source/Destination Description
Egress 443 0.0.0.0/0 HTTPS to internet
Egress 8000-8003 ECS SG Service ports to ECS
Egress 5000 ECS SG MLflow to ECS
Egress 8000-8003 Consolidated SG* Service ports to EC2
Egress 5000 Consolidated SG* MLflow to EC2

No ingress rules. Lambda is egress-only.

4.4 RDS Security Group (tradai-rds-sg)

Direction Port(s) Source/Destination Description
Ingress 5432 ECS SG PostgreSQL from ECS
Ingress 5432 Lambda SG PostgreSQL from Lambda
Ingress 5432 Consolidated SG* PostgreSQL from EC2

No egress rules defined (uses default).

4.5 NAT Security Group (tradai-nat-sg)

NAT Instance accepts ALL TCP from private subnets

The NAT security group allows all TCP ports (0-65535) from both private subnets. This is intentional -- the NAT instance must forward arbitrary traffic from ECS tasks and Lambda functions to AWS APIs and the internet. The security boundary is enforced by the source security groups on the originating services, not on the NAT instance itself.

Direction Port(s) Source/Destination Description
Ingress 0-65535 (ALL TCP) 10.0.11.0/24 All TCP from private subnet AZ-1
Ingress 0-65535 (ALL TCP) 10.0.12.0/24 All TCP from private subnet AZ-2
Egress 0-65535 (ALL TCP) 0.0.0.0/0 All TCP to internet

4.6 VPC Endpoint Security Group (tradai-endpoint-sg)

Direction Port(s) Source/Destination Description
Ingress 443 10.0.11.0/24 HTTPS from private subnet AZ-1
Ingress 443 10.0.12.0/24 HTTPS from private subnet AZ-2
Ingress 443 ECS SG HTTPS from ECS tasks
Ingress 443 Lambda SG HTTPS from Lambda functions
Ingress 443 Consolidated SG* HTTPS from EC2 (if consolidated mode)

4.7 Consolidated Security Group (tradai-consolidated-sg) -- dev/staging only

Only created when is_consolidated_mode() returns True.

Direction Port(s) Source/Destination Description
Ingress 8000 ALB SG Backend API from ALB
Ingress 8002 ALB SG Data Collection from ALB
Ingress 5000 ALB SG MLflow from ALB
Ingress 8000-8003 Lambda SG Service ports from Lambda
Ingress 5000 Lambda SG MLflow from Lambda
Ingress 8000-8003 ECS SG Service ports from ECS Fargate (backtest tasks)
Ingress 5000 ECS SG MLflow from ECS Fargate
Egress 443 0.0.0.0/0 HTTPS to AWS APIs
Egress 5432 RDS SG PostgreSQL to RDS

Source: infra/foundation/modules/security_groups.py


5. NAT Instance

A cost-effective NAT instance replaces NAT Gateway, saving ~$32/month.

Property Value
Instance type t4g.nano (ARM64, ~$3/month)
AMI Latest Amazon Linux 2023 ARM64
Placement First public subnet (eu-central-1a)
EIP Dedicated Elastic IP for stable address
IMDSv2 Required (http_tokens="required")
Monitoring Detailed monitoring enabled

High Availability

The NAT instance runs in an Auto Scaling Group (min=1, max=1, desired=1) for automatic replacement on failure.

Component Purpose
ASG Single-instance HA; replaces failed instances automatically
Lifecycle Hook EC2_INSTANCE_LAUNCHING with 300s heartbeat; delays traffic until ready
EventBridge Rule Triggers on ASG launch events for the NAT ASG
Lambda (update-nat-routes) Updates private route table to point 0.0.0.0/0 at new instance
IAM Role ec2:AssociateAddress, ec2:ModifyInstanceAttribute, ec2:DescribeInstances

The Lambda and EventBridge rule are created before the ASG to prevent a race condition where the first instance launches before the event handling chain is ready.

User data is rendered from a Jinja2 template (infra/foundation/templates/nat-userdata.sh.j2) and configures IP forwarding, iptables NAT masquerading, EIP association, and source/destination check disabling.

Source: infra/foundation/modules/nat_instance.py


6. VPC Endpoints

6.1 Gateway Endpoints (free)

Endpoint Service Route Tables Policy
S3 com.amazonaws.eu-central-1.s3 Private, Database TradAI buckets (tradai-*-{env}), ECR layer buckets (prod-*-starport-layer-bucket), system package buckets (amazon-ssm-*, al2023-repos-*, amazonlinux-*)
DynamoDB com.amazonaws.eu-central-1.dynamodb Private, Database TradAI tables only (tradai-*), includes index access

VPC Endpoint Costs

Interface endpoints cost ~$7.30/month per endpoint per AZ. With 9 interface endpoints across 2 AZs, the total is ~$65.70/month. Gateway endpoints (S3, DynamoDB) are free. Evaluate whether the NAT instance can handle the traffic instead -- see 07-COST-ANALYSIS.md Optimization 2.

6.2 Interface Endpoints (~$65.70/month total)

All interface endpoints are placed in private subnets, use the endpoint security group, and have private DNS enabled.

Endpoint Service Purpose Cost/month/AZ
ECR API ecr.api Container image pulls (API) ~$7.30
ECR DKR ecr.dkr Container image pulls (Docker) ~$7.30
STS sts Credential refresh (ECR auth) ~$7.30
Secrets Manager secretsmanager RDS credentials for MLflow ~$7.30
CloudWatch Logs logs Log delivery without NAT ~$7.30
SSM ssm Session Manager (debugging) ~$7.30
SSM Messages ssmmessages Session Manager (debugging) ~$7.30
EC2 Messages ec2messages Session Manager (debugging) ~$7.30
SQS sqs Backtest queue without NAT ~$7.30

Interface endpoints are created when both private_subnet_ids and endpoint_security_group_id are provided to the VpcEndpoints constructor.

Source: infra/foundation/modules/vpc_endpoints.py


7. Network ACLs

NACLs provide stateless subnet-level filtering as defense-in-depth on top of security groups.

7.1 Public NACL

Inbound:

Rule # Protocol Port(s) Source Description
90 TCP 22 0.0.0.0/0 SSH (Packer AMI builds, admin)
100 TCP 443 0.0.0.0/0 HTTPS
110 TCP 80 0.0.0.0/0 HTTP
120 TCP 1024-65535 0.0.0.0/0 Ephemeral (return traffic)

Outbound:

Rule # Protocol Port(s) Destination Description
100 TCP 443 0.0.0.0/0 HTTPS to internet
110 TCP 80 0.0.0.0/0 HTTP to internet
120 TCP 8000-8003 10.0.0.0/16 ECS service ports to VPC
130 TCP 5000 10.0.0.0/16 MLflow to VPC
140 TCP 1024-65535 0.0.0.0/0 Ephemeral (responses)

7.2 Private NACL

Inbound:

Rule # Protocol Port(s) Source Description
100 TCP 8000-8003 10.0.1.0/24 ECS ports from public AZ-1 (ALB)
110 TCP 8000-8003 10.0.2.0/24 ECS ports from public AZ-2 (ALB)
115 TCP 8000-8003 10.0.11.0/24 ECS ports from private AZ-1 (Lambda/intra-VPC)
116 TCP 8000-8003 10.0.12.0/24 ECS ports from private AZ-2 (Lambda/intra-VPC)
120 TCP 5000 10.0.0.0/16 MLflow from VPC
125 TCP 443 10.0.0.0/16 HTTPS from VPC (cross-subnet VPC endpoint traffic)
130 TCP 1024-65535 0.0.0.0/0 Ephemeral (return traffic from NAT/internet)
140 ICMP All 0.0.0.0/0 ICMP (NAT instance, network diagnostics)

Outbound:

Rule # Protocol Port(s) Destination Description
100 TCP 443 0.0.0.0/0 HTTPS to internet (via NAT)
110 TCP 5432 10.0.21.0/24 PostgreSQL to database AZ-1
120 TCP 5432 10.0.22.0/24 PostgreSQL to database AZ-2
130 TCP 1024-65535 0.0.0.0/0 Ephemeral (responses)

7.3 Database NACL

Inbound:

Rule # Protocol Port(s) Source Description
100 TCP 5432 10.0.11.0/24 PostgreSQL from private AZ-1
110 TCP 5432 10.0.12.0/24 PostgreSQL from private AZ-2

Outbound:

Rule # Protocol Port(s) Destination Description
100 TCP 1024-65535 10.0.11.0/24 Ephemeral to private AZ-1
110 TCP 1024-65535 10.0.12.0/24 Ephemeral to private AZ-2

This is the most restrictive NACL -- only PostgreSQL from private subnets, responses back.

Source: infra/foundation/modules/nacl.py


8. Application Load Balancer

Property Value
Type Internet-facing, Application
Subnets Both public subnets
Security group ALB SG
Deletion protection Enabled in prod only

Listeners

Listener Port Behavior
HTTPS (with cert) 443 Path-based routing, TLS policy ELBSecurityPolicy-TLS13-1-2-2021-06
HTTP (with cert) 80 301 redirect to HTTPS
HTTP (no cert, dev) 80 Path-based routing directly (no TLS)

Default action for routing listeners: 404 fixed response. Path rules route to target groups.

Path Routing

Service Path Pattern(s) Priority Target Type
MLflow /mlflow, /mlflow/* 100+ (specific) ip (Fargate) or instance (consolidated)
Live Trading /api/v1/live, /api/v1/live/* 100+ (specific) ip
Dry-Run Trading /api/v1/dry-run, /api/v1/dry-run/* 100+ (specific) ip
Backend API /api/v1/* 900+ (catch-all) ip or instance
Data Collection (none -- internal only) -- --
Strategy Service (none -- internal only) -- --

Health Checks

All target groups use: healthy threshold 2, unhealthy threshold 3, timeout 5s, interval 30s, matcher 200.

Service Health Path Port
Backend API /api/v1/health 8000
Data Collection /api/v1/health 8002
Strategy Service /api/v1/health 8003
MLflow /mlflow/ 5000

Source: infra/compute/modules/alb.py, infra/shared/tradai_infra_shared/config.py ALB_PATH_PATTERNS


9. DNS & Service Discovery

Internal service-to-service communication uses AWS Cloud Map (not Route 53 public DNS).

Property Value
Namespace tradai-{env}.local
Type Private DNS namespace
Example mlflow.tradai-dev.local:5000

Services register via Cloud Map and are addressable by name within the VPC. The MLflow tracking URI is constructed as http://mlflow.{namespace}:5000/mlflow.

Source: infra/shared/tradai_infra_shared/config.py get_sd_namespace(), get_mlflow_tracking_uri()


10. VPC Flow Logs

Property Value
Traffic type ALL (accept + reject)
Destination CloudWatch Logs (/aws/vpc/tradai-flow-logs)
Retention 7 days
IAM role Dedicated role for vpc-flow-logs.amazonaws.com

Flow logs capture source/destination IPs, ports, protocol, packet/byte counts, and accept/reject actions for all VPC traffic.

Source: infra/foundation/modules/vpc_flow_logs.py


Changelog

Version Date Changes
10.0.0 2026-03-28 Full regeneration from infra code. Corrected SG rules, VPC endpoints, NAT config

Dependencies

If This Changes Update This Doc
infra/foundation/modules/security_groups.py Security group rules
infra/foundation/modules/vpc_endpoints.py VPC endpoint list
infra/foundation/modules/nat_instance.py NAT configuration
infra/shared/tradai_infra_shared/config.py VPC_CIDR/SUBNETS Network layout