ADR-001: NAT Instance over NAT Gateway¶
Date: 2025-11
Status: Accepted
Context¶
TradAI runs in a single VPC (10.0.0.0/16) in eu-central-1 with private subnets that need outbound internet access for ECS tasks, Lambda functions, and AWS API calls. AWS offers two primary options for NAT:
- NAT Gateway -- managed service, highly available per AZ, billed at $0.045/hr plus $0.045/GB processed.
- NAT Instance -- self-managed EC2 instance running iptables NAT, billed at standard EC2 rates.
At TradAI's traffic volume (~100 GB/month outbound), a dual-AZ NAT Gateway costs approximately $70/month. The platform targets a base cost of ~$120-130/month, making networking a significant fraction of the budget.
Decision¶
Use a single t4g.nano NAT instance (ARM64, ~$3/month) in the first public subnet instead of NAT Gateway. The instance runs in an Auto Scaling Group (min=1, max=1, desired=1) for automatic replacement on failure.
High-availability components:
- ASG replaces failed instances automatically.
- Lifecycle hook (
EC2_INSTANCE_LAUNCHING, 300s heartbeat) delays traffic until the new instance is ready. - EventBridge rule triggers on ASG launch events.
- Lambda (
update-nat-routes) updates the private route table to point0.0.0.0/0at the new instance. - User data (Jinja2 template) configures IP forwarding, iptables masquerading, EIP association, and source/destination check disabling.
IMDSv2 is required (http_tokens="required") and detailed monitoring is enabled.
Consequences¶
Benefits:
- Monthly cost drops from ~$70 (NAT Gateway, 2 AZs) to ~$6 (NAT instance + overhead), saving ~$64/month.
- EIP provides a stable outbound IP address.
- ASG provides automated recovery without manual intervention.
Trade-offs:
- Single AZ placement means a brief outage (~2-3 minutes) during instance replacement.
- Self-managed: AMI updates, security patching, and iptables configuration are our responsibility.
- Throughput limited to
t4g.nanonetwork bandwidth (~5 Gbps burst). Sufficient for current traffic but would need resizing for higher throughput. - More infrastructure code to maintain (
nat_instance.py, Lambda route updater, lifecycle hooks).
Mitigations:
- ASG automatic replacement limits downtime to the instance boot time.
- VPC endpoints handle high-volume AWS API traffic (ECR, S3, CloudWatch) without routing through NAT.
- Monitoring and alarms on the NAT instance health.
References¶
infra/foundation/modules/nat_instance.pyreports/final-architecture/03-VPC-NETWORKING.md(Section 5)reports/final-architecture/07-COST-ANALYSIS.md(Optimization 1)