Skip to content

ADR-001: NAT Instance over NAT Gateway

Date: 2025-11

Status: Accepted

Context

TradAI runs in a single VPC (10.0.0.0/16) in eu-central-1 with private subnets that need outbound internet access for ECS tasks, Lambda functions, and AWS API calls. AWS offers two primary options for NAT:

  1. NAT Gateway -- managed service, highly available per AZ, billed at $0.045/hr plus $0.045/GB processed.
  2. NAT Instance -- self-managed EC2 instance running iptables NAT, billed at standard EC2 rates.

At TradAI's traffic volume (~100 GB/month outbound), a dual-AZ NAT Gateway costs approximately $70/month. The platform targets a base cost of ~$120-130/month, making networking a significant fraction of the budget.

Decision

Use a single t4g.nano NAT instance (ARM64, ~$3/month) in the first public subnet instead of NAT Gateway. The instance runs in an Auto Scaling Group (min=1, max=1, desired=1) for automatic replacement on failure.

High-availability components:

  • ASG replaces failed instances automatically.
  • Lifecycle hook (EC2_INSTANCE_LAUNCHING, 300s heartbeat) delays traffic until the new instance is ready.
  • EventBridge rule triggers on ASG launch events.
  • Lambda (update-nat-routes) updates the private route table to point 0.0.0.0/0 at the new instance.
  • User data (Jinja2 template) configures IP forwarding, iptables masquerading, EIP association, and source/destination check disabling.

IMDSv2 is required (http_tokens="required") and detailed monitoring is enabled.

Consequences

Benefits:

  • Monthly cost drops from ~$70 (NAT Gateway, 2 AZs) to ~$6 (NAT instance + overhead), saving ~$64/month.
  • EIP provides a stable outbound IP address.
  • ASG provides automated recovery without manual intervention.

Trade-offs:

  • Single AZ placement means a brief outage (~2-3 minutes) during instance replacement.
  • Self-managed: AMI updates, security patching, and iptables configuration are our responsibility.
  • Throughput limited to t4g.nano network bandwidth (~5 Gbps burst). Sufficient for current traffic but would need resizing for higher throughput.
  • More infrastructure code to maintain (nat_instance.py, Lambda route updater, lifecycle hooks).

Mitigations:

  • ASG automatic replacement limits downtime to the instance boot time.
  • VPC endpoints handle high-volume AWS API traffic (ECR, S3, CloudWatch) without routing through NAT.
  • Monitoring and alarms on the NAT instance health.

References

  • infra/foundation/modules/nat_instance.py
  • reports/final-architecture/03-VPC-NETWORKING.md (Section 5)
  • reports/final-architecture/07-COST-ANALYSIS.md (Optimization 1)