Infrastructure Issues Runbook¶
Procedures for handling NAT instance failures, VPC connectivity issues, and RDS problems.
NAT Instance Failure¶
Symptoms¶
- ECS tasks cannot reach external APIs (exchanges)
- Timeout errors when accessing internet
- CloudWatch alarm:
tradai-{env}-nat-unhealthy
Diagnosis¶
-
Check NAT instance status:
-
Check NAT instance health:
-
Verify route table:
Resolution¶
Option 1: Restart NAT instance
# Stop instance
aws ec2 stop-instances --instance-ids $NAT_INSTANCE_ID
# Wait for stopped state
aws ec2 wait instance-stopped --instance-ids $NAT_INSTANCE_ID
# Start instance
aws ec2 start-instances --instance-ids $NAT_INSTANCE_ID
# Wait for running state
aws ec2 wait instance-running --instance-ids $NAT_INSTANCE_ID
Option 2: Replace NAT instance
# Launch new NAT instance from AMI
aws ec2 run-instances \
--image-id ami-XXXXXXXXX \
--instance-type t3.micro \
--subnet-id $PUBLIC_SUBNET_ID \
--security-group-ids $NAT_SG_ID \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=tradai-nat-${ENVIRONMENT}}]' \
--source-dest-check false
# Update route table
aws ec2 replace-route \
--route-table-id $PRIVATE_RT_ID \
--destination-cidr-block 0.0.0.0/0 \
--instance-id $NEW_NAT_INSTANCE_ID
# Terminate old instance after verification
aws ec2 terminate-instances --instance-ids $OLD_NAT_INSTANCE_ID
Option 3: Switch to NAT Gateway (recommended for production)
# Create NAT Gateway
aws ec2 create-nat-gateway \
--subnet-id $PUBLIC_SUBNET_ID \
--allocation-id $ELASTIC_IP_ALLOCATION_ID \
--tag-specifications 'ResourceType=natgateway,Tags=[{Key=Name,Value=tradai-nat-gw-${ENVIRONMENT}}]'
# Wait for NAT Gateway to be available
aws ec2 wait nat-gateway-available --nat-gateway-ids $NAT_GW_ID
# Update route table
aws ec2 replace-route \
--route-table-id $PRIVATE_RT_ID \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id $NAT_GW_ID
Verification¶
# Test connectivity from ECS task
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $TASK_ARN \
--container $CONTAINER_NAME \
--interactive \
--command "curl -I https://api.binance.com/api/v3/ping"
VPC Endpoint Issues¶
Symptoms¶
- DynamoDB/S3/Secrets Manager access failing
- Timeout errors for AWS services
- "Could not connect" errors in logs
Diagnosis¶
-
List VPC endpoints:
-
Check endpoint status:
-
Verify security groups:
Resolution¶
Fix security group rules:
# Allow HTTPS from ECS tasks
aws ec2 authorize-security-group-ingress \
--group-id $ENDPOINT_SG_ID \
--protocol tcp \
--port 443 \
--source-group $ECS_SG_ID
Recreate VPC endpoint (if corrupted):
# Delete existing endpoint
aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $ENDPOINT_ID
# Create new endpoint
aws ec2 create-vpc-endpoint \
--vpc-id $VPC_ID \
--service-name com.amazonaws.${AWS_REGION}.dynamodb \
--vpc-endpoint-type Gateway \
--route-table-ids $PRIVATE_RT_ID
RDS Connectivity Issues¶
Symptoms¶
- Database connection errors
- Timeout connecting to RDS
- "Connection refused" errors
Diagnosis¶
-
Check RDS instance status:
-
Verify security groups:
-
Check from ECS task:
Common Causes¶
| Issue | Cause | Fix |
|---|---|---|
| Security group | ECS SG not allowed | Add inbound rule |
| Wrong endpoint | Using incorrect hostname | Update env vars |
| RDS stopped | Instance in stopped state | Start instance |
| RDS maintenance | Scheduled maintenance | Wait for completion |
| Max connections | Connection pool exhausted | Restart service |
Resolution¶
Fix security group:
# Get RDS security group
RDS_SG=$(aws rds describe-db-instances \
--db-instance-identifier tradai-${ENVIRONMENT} \
--query 'DBInstances[0].VpcSecurityGroups[0].VpcSecurityGroupId' \
--output text)
# Allow ECS tasks
aws ec2 authorize-security-group-ingress \
--group-id $RDS_SG \
--protocol tcp \
--port 5432 \
--source-group $ECS_SG_ID
Restart RDS (if hung):
Secrets Manager Access Issues¶
Symptoms¶
- "AccessDeniedException" when loading secrets
- "SecretNotFoundException" errors
- Service fails to start due to missing config
Diagnosis¶
-
Verify secret exists:
-
Check ECS task role permissions:
-
Verify secret value is valid JSON (if applicable):
Resolution¶
Grant access to secret:
# Add policy to task role
aws iam put-role-policy \
--role-name tradai-ecs-task-role-${ENVIRONMENT} \
--policy-name secrets-access \
--policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": "arn:aws:secretsmanager:${AWS_REGION}:${ACCOUNT_ID}:secret:tradai/${ENVIRONMENT}/*"
}]
}'
DNS Resolution Issues¶
Symptoms¶
- "Name or service not known" errors
- DNS lookups failing
- Intermittent connectivity
Diagnosis¶
-
Test from container:
-
Check VPC DNS settings:
Resolution¶
Enable VPC DNS:
aws ec2 modify-vpc-attribute \
--vpc-id $VPC_ID \
--enable-dns-support
aws ec2 modify-vpc-attribute \
--vpc-id $VPC_ID \
--enable-dns-hostnames
Verification Checklist¶
After any infrastructure resolution:
- [ ] ECS tasks can reach external APIs
- [ ] Database connectivity restored
- [ ] Secrets loading correctly
- [ ] VPC endpoints functional
- [ ] DNS resolution working
- [ ] All services reporting healthy
- [ ] CloudWatch alarms cleared
- [ ] No error patterns in logs (last 15 minutes)