Skip to content

Infrastructure Issues Runbook

Procedures for handling NAT instance failures, VPC connectivity issues, and RDS problems.

NAT Instance Failure

Symptoms

  • ECS tasks cannot reach external APIs (exchanges)
  • Timeout errors when accessing internet
  • CloudWatch alarm: tradai-{env}-nat-unhealthy

Diagnosis

  1. Check NAT instance status:

    aws ec2 describe-instances \
      --filters "Name=tag:Name,Values=tradai-nat-${ENVIRONMENT}" \
      --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name,Status:Status}'
    

  2. Check NAT instance health:

    aws ec2 describe-instance-status \
      --instance-ids $NAT_INSTANCE_ID
    

  3. Verify route table:

    # Get private subnet route table
    aws ec2 describe-route-tables \
      --filters "Name=tag:Name,Values=tradai-private-rt-${ENVIRONMENT}"
    

Resolution

Option 1: Restart NAT instance

# Stop instance
aws ec2 stop-instances --instance-ids $NAT_INSTANCE_ID

# Wait for stopped state
aws ec2 wait instance-stopped --instance-ids $NAT_INSTANCE_ID

# Start instance
aws ec2 start-instances --instance-ids $NAT_INSTANCE_ID

# Wait for running state
aws ec2 wait instance-running --instance-ids $NAT_INSTANCE_ID

Option 2: Replace NAT instance

# Launch new NAT instance from AMI
aws ec2 run-instances \
  --image-id ami-XXXXXXXXX \
  --instance-type t3.micro \
  --subnet-id $PUBLIC_SUBNET_ID \
  --security-group-ids $NAT_SG_ID \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=tradai-nat-${ENVIRONMENT}}]' \
  --source-dest-check false

# Update route table
aws ec2 replace-route \
  --route-table-id $PRIVATE_RT_ID \
  --destination-cidr-block 0.0.0.0/0 \
  --instance-id $NEW_NAT_INSTANCE_ID

# Terminate old instance after verification
aws ec2 terminate-instances --instance-ids $OLD_NAT_INSTANCE_ID

Option 3: Switch to NAT Gateway (recommended for production)

# Create NAT Gateway
aws ec2 create-nat-gateway \
  --subnet-id $PUBLIC_SUBNET_ID \
  --allocation-id $ELASTIC_IP_ALLOCATION_ID \
  --tag-specifications 'ResourceType=natgateway,Tags=[{Key=Name,Value=tradai-nat-gw-${ENVIRONMENT}}]'

# Wait for NAT Gateway to be available
aws ec2 wait nat-gateway-available --nat-gateway-ids $NAT_GW_ID

# Update route table
aws ec2 replace-route \
  --route-table-id $PRIVATE_RT_ID \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id $NAT_GW_ID

Verification

# Test connectivity from ECS task
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container $CONTAINER_NAME \
  --interactive \
  --command "curl -I https://api.binance.com/api/v3/ping"

VPC Endpoint Issues

Symptoms

  • DynamoDB/S3/Secrets Manager access failing
  • Timeout errors for AWS services
  • "Could not connect" errors in logs

Diagnosis

  1. List VPC endpoints:

    aws ec2 describe-vpc-endpoints \
      --filters "Name=vpc-id,Values=$VPC_ID" \
      --query 'VpcEndpoints[].{Service:ServiceName,State:State,ID:VpcEndpointId}'
    

  2. Check endpoint status:

    aws ec2 describe-vpc-endpoints \
      --vpc-endpoint-ids $ENDPOINT_ID
    

  3. Verify security groups:

    # VPC endpoint security group must allow traffic from ECS tasks
    aws ec2 describe-security-groups \
      --group-ids $ENDPOINT_SG_ID
    

Resolution

Fix security group rules:

# Allow HTTPS from ECS tasks
aws ec2 authorize-security-group-ingress \
  --group-id $ENDPOINT_SG_ID \
  --protocol tcp \
  --port 443 \
  --source-group $ECS_SG_ID

Recreate VPC endpoint (if corrupted):

# Delete existing endpoint
aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $ENDPOINT_ID

# Create new endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.${AWS_REGION}.dynamodb \
  --vpc-endpoint-type Gateway \
  --route-table-ids $PRIVATE_RT_ID


RDS Connectivity Issues

Symptoms

  • Database connection errors
  • Timeout connecting to RDS
  • "Connection refused" errors

Diagnosis

  1. Check RDS instance status:

    aws rds describe-db-instances \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --query 'DBInstances[0].{Status:DBInstanceStatus,Endpoint:Endpoint}'
    

  2. Verify security groups:

    aws rds describe-db-instances \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --query 'DBInstances[0].VpcSecurityGroups'
    

  3. Check from ECS task:

    # Execute command in running container
    aws ecs execute-command \
      --cluster tradai-${ENVIRONMENT} \
      --task $TASK_ARN \
      --container $CONTAINER_NAME \
      --interactive \
      --command "nc -zv $RDS_ENDPOINT 5432"
    

Common Causes

Issue Cause Fix
Security group ECS SG not allowed Add inbound rule
Wrong endpoint Using incorrect hostname Update env vars
RDS stopped Instance in stopped state Start instance
RDS maintenance Scheduled maintenance Wait for completion
Max connections Connection pool exhausted Restart service

Resolution

Fix security group:

# Get RDS security group
RDS_SG=$(aws rds describe-db-instances \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --query 'DBInstances[0].VpcSecurityGroups[0].VpcSecurityGroupId' \
  --output text)

# Allow ECS tasks
aws ec2 authorize-security-group-ingress \
  --group-id $RDS_SG \
  --protocol tcp \
  --port 5432 \
  --source-group $ECS_SG_ID

Restart RDS (if hung):

aws rds reboot-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT}


Secrets Manager Access Issues

Symptoms

  • "AccessDeniedException" when loading secrets
  • "SecretNotFoundException" errors
  • Service fails to start due to missing config

Diagnosis

  1. Verify secret exists:

    aws secretsmanager describe-secret \
      --secret-id tradai/${ENVIRONMENT}/api-keys
    

  2. Check ECS task role permissions:

    aws iam get-role-policy \
      --role-name tradai-ecs-task-role-${ENVIRONMENT} \
      --policy-name secrets-access
    

  3. Verify secret value is valid JSON (if applicable):

    aws secretsmanager get-secret-value \
      --secret-id tradai/${ENVIRONMENT}/api-keys \
      --query 'SecretString' \
      --output text | jq .
    

Resolution

Grant access to secret:

# Add policy to task role
aws iam put-role-policy \
  --role-name tradai-ecs-task-role-${ENVIRONMENT} \
  --policy-name secrets-access \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Action": ["secretsmanager:GetSecretValue"],
      "Resource": "arn:aws:secretsmanager:${AWS_REGION}:${ACCOUNT_ID}:secret:tradai/${ENVIRONMENT}/*"
    }]
  }'


DNS Resolution Issues

Symptoms

  • "Name or service not known" errors
  • DNS lookups failing
  • Intermittent connectivity

Diagnosis

  1. Test from container:

    aws ecs execute-command \
      --cluster tradai-${ENVIRONMENT} \
      --task $TASK_ARN \
      --container $CONTAINER_NAME \
      --interactive \
      --command "nslookup api.binance.com"
    

  2. Check VPC DNS settings:

    aws ec2 describe-vpc-attribute \
      --vpc-id $VPC_ID \
      --attribute enableDnsSupport
    
    aws ec2 describe-vpc-attribute \
      --vpc-id $VPC_ID \
      --attribute enableDnsHostnames
    

Resolution

Enable VPC DNS:

aws ec2 modify-vpc-attribute \
  --vpc-id $VPC_ID \
  --enable-dns-support

aws ec2 modify-vpc-attribute \
  --vpc-id $VPC_ID \
  --enable-dns-hostnames


Verification Checklist

After any infrastructure resolution:

  • [ ] ECS tasks can reach external APIs
  • [ ] Database connectivity restored
  • [ ] Secrets loading correctly
  • [ ] VPC endpoints functional
  • [ ] DNS resolution working
  • [ ] All services reporting healthy
  • [ ] CloudWatch alarms cleared
  • [ ] No error patterns in logs (last 15 minutes)