Skip to content

Infrastructure Issues Runbook

Procedures for handling NAT instance failures, VPC connectivity issues, and RDS problems.

NAT Instance Failure

Symptoms

  • ECS tasks cannot reach external APIs (exchanges)
  • Timeout errors when accessing internet
  • CloudWatch alarm: tradai-{env}-nat-unhealthy

Diagnosis

  1. Check NAT instance status:

    aws ec2 describe-instances \
      --filters "Name=tag:Name,Values=tradai-nat-${ENVIRONMENT}" \
      --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name,Status:Status}'
    

  2. Check NAT instance health:

    aws ec2 describe-instance-status \
      --instance-ids $NAT_INSTANCE_ID
    

  3. Verify route table:

    # Get private subnet route table
    aws ec2 describe-route-tables \
      --filters "Name=tag:Name,Values=tradai-private-rt-${ENVIRONMENT}"
    

Resolution

Option 1: Restart NAT instance

# Stop instance
aws ec2 stop-instances --instance-ids $NAT_INSTANCE_ID

# Wait for stopped state
aws ec2 wait instance-stopped --instance-ids $NAT_INSTANCE_ID

# Start instance
aws ec2 start-instances --instance-ids $NAT_INSTANCE_ID

# Wait for running state
aws ec2 wait instance-running --instance-ids $NAT_INSTANCE_ID

Option 2: Replace NAT instance

# Launch new NAT instance from AMI
aws ec2 run-instances \
  --image-id ami-XXXXXXXXX \
  --instance-type t3.micro \
  --subnet-id $PUBLIC_SUBNET_ID \
  --security-group-ids $NAT_SG_ID \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=tradai-nat-${ENVIRONMENT}}]' \
  --source-dest-check false

# Update route table
aws ec2 replace-route \
  --route-table-id $PRIVATE_RT_ID \
  --destination-cidr-block 0.0.0.0/0 \
  --instance-id $NEW_NAT_INSTANCE_ID

# Terminate old instance after verification
aws ec2 terminate-instances --instance-ids $OLD_NAT_INSTANCE_ID

Option 3: Switch to NAT Gateway (recommended for production)

# Create NAT Gateway
aws ec2 create-nat-gateway \
  --subnet-id $PUBLIC_SUBNET_ID \
  --allocation-id $ELASTIC_IP_ALLOCATION_ID \
  --tag-specifications 'ResourceType=natgateway,Tags=[{Key=Name,Value=tradai-nat-gw-${ENVIRONMENT}}]'

# Wait for NAT Gateway to be available
aws ec2 wait nat-gateway-available --nat-gateway-ids $NAT_GW_ID

# Update route table
aws ec2 replace-route \
  --route-table-id $PRIVATE_RT_ID \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id $NAT_GW_ID

Verification

# Test connectivity from ECS task
aws ecs execute-command \
  --cluster tradai-${ENVIRONMENT} \
  --task $TASK_ARN \
  --container $CONTAINER_NAME \
  --interactive \
  --command "curl -I https://api.binance.com/api/v3/ping"

VPC Endpoint Issues

Symptoms

  • DynamoDB/S3/Secrets Manager access failing
  • Timeout errors for AWS services
  • "Could not connect" errors in logs

Diagnosis

  1. List VPC endpoints:

    aws ec2 describe-vpc-endpoints \
      --filters "Name=vpc-id,Values=$VPC_ID" \
      --query 'VpcEndpoints[].{Service:ServiceName,State:State,ID:VpcEndpointId}'
    

  2. Check endpoint status:

    aws ec2 describe-vpc-endpoints \
      --vpc-endpoint-ids $ENDPOINT_ID
    

  3. Verify security groups:

    # VPC endpoint security group must allow traffic from ECS tasks
    aws ec2 describe-security-groups \
      --group-ids $ENDPOINT_SG_ID
    

Resolution

Fix security group rules:

# Allow HTTPS from ECS tasks
aws ec2 authorize-security-group-ingress \
  --group-id $ENDPOINT_SG_ID \
  --protocol tcp \
  --port 443 \
  --source-group $ECS_SG_ID

Recreate VPC endpoint (if corrupted):

# Delete existing endpoint
aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $ENDPOINT_ID

# Create new endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.${AWS_REGION}.dynamodb \
  --vpc-endpoint-type Gateway \
  --route-table-ids $PRIVATE_RT_ID


RDS Connectivity Issues

Symptoms

  • Database connection errors
  • Timeout connecting to RDS
  • "Connection refused" errors

Diagnosis

  1. Check RDS instance status:

    aws rds describe-db-instances \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --query 'DBInstances[0].{Status:DBInstanceStatus,Endpoint:Endpoint}'
    

  2. Verify security groups:

    aws rds describe-db-instances \
      --db-instance-identifier tradai-${ENVIRONMENT} \
      --query 'DBInstances[0].VpcSecurityGroups'
    

  3. Check from ECS task:

    # Execute command in running container
    aws ecs execute-command \
      --cluster tradai-${ENVIRONMENT} \
      --task $TASK_ARN \
      --container $CONTAINER_NAME \
      --interactive \
      --command "nc -zv $RDS_ENDPOINT 5432"
    

Common Causes

Issue Cause Fix
Security group ECS SG not allowed Add inbound rule
Wrong endpoint Using incorrect hostname Update env vars
RDS stopped Instance in stopped state Start instance
RDS maintenance Scheduled maintenance Wait for completion
Max connections Connection pool exhausted Restart service

Resolution

Fix security group:

# Get RDS security group
RDS_SG=$(aws rds describe-db-instances \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --query 'DBInstances[0].VpcSecurityGroups[0].VpcSecurityGroupId' \
  --output text)

# Allow ECS tasks
aws ec2 authorize-security-group-ingress \
  --group-id $RDS_SG \
  --protocol tcp \
  --port 5432 \
  --source-group $ECS_SG_ID

Restart RDS (if hung):

aws rds reboot-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT}


Secrets Manager Access Issues

Symptoms

  • "AccessDeniedException" when loading secrets
  • "SecretNotFoundException" errors
  • Service fails to start due to missing config

Diagnosis

  1. Verify secret exists:

    aws secretsmanager describe-secret \
      --secret-id tradai/${ENVIRONMENT}/api-keys
    

  2. Check ECS task role permissions:

    aws iam get-role-policy \
      --role-name tradai-ecs-task-role-${ENVIRONMENT} \
      --policy-name secrets-access
    

  3. Verify secret value is valid JSON (if applicable):

    aws secretsmanager get-secret-value \
      --secret-id tradai/${ENVIRONMENT}/api-keys \
      --query 'SecretString' \
      --output text | jq .
    

Resolution

Grant access to secret:

# Add policy to task role
aws iam put-role-policy \
  --role-name tradai-ecs-task-role-${ENVIRONMENT} \
  --policy-name secrets-access \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Action": ["secretsmanager:GetSecretValue"],
      "Resource": "arn:aws:secretsmanager:${AWS_REGION}:${ACCOUNT_ID}:secret:tradai/${ENVIRONMENT}/*"
    }]
  }'


DNS Resolution Issues

Symptoms

  • "Name or service not known" errors
  • DNS lookups failing
  • Intermittent connectivity

Diagnosis

  1. Test from container:

    aws ecs execute-command \
      --cluster tradai-${ENVIRONMENT} \
      --task $TASK_ARN \
      --container $CONTAINER_NAME \
      --interactive \
      --command "nslookup api.binance.com"
    

  2. Check VPC DNS settings:

    aws ec2 describe-vpc-attribute \
      --vpc-id $VPC_ID \
      --attribute enableDnsSupport
    
    aws ec2 describe-vpc-attribute \
      --vpc-id $VPC_ID \
      --attribute enableDnsHostnames
    

Resolution

Enable VPC DNS:

aws ec2 modify-vpc-attribute \
  --vpc-id $VPC_ID \
  --enable-dns-support

aws ec2 modify-vpc-attribute \
  --vpc-id $VPC_ID \
  --enable-dns-hostnames


SSM Session Manager Issues

Symptoms

  • aws ssm start-session fails with "TargetNotConnected"
  • Instance not appearing in SSM Fleet Manager
  • ssm describe-instance-information returns empty

Diagnosis

  1. Check instance is running:

    aws ec2 describe-instances \
      --filters "Name=tag:Name,Values=tradai-consolidated-dev" "Name=instance-state-name,Values=running" \
      --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name,IP:PrivateIpAddress}' \
      --profile tradai --region eu-central-1
    

  2. Check SSM agent registration:

    aws ssm describe-instance-information \
      --filters "Key=InstanceIds,Values=<instance-id>" \
      --query 'InstanceInformationList[].{ID:InstanceId,Status:PingStatus,Agent:AgentVersion}' \
      --profile tradai --region eu-central-1
    

  3. Check EC2 console output (boot logs):

    aws ec2 get-console-output --instance-id <instance-id> \
      --profile tradai --region eu-central-1 | jq -r '.Output'
    

Common Causes

Issue Cause Fix
Agent not installed Custom AMI missing SSM agent Redeploy compute stack (userdata installs it)
S3 403 on agent RPM download VPC endpoint policy blocks amazon-ssm-* buckets Deploy foundation stack (AllowSystemPackageBuckets)
No IAM permissions Missing AmazonSSMManagedInstanceCore Check consolidated role in compute/modules/iam.py
VPC endpoint missing No SSM interface endpoints Check foundation/modules/vpc_endpoints.py

Resolution

If agent not installed (most common):

# Terminate the instance — ASG will launch a new one with updated userdata
aws ec2 terminate-instances --instance-ids <instance-id> \
  --profile tradai --region eu-central-1

# Wait for new instance (120-180s)
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names tradai-consolidated-asg-dev \
  --query 'AutoScalingGroups[0].Instances[].{ID:InstanceId,State:LifecycleState}' \
  --profile tradai --region eu-central-1

# Verify SSM registration
aws ssm describe-instance-information \
  --filters "Key=tag:Name,Values=tradai-consolidated-dev" \
  --profile tradai --region eu-central-1

If VPC endpoint policy is blocking:

# Redeploy foundation stack (includes AllowSystemPackageBuckets fix)
just infra-up-foundation dev

# Then terminate instance to pick up changes
aws ec2 terminate-instances --instance-ids <instance-id> \
  --profile tradai --region eu-central-1


Consolidated EC2 Debugging

Symptoms

  • Services not accessible via ALB
  • Health checks failing
  • Docker containers crashing

Diagnosis

  1. SSM into the instance (see SSM section above)

  2. Check containers:

    sudo docker ps                    # Running containers
    sudo docker ps -a                 # Include stopped/crashed
    sudo docker logs backend-api      # Service logs
    sudo docker inspect backend-api   # Full container config
    

  3. Check docker-compose:

    cat /opt/tradai/docker-compose.yml   # Verify config
    sudo docker-compose ps               # Service status
    

  4. Check system resources:

    sudo docker stats --no-stream     # Container CPU/memory
    df -h                             # Disk usage
    free -m                           # System memory
    

Common Causes

Issue Cause Fix
Container exits immediately Missing env vars or secrets Check docker-compose.yml and docker logs
Image pull fails ECR auth expired aws ecr get-login-password ... \| docker login
Port conflict Stale container from previous deploy sudo docker rm -f <container>
Out of memory Instance too small Check docker stats, consider larger instance type
ArcticDB S3 redirect S3 client following 301 redirects Known issue, configure correct S3 region

Resolution

Restart all services:

cd /opt/tradai
sudo docker-compose down --remove-orphans
sudo docker-compose pull
sudo docker-compose up -d
sudo docker ps

Force fresh deploy (new ECR images):

# 1. Push new images from dev machine
just service-push-all

# 2. On EC2 instance via SSM
aws ecr get-login-password --region eu-central-1 | sudo docker login --username AWS --password-stdin <ECR_REGISTRY>
cd /opt/tradai
sudo docker-compose pull
sudo docker-compose up -d --force-recreate

Replace the entire instance:

# Terminate — ASG launches fresh instance with latest launch template
aws ec2 terminate-instances --instance-ids <instance-id> \
  --profile tradai --region eu-central-1

# Or use justfile
just asg-refresh
just asg-status


Verification Checklist

After any infrastructure resolution:

  • [ ] ECS tasks can reach external APIs
  • [ ] Database connectivity restored
  • [ ] Secrets loading correctly
  • [ ] VPC endpoints functional
  • [ ] DNS resolution working
  • [ ] All services reporting healthy
  • [ ] CloudWatch alarms cleared
  • [ ] No error patterns in logs (last 15 minutes)
  • [ ] SSM Session Manager connecting to consolidated EC2
  • [ ] All Docker containers running (sudo docker ps shows 4 services)