Infrastructure Issues Runbook¶
Procedures for handling NAT instance failures, VPC connectivity issues, and RDS problems.
NAT Instance Failure¶
Symptoms¶
- ECS tasks cannot reach external APIs (exchanges)
- Timeout errors when accessing internet
- CloudWatch alarm:
tradai-{env}-nat-unhealthy
Diagnosis¶
-
Check NAT instance status:
-
Check NAT instance health:
-
Verify route table:
Resolution¶
Option 1: Restart NAT instance
# Stop instance
aws ec2 stop-instances --instance-ids $NAT_INSTANCE_ID
# Wait for stopped state
aws ec2 wait instance-stopped --instance-ids $NAT_INSTANCE_ID
# Start instance
aws ec2 start-instances --instance-ids $NAT_INSTANCE_ID
# Wait for running state
aws ec2 wait instance-running --instance-ids $NAT_INSTANCE_ID
Option 2: Replace NAT instance
# Launch new NAT instance from AMI
aws ec2 run-instances \
--image-id ami-XXXXXXXXX \
--instance-type t3.micro \
--subnet-id $PUBLIC_SUBNET_ID \
--security-group-ids $NAT_SG_ID \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=tradai-nat-${ENVIRONMENT}}]' \
--source-dest-check false
# Update route table
aws ec2 replace-route \
--route-table-id $PRIVATE_RT_ID \
--destination-cidr-block 0.0.0.0/0 \
--instance-id $NEW_NAT_INSTANCE_ID
# Terminate old instance after verification
aws ec2 terminate-instances --instance-ids $OLD_NAT_INSTANCE_ID
Option 3: Switch to NAT Gateway (recommended for production)
# Create NAT Gateway
aws ec2 create-nat-gateway \
--subnet-id $PUBLIC_SUBNET_ID \
--allocation-id $ELASTIC_IP_ALLOCATION_ID \
--tag-specifications 'ResourceType=natgateway,Tags=[{Key=Name,Value=tradai-nat-gw-${ENVIRONMENT}}]'
# Wait for NAT Gateway to be available
aws ec2 wait nat-gateway-available --nat-gateway-ids $NAT_GW_ID
# Update route table
aws ec2 replace-route \
--route-table-id $PRIVATE_RT_ID \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id $NAT_GW_ID
Verification¶
# Test connectivity from ECS task
aws ecs execute-command \
--cluster tradai-${ENVIRONMENT} \
--task $TASK_ARN \
--container $CONTAINER_NAME \
--interactive \
--command "curl -I https://api.binance.com/api/v3/ping"
VPC Endpoint Issues¶
Symptoms¶
- DynamoDB/S3/Secrets Manager access failing
- Timeout errors for AWS services
- "Could not connect" errors in logs
Diagnosis¶
-
List VPC endpoints:
-
Check endpoint status:
-
Verify security groups:
Resolution¶
Fix security group rules:
# Allow HTTPS from ECS tasks
aws ec2 authorize-security-group-ingress \
--group-id $ENDPOINT_SG_ID \
--protocol tcp \
--port 443 \
--source-group $ECS_SG_ID
Recreate VPC endpoint (if corrupted):
# Delete existing endpoint
aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $ENDPOINT_ID
# Create new endpoint
aws ec2 create-vpc-endpoint \
--vpc-id $VPC_ID \
--service-name com.amazonaws.${AWS_REGION}.dynamodb \
--vpc-endpoint-type Gateway \
--route-table-ids $PRIVATE_RT_ID
RDS Connectivity Issues¶
Symptoms¶
- Database connection errors
- Timeout connecting to RDS
- "Connection refused" errors
Diagnosis¶
-
Check RDS instance status:
-
Verify security groups:
-
Check from ECS task:
Common Causes¶
| Issue | Cause | Fix |
|---|---|---|
| Security group | ECS SG not allowed | Add inbound rule |
| Wrong endpoint | Using incorrect hostname | Update env vars |
| RDS stopped | Instance in stopped state | Start instance |
| RDS maintenance | Scheduled maintenance | Wait for completion |
| Max connections | Connection pool exhausted | Restart service |
Resolution¶
Fix security group:
# Get RDS security group
RDS_SG=$(aws rds describe-db-instances \
--db-instance-identifier tradai-${ENVIRONMENT} \
--query 'DBInstances[0].VpcSecurityGroups[0].VpcSecurityGroupId' \
--output text)
# Allow ECS tasks
aws ec2 authorize-security-group-ingress \
--group-id $RDS_SG \
--protocol tcp \
--port 5432 \
--source-group $ECS_SG_ID
Restart RDS (if hung):
Secrets Manager Access Issues¶
Symptoms¶
- "AccessDeniedException" when loading secrets
- "SecretNotFoundException" errors
- Service fails to start due to missing config
Diagnosis¶
-
Verify secret exists:
-
Check ECS task role permissions:
-
Verify secret value is valid JSON (if applicable):
Resolution¶
Grant access to secret:
# Add policy to task role
aws iam put-role-policy \
--role-name tradai-ecs-task-role-${ENVIRONMENT} \
--policy-name secrets-access \
--policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": "arn:aws:secretsmanager:${AWS_REGION}:${ACCOUNT_ID}:secret:tradai/${ENVIRONMENT}/*"
}]
}'
DNS Resolution Issues¶
Symptoms¶
- "Name or service not known" errors
- DNS lookups failing
- Intermittent connectivity
Diagnosis¶
-
Test from container:
-
Check VPC DNS settings:
Resolution¶
Enable VPC DNS:
aws ec2 modify-vpc-attribute \
--vpc-id $VPC_ID \
--enable-dns-support
aws ec2 modify-vpc-attribute \
--vpc-id $VPC_ID \
--enable-dns-hostnames
SSM Session Manager Issues¶
Symptoms¶
aws ssm start-sessionfails with "TargetNotConnected"- Instance not appearing in SSM Fleet Manager
ssm describe-instance-informationreturns empty
Diagnosis¶
-
Check instance is running:
-
Check SSM agent registration:
-
Check EC2 console output (boot logs):
Common Causes¶
| Issue | Cause | Fix |
|---|---|---|
| Agent not installed | Custom AMI missing SSM agent | Redeploy compute stack (userdata installs it) |
| S3 403 on agent RPM download | VPC endpoint policy blocks amazon-ssm-* buckets | Deploy foundation stack (AllowSystemPackageBuckets) |
| No IAM permissions | Missing AmazonSSMManagedInstanceCore | Check consolidated role in compute/modules/iam.py |
| VPC endpoint missing | No SSM interface endpoints | Check foundation/modules/vpc_endpoints.py |
Resolution¶
If agent not installed (most common):
# Terminate the instance — ASG will launch a new one with updated userdata
aws ec2 terminate-instances --instance-ids <instance-id> \
--profile tradai --region eu-central-1
# Wait for new instance (120-180s)
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names tradai-consolidated-asg-dev \
--query 'AutoScalingGroups[0].Instances[].{ID:InstanceId,State:LifecycleState}' \
--profile tradai --region eu-central-1
# Verify SSM registration
aws ssm describe-instance-information \
--filters "Key=tag:Name,Values=tradai-consolidated-dev" \
--profile tradai --region eu-central-1
If VPC endpoint policy is blocking:
# Redeploy foundation stack (includes AllowSystemPackageBuckets fix)
just infra-up-foundation dev
# Then terminate instance to pick up changes
aws ec2 terminate-instances --instance-ids <instance-id> \
--profile tradai --region eu-central-1
Consolidated EC2 Debugging¶
Symptoms¶
- Services not accessible via ALB
- Health checks failing
- Docker containers crashing
Diagnosis¶
-
SSM into the instance (see SSM section above)
-
Check containers:
-
Check docker-compose:
-
Check system resources:
Common Causes¶
| Issue | Cause | Fix |
|---|---|---|
| Container exits immediately | Missing env vars or secrets | Check docker-compose.yml and docker logs |
| Image pull fails | ECR auth expired | aws ecr get-login-password ... \| docker login |
| Port conflict | Stale container from previous deploy | sudo docker rm -f <container> |
| Out of memory | Instance too small | Check docker stats, consider larger instance type |
| ArcticDB S3 redirect | S3 client following 301 redirects | Known issue, configure correct S3 region |
Resolution¶
Restart all services:
cd /opt/tradai
sudo docker-compose down --remove-orphans
sudo docker-compose pull
sudo docker-compose up -d
sudo docker ps
Force fresh deploy (new ECR images):
# 1. Push new images from dev machine
just service-push-all
# 2. On EC2 instance via SSM
aws ecr get-login-password --region eu-central-1 | sudo docker login --username AWS --password-stdin <ECR_REGISTRY>
cd /opt/tradai
sudo docker-compose pull
sudo docker-compose up -d --force-recreate
Replace the entire instance:
# Terminate — ASG launches fresh instance with latest launch template
aws ec2 terminate-instances --instance-ids <instance-id> \
--profile tradai --region eu-central-1
# Or use justfile
just asg-refresh
just asg-status
Verification Checklist¶
After any infrastructure resolution:
- [ ] ECS tasks can reach external APIs
- [ ] Database connectivity restored
- [ ] Secrets loading correctly
- [ ] VPC endpoints functional
- [ ] DNS resolution working
- [ ] All services reporting healthy
- [ ] CloudWatch alarms cleared
- [ ] No error patterns in logs (last 15 minutes)
- [ ] SSM Session Manager connecting to consolidated EC2
- [ ] All Docker containers running (
sudo docker psshows 4 services)