Performance Degradation Runbook¶
Procedures for diagnosing and resolving performance issues across ECS services, Lambda functions, databases, and the application layer.
High API Latency¶
Symptoms¶
- CloudWatch alarm:
tradai-{env}-api-latency-high - API responses taking >2 seconds
- User-reported slowness
- Increased timeout errors
Diagnosis¶
-
Check ALB latency metrics:
aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name TargetResponseTime \ --dimensions Name=LoadBalancer,Value=app/tradai-${ENVIRONMENT}/xxxxx \ --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average,p99 -
Check per-service latency:
# Backend service aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name TargetResponseTime \ --dimensions Name=TargetGroup,Value=targetgroup/tradai-backend-${ENVIRONMENT}/xxxxx \ --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average -
Check request count (to identify load spike):
Common Causes and Fixes¶
| Cause | Indicator | Fix |
|---|---|---|
| Load spike | High request count | Scale out ECS tasks |
| Database slow | RDS CPU high | Check slow queries, scale RDS |
| Cold start | Lambda duration spike | Use provisioned concurrency |
| Memory pressure | ECS memory > 80% | Increase task memory |
| External API slow | Timeout errors | Check exchange API status |
ECS Service Performance¶
CPU/Memory Issues¶
-
Check service metrics:
# CPU utilization aws cloudwatch get-metric-statistics \ --namespace AWS/ECS \ --metric-name CPUUtilization \ --dimensions Name=ClusterName,Value=tradai-${ENVIRONMENT} Name=ServiceName,Value=tradai-backend-${ENVIRONMENT} \ --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average,Maximum # Memory utilization aws cloudwatch get-metric-statistics \ --namespace AWS/ECS \ --metric-name MemoryUtilization \ --dimensions Name=ClusterName,Value=tradai-${ENVIRONMENT} Name=ServiceName,Value=tradai-backend-${ENVIRONMENT} \ --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average,Maximum -
Check running task count:
Scaling ECS Services¶
Manual scale out:
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-backend-${ENVIRONMENT} \
--desired-count 3
Increase task resources (requires new task definition):
# Get current task definition
aws ecs describe-task-definition \
--task-definition tradai-backend-${ENVIRONMENT} \
--query 'taskDefinition.{cpu:cpu,memory:memory}'
# Register new task definition with more resources
# Then update service to use it
Lambda Performance¶
Cold Start Issues¶
-
Check duration metrics:
aws cloudwatch get-metric-statistics \ --namespace AWS/Lambda \ --metric-name Duration \ --dimensions Name=FunctionName,Value=tradai-${FUNCTION_NAME}-${ENVIRONMENT} \ --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average,Maximum -
Check init duration (cold start indicator):
Resolution¶
Enable provisioned concurrency (for critical functions):
aws lambda put-provisioned-concurrency-config \
--function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
--qualifier $ALIAS_OR_VERSION \
--provisioned-concurrent-executions 5
Increase memory (also increases CPU):
aws lambda update-function-configuration \
--function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
--memory-size 512
Lambda Timeout Issues¶
-
Check timeout errors:
-
Increase timeout:
Database Performance¶
RDS Slow Queries¶
-
Check RDS CPU:
-
Check read/write IOPS:
-
Find slow queries (if Performance Insights enabled):
DynamoDB Throttling¶
-
Check throttled requests:
-
Check consumed capacity:
aws cloudwatch get-metric-statistics \ --namespace AWS/DynamoDB \ --metric-name ConsumedReadCapacityUnits \ --dimensions Name=TableName,Value=tradai-${ENVIRONMENT}-workflow-state \ --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Sum
Resolution¶
Scale RDS instance:
aws rds modify-db-instance \
--db-instance-identifier tradai-${ENVIRONMENT} \
--db-instance-class db.t4g.small \
--apply-immediately
DynamoDB is on-demand (auto-scales), but check for hot partitions if throttling persists.
Backtest Performance¶
Slow Backtests¶
-
Check ECS task metrics for strategy tasks:
-
Check for memory issues:
Common Backtest Performance Issues¶
| Issue | Cause | Fix |
|---|---|---|
| Slow data loading | Large date range | Use incremental loading |
| OOM during backtest | Too many pairs | Reduce pairs per task |
| Slow indicator calc | Inefficient indicators | Optimize indicator code |
| FreqAI slow | Large model | Reduce model complexity |
External Service Latency¶
Exchange API Issues¶
-
Check for timeout errors:
-
Verify exchange status:
- Binance: https://api.binance.com/api/v3/ping
- Check exchange status pages for maintenance
Rate Limiting¶
If hitting exchange rate limits:
# Check rate limit errors
aws logs filter-log-events \
--log-group-name /ecs/tradai-data-collection-${ENVIRONMENT} \
--start-time $(date -u -v-1H +%s)000 \
--filter-pattern '"429" OR "rate limit" OR "too many requests"'
Resolution: - Implement request throttling in code - Use exchange WebSocket for real-time data - Cache frequently accessed data
Quick Performance Checks¶
All-in-one health check¶
# ECS services
for SERVICE in backend strategy-service data-collection; do
echo "=== $SERVICE ==="
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ClusterName,Value=tradai-${ENVIRONMENT} Name=ServiceName,Value=tradai-${SERVICE}-${ENVIRONMENT} \
--start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 \
--statistics Average \
--query 'Datapoints[0].Average'
done
API response time test¶
# Test each service health endpoint
for PORT in 8000 8003 8002; do
echo "=== Port $PORT ==="
time curl -s http://localhost:$PORT/api/v1/health | jq '.status'
done
Verification Checklist¶
After performance issue resolution:
- [ ] API latency back to normal (<500ms p99)
- [ ] ECS CPU/memory utilization <70%
- [ ] No Lambda timeout errors
- [ ] Database connections stable
- [ ] No throttling on DynamoDB
- [ ] CloudWatch alarms cleared
- [ ] User-facing functionality verified
- [ ] Performance baseline documented