Skip to content

Performance Degradation Runbook

Procedures for diagnosing and resolving performance issues across ECS services, Lambda functions, databases, and the application layer.

High API Latency

Symptoms

  • CloudWatch alarm: tradai-{env}-api-latency-high
  • API responses taking >2 seconds
  • User-reported slowness
  • Increased timeout errors

Diagnosis

  1. Check ALB latency metrics:

    aws cloudwatch get-metric-statistics \
      --namespace AWS/ApplicationELB \
      --metric-name TargetResponseTime \
      --dimensions Name=LoadBalancer,Value=app/tradai-${ENVIRONMENT}/xxxxx \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Average,p99
    

  2. Check per-service latency:

    # Backend service
    aws cloudwatch get-metric-statistics \
      --namespace AWS/ApplicationELB \
      --metric-name TargetResponseTime \
      --dimensions Name=TargetGroup,Value=targetgroup/tradai-backend-${ENVIRONMENT}/xxxxx \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Average
    

  3. Check request count (to identify load spike):

    aws cloudwatch get-metric-statistics \
      --namespace AWS/ApplicationELB \
      --metric-name RequestCount \
      --dimensions Name=LoadBalancer,Value=app/tradai-${ENVIRONMENT}/xxxxx \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Sum
    

Common Causes and Fixes

Cause Indicator Fix
Load spike High request count Scale out ECS tasks
Database slow RDS CPU high Check slow queries, scale RDS
Cold start Lambda duration spike Use provisioned concurrency
Memory pressure ECS memory > 80% Increase task memory
External API slow Timeout errors Check exchange API status

ECS Service Performance

CPU/Memory Issues

  1. Check service metrics:

    # CPU utilization
    aws cloudwatch get-metric-statistics \
      --namespace AWS/ECS \
      --metric-name CPUUtilization \
      --dimensions Name=ClusterName,Value=tradai-${ENVIRONMENT} Name=ServiceName,Value=tradai-backend-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Average,Maximum
    
    # Memory utilization
    aws cloudwatch get-metric-statistics \
      --namespace AWS/ECS \
      --metric-name MemoryUtilization \
      --dimensions Name=ClusterName,Value=tradai-${ENVIRONMENT} Name=ServiceName,Value=tradai-backend-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Average,Maximum
    

  2. Check running task count:

    aws ecs describe-services \
      --cluster tradai-${ENVIRONMENT} \
      --services tradai-backend-${ENVIRONMENT} \
      --query 'services[0].{Running:runningCount,Desired:desiredCount,Pending:pendingCount}'
    

Scaling ECS Services

Manual scale out:

aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-backend-${ENVIRONMENT} \
  --desired-count 3

Increase task resources (requires new task definition):

# Get current task definition
aws ecs describe-task-definition \
  --task-definition tradai-backend-${ENVIRONMENT} \
  --query 'taskDefinition.{cpu:cpu,memory:memory}'

# Register new task definition with more resources
# Then update service to use it


Lambda Performance

Cold Start Issues

  1. Check duration metrics:

    aws cloudwatch get-metric-statistics \
      --namespace AWS/Lambda \
      --metric-name Duration \
      --dimensions Name=FunctionName,Value=tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Average,Maximum
    

  2. Check init duration (cold start indicator):

    aws logs filter-log-events \
      --log-group-name /aws/lambda/tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%s)000 \
      --filter-pattern '"Init Duration"'
    

Resolution

Enable provisioned concurrency (for critical functions):

aws lambda put-provisioned-concurrency-config \
  --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
  --qualifier $ALIAS_OR_VERSION \
  --provisioned-concurrent-executions 5

Increase memory (also increases CPU):

aws lambda update-function-configuration \
  --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
  --memory-size 512

Lambda Timeout Issues

  1. Check timeout errors:

    aws logs filter-log-events \
      --log-group-name /aws/lambda/tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%s)000 \
      --filter-pattern '"Task timed out"'
    

  2. Increase timeout:

    aws lambda update-function-configuration \
      --function-name tradai-${FUNCTION_NAME}-${ENVIRONMENT} \
      --timeout 60
    


Database Performance

RDS Slow Queries

  1. Check RDS CPU:

    aws cloudwatch get-metric-statistics \
      --namespace AWS/RDS \
      --metric-name CPUUtilization \
      --dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Average,Maximum
    

  2. Check read/write IOPS:

    aws cloudwatch get-metric-statistics \
      --namespace AWS/RDS \
      --metric-name ReadIOPS \
      --dimensions Name=DBInstanceIdentifier,Value=tradai-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Average
    

  3. Find slow queries (if Performance Insights enabled):

    aws pi get-resource-metrics \
      --service-type RDS \
      --identifier db-XXXXX \
      --metric-queries '[{"Metric": "db.sql.queries"}]' \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
    

DynamoDB Throttling

  1. Check throttled requests:

    aws cloudwatch get-metric-statistics \
      --namespace AWS/DynamoDB \
      --metric-name ThrottledRequests \
      --dimensions Name=TableName,Value=tradai-${ENVIRONMENT}-workflow-state \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Sum
    

  2. Check consumed capacity:

    aws cloudwatch get-metric-statistics \
      --namespace AWS/DynamoDB \
      --metric-name ConsumedReadCapacityUnits \
      --dimensions Name=TableName,Value=tradai-${ENVIRONMENT}-workflow-state \
      --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 60 \
      --statistics Sum
    

Resolution

Scale RDS instance:

aws rds modify-db-instance \
  --db-instance-identifier tradai-${ENVIRONMENT} \
  --db-instance-class db.t4g.small \
  --apply-immediately

DynamoDB is on-demand (auto-scales), but check for hot partitions if throttling persists.


Backtest Performance

Slow Backtests

  1. Check ECS task metrics for strategy tasks:

    aws logs filter-log-events \
      --log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%s)000 \
      --filter-pattern '"backtest completed"'
    

  2. Check for memory issues:

    aws logs filter-log-events \
      --log-group-name /ecs/tradai-strategy-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%s)000 \
      --filter-pattern 'MemoryError OR "out of memory"'
    

Common Backtest Performance Issues

Issue Cause Fix
Slow data loading Large date range Use incremental loading
OOM during backtest Too many pairs Reduce pairs per task
Slow indicator calc Inefficient indicators Optimize indicator code
FreqAI slow Large model Reduce model complexity

External Service Latency

Exchange API Issues

  1. Check for timeout errors:

    aws logs filter-log-events \
      --log-group-name /ecs/tradai-data-collection-${ENVIRONMENT} \
      --start-time $(date -u -v-1H +%s)000 \
      --filter-pattern '"timeout" OR "connection refused" OR "rate limit"'
    

  2. Verify exchange status:

  3. Binance: https://api.binance.com/api/v3/ping
  4. Check exchange status pages for maintenance

Rate Limiting

If hitting exchange rate limits:

# Check rate limit errors
aws logs filter-log-events \
  --log-group-name /ecs/tradai-data-collection-${ENVIRONMENT} \
  --start-time $(date -u -v-1H +%s)000 \
  --filter-pattern '"429" OR "rate limit" OR "too many requests"'

Resolution: - Implement request throttling in code - Use exchange WebSocket for real-time data - Cache frequently accessed data


Quick Performance Checks

All-in-one health check

# ECS services
for SERVICE in backend strategy-service data-collection; do
  echo "=== $SERVICE ==="
  aws cloudwatch get-metric-statistics \
    --namespace AWS/ECS \
    --metric-name CPUUtilization \
    --dimensions Name=ClusterName,Value=tradai-${ENVIRONMENT} Name=ServiceName,Value=tradai-${SERVICE}-${ENVIRONMENT} \
    --start-time $(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
    --period 300 \
    --statistics Average \
    --query 'Datapoints[0].Average'
done

API response time test

# Test each service health endpoint
for PORT in 8000 8003 8002; do
  echo "=== Port $PORT ==="
  time curl -s http://localhost:$PORT/api/v1/health | jq '.status'
done

Verification Checklist

After performance issue resolution:

  • [ ] API latency back to normal (<500ms p99)
  • [ ] ECS CPU/memory utilization <70%
  • [ ] No Lambda timeout errors
  • [ ] Database connections stable
  • [ ] No throttling on DynamoDB
  • [ ] CloudWatch alarms cleared
  • [ ] User-facing functionality verified
  • [ ] Performance baseline documented