Skip to content

Trading Issues Runbook

Procedures for handling stale heartbeats, stuck positions, and order execution failures.

Stale Heartbeat Alert

Symptoms

  • CloudWatch alarm: tradai-{env}-heartbeat-stale
  • SNS notification about missing heartbeat
  • Trading heartbeat Lambda detected > 2 minutes since last heartbeat

Diagnosis

  1. Check trading state in DynamoDB:

    aws dynamodb get-item \
      --table-name tradai-trading-state-${ENVIRONMENT} \
      --key '{"strategy_id": {"S": "STRATEGY_ID"}}'
    

  2. Check if strategy service is running:

    aws ecs describe-services \
      --cluster tradai-${ENVIRONMENT} \
      --services tradai-strategy-service-${ENVIRONMENT}
    

  3. Check strategy service logs:

    aws logs tail /ecs/tradai-strategy-service-${ENVIRONMENT} --follow
    

  4. Check for connectivity issues:

  5. Exchange API connectivity
  6. Database connectivity
  7. Network issues

Common Causes

Cause Signs Resolution
Service crashed No running tasks Restart service
Exchange API down Timeout errors in logs Wait or switch exchange
Rate limiting 429 errors in logs Reduce request frequency
Memory exhaustion OOM in logs Increase memory limits
Deadlock Service frozen, no logs Force restart

Resolution

Option 1: Force service restart

aws ecs update-service \
  --cluster tradai-${ENVIRONMENT} \
  --service tradai-strategy-service-${ENVIRONMENT} \
  --force-new-deployment

Option 2: Stop and restart strategy

# Update state to stopped
aws dynamodb update-item \
  --table-name tradai-trading-state-${ENVIRONMENT} \
  --key '{"strategy_id": {"S": "STRATEGY_ID"}}' \
  --update-expression "SET #status = :stopped" \
  --expression-attribute-names '{"#status": "status"}' \
  --expression-attribute-values '{":stopped": {"S": "stopped"}}'

# Wait for service to acknowledge
sleep 30

# Update state to running
aws dynamodb update-item \
  --table-name tradai-trading-state-${ENVIRONMENT} \
  --key '{"strategy_id": {"S": "STRATEGY_ID"}}' \
  --update-expression "SET #status = :running" \
  --expression-attribute-names '{"#status": "status"}' \
  --expression-attribute-values '{":running": {"S": "running"}}'

Verification

After resolution: 1. Verify heartbeat updates within 60 seconds 2. Check strategy is processing signals normally 3. Monitor for 10 minutes to ensure stability


Position Stuck Open

Symptoms

  • Position open for longer than expected
  • Exit signal was generated but order not filled
  • Manual review shows position should be closed

Immediate Assessment

  1. Check position details:

    aws dynamodb get-item \
      --table-name tradai-positions-${ENVIRONMENT} \
      --key '{"position_id": {"S": "POSITION_ID"}}'
    

  2. Check order history:

    aws dynamodb query \
      --table-name tradai-orders-${ENVIRONMENT} \
      --key-condition-expression "position_id = :pid" \
      --expression-attribute-values '{":pid": {"S": "POSITION_ID"}}'
    

  3. Check exchange for actual position:

  4. Log into exchange dashboard
  5. Verify actual position matches our records

Resolution Options

Option 1: Wait for next cycle - If market conditions are acceptable - Strategy will attempt exit on next signal

Option 2: Manual market order via exchange - Log into exchange - Place market order to close position - Update our records after fill

Option 3: Emergency close via API

# This requires appropriate API access and should be done carefully
# Pseudocode - actual implementation depends on exchange

# 1. Get current position from exchange
# 2. Place market order to close
# 3. Update DynamoDB with closed status
# 4. Log the manual intervention

Post-Resolution

  1. Update position record:

    aws dynamodb update-item \
      --table-name tradai-positions-${ENVIRONMENT} \
      --key '{"position_id": {"S": "POSITION_ID"}}' \
      --update-expression "SET #status = :closed, closed_manually = :true, closed_at = :ts" \
      --expression-attribute-names '{"#status": "status"}' \
      --expression-attribute-values '{
        ":closed": {"S": "closed"},
        ":true": {"BOOL": true},
        ":ts": {"S": "2024-01-01T00:00:00Z"}
      }'
    

  2. Create incident report documenting:

  3. Why position got stuck
  4. Manual intervention taken
  5. P&L impact
  6. Root cause analysis

Order Execution Failures

Symptoms

  • Orders not being filled
  • Timeout errors in logs
  • Order status stuck in "pending"

Common Causes

Error Cause Resolution
Insufficient balance Not enough margin Reduce position size or add funds
Invalid price Price moved too fast Use market orders or wider limits
Rate limited Too many requests Implement backoff
Exchange maintenance Exchange offline Wait for maintenance to end
Invalid symbol Symbol delisted/changed Update symbol config

Diagnosis

  1. Check order logs:

    aws logs filter-log-events \
      --log-group-name /ecs/tradai-strategy-service-${ENVIRONMENT} \
      --filter-pattern "order" \
      --start-time $(date -u -v-1H +%s000)
    

  2. Check exchange status:

  3. Exchange status page
  4. API health endpoints

  5. Verify account status:

  6. Check if account is frozen
  7. Verify API key permissions
  8. Check margin/balance

Resolution

For rate limiting:

# Reduce order frequency in config
aws dynamodb update-item \
  --table-name tradai-strategy-config-${ENVIRONMENT} \
  --key '{"strategy_id": {"S": "STRATEGY_ID"}}' \
  --update-expression "SET order_cooldown_seconds = :v" \
  --expression-attribute-values '{":v": {"N": "60"}}'

For insufficient balance: 1. Close some positions to free margin 2. Reduce stake amount in strategy config 3. Add funds to exchange account


Emergency Stop

If trading needs to be stopped immediately:

  1. Stop all strategies:

    # Scan for all running strategies
    aws dynamodb scan \
      --table-name tradai-trading-state-${ENVIRONMENT} \
      --filter-expression "#status = :running" \
      --expression-attribute-names '{"#status": "status"}' \
      --expression-attribute-values '{":running": {"S": "running"}}'
    
    # Update each to stopped (or use batch write)
    

  2. Scale down strategy service:

    aws ecs update-service \
      --cluster tradai-${ENVIRONMENT} \
      --service tradai-strategy-service-${ENVIRONMENT} \
      --desired-count 0
    

  3. Notify stakeholders:

  4. Send alert via SNS
  5. Update status page if available

Verification Checklist

After any trading issue resolution:

  • [ ] All positions accounted for (none stuck)
  • [ ] Heartbeat updating normally (< 2 minutes old)
  • [ ] Order execution working (test with small order if safe)
  • [ ] Strategy processing signals normally
  • [ ] No error patterns in logs
  • [ ] CloudWatch metrics show healthy state
  • [ ] Incident documented in log