Trading Issues Runbook¶
Procedures for handling stale heartbeats, stuck positions, and order execution failures.
Stale Heartbeat Alert¶
Symptoms¶
- CloudWatch alarm:
tradai-{env}-heartbeat-stale - SNS notification about missing heartbeat
- Trading heartbeat Lambda detected > 2 minutes since last heartbeat
Diagnosis¶
-
Check trading state in DynamoDB:
-
Check if strategy service is running:
-
Check strategy service logs:
-
Check for connectivity issues:
- Exchange API connectivity
- Database connectivity
- Network issues
Common Causes¶
| Cause | Signs | Resolution |
|---|---|---|
| Service crashed | No running tasks | Restart service |
| Exchange API down | Timeout errors in logs | Wait or switch exchange |
| Rate limiting | 429 errors in logs | Reduce request frequency |
| Memory exhaustion | OOM in logs | Increase memory limits |
| Deadlock | Service frozen, no logs | Force restart |
Resolution¶
Option 1: Force service restart
aws ecs update-service \
--cluster tradai-${ENVIRONMENT} \
--service tradai-strategy-service-${ENVIRONMENT} \
--force-new-deployment
Option 2: Stop and restart strategy
# Update state to stopped
aws dynamodb update-item \
--table-name tradai-trading-state-${ENVIRONMENT} \
--key '{"strategy_id": {"S": "STRATEGY_ID"}}' \
--update-expression "SET #status = :stopped" \
--expression-attribute-names '{"#status": "status"}' \
--expression-attribute-values '{":stopped": {"S": "stopped"}}'
# Wait for service to acknowledge
sleep 30
# Update state to running
aws dynamodb update-item \
--table-name tradai-trading-state-${ENVIRONMENT} \
--key '{"strategy_id": {"S": "STRATEGY_ID"}}' \
--update-expression "SET #status = :running" \
--expression-attribute-names '{"#status": "status"}' \
--expression-attribute-values '{":running": {"S": "running"}}'
Verification¶
After resolution: 1. Verify heartbeat updates within 60 seconds 2. Check strategy is processing signals normally 3. Monitor for 10 minutes to ensure stability
Position Stuck Open¶
Symptoms¶
- Position open for longer than expected
- Exit signal was generated but order not filled
- Manual review shows position should be closed
Immediate Assessment¶
-
Check position details:
-
Check order history:
-
Check exchange for actual position:
- Log into exchange dashboard
- Verify actual position matches our records
Resolution Options¶
Option 1: Wait for next cycle - If market conditions are acceptable - Strategy will attempt exit on next signal
Option 2: Manual market order via exchange - Log into exchange - Place market order to close position - Update our records after fill
Option 3: Emergency close via API
# This requires appropriate API access and should be done carefully
# Pseudocode - actual implementation depends on exchange
# 1. Get current position from exchange
# 2. Place market order to close
# 3. Update DynamoDB with closed status
# 4. Log the manual intervention
Post-Resolution¶
-
Update position record:
aws dynamodb update-item \ --table-name tradai-positions-${ENVIRONMENT} \ --key '{"position_id": {"S": "POSITION_ID"}}' \ --update-expression "SET #status = :closed, closed_manually = :true, closed_at = :ts" \ --expression-attribute-names '{"#status": "status"}' \ --expression-attribute-values '{ ":closed": {"S": "closed"}, ":true": {"BOOL": true}, ":ts": {"S": "2024-01-01T00:00:00Z"} }' -
Create incident report documenting:
- Why position got stuck
- Manual intervention taken
- P&L impact
- Root cause analysis
Order Execution Failures¶
Symptoms¶
- Orders not being filled
- Timeout errors in logs
- Order status stuck in "pending"
Common Causes¶
| Error | Cause | Resolution |
|---|---|---|
| Insufficient balance | Not enough margin | Reduce position size or add funds |
| Invalid price | Price moved too fast | Use market orders or wider limits |
| Rate limited | Too many requests | Implement backoff |
| Exchange maintenance | Exchange offline | Wait for maintenance to end |
| Invalid symbol | Symbol delisted/changed | Update symbol config |
Diagnosis¶
-
Check order logs:
-
Check exchange status:
- Exchange status page
-
API health endpoints
-
Verify account status:
- Check if account is frozen
- Verify API key permissions
- Check margin/balance
Resolution¶
For rate limiting:
# Reduce order frequency in config
aws dynamodb update-item \
--table-name tradai-strategy-config-${ENVIRONMENT} \
--key '{"strategy_id": {"S": "STRATEGY_ID"}}' \
--update-expression "SET order_cooldown_seconds = :v" \
--expression-attribute-values '{":v": {"N": "60"}}'
For insufficient balance: 1. Close some positions to free margin 2. Reduce stake amount in strategy config 3. Add funds to exchange account
Emergency Stop¶
If trading needs to be stopped immediately:
-
Stop all strategies:
# Scan for all running strategies aws dynamodb scan \ --table-name tradai-trading-state-${ENVIRONMENT} \ --filter-expression "#status = :running" \ --expression-attribute-names '{"#status": "status"}' \ --expression-attribute-values '{":running": {"S": "running"}}' # Update each to stopped (or use batch write) -
Scale down strategy service:
-
Notify stakeholders:
- Send alert via SNS
- Update status page if available
Verification Checklist¶
After any trading issue resolution:
- [ ] All positions accounted for (none stuck)
- [ ] Heartbeat updating normally (< 2 minutes old)
- [ ] Order execution working (test with small order if safe)
- [ ] Strategy processing signals normally
- [ ] No error patterns in logs
- [ ] CloudWatch metrics show healthy state
- [ ] Incident documented in log