Security Incident Response Runbook¶
Procedures for handling security incidents including credential exposure, unauthorized access, and security policy violations.
Severity Classification¶
| Level | Description | Response Time | Examples |
|---|---|---|---|
| P1 | Active breach, data exposure | Immediate (<15 min) | Compromised AWS creds, unauthorized trading |
| P2 | Suspected compromise | < 1 hour | Suspicious API activity, failed auth spike |
| P3 | Policy violation | < 4 hours | Exposed secrets in logs, missing encryption |
API Key/Secret Exposure¶
Symptoms¶
- Secrets found in logs, code commits, or public repositories
- Unexpected API calls from unknown sources
- Exchange reporting unusual activity
Immediate Actions (P1)¶
-
Identify exposed credentials:
-
Rotate exchange API keys immediately:
# 1. Create new API key on exchange (manually via exchange UI) # 2. Update Secrets Manager aws secretsmanager put-secret-value \ --secret-id tradai/${ENVIRONMENT}/exchange-keys \ --secret-string '{"api_key":"NEW_KEY","api_secret":"NEW_SECRET"}' # 3. Restart services to pick up new credentials aws ecs update-service \ --cluster tradai-${ENVIRONMENT} \ --service tradai-strategy-service-${ENVIRONMENT} \ --force-new-deployment -
Revoke old exchange keys (via exchange UI)
-
Check for unauthorized activity:
AWS Credential Rotation¶
If AWS credentials exposed:
-
Disable compromised credentials:
-
Create new credentials:
-
Review CloudTrail for unauthorized actions:
-
Delete compromised credentials (after verification):
JWT Token Invalidation¶
If JWT signing key compromised:
Cognito security configuration (canonical source: infra/shared/tradai_infra_shared/config.py):
| Setting | Value |
|---|---|
| MFA | Required (TOTP) |
| Password minimum length | 12 characters |
| Password requirements | Uppercase + lowercase + numbers + symbols |
| Access token lifetime | 1 hour |
| Refresh token lifetime | 30 days |
-
Rotate Cognito app client secret:
-
Force sign-out all users:
-
Verify MFA is enforced (should always be TOTP):
CodeArtifact Credential Rotation¶
CodeArtifact auth tokens expire after 12 hours. If credentials are suspected compromised or need rotation:
-
Revoke existing tokens by generating a new one (previous tokens are invalidated):
-
Refresh all CI/CD pipeline credentials:
-
Rotate the domain policy if domain-level access is compromised:
# Review current domain policy aws codeartifact get-domain-permissions-policy \ --domain tradai \ --domain-owner ${AWS_ACCOUNT_ID} # Update policy to restrict access aws codeartifact put-domain-permissions-policy \ --domain tradai \ --domain-owner ${AWS_ACCOUNT_ID} \ --policy-document file://updated-domain-policy.json
Unauthorized Access Detected¶
Symptoms¶
- Unexpected IAM activity in CloudTrail
- Unknown IP addresses in access logs
- Privilege escalation attempts
- Resource creation outside normal patterns
Diagnosis¶
-
Review CloudTrail events:
-
Check for new IAM users/roles:
-
Check API Gateway access logs: (Not currently configured in infrastructure. API Gateway access logging is not enabled.)
ECS Task Isolation¶
If a container is suspected compromised:
-
Stop the suspicious task:
-
Update security group to deny all egress (preserve for forensics):
-
Capture task logs for forensics:
Network Security Breach¶
Symptoms¶
- Unexpected outbound connections
- Data exfiltration attempts in VPC Flow Logs
- Unusual traffic patterns
Diagnosis¶
-
Check VPC Flow Logs:
-
Look for unexpected destinations:
# Check for traffic to unexpected IPs aws logs filter-log-events \ --log-group-name /aws/vpc/tradai-flow-logs \ --start-time $(date -u -v-1H +%s)000 \ --filter-pattern '[version, accountid, interfaceid, srcaddr, dstaddr != 10.*, srcport, dstport, protocol, packets, bytes, start, end, action=ACCEPT, logstatus]'
Containment¶
-
Block suspicious IPs via WAF:
-
Update security group to block IP:
CloudTrail Audit Analysis¶
Common Queries¶
Failed API calls:
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ReadOnly,AttributeValue=false \
--start-time $(date -u -v-24H +%Y-%m-%dT%H:%M:%SZ) \
--query 'Events[?ErrorCode!=`null`].{Time:EventTime,Event:EventName,Error:ErrorCode,User:Username}'
Root account activity (should be rare):
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=Username,AttributeValue=root \
--start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ)
IAM policy changes:
aws cloudtrail lookup-events \
--start-time $(date -u -v-24H +%Y-%m-%dT%H:%M:%SZ) \
--query 'Events[?contains(EventName, `Policy`)].{Time:EventTime,Event:EventName,User:Username}'
Security group changes:
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventSource,AttributeValue=ec2.amazonaws.com \
--start-time $(date -u -v-24H +%Y-%m-%dT%H:%M:%SZ) \
--query 'Events[?contains(EventName, `SecurityGroup`)].{Time:EventTime,Event:EventName,User:Username}'
Post-Incident Actions¶
Immediate (within 1 hour)¶
- Contain the incident (rotate credentials, isolate resources)
- Preserve evidence (logs, configurations, artifacts)
- Notify stakeholders (team lead, security contact)
- Document initial findings
Short-term (within 24 hours)¶
- Complete root cause analysis
- Verify all compromised credentials rotated
- Review all related resources for tampering
- Check for persistence mechanisms (backdoors, new users/keys)
- Update WAF/security group rules as needed
Long-term (within 7 days)¶
- Write incident report with timeline
- Identify and implement preventive measures
- Update runbooks with lessons learned
- Review and update monitoring/alerting
- Conduct team debrief
Emergency Contacts¶
| Role | Contact | When to Escalate |
|---|---|---|
| On-call Engineer | [PagerDuty] | P1/P2 incidents |
| Security Lead | [Contact info] | All P1, suspected breaches |
| AWS Support | [Support case] | AWS resource compromise |
| Exchange Support | [Exchange contact] | Trading account compromise |
Verification Checklist¶
After security incident resolution:
- All compromised credentials rotated
- Unauthorized access terminated
- No persistent backdoors found
- CloudTrail showing normal activity
- VPC Flow Logs clean
- Services operating normally
- Monitoring alerts configured for recurrence
- Incident documented and reported