Production Deployment Checklist¶
Comprehensive checklist for deploying TradAI to production.
Pre-Deployment¶
Code Quality¶
- [ ] All tests pass:
just test - [ ] Type checking passes:
just typecheck - [ ] Linting passes:
just lint - [ ] 80%+ test coverage for new code
- [ ] No
# type: ignorecomments without justification - [ ] No hardcoded secrets or credentials
Strategy Validation¶
- [ ] Preflight validation passes
- [ ] Backtests run successfully on production data sample
- [ ] Sanity checks pass (drawdown limits, trade frequency)
- [ ] Model drift monitoring configured
Documentation¶
- [ ] README updated for new features
- [ ] API documentation updated
- [ ] Environment variables documented
- [ ] Runbook updated for new failure modes
Infrastructure¶
AWS Resources¶
- [ ] VPC and subnets configured
- [ ] Security groups reviewed (least privilege)
- [ ] IAM roles created with minimal permissions
- [ ] S3 buckets created with encryption enabled
- [ ] DynamoDB tables provisioned
- [ ] Secrets Manager secrets created
ECS Configuration¶
- [ ] Task definitions reviewed
- [ ] Container resource limits set (CPU, memory)
- [ ] Health checks configured
- [ ] Auto-scaling policies defined
- [ ] Service discovery registered
Database¶
- [ ] RDS/Aurora instance sized appropriately
- [ ] Backup retention configured
- [ ] Multi-AZ enabled for production
- [ ] Connection pooling configured
- [ ] Performance Insights enabled
Security¶
Authentication¶
- [ ] Cognito User Pool configured (if applicable)
- [ ] API Gateway authentication enabled
- [ ] JWT validation configured
- [ ] Token expiration settings reviewed
Secrets Management¶
- [ ] All secrets in AWS Secrets Manager
- [ ] Rotation policies configured
- [ ] No secrets in environment variables
- [ ] No secrets in code or config files
Network Security¶
- [ ] Services in private subnets
- [ ] ALB in public subnet (if needed)
- [ ] Security groups restrict ingress
- [ ] VPC flow logs enabled
- [ ] WAF rules configured (if applicable)
Monitoring¶
CloudWatch¶
- [ ] Log groups created with retention policy
- [ ] Metric alarms configured:
- [ ] CPU utilization > 80%
- [ ] Memory utilization > 80%
- [ ] Error rate > 1%
- [ ] Latency p99 > threshold
- [ ] Dashboard created for key metrics
Alerting¶
- [ ] SNS topics configured
- [ ] Alert notifications set up (email, Slack, PagerDuty)
- [ ] On-call rotation defined
- [ ] Escalation policy documented
MLflow¶
- [ ] Experiment tracking configured
- [ ] Model registry set up
- [ ] Artifact storage configured (S3)
- [ ] Authentication enabled
Data¶
ArcticDB¶
- [ ] S3 bucket configured
- [ ] Library created
- [ ] Data backfill completed
- [ ] Data quality validation passed
Exchange Connectivity¶
- [ ] API keys configured in Secrets Manager
- [ ] Rate limits configured
- [ ] Fallback exchange configured (if applicable)
Deployment¶
CI/CD Pipeline¶
- [ ] GitHub Actions / Bitbucket Pipelines configured
- [ ] Docker images built and pushed to ECR
- [ ] Deployment workflow tested
- [ ] Rollback procedure documented and tested
Pulumi Infrastructure¶
Post-Deployment Verification¶
- [ ] Health check endpoints responding
- [ ] Services registered in Service Discovery
- [ ] MLflow accessible
- [ ] Data sync working
- [ ] Strategy execution tested
Rollback Plan¶
Automated Rollback¶
Configure ECS to automatically rollback on failed deployments:
Manual Rollback¶
# Revert to previous task definition
aws ecs update-service \
--cluster tradai-cluster \
--service strategy-service \
--task-definition strategy-service:PREVIOUS_VERSION
# Or revert Pulumi stack
pulumi stack export --stack prod > backup.json
pulumi up --target-replace urn:pulumi:prod::tradai::aws:ecs/service:Service::strategy-service
Post-Deployment¶
Smoke Tests¶
- [ ] API endpoints responding
- [ ] Health checks passing
- [ ] Backtest job executes successfully
- [ ] Data sync completes
- [ ] Model inference works
Monitoring Check¶
- [ ] Logs appearing in CloudWatch
- [ ] Metrics being recorded
- [ ] No error spikes in dashboards
- [ ] Performance within expected bounds
Documentation Update¶
- [ ] Deployment notes recorded
- [ ] Known issues documented
- [ ] Changelog updated
Production Environment Variables¶
Ensure these are set in production:
# AWS
AWS_REGION=us-east-1
# (IAM roles handle credentials)
# MLflow
MLFLOW_TRACKING_URI=https://mlflow.your-domain.com
MLFLOW_TRACKING_USERNAME=<from-secrets-manager>
MLFLOW_TRACKING_PASSWORD=<from-secrets-manager>
# Services
BACKEND_EXECUTOR_MODE=stepfunctions
STRATEGY_SERVICE_S3_CONFIG_BUCKET=tradai-config-prod
STRATEGY_SERVICE_EXCHANGE_SECRET_NAME=tradai/exchange-credentials
# Data
DATA_COLLECTION_ARCTIC_S3_BUCKET=tradai-arctic-prod