Production Deployment Checklist¶
Comprehensive checklist for deploying TradAI to production.
Pre-Deployment¶
Code Quality¶
- All tests pass:
just test - Type checking passes:
just typecheck - Linting passes:
just lint - 80%+ test coverage for new code (CI gate: 60%, target: 80%)
- No
# type: ignorecomments without justification - No hardcoded secrets or credentials
Strategy Validation¶
- Preflight validation passes
- Backtests run successfully on production data sample
- Sanity checks pass (drawdown limits, trade frequency)
- Model drift monitoring configured
Documentation¶
- README updated for new features
- API documentation updated
- Environment variables documented
- Runbook updated for new failure modes
Infrastructure¶
AWS Resources¶
- VPC and subnets configured
- Security groups reviewed (least privilege)
- IAM roles created with minimal permissions
- S3 buckets created with encryption enabled
- DynamoDB tables provisioned
- Secrets Manager secrets created
ECS Configuration¶
- Task definitions reviewed
- Container resource limits set (CPU, memory)
- Health checks configured
- Auto-scaling policies defined
- Service discovery registered
Database¶
- RDS/Aurora instance sized appropriately
- Backup retention configured
- Multi-AZ enabled for production
- Connection pooling configured
- Performance Insights enabled
Security¶
Authentication¶
- Cognito User Pool configured (if applicable)
- API Gateway authentication enabled
- JWT validation configured
- Token expiration settings reviewed
Secrets Management¶
- All secrets in AWS Secrets Manager
- Rotation policies configured
- Secrets are injected from AWS Secrets Manager into container environment variables at runtime -- never stored as plaintext in config files or
.env - No secrets in code or config files
Network Security¶
- Services in private subnets
- ALB in public subnet (if needed)
- Security groups restrict ingress
- VPC flow logs enabled
- WAF rules configured (if applicable)
Monitoring¶
CloudWatch¶
- Log groups created with retention policy
- Metric alarms configured:
- CPU utilization > 80%
- Memory utilization > 80%
- Error rate > 1%
- Latency p99 > threshold
- Dashboard created for key metrics
Alerting¶
- SNS topics configured
- Alert notifications set up (email, Slack, PagerDuty)
- On-call rotation defined
- Escalation policy documented
MLflow¶
- Experiment tracking configured
- Model registry set up
- Artifact storage configured (S3)
- Authentication enabled
Data¶
ArcticDB¶
- S3 bucket configured
- Library created
- Data backfill completed
- Data quality validation passed
Exchange Connectivity¶
- API keys configured in Secrets Manager
- Rate limits configured
- Fallback exchange configured (if applicable)
Deployment¶
CI/CD Pipeline¶
- GitHub Actions configured
- Docker images built and pushed to ECR
- Deployment workflow tested
- Rollback procedure documented and tested
Pulumi Infrastructure¶
# Preview all stacks
just infra-preview dev
# Deploy all stacks
just infra-bootstrap dev
# Verify outputs
just infra-outputs compute dev
Post-Deployment Verification¶
- Health check endpoints responding
- Services registered in Service Discovery
- MLflow accessible
- Data sync working
- Strategy execution tested
Rollback Plan¶
Automated Rollback¶
Configure ECS to automatically rollback on failed deployments:
Manual Rollback¶
# Revert to previous task definition
aws ecs update-service \
--cluster tradai-prod \
--service tradai-strategy-service-prod \
--task-definition tradai-strategy-service-prod:PREVIOUS_VERSION
# Or revert Pulumi stack
cd infra/compute && source ../.env && pulumi stack export --stack prod > backup.json
cd infra/compute && source ../.env && pulumi up --target-replace urn:pulumi:prod::tradai::aws:ecs/service:Service::strategy-service
Post-Deployment¶
Smoke Tests¶
- API endpoints responding
- Health checks passing
- Backtest job executes successfully
- Data sync completes
- Model inference works
Monitoring Check¶
- Logs appearing in CloudWatch
- Metrics being recorded
- No error spikes in dashboards
- Performance within expected bounds
Documentation Update¶
- Deployment notes recorded
- Known issues documented
- Changelog updated
Production Environment Variables¶
Ensure these are set in production:
# AWS
AWS_REGION=eu-central-1
# (IAM roles handle credentials)
# MLflow
MLFLOW_TRACKING_URI=https://mlflow.your-domain.com
MLFLOW_TRACKING_USERNAME=<from-secrets-manager>
MLFLOW_TRACKING_PASSWORD=<from-secrets-manager>
# Services
BACKEND_EXECUTOR_MODE=stepfunctions
STRATEGY_SERVICE_S3_CONFIG_BUCKET=tradai-configs-prod
STRATEGY_SERVICE_EXCHANGE_SECRET_NAME=tradai/exchange-credentials
# Data
DATA_COLLECTION_ARCTIC_S3_BUCKET=tradai-arcticdb-prod