Skip to content

Production Deployment Checklist

Comprehensive checklist for deploying TradAI to production.


Pre-Deployment

Code Quality

  • [ ] All tests pass: just test
  • [ ] Type checking passes: just typecheck
  • [ ] Linting passes: just lint
  • [ ] 80%+ test coverage for new code
  • [ ] No # type: ignore comments without justification
  • [ ] No hardcoded secrets or credentials

Strategy Validation

  • [ ] Preflight validation passes
  • [ ] Backtests run successfully on production data sample
  • [ ] Sanity checks pass (drawdown limits, trade frequency)
  • [ ] Model drift monitoring configured

Documentation

  • [ ] README updated for new features
  • [ ] API documentation updated
  • [ ] Environment variables documented
  • [ ] Runbook updated for new failure modes

Infrastructure

AWS Resources

  • [ ] VPC and subnets configured
  • [ ] Security groups reviewed (least privilege)
  • [ ] IAM roles created with minimal permissions
  • [ ] S3 buckets created with encryption enabled
  • [ ] DynamoDB tables provisioned
  • [ ] Secrets Manager secrets created

ECS Configuration

  • [ ] Task definitions reviewed
  • [ ] Container resource limits set (CPU, memory)
  • [ ] Health checks configured
  • [ ] Auto-scaling policies defined
  • [ ] Service discovery registered

Database

  • [ ] RDS/Aurora instance sized appropriately
  • [ ] Backup retention configured
  • [ ] Multi-AZ enabled for production
  • [ ] Connection pooling configured
  • [ ] Performance Insights enabled

Security

Authentication

  • [ ] Cognito User Pool configured (if applicable)
  • [ ] API Gateway authentication enabled
  • [ ] JWT validation configured
  • [ ] Token expiration settings reviewed

Secrets Management

  • [ ] All secrets in AWS Secrets Manager
  • [ ] Rotation policies configured
  • [ ] No secrets in environment variables
  • [ ] No secrets in code or config files

Network Security

  • [ ] Services in private subnets
  • [ ] ALB in public subnet (if needed)
  • [ ] Security groups restrict ingress
  • [ ] VPC flow logs enabled
  • [ ] WAF rules configured (if applicable)

Monitoring

CloudWatch

  • [ ] Log groups created with retention policy
  • [ ] Metric alarms configured:
    • [ ] CPU utilization > 80%
    • [ ] Memory utilization > 80%
    • [ ] Error rate > 1%
    • [ ] Latency p99 > threshold
  • [ ] Dashboard created for key metrics

Alerting

  • [ ] SNS topics configured
  • [ ] Alert notifications set up (email, Slack, PagerDuty)
  • [ ] On-call rotation defined
  • [ ] Escalation policy documented

MLflow

  • [ ] Experiment tracking configured
  • [ ] Model registry set up
  • [ ] Artifact storage configured (S3)
  • [ ] Authentication enabled

Data

ArcticDB

  • [ ] S3 bucket configured
  • [ ] Library created
  • [ ] Data backfill completed
  • [ ] Data quality validation passed

Exchange Connectivity

  • [ ] API keys configured in Secrets Manager
  • [ ] Rate limits configured
  • [ ] Fallback exchange configured (if applicable)

Deployment

CI/CD Pipeline

  • [ ] GitHub Actions / Bitbucket Pipelines configured
  • [ ] Docker images built and pushed to ECR
  • [ ] Deployment workflow tested
  • [ ] Rollback procedure documented and tested

Pulumi Infrastructure

# Preview changes
cd infra && pulumi preview

# Deploy
pulumi up

# Verify
pulumi stack output

Post-Deployment Verification

  • [ ] Health check endpoints responding
  • [ ] Services registered in Service Discovery
  • [ ] MLflow accessible
  • [ ] Data sync working
  • [ ] Strategy execution tested

Rollback Plan

Automated Rollback

Configure ECS to automatically rollback on failed deployments:

# Task definition
deployment_circuit_breaker:
  enable: true
  rollback: true

Manual Rollback

# Revert to previous task definition
aws ecs update-service \
  --cluster tradai-cluster \
  --service strategy-service \
  --task-definition strategy-service:PREVIOUS_VERSION

# Or revert Pulumi stack
pulumi stack export --stack prod > backup.json
pulumi up --target-replace urn:pulumi:prod::tradai::aws:ecs/service:Service::strategy-service

Post-Deployment

Smoke Tests

  • [ ] API endpoints responding
  • [ ] Health checks passing
  • [ ] Backtest job executes successfully
  • [ ] Data sync completes
  • [ ] Model inference works

Monitoring Check

  • [ ] Logs appearing in CloudWatch
  • [ ] Metrics being recorded
  • [ ] No error spikes in dashboards
  • [ ] Performance within expected bounds

Documentation Update

  • [ ] Deployment notes recorded
  • [ ] Known issues documented
  • [ ] Changelog updated

Production Environment Variables

Ensure these are set in production:

# AWS
AWS_REGION=us-east-1
# (IAM roles handle credentials)

# MLflow
MLFLOW_TRACKING_URI=https://mlflow.your-domain.com
MLFLOW_TRACKING_USERNAME=<from-secrets-manager>
MLFLOW_TRACKING_PASSWORD=<from-secrets-manager>

# Services
BACKEND_EXECUTOR_MODE=stepfunctions
STRATEGY_SERVICE_S3_CONFIG_BUCKET=tradai-config-prod
STRATEGY_SERVICE_EXCHANGE_SECRET_NAME=tradai/exchange-credentials

# Data
DATA_COLLECTION_ARCTIC_S3_BUCKET=tradai-arctic-prod

See Also