Skip to content

Production Deployment Checklist

Comprehensive checklist for deploying TradAI to production.


Pre-Deployment

Code Quality

  • All tests pass: just test
  • Type checking passes: just typecheck
  • Linting passes: just lint
  • 80%+ test coverage for new code (CI gate: 60%, target: 80%)
  • No # type: ignore comments without justification
  • No hardcoded secrets or credentials

Strategy Validation

  • Preflight validation passes
  • Backtests run successfully on production data sample
  • Sanity checks pass (drawdown limits, trade frequency)
  • Model drift monitoring configured

Documentation

  • README updated for new features
  • API documentation updated
  • Environment variables documented
  • Runbook updated for new failure modes

Infrastructure

AWS Resources

  • VPC and subnets configured
  • Security groups reviewed (least privilege)
  • IAM roles created with minimal permissions
  • S3 buckets created with encryption enabled
  • DynamoDB tables provisioned
  • Secrets Manager secrets created

ECS Configuration

  • Task definitions reviewed
  • Container resource limits set (CPU, memory)
  • Health checks configured
  • Auto-scaling policies defined
  • Service discovery registered

Database

  • RDS/Aurora instance sized appropriately
  • Backup retention configured
  • Multi-AZ enabled for production
  • Connection pooling configured
  • Performance Insights enabled

Security

Authentication

  • Cognito User Pool configured (if applicable)
  • API Gateway authentication enabled
  • JWT validation configured
  • Token expiration settings reviewed

Secrets Management

  • All secrets in AWS Secrets Manager
  • Rotation policies configured
  • Secrets are injected from AWS Secrets Manager into container environment variables at runtime -- never stored as plaintext in config files or .env
  • No secrets in code or config files

Network Security

  • Services in private subnets
  • ALB in public subnet (if needed)
  • Security groups restrict ingress
  • VPC flow logs enabled
  • WAF rules configured (if applicable)

Monitoring

CloudWatch

  • Log groups created with retention policy
  • Metric alarms configured:
    • CPU utilization > 80%
    • Memory utilization > 80%
    • Error rate > 1%
    • Latency p99 > threshold
  • Dashboard created for key metrics

Alerting

  • SNS topics configured
  • Alert notifications set up (email, Slack, PagerDuty)
  • On-call rotation defined
  • Escalation policy documented

MLflow

  • Experiment tracking configured
  • Model registry set up
  • Artifact storage configured (S3)
  • Authentication enabled

Data

ArcticDB

  • S3 bucket configured
  • Library created
  • Data backfill completed
  • Data quality validation passed

Exchange Connectivity

  • API keys configured in Secrets Manager
  • Rate limits configured
  • Fallback exchange configured (if applicable)

Deployment

CI/CD Pipeline

  • GitHub Actions configured
  • Docker images built and pushed to ECR
  • Deployment workflow tested
  • Rollback procedure documented and tested

Pulumi Infrastructure

# Preview all stacks
just infra-preview dev

# Deploy all stacks
just infra-bootstrap dev

# Verify outputs
just infra-outputs compute dev

Post-Deployment Verification

  • Health check endpoints responding
  • Services registered in Service Discovery
  • MLflow accessible
  • Data sync working
  • Strategy execution tested

Rollback Plan

Automated Rollback

Configure ECS to automatically rollback on failed deployments:

# Task definition
deployment_circuit_breaker:
  enable: true
  rollback: true

Manual Rollback

# Revert to previous task definition
aws ecs update-service \
  --cluster tradai-prod \
  --service tradai-strategy-service-prod \
  --task-definition tradai-strategy-service-prod:PREVIOUS_VERSION

# Or revert Pulumi stack
cd infra/compute && source ../.env && pulumi stack export --stack prod > backup.json
cd infra/compute && source ../.env && pulumi up --target-replace urn:pulumi:prod::tradai::aws:ecs/service:Service::strategy-service

Post-Deployment

Smoke Tests

  • API endpoints responding
  • Health checks passing
  • Backtest job executes successfully
  • Data sync completes
  • Model inference works

Monitoring Check

  • Logs appearing in CloudWatch
  • Metrics being recorded
  • No error spikes in dashboards
  • Performance within expected bounds

Documentation Update

  • Deployment notes recorded
  • Known issues documented
  • Changelog updated

Production Environment Variables

Ensure these are set in production:

# AWS
AWS_REGION=eu-central-1
# (IAM roles handle credentials)

# MLflow
MLFLOW_TRACKING_URI=https://mlflow.your-domain.com
MLFLOW_TRACKING_USERNAME=<from-secrets-manager>
MLFLOW_TRACKING_PASSWORD=<from-secrets-manager>

# Services
BACKEND_EXECUTOR_MODE=stepfunctions
STRATEGY_SERVICE_S3_CONFIG_BUCKET=tradai-configs-prod
STRATEGY_SERVICE_EXCHANGE_SECRET_NAME=tradai/exchange-credentials

# Data
DATA_COLLECTION_ARCTIC_S3_BUCKET=tradai-arcticdb-prod

See Also