Lambda Functions¶
Serverless functions for TradAI platform orchestration, monitoring, and automation.
Overview¶
TradAI uses AWS Lambda for event-driven processing, scheduled tasks, and Step Functions workflow integration.
Deployment Model¶
All Lambdas are deployed as container images (not zip packages). Each Lambda has its own Dockerfile in lambdas/<name>/ that extends a shared base image containing tradai-common.
lambdas/
├── base/ # Shared base Docker image (tradai-common wheel)
├── backtest-consumer/ # Each Lambda is a separate container image
│ ├── Dockerfile
│ ├── handler.py
│ └── requirements.txt
├── drift-monitor/
│ ├── Dockerfile
│ └── handler.py
└── ...
Handler Pattern¶
Most Lambdas use the @lambda_handler decorator for consistent error handling and settings injection:
from tradai.common.lambda_ import lambda_handler, LambdaResponse
@lambda_handler(MySettings)
def handler(event: dict, context) -> dict:
# Settings are injected via the decorator
return LambdaResponse.success({"result": "data"})
Note: Some Lambdas (e.g.,
backtest-consumer,sqs-consumer) use direct handler implementations without the decorator for specialized SQS batch processing.
Lambda Categories¶
Workflow Orchestration¶
| Lambda | Trigger | Purpose |
|---|---|---|
| backtest-consumer | SQS | Launches ECS backtesting tasks from queue |
| sqs-consumer | SQS | Launches ECS retraining tasks from queue |
| update-status | Step Functions | Updates job status in DynamoDB workflow-state table |
| cleanup-resources | Step Functions | Cleans up orphaned ECS tasks on workflow failure |
Model Lifecycle¶
| Lambda | Trigger | Purpose |
|---|---|---|
| check-retraining-needed | Step Functions | Evaluates if model needs retraining |
| compare-models | Step Functions | Compares champion vs challenger models |
| promote-model | Step Functions | Promotes model to Production in MLflow |
| model-rollback | CloudWatch Alarm | Rolls back model to previous version |
| retraining-scheduler | EventBridge | Schedules and triggers model retraining |
Monitoring & Health¶
| Lambda | Trigger | Purpose |
|---|---|---|
| health-check | EventBridge | Checks ECS service health via Service Discovery |
| trading-heartbeat-check | EventBridge | Monitors live trading container heartbeats |
| drift-monitor | EventBridge | Detects model drift using PSI metrics |
| orphan-scanner | EventBridge | Finds and stops orphaned ECS tasks |
| pulumi-drift-detector | EventBridge | Detects infrastructure drift via Pulumi preview |
Deployment & Validation¶
| Lambda | Trigger | Purpose |
|---|---|---|
| validate-strategy | Step Functions | Validates strategy config before deployment |
| notify-completion | Step Functions | Sends SNS/Slack notifications |
Data Services¶
| Lambda | Trigger | Purpose |
|---|---|---|
| data-collection-proxy | API/Step Functions | Proxy to data-collection service |
Common Patterns¶
Environment Variables¶
All Lambdas use these common environment variables:
| Variable | Description |
|---|---|
ENVIRONMENT | Environment name (dev/staging/prod) |
DYNAMODB_TABLE_NAME | Default state repository table |
SNS_ALERTS_TOPIC_ARN | SNS topic for alerts |
ECS_CLUSTER | ECS cluster name/ARN |
Response Format¶
All Lambdas return LambdaResponse format:
# Success
{
"statusCode": 200,
"body": {"result": "data"}
}
# Error
{
"statusCode": 400,
"body": {"error": "Error message"}
}
Step Functions Integration¶
Lambdas in Step Functions workflows use to_step_functions() which returns the body dict (with environment, timestamp, and data fields at the top level). Standard Lambdas use to_dict() which wraps the response in {"statusCode": 200, "body": "{...}"}:
# For Step Functions (e.g., validate-strategy, update-status)
return LambdaResponse.success({"decision": "PROMOTE"}).to_step_functions()
# Returns: {"environment": "dev", "timestamp": "...", "decision": "PROMOTE"}
# For API Gateway / EventBridge (e.g., health-check, drift-monitor)
return LambdaResponse.success({"healthy": True}).to_dict()
# Returns: {"statusCode": 200, "body": "{\"environment\": \"dev\", \"healthy\": true, ...}"}
SQS Batch Processing¶
SQS-triggered Lambdas support partial batch failure:
Deployment¶
Lambdas are deployed as container images via Pulumi in infra/compute/modules/lambda_funcs.py:
# Container-image deployment (actual pattern)
lambda_func = aws.lambda_.Function(
f"{name}-lambda",
package_type="Image",
image_uri=f"{ecr_registry}/tradai/lambda-{name}:latest",
timeout=config.timeout,
memory_size=config.memory,
environment=aws.lambda_.FunctionEnvironmentArgs(
variables=env_vars,
),
)
Build and deploy with:
just lambda-bootstrap # Full pipeline: build wheel -> images -> push to ECR
just lambda-build-all # Build all Lambda container images
just lambda-push-all # Push all Lambda images to ECR
IAM Permissions¶
Each Lambda has specific IAM permissions defined in infra/compute/modules/lambda_funcs.py. Common patterns:
- ECS Lambdas:
ecs:RunTask,ecs:DescribeTasks,ecs:ListTasks,ecs:StopTask - DynamoDB Lambdas:
dynamodb:GetItem,dynamodb:PutItem,dynamodb:Query,dynamodb:Scan - SNS Lambdas:
sns:Publish - CloudWatch Lambdas:
cloudwatch:PutMetricData
Monitoring¶
CloudWatch Metrics¶
All Lambdas publish custom metrics to CloudWatch namespace TradAI/Lambda:
| Metric | Description |
|---|---|
Invocations | Total invocations |
Errors | Error count |
Duration | Execution time (ms) |
ConcurrentExecutions | Concurrent executions |
CloudWatch Alarms¶
Critical Lambdas have CloudWatch alarms configured:
# Example alarm for health-check failures
aws.cloudwatch.MetricAlarm(
"health-check-alarm",
comparison_operator="GreaterThanThreshold",
evaluation_periods=3,
metric_name="Errors",
namespace="AWS/Lambda",
period=300,
statistic="Sum",
threshold=1,
)
See Also¶
Architecture:
- Architecture Overview - System diagrams including Lambda infrastructure
- Step Functions - Workflow orchestration details
- Services - ECS service definitions
- ML Lifecycle - Model training and drift detection
SDK Reference:
- tradai-common - Lambda decorators and utilities
- tradai-strategy - Strategy validation
Services:
- Backend Service - Backtest submission
- Strategy Service - Model registry integration
CLI:
- CLI Reference - Command-line tools
Infrastructure:
- Pulumi Code - Lambda deployment code