Lambda Functions¶

Serverless functions for TradAI platform orchestration, monitoring, and automation.

Overview¶

TradAI uses AWS Lambda for event-driven processing, scheduled tasks, and Step Functions workflow integration.

Deployment Model¶

All Lambdas are deployed as container images (not zip packages). Each Lambda has its own Dockerfile in lambdas/<name>/ that extends a shared base image containing tradai-common.

lambdas/
├── base/              # Shared base Docker image (tradai-common wheel)
├── backtest-consumer/ # Each Lambda is a separate container image
│   ├── Dockerfile
│   ├── handler.py
│   └── requirements.txt
├── drift-monitor/
│   ├── Dockerfile
│   └── handler.py
└── ...

Handler Pattern¶

Most Lambdas use the @lambda_handler decorator for consistent error handling and settings injection:

from tradai.common.lambda_ import lambda_handler, LambdaResponse

@lambda_handler(MySettings)
def handler(event: dict, context) -> dict:
    # Settings are injected via the decorator
    return LambdaResponse.success({"result": "data"})

Note: Some Lambdas (e.g., backtest-consumer, sqs-consumer) use direct handler implementations without the decorator for specialized SQS batch processing.

Lambda Categories¶

Workflow Orchestration¶

Lambda	Trigger	Purpose
backtest-consumer	SQS	Launches ECS backtesting tasks from queue
sqs-consumer	SQS	Launches ECS retraining tasks from queue
update-status	Step Functions	Updates job status in DynamoDB workflow-state table
cleanup-resources	Step Functions	Cleans up orphaned ECS tasks on workflow failure

Model Lifecycle¶

Lambda	Trigger	Purpose
check-retraining-needed	Step Functions	Evaluates if model needs retraining
compare-models	Step Functions	Compares champion vs challenger models
promote-model	Step Functions	Promotes model to Production in MLflow
model-rollback	CloudWatch Alarm	Rolls back model to previous version
retraining-scheduler	EventBridge	Schedules and triggers model retraining

Monitoring & Health¶

Lambda	Trigger	Purpose
health-check	EventBridge	Checks ECS service health via Service Discovery
trading-heartbeat-check	EventBridge	Monitors live trading container heartbeats
drift-monitor	EventBridge	Detects model drift using PSI metrics
orphan-scanner	EventBridge	Finds and stops orphaned ECS tasks
pulumi-drift-detector	EventBridge	Detects infrastructure drift via Pulumi preview

Deployment & Validation¶

Lambda	Trigger	Purpose
validate-strategy	Step Functions	Validates strategy config before deployment
notify-completion	Step Functions	Sends SNS/Slack notifications

Data Services¶

Lambda	Trigger	Purpose
data-collection-proxy	API/Step Functions	Proxy to data-collection service

Common Patterns¶

Environment Variables¶

All Lambdas use these common environment variables:

Variable	Description
`ENVIRONMENT`	Environment name (dev/staging/prod)
`DYNAMODB_TABLE_NAME`	Default state repository table
`SNS_ALERTS_TOPIC_ARN`	SNS topic for alerts
`ECS_CLUSTER`	ECS cluster name/ARN

Response Format¶

All Lambdas return LambdaResponse format:

# Success
{
    "statusCode": 200,
    "body": {"result": "data"}
}

# Error
{
    "statusCode": 400,
    "body": {"error": "Error message"}
}

Step Functions Integration¶

Lambdas in Step Functions workflows use to_step_functions() which returns the body dict (with environment, timestamp, and data fields at the top level). Standard Lambdas use to_dict() which wraps the response in {"statusCode": 200, "body": "{...}"}:

# For Step Functions (e.g., validate-strategy, update-status)
return LambdaResponse.success({"decision": "PROMOTE"}).to_step_functions()
# Returns: {"environment": "dev", "timestamp": "...", "decision": "PROMOTE"}

# For API Gateway / EventBridge (e.g., health-check, drift-monitor)
return LambdaResponse.success({"healthy": True}).to_dict()
# Returns: {"statusCode": 200, "body": "{\"environment\": \"dev\", \"healthy\": true, ...}"}

SQS Batch Processing¶

SQS-triggered Lambdas support partial batch failure:

return {
    "batchItemFailures": [
        {"itemIdentifier": "failed-message-id"}
    ]
}

Deployment¶

Lambdas are deployed as container images via Pulumi in infra/compute/modules/lambda_funcs.py:

# Container-image deployment (actual pattern)
lambda_func = aws.lambda_.Function(
    f"{name}-lambda",
    package_type="Image",
    image_uri=f"{ecr_registry}/tradai/lambda-{name}:latest",
    timeout=config.timeout,
    memory_size=config.memory,
    environment=aws.lambda_.FunctionEnvironmentArgs(
        variables=env_vars,
    ),
)

Build and deploy with:

just lambda-bootstrap    # Full pipeline: build wheel -> images -> push to ECR
just lambda-build-all    # Build all Lambda container images
just lambda-push-all     # Push all Lambda images to ECR

IAM Permissions¶

Each Lambda has specific IAM permissions defined in infra/compute/modules/lambda_funcs.py. Common patterns:

ECS Lambdas: ecs:RunTask, ecs:DescribeTasks, ecs:ListTasks, ecs:StopTask
DynamoDB Lambdas: dynamodb:GetItem, dynamodb:PutItem, dynamodb:Query, dynamodb:Scan
SNS Lambdas: sns:Publish
CloudWatch Lambdas: cloudwatch:PutMetricData

Monitoring¶

CloudWatch Metrics¶

All Lambdas publish custom metrics to CloudWatch namespace TradAI/Lambda:

Metric	Description
`Invocations`	Total invocations
`Errors`	Error count
`Duration`	Execution time (ms)
`ConcurrentExecutions`	Concurrent executions

CloudWatch Alarms¶

Critical Lambdas have CloudWatch alarms configured:

# Example alarm for health-check failures
aws.cloudwatch.MetricAlarm(
    "health-check-alarm",
    comparison_operator="GreaterThanThreshold",
    evaluation_periods=3,
    metric_name="Errors",
    namespace="AWS/Lambda",
    period=300,
    statistic="Sum",
    threshold=1,
)