Skip to content

Lambda Functions

Serverless functions for TradAI platform orchestration, monitoring, and automation.

Overview

TradAI uses AWS Lambda for event-driven processing, scheduled tasks, and Step Functions workflow integration.

Deployment Model

All Lambdas are deployed as container images (not zip packages). Each Lambda has its own Dockerfile in lambdas/<name>/ that extends a shared base image containing tradai-common.

lambdas/
├── base/              # Shared base Docker image (tradai-common wheel)
├── backtest-consumer/ # Each Lambda is a separate container image
│   ├── Dockerfile
│   ├── handler.py
│   └── requirements.txt
├── drift-monitor/
│   ├── Dockerfile
│   └── handler.py
└── ...

Handler Pattern

Most Lambdas use the @lambda_handler decorator for consistent error handling and settings injection:

from tradai.common.lambda_ import lambda_handler, LambdaResponse

@lambda_handler(MySettings)
def handler(event: dict, context) -> dict:
    # Settings are injected via the decorator
    return LambdaResponse.success({"result": "data"})

Note: Some Lambdas (e.g., backtest-consumer, sqs-consumer) use direct handler implementations without the decorator for specialized SQS batch processing.


Lambda Categories

Workflow Orchestration

Lambda Trigger Purpose
backtest-consumer SQS Launches ECS backtesting tasks from queue
sqs-consumer SQS Launches ECS retraining tasks from queue
update-status Step Functions Updates job status in DynamoDB workflow-state table
cleanup-resources Step Functions Cleans up orphaned ECS tasks on workflow failure

Model Lifecycle

Lambda Trigger Purpose
check-retraining-needed Step Functions Evaluates if model needs retraining
compare-models Step Functions Compares champion vs challenger models
promote-model Step Functions Promotes model to Production in MLflow
model-rollback CloudWatch Alarm Rolls back model to previous version
retraining-scheduler EventBridge Schedules and triggers model retraining

Monitoring & Health

Lambda Trigger Purpose
health-check EventBridge Checks ECS service health via Service Discovery
trading-heartbeat-check EventBridge Monitors live trading container heartbeats
drift-monitor EventBridge Detects model drift using PSI metrics
orphan-scanner EventBridge Finds and stops orphaned ECS tasks
pulumi-drift-detector EventBridge Detects infrastructure drift via Pulumi preview

Deployment & Validation

Lambda Trigger Purpose
validate-strategy Step Functions Validates strategy config before deployment
notify-completion Step Functions Sends SNS/Slack notifications

Data Services

Lambda Trigger Purpose
data-collection-proxy API/Step Functions Proxy to data-collection service

Common Patterns

Environment Variables

All Lambdas use these common environment variables:

Variable Description
ENVIRONMENT Environment name (dev/staging/prod)
DYNAMODB_TABLE_NAME Default state repository table
SNS_ALERTS_TOPIC_ARN SNS topic for alerts
ECS_CLUSTER ECS cluster name/ARN

Response Format

All Lambdas return LambdaResponse format:

# Success
{
    "statusCode": 200,
    "body": {"result": "data"}
}

# Error
{
    "statusCode": 400,
    "body": {"error": "Error message"}
}

Step Functions Integration

Lambdas in Step Functions workflows use to_step_functions() which returns the body dict (with environment, timestamp, and data fields at the top level). Standard Lambdas use to_dict() which wraps the response in {"statusCode": 200, "body": "{...}"}:

# For Step Functions (e.g., validate-strategy, update-status)
return LambdaResponse.success({"decision": "PROMOTE"}).to_step_functions()
# Returns: {"environment": "dev", "timestamp": "...", "decision": "PROMOTE"}

# For API Gateway / EventBridge (e.g., health-check, drift-monitor)
return LambdaResponse.success({"healthy": True}).to_dict()
# Returns: {"statusCode": 200, "body": "{\"environment\": \"dev\", \"healthy\": true, ...}"}

SQS Batch Processing

SQS-triggered Lambdas support partial batch failure:

return {
    "batchItemFailures": [
        {"itemIdentifier": "failed-message-id"}
    ]
}

Deployment

Lambdas are deployed as container images via Pulumi in infra/compute/modules/lambda_funcs.py:

# Container-image deployment (actual pattern)
lambda_func = aws.lambda_.Function(
    f"{name}-lambda",
    package_type="Image",
    image_uri=f"{ecr_registry}/tradai/lambda-{name}:latest",
    timeout=config.timeout,
    memory_size=config.memory,
    environment=aws.lambda_.FunctionEnvironmentArgs(
        variables=env_vars,
    ),
)

Build and deploy with:

just lambda-bootstrap    # Full pipeline: build wheel -> images -> push to ECR
just lambda-build-all    # Build all Lambda container images
just lambda-push-all     # Push all Lambda images to ECR

IAM Permissions

Each Lambda has specific IAM permissions defined in infra/compute/modules/lambda_funcs.py. Common patterns:

  • ECS Lambdas: ecs:RunTask, ecs:DescribeTasks, ecs:ListTasks, ecs:StopTask
  • DynamoDB Lambdas: dynamodb:GetItem, dynamodb:PutItem, dynamodb:Query, dynamodb:Scan
  • SNS Lambdas: sns:Publish
  • CloudWatch Lambdas: cloudwatch:PutMetricData

Monitoring

CloudWatch Metrics

All Lambdas publish custom metrics to CloudWatch namespace TradAI/Lambda:

Metric Description
Invocations Total invocations
Errors Error count
Duration Execution time (ms)
ConcurrentExecutions Concurrent executions

CloudWatch Alarms

Critical Lambdas have CloudWatch alarms configured:

# Example alarm for health-check failures
aws.cloudwatch.MetricAlarm(
    "health-check-alarm",
    comparison_operator="GreaterThanThreshold",
    evaluation_periods=3,
    metric_name="Errors",
    namespace="AWS/Lambda",
    period=300,
    statistic="Sum",
    threshold=1,
)

See Also

Architecture:

SDK Reference:

Services:

CLI:

Infrastructure: