Skip to content

Lambda Functions

Serverless functions for TradAI platform orchestration, monitoring, and automation.

Overview

TradAI uses AWS Lambda for event-driven processing, scheduled tasks, and Step Functions workflow integration. All Lambdas use the lambda_handler decorator from tradai-common for consistent error handling and metrics.

from tradai.common.lambda_ import lambda_handler, LambdaResponse

@lambda_handler
def handler(event: dict, context) -> dict:
    # Business logic here
    return LambdaResponse.success({"result": "data"})

Lambda Categories

Workflow Orchestration

Lambda Trigger Purpose
backtest-consumer SQS Launches ECS backtesting tasks from queue
sqs-consumer SQS Launches ECS retraining tasks from queue
cleanup-resources Step Functions Cleans up orphaned ECS tasks on workflow failure

Model Lifecycle

Lambda Trigger Purpose
check-retraining-needed Step Functions Evaluates if model needs retraining
compare-models Step Functions Compares champion vs challenger models
promote-model Step Functions Promotes model to Production in MLflow
model-rollback CloudWatch Alarm Rolls back model to previous version
retraining-scheduler EventBridge Schedules and triggers model retraining

Monitoring & Health

Lambda Trigger Purpose
health-check EventBridge Checks ECS service health via Service Discovery
trading-heartbeat-check EventBridge Monitors live trading container heartbeats
drift-monitor EventBridge Detects model drift using PSI metrics
orphan-scanner EventBridge Finds and stops orphaned ECS tasks

Deployment & Validation

Lambda Trigger Purpose
validate-strategy Step Functions Validates strategy config before deployment
notify-completion Step Functions Sends SNS/Slack notifications

Data Services

Lambda Trigger Purpose
data-collection-proxy API/Step Functions Proxy to data-collection service

Common Patterns

Environment Variables

All Lambdas use these common environment variables:

Variable Description
ENVIRONMENT Environment name (dev/staging/prod)
DYNAMODB_TABLE_NAME Default state repository table
ALERT_SNS_TOPIC_ARN SNS topic for alerts
ECS_CLUSTER ECS cluster name/ARN

Response Format

All Lambdas return LambdaResponse format:

# Success
{
    "statusCode": 200,
    "body": {"result": "data"}
}

# Error
{
    "statusCode": 400,
    "body": {"error": "Error message"}
}

Step Functions Integration

Lambdas in Step Functions workflows return data directly (not wrapped in statusCode):

# For Step Functions
return LambdaResponse.success({"decision": "PROMOTE"})
# Returns: {"decision": "PROMOTE"}

SQS Batch Processing

SQS-triggered Lambdas support partial batch failure:

return {
    "batchItemFailures": [
        {"itemIdentifier": "failed-message-id"}
    ]
}

Deployment

Lambdas are deployed via Pulumi in infra/modules/lambda_funcs.py:

# Example: Creating a Lambda function
lambda_func = aws.lambda_.Function(
    f"{name}-lambda",
    runtime="python3.11",
    handler="handler.handler",
    timeout=300,
    memory_size=256,
    environment=aws.lambda_.FunctionEnvironmentArgs(
        variables={
            "ENVIRONMENT": environment,
            "DYNAMODB_TABLE_NAME": state_table.name,
        }
    ),
)

IAM Permissions

Each Lambda has specific IAM permissions defined in infra/modules/iam.py. Common patterns:

  • ECS Lambdas: ecs:RunTask, ecs:DescribeTasks, ecs:ListTasks, ecs:StopTask
  • DynamoDB Lambdas: dynamodb:GetItem, dynamodb:PutItem, dynamodb:Query, dynamodb:Scan
  • SNS Lambdas: sns:Publish
  • CloudWatch Lambdas: cloudwatch:PutMetricData

Monitoring

CloudWatch Metrics

All Lambdas publish custom metrics to CloudWatch namespace TradAI/Lambda:

Metric Description
Invocations Total invocations
Errors Error count
Duration Execution time (ms)
ConcurrentExecutions Concurrent executions

CloudWatch Alarms

Critical Lambdas have CloudWatch alarms configured:

# Example alarm for health-check failures
aws.cloudwatch.MetricAlarm(
    "health-check-alarm",
    comparison_operator="GreaterThanThreshold",
    evaluation_periods=3,
    metric_name="Errors",
    namespace="AWS/Lambda",
    period=300,
    statistic="Sum",
    threshold=1,
)

See Also

Architecture:

SDK Reference:

Services:

CLI:

Infrastructure: