Skip to content

pulumi-drift-detector

Monitors infrastructure drift by running pulumi preview --expect-no-changes against each configured Pulumi stack and alerting when drift is detected.

Overview

Property Value
Trigger EventBridge scheduled event
Runtime Python 3.11
Timeout 300 seconds
Memory 512 MB
Settings class PulumiDriftSettings

How It Works

The handler runs the Pulumi CLI directly via subprocess.run() (not an S3 state check). For each configured stack it:

  1. Logs in to the S3 backend with pulumi login {PULUMI_BACKEND_URL}
  2. Runs pulumi preview --stack {stack_name} --expect-no-changes --json --non-interactive
  3. Parses the JSON output to count resource changes by operation (create, update, delete, same)
  4. A non-zero exit code from pulumi preview indicates drift

Both commands run with a working directory of /var/task/infra and environment variables including PULUMI_CONFIG_PASSPHRASE (from Secrets Manager) and PULUMI_SKIP_UPDATE_CHECK=true.

Input Schema

{
    "stacks": ["tradai-foundation-prod", "tradai-compute-prod"],  # Override default stacks
    "dry_run": false  # If true, skip Pulumi operations (for testing)
}

Output Schema

{
    "success": true,
    "data": {
        "summary": {
            "stacks_checked": 2,
            "drifted": 1,
            "errors": 0
        },
        "results": [
            {
                "stack_name": "tradai-foundation-prod",
                "status": "checked",
                "has_drift": true,
                "resources_to_create": 0,
                "resources_to_update": 2,
                "resources_to_delete": 0,
                "resources_unchanged": 45,
                "drift_details": "{...}",
                "timestamp": "2024-02-07T12:00:00+00:00"
            }
        ]
    },
    "environment": "dev"
}

Error Results

Individual stack checks that fail return error results without halting other stacks:

{
    "stack_name": "tradai-compute-prod",
    "status": "error",
    "error": "Pulumi preview timed out",
    "timestamp": "2024-02-07T12:05:00+00:00"
}

Settings: PulumiDriftSettings

Extends DynamoDBSettings with Pulumi-specific configuration.

Setting Env Var Default Description
pulumi_backend_url PULUMI_BACKEND_URL - S3 backend URL for Pulumi state
pulumi_config_passphrase_secret_arn PULUMI_CONFIG_PASSPHRASE_SECRET_ARN - Secrets Manager ARN for Pulumi passphrase
stacks_to_check STACKS_TO_CHECK dev,staging,prod Comma-separated list of stack names
alert_on_drift ALERT_ON_DRIFT true Whether to send SNS alerts on drift
dynamodb_table_name INFRA_DRIFT_STATE_TABLE - State table for drift tracking

Plus inherited LambdaSettings fields: ENVIRONMENT, SNS_ALERTS_TOPIC_ARN, LOG_LEVEL.

Stack Filtering

Stacks to check are determined by: 1. event["stacks"] if provided (overrides settings) 2. settings.get_stacks() which parses STACKS_TO_CHECK comma-separated list

Pulumi Passphrase

Retrieved from AWS Secrets Manager using the ARN in PULUMI_CONFIG_PASSPHRASE_SECRET_ARN. If retrieval fails, the handler returns an error response without checking any stacks.

Drift State Tracking

Uses DynamoDBStateRepository with InfraDriftState entity to track drift state per stack:

InfraDriftState(
    stack_name="dev",
    has_drift=True,
    resources_to_create=0,
    resources_to_update=2,
    resources_to_delete=0,
    resources_unchanged=45,
    drift_detected_at="2024-02-07T12:00:00+00:00",
    last_check="2024-02-07T12:00:00+00:00",
    drift_details="{...}"
)

Alert deduplication: Only alerts on transition from no-drift to drifted state (prevents repeated alerts for the same drift). If state tracking fails, the alert is still sent as a safety measure.

SNS Alert Logic

When drift is newly detected and alert_on_drift=True:

  • Subject: [{ENV}] Infrastructure Drift Detected: {stack_name}
  • Body includes resource change counts and remediation recommendations
  • Message attributes: stack, environment, drift_type=infrastructure

CloudWatch Metrics

Namespace suffix: InfraDrift

Metric Dimensions Description
DriftDetected Stack, Environment 1.0 if drift detected, 0.0 otherwise
ResourcesToCreate Stack, Environment Count of resources to create
ResourcesToUpdate Stack, Environment Count of resources to update
ResourcesToDelete Stack, Environment Count of resources to delete
CheckSuccess Stack, Environment 1.0 if check succeeded, 0.0 on error

EventBridge Schedule

{
  "ScheduleExpression": "rate(6 hours)",
  "Targets": [{
    "Arn": "arn:aws:lambda:...:pulumi-drift-detector",
    "Input": "{}"
  }]
}

See Also

Related Lambdas:

Architecture:

Guides: