Skip to content

orphan-scanner

Scans for orphaned ECS tasks that have been running too long or have no state record.

Overview

Property Value
Trigger EventBridge scheduled event
Runtime Python 3.11
Timeout 300 seconds
Memory 256 MB

Input Schema

{
    "dry_run": true,           # Default: true (safety default)
    "max_runtime_hours": 6     # Default: 6
}

Output Schema

{
    "summary": {
        "total_running": 15,
        "orphans_found": 2,
        "tasks_stopped": 2,    # 0 if dry_run
        "dry_run": false
    },
    "orphans": [
        {
            "task_arn": "arn:aws:ecs:...:task/abc123",
            "reason": "Running longer than 6 hours",
            "runtime_hours": 8.5
        },
        {
            "task_arn": "arn:aws:ecs:...:task/def456",
            "reason": "No matching DynamoDB state record"
        }
    ]
}

Environment Variables

Variable Required Default Description
ECS_CLUSTER Yes - ECS cluster name/ARN
RETRAINING_STATE_TABLE Yes - Retraining state table
TRADING_STATE_TABLE Yes - Trading state table
MAX_TASK_RUNTIME_HOURS No 6 Max expected runtime
DRY_RUN No "true" Safety default

Orphan Detection Criteria

A task is considered orphaned if:

  1. Runtime exceeded: Running longer than max_runtime_hours
  2. No state record: No matching entry in DynamoDB
  3. Stale state: State shows "completed"/"failed" but task still running
  4. Mismatched ARN: State record points to different task ARN

CloudWatch Metrics

Metric Description
RunningTasksScanned Total tasks scanned
OrphanTasksFound Orphaned tasks detected
OrphanTasksStopped Tasks actually stopped

Key Features

  • Uses pagination for large task lists (100 tasks per batch)
  • Checks both retraining and trading state tables
  • Dry-run mode for safe inspection
  • Detailed alert with orphan reasons

EventBridge Schedule

{
  "ScheduleExpression": "rate(6 hours)",
  "Targets": [{
    "Arn": "arn:aws:lambda:...:orphan-scanner",
    "Input": "{\"dry_run\": false, \"max_runtime_hours\": 6}"
  }]
}

SNS Alert Format

{
    "subject": "Orphaned ECS Tasks Found",
    "message": {
        "total_running": 15,
        "orphans_found": 2,
        "tasks_stopped": 2,
        "orphans": [
            {
                "task_arn": "...",
                "reason": "Running longer than 6 hours",
                "runtime_hours": 8.5
            }
        ]
    }
}