Skip to content

orphan-scanner

Scans for orphaned ECS tasks that have been running too long or have no corresponding DynamoDB state record.

Overview

Property Value
Trigger EventBridge scheduled event
Runtime Python 3.11
Timeout 60 seconds
Memory 128 MB
Settings class OrphanScannerSettings

Input Schema

{
    "dry_run": true,           # Default: true (safety default)
    "max_runtime_hours": 6     # Default: 6
}

Output Schema

{
    "success": true,
    "data": {
        "summary": {
            "total_running": 15,
            "orphans_found": 2,
            "tasks_stopped": 2,    # 0 if dry_run
            "dry_run": false
        },
        "orphans": [
            {
                "task_arn": "arn:aws:ecs:...:task/abc123",
                "reason": "Exceeded max runtime (8.5h > 6h)",
                "runtime_hours": 8.5
            },
            {
                "task_arn": "arn:aws:ecs:...:task/def456",
                "reason": "State shows 'completed' but task still running"
            }
        ]
    },
    "environment": "dev"
}

Environment Variables

Variable Required Default Description
ECS_CLUSTER Yes - ECS cluster name/ARN
RETRAINING_STATE_TABLE Yes - DynamoDB table for retraining state (keyed by model_name)
TRADING_STATE_TABLE Yes - DynamoDB table for trading state (keyed by strategy_id)
MAX_TASK_RUNTIME_HOURS No 6 Max expected runtime in hours
DRY_RUN No true Safety default: only report, do not stop tasks

DynamoDB Tables

The orphan scanner cross-references running ECS tasks against two DynamoDB tables:

Table Env Var Key Used For
Retraining state RETRAINING_STATE_TABLE model_name Checking retraining task state
Trading state TRADING_STATE_TABLE strategy_id Checking trading task state

Task-to-table mapping uses ECS task tags: tasks with a model_name tag are checked against the retraining table, tasks with a strategy_id tag are checked against the trading table.

Orphan Detection Criteria

A task is considered orphaned if any of these conditions are met:

  1. Runtime exceeded: Running longer than max_runtime_hours (checked via startedAt from ECS)
  2. Retraining state mismatch: State record shows completed or failed but ECS task still running
  3. Task ARN mismatch: Retraining state record points to a different task_arn
  4. Trading state mismatch: Trading state shows stopped, failed, or terminated but task still running

Batch Processing

  • ECS list_tasks uses pagination (returns max per API call)
  • describe_tasks processes in batches of 100 (ECS API limit), with include=["TAGS"] to read task tags
  • Tags extracted: model_name, strategy_id

Cleanup Logic

When dry_run=false:

  1. For each identified orphan, calls ecs.stop_task() with reason "Orphan scan: {reason}"
  2. Tracks count of successfully stopped tasks
  3. Failures to stop individual tasks are logged but do not halt processing of remaining orphans

CloudWatch Metrics

Namespace suffix: OrphanScanning

Metric Dimensions Description
RunningTasksScanned Environment Total tasks scanned
OrphanTasksFound Environment Orphaned tasks detected
OrphanTasksStopped Environment Tasks actually stopped

SNS Alert

When orphans are found and alert_publisher is enabled, an SNS alert is sent:

  • Subject: [{ENV}] Orphan Tasks {Detected (DRY RUN) | Stopped}: {count} tasks
  • Body includes environment, cluster, orphan count, dry-run status
  • Lists up to 10 orphan task short IDs with their reasons
  • Severity: MEDIUM, alert type: orphan_tasks

EventBridge Schedule

{
  "ScheduleExpression": "rate(5 minutes)",
  "Targets": [{
    "Arn": "arn:aws:lambda:...:orphan-scanner",
    "Input": "{\"dry_run\": false, \"max_runtime_hours\": 6}"
  }]
}