orphan-scanner¶
Scans for orphaned ECS tasks that have been running too long or have no corresponding DynamoDB state record.
Overview¶
| Property | Value |
|---|---|
| Trigger | EventBridge scheduled event |
| Runtime | Python 3.11 |
| Timeout | 60 seconds |
| Memory | 128 MB |
| Settings class | OrphanScannerSettings |
Input Schema¶
Output Schema¶
{
"success": true,
"data": {
"summary": {
"total_running": 15,
"orphans_found": 2,
"tasks_stopped": 2, # 0 if dry_run
"dry_run": false
},
"orphans": [
{
"task_arn": "arn:aws:ecs:...:task/abc123",
"reason": "Exceeded max runtime (8.5h > 6h)",
"runtime_hours": 8.5
},
{
"task_arn": "arn:aws:ecs:...:task/def456",
"reason": "State shows 'completed' but task still running"
}
]
},
"environment": "dev"
}
Environment Variables¶
| Variable | Required | Default | Description |
|---|---|---|---|
ECS_CLUSTER | Yes | - | ECS cluster name/ARN |
RETRAINING_STATE_TABLE | Yes | - | DynamoDB table for retraining state (keyed by model_name) |
TRADING_STATE_TABLE | Yes | - | DynamoDB table for trading state (keyed by strategy_id) |
MAX_TASK_RUNTIME_HOURS | No | 6 | Max expected runtime in hours |
DRY_RUN | No | true | Safety default: only report, do not stop tasks |
DynamoDB Tables¶
The orphan scanner cross-references running ECS tasks against two DynamoDB tables:
| Table | Env Var | Key | Used For |
|---|---|---|---|
| Retraining state | RETRAINING_STATE_TABLE | model_name | Checking retraining task state |
| Trading state | TRADING_STATE_TABLE | strategy_id | Checking trading task state |
Task-to-table mapping uses ECS task tags: tasks with a model_name tag are checked against the retraining table, tasks with a strategy_id tag are checked against the trading table.
Orphan Detection Criteria¶
A task is considered orphaned if any of these conditions are met:
- Runtime exceeded: Running longer than
max_runtime_hours(checked viastartedAtfrom ECS) - Retraining state mismatch: State record shows
completedorfailedbut ECS task still running - Task ARN mismatch: Retraining state record points to a different
task_arn - Trading state mismatch: Trading state shows
stopped,failed, orterminatedbut task still running
Batch Processing¶
- ECS
list_tasksuses pagination (returns max per API call) describe_tasksprocesses in batches of 100 (ECS API limit), withinclude=["TAGS"]to read task tags- Tags extracted:
model_name,strategy_id
Cleanup Logic¶
When dry_run=false:
- For each identified orphan, calls
ecs.stop_task()with reason"Orphan scan: {reason}" - Tracks count of successfully stopped tasks
- Failures to stop individual tasks are logged but do not halt processing of remaining orphans
CloudWatch Metrics¶
Namespace suffix: OrphanScanning
| Metric | Dimensions | Description |
|---|---|---|
RunningTasksScanned | Environment | Total tasks scanned |
OrphanTasksFound | Environment | Orphaned tasks detected |
OrphanTasksStopped | Environment | Tasks actually stopped |
SNS Alert¶
When orphans are found and alert_publisher is enabled, an SNS alert is sent:
- Subject:
[{ENV}] Orphan Tasks {Detected (DRY RUN) | Stopped}: {count} tasks - Body includes environment, cluster, orphan count, dry-run status
- Lists up to 10 orphan task short IDs with their reasons
- Severity:
MEDIUM, alert type:orphan_tasks
EventBridge Schedule¶
{
"ScheduleExpression": "rate(5 minutes)",
"Targets": [{
"Arn": "arn:aws:lambda:...:orphan-scanner",
"Input": "{\"dry_run\": false, \"max_runtime_hours\": 6}"
}]
}
Related¶
- cleanup-resources - Immediate cleanup on workflow failure
- health-check - Service health monitoring