Debug Workflows Runbook¶
Procedures for debugging Step Functions workflows, tracing execution through DynamoDB, and correlating workflow state with service logs.
Step Functions Overview¶
TradAI uses AWS Step Functions to orchestrate multi-step workflows such as backtest execution, model retraining, and deployment pipelines. Each execution is tracked in the tradai-workflow-state-{ENV} DynamoDB table.
Tracing Workflows via DynamoDB¶
Key Table: tradai-workflow-state-{ENV}¶
This table stores the state of every workflow execution. It has two GSIs for efficient querying:
| GSI Name | Partition Key | Sort Key | Use Case |
|---|---|---|---|
status-created_at-index | status | created_at | Find workflows by status (e.g., all FAILED) |
trace_id-index | trace_id | - | Trace a single execution across all services |
Look up a workflow by job ID:¶
aws dynamodb get-item \
--table-name tradai-workflow-state-${ENVIRONMENT} \
--key '{"job_id": {"S": "JOB_ID"}}'
Find all failed workflows:¶
aws dynamodb query \
--table-name tradai-workflow-state-${ENVIRONMENT} \
--index-name status-created_at-index \
--key-condition-expression "#status = :failed" \
--expression-attribute-names '{"#status": "status"}' \
--expression-attribute-values '{":failed": {"S": "FAILED"}}' \
--scan-index-forward false \
--limit 10
Trace a workflow using trace_id:¶
# 1. Get the trace_id from the workflow state
aws dynamodb get-item \
--table-name tradai-workflow-state-${ENVIRONMENT} \
--key '{"job_id": {"S": "JOB_ID"}}' \
--projection-expression "trace_id, job_id, #status, created_at, mlflow_run_id" \
--expression-attribute-names '{"#status": "status"}'
# 2. Use trace_id-index GSI to find all related records
aws dynamodb query \
--table-name tradai-workflow-state-${ENVIRONMENT} \
--index-name trace_id-index \
--key-condition-expression "trace_id = :tid" \
--expression-attribute-values '{":tid": {"S": "TRACE_ID_VALUE"}}'
Step Functions Execution Debugging¶
List recent executions:¶
aws stepfunctions list-executions \
--state-machine-arn arn:aws:states:${AWS_REGION}:${ACCOUNT_ID}:stateMachine:tradai-${ENVIRONMENT}-backtest \
--status-filter FAILED \
--max-results 10
Get execution details:¶
Get execution history (step-by-step):¶
aws stepfunctions get-execution-history \
--execution-arn $EXECUTION_ARN \
--reverse-order \
--max-results 20
Correlating Step Functions with DynamoDB State¶
Each Step Functions execution writes its trace_id to the DynamoDB workflow state. Use this to correlate:
-
Get trace_id from Step Functions execution input:
-
Look up all workflow records with that trace_id:
-
Cross-reference with CloudWatch logs using the same trace_id:
-
Check MLflow run (if
mlflow_run_idis present in the workflow state):
Common Workflow Issues¶
| Symptom | Likely Cause | Investigation |
|---|---|---|
| Workflow stuck in RUNNING | ECS task failed silently | Check ECS task status and logs |
| Workflow FAILED immediately | Validation error in input | Check Step Functions execution input |
| Workflow FAILED at ECS step | Container crashed or timed out | Check /ecs/tradai/{env} logs |
| DynamoDB state shows RUNNING but Step Functions shows FAILED | Status update Lambda failed | Check /aws/lambda/tradai-update-status-{env} logs |
| Missing trace_id in DynamoDB | Older workflow or bug | Query by job_id directly instead |
Verification Checklist¶
After debugging a workflow issue:
- [ ] Root cause identified (Step Functions history reviewed)
- [ ] DynamoDB workflow state consistent with actual outcome
- [ ] trace_id correlation confirmed across services
- [ ] Any stuck workflows cancelled or updated
- [ ] CloudWatch logs reviewed for error patterns
- [ ] Incident documented with timeline