Debug Workflows Runbook¶

Procedures for debugging Step Functions workflows, tracing execution through DynamoDB, and correlating workflow state with service logs.

Step Functions Overview¶

TradAI uses AWS Step Functions to orchestrate multi-step workflows such as backtest execution, model retraining, and deployment pipelines. Each execution is tracked in the tradai-workflow-state-{ENV} DynamoDB table.

Tracing Workflows via DynamoDB¶

Key Table: `tradai-workflow-state-{ENV}`¶

This table stores the state of every workflow execution. It has two GSIs for efficient querying:

GSI Name	Partition Key	Sort Key	Use Case
`status-created_at-index`	`status`	`created_at`	Find workflows by status (e.g., all FAILED)
`trace_id-index`	`trace_id`	-	Trace a single execution across all services

Look up a workflow by job ID:¶

aws dynamodb get-item \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --key '{"job_id": {"S": "JOB_ID"}}'

Find all failed workflows:¶

aws dynamodb query \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --index-name status-created_at-index \
  --key-condition-expression "#status = :failed" \
  --expression-attribute-names '{"#status": "status"}' \
  --expression-attribute-values '{":failed": {"S": "FAILED"}}' \
  --scan-index-forward false \
  --limit 10

Trace a workflow using `trace_id`:¶

# 1. Get the trace_id from the workflow state
aws dynamodb get-item \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --key '{"job_id": {"S": "JOB_ID"}}' \
  --projection-expression "trace_id, job_id, #status, created_at, mlflow_run_id" \
  --expression-attribute-names '{"#status": "status"}'

# 2. Use trace_id-index GSI to find all related records
aws dynamodb query \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --index-name trace_id-index \
  --key-condition-expression "trace_id = :tid" \
  --expression-attribute-values '{":tid": {"S": "TRACE_ID_VALUE"}}'

Step Functions Execution Debugging¶

List recent executions:¶

aws stepfunctions list-executions \
  --state-machine-arn arn:aws:states:${AWS_REGION}:${ACCOUNT_ID}:stateMachine:tradai-${ENVIRONMENT}-backtest \
  --status-filter FAILED \
  --max-results 10

Get execution details:¶

aws stepfunctions describe-execution \
  --execution-arn $EXECUTION_ARN

Get execution history (step-by-step):¶

aws stepfunctions get-execution-history \
  --execution-arn $EXECUTION_ARN \
  --reverse-order \
  --max-results 20

Correlating Step Functions with DynamoDB State¶

Each Step Functions execution writes its trace_id to the DynamoDB workflow state. Use this to correlate:

Get trace_id from Step Functions execution input:

aws stepfunctions describe-execution \
  --execution-arn $EXECUTION_ARN \
  --query 'input' \
  --output text | jq -r '.trace_id'

Look up all workflow records with that trace_id:

aws dynamodb query \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --index-name trace_id-index \
  --key-condition-expression "trace_id = :tid" \
  --expression-attribute-values '{":tid": {"S": "TRACE_ID_VALUE"}}'

Cross-reference with CloudWatch logs using the same trace_id:

-- Log group: /ecs/tradai/{env}
fields @timestamp, @message
| filter @message like /TRACE_ID_VALUE/
| sort @timestamp asc
| limit 100

Check MLflow run (if mlflow_run_id is present in the workflow state):

curl "http://localhost:5001/api/2.0/mlflow/runs/get?run_id=MLFLOW_RUN_ID" | jq

Common Workflow Issues¶

Symptom	Likely Cause	Investigation
Workflow stuck in RUNNING	ECS task failed silently	Check ECS task status and logs
Workflow FAILED immediately	Validation error in input	Check Step Functions execution input
Workflow FAILED at ECS step	Container crashed or timed out	Check `/ecs/tradai/{env}` logs
DynamoDB state shows RUNNING but Step Functions shows FAILED	Status update Lambda failed	Check `/aws/lambda/tradai-update-status-{env}` logs
Missing trace_id in DynamoDB	Older workflow or bug	Query by `job_id` directly instead

Verification Checklist¶

After debugging a workflow issue:

[ ] Root cause identified (Step Functions history reviewed)
[ ] DynamoDB workflow state consistent with actual outcome
[ ] trace_id correlation confirmed across services
[ ] Any stuck workflows cancelled or updated
[ ] CloudWatch logs reviewed for error patterns
[ ] Incident documented with timeline