Skip to content

Debug Workflows Runbook

Procedures for debugging Step Functions workflows, tracing execution through DynamoDB, and correlating workflow state with service logs.

Step Functions Overview

TradAI uses AWS Step Functions to orchestrate multi-step workflows such as backtest execution, model retraining, and deployment pipelines. Each execution is tracked in the tradai-workflow-state-{ENV} DynamoDB table.


Tracing Workflows via DynamoDB

Key Table: tradai-workflow-state-{ENV}

This table stores the state of every workflow execution. It has two GSIs for efficient querying:

GSI Name Partition Key Sort Key Use Case
status-created_at-index status created_at Find workflows by status (e.g., all FAILED)
trace_id-index trace_id - Trace a single execution across all services

Look up a workflow by job ID:

aws dynamodb get-item \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --key '{"job_id": {"S": "JOB_ID"}}'

Find all failed workflows:

aws dynamodb query \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --index-name status-created_at-index \
  --key-condition-expression "#status = :failed" \
  --expression-attribute-names '{"#status": "status"}' \
  --expression-attribute-values '{":failed": {"S": "FAILED"}}' \
  --scan-index-forward false \
  --limit 10

Trace a workflow using trace_id:

# 1. Get the trace_id from the workflow state
aws dynamodb get-item \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --key '{"job_id": {"S": "JOB_ID"}}' \
  --projection-expression "trace_id, job_id, #status, created_at, mlflow_run_id" \
  --expression-attribute-names '{"#status": "status"}'

# 2. Use trace_id-index GSI to find all related records
aws dynamodb query \
  --table-name tradai-workflow-state-${ENVIRONMENT} \
  --index-name trace_id-index \
  --key-condition-expression "trace_id = :tid" \
  --expression-attribute-values '{":tid": {"S": "TRACE_ID_VALUE"}}'

Step Functions Execution Debugging

List recent executions:

aws stepfunctions list-executions \
  --state-machine-arn arn:aws:states:${AWS_REGION}:${ACCOUNT_ID}:stateMachine:tradai-${ENVIRONMENT}-backtest \
  --status-filter FAILED \
  --max-results 10

Get execution details:

aws stepfunctions describe-execution \
  --execution-arn $EXECUTION_ARN

Get execution history (step-by-step):

aws stepfunctions get-execution-history \
  --execution-arn $EXECUTION_ARN \
  --reverse-order \
  --max-results 20

Correlating Step Functions with DynamoDB State

Each Step Functions execution writes its trace_id to the DynamoDB workflow state. Use this to correlate:

  1. Get trace_id from Step Functions execution input:

    aws stepfunctions describe-execution \
      --execution-arn $EXECUTION_ARN \
      --query 'input' \
      --output text | jq -r '.trace_id'
    

  2. Look up all workflow records with that trace_id:

    aws dynamodb query \
      --table-name tradai-workflow-state-${ENVIRONMENT} \
      --index-name trace_id-index \
      --key-condition-expression "trace_id = :tid" \
      --expression-attribute-values '{":tid": {"S": "TRACE_ID_VALUE"}}'
    

  3. Cross-reference with CloudWatch logs using the same trace_id:

    -- Log group: /ecs/tradai/{env}
    fields @timestamp, @message
    | filter @message like /TRACE_ID_VALUE/
    | sort @timestamp asc
    | limit 100
    

  4. Check MLflow run (if mlflow_run_id is present in the workflow state):

    curl "http://localhost:5001/api/2.0/mlflow/runs/get?run_id=MLFLOW_RUN_ID" | jq
    


Common Workflow Issues

Symptom Likely Cause Investigation
Workflow stuck in RUNNING ECS task failed silently Check ECS task status and logs
Workflow FAILED immediately Validation error in input Check Step Functions execution input
Workflow FAILED at ECS step Container crashed or timed out Check /ecs/tradai/{env} logs
DynamoDB state shows RUNNING but Step Functions shows FAILED Status update Lambda failed Check /aws/lambda/tradai-update-status-{env} logs
Missing trace_id in DynamoDB Older workflow or bug Query by job_id directly instead

Verification Checklist

After debugging a workflow issue:

  • [ ] Root cause identified (Step Functions history reviewed)
  • [ ] DynamoDB workflow state consistent with actual outcome
  • [ ] trace_id correlation confirmed across services
  • [ ] Any stuck workflows cancelled or updated
  • [ ] CloudWatch logs reviewed for error patterns
  • [ ] Incident documented with timeline