binex bisect

Synopsis

binex bisect <GOOD_RUN_ID> <BAD_RUN_ID> [OPTIONS]

Description

Find the divergence point between two runs. Compares runs node-by-node, classifying each as a match, status difference, or content difference. Identifies the first node where the two runs diverge — helping you pinpoint where a regression or behavior change was introduced.

The comparison uses content similarity (via difflib.SequenceMatcher) to detect subtle output differences even when both nodes completed successfully.

Arguments

Argument	Required	Description
`GOOD_RUN_ID`	Yes	The "known good" run (baseline)
`BAD_RUN_ID`	Yes	The "known bad" run (comparison)

Options

Option	Type	Default	Description
`--threshold`	`float`	`0.9`	Content similarity threshold (0.0-1.0). Nodes with similarity below this are flagged as `content_diff`
`--diff`	flag	false	Show full unified diffs instead of content preview
`--json`	flag	false	Output as JSON
`--rich / --no-rich`	flag	auto	Rich formatted output (auto-detected if `rich` is installed)

Exit Codes

Code	Meaning
0	Success
1	Run not found

Examples

# Find where two runs diverge
binex bisect run_good run_bad

# Stricter content comparison
binex bisect run_good run_bad --threshold 0.95

# Show full diffs for changed nodes
binex bisect run_good run_bad --diff

# JSON for scripting
binex bisect run_good run_bad --json

Output

Plain text (default)

Bisecting: run_good vs run_bad

  planner       match
  researcher    match
  validator     content_diff  (similarity: 0.72)
    Good: {"validated": 9, "papers": [...]}
    Bad:  {"validated": 5, "papers": [...]}
  summarizer    status_diff   (completed -> failed)

Verdict: First divergence at 'validator'
  3 of 4 nodes compared
  1 content diff, 1 status diff

Rich (`--rich`)

The rich output includes:

Verdict Card — highlights the first divergence node with status
Pipeline Tree — visual node-by-node comparison with colored icons:
Green checkmark for matches
Yellow warning for content differences
Red cross for status differences
Footer with summary statistics

JSON (`--json`)

{
  "good_run": "run_good",
  "bad_run": "run_bad",
  "threshold": 0.9,
  "verdict": {
    "node_id": "validator",
    "type": "content_diff",
    "similarity": 0.72
  },
  "nodes": [
    {
      "node_id": "planner",
      "status": "match",
      "status_good": "completed",
      "status_bad": "completed",
      "similarity": 1.0
    },
    {
      "node_id": "researcher",
      "status": "match",
      "status_good": "completed",
      "status_bad": "completed",
      "similarity": 0.98
    },
    {
      "node_id": "validator",
      "status": "content_diff",
      "status_good": "completed",
      "status_bad": "completed",
      "similarity": 0.72
    },
    {
      "node_id": "summarizer",
      "status": "status_diff",
      "status_good": "completed",
      "status_bad": "failed"
    }
  ]
}

Node Comparison Statuses

Status	Meaning
`match`	Same status and content similarity above threshold
`content_diff`	Same status but content similarity below threshold
`status_diff`	Different execution status (e.g., completed vs failed)

Use Cases

Debugging a Regression

After a workflow that was working starts failing:

# Find the last good run and the failing run
binex bisect run_last_good run_failing

The verdict tells you exactly which node started behaving differently.

Comparing Model Swaps

After replaying a run with a different model:

binex replay run_original --from summarizer --agent summarizer=llm://anthropic/claude-sonnet-4-20250514
# Produces run_new

binex bisect run_original run_new --diff

The --diff flag shows exactly how the output content changed.

CI Regression Detection

RESULT=$(binex bisect "$BASELINE_RUN" "$CURRENT_RUN" --json)
VERDICT_TYPE=$(echo "$RESULT" | jq -r '.verdict.type')

if [ "$VERDICT_TYPE" = "status_diff" ]; then
  echo "Status regression detected"
  exit 1
fi

Tips

Put the "known good" run first and the "bad" run second — the output labels use these terms.
Use --threshold 0.95 for stricter comparison when outputs should be nearly identical.
Use --threshold 0.5 for looser comparison when you only care about major changes.
Combine with binex debug to inspect the divergent node in detail.