Skip to content

Add diff and why subcommands — session comparison and causal chain tracing#12

Merged
Siddhant-K-code merged 6 commits intomainfrom
feature/diff-why
Mar 28, 2026
Merged

Add diff and why subcommands — session comparison and causal chain tracing#12
Siddhant-K-code merged 6 commits intomainfrom
feature/diff-why

Conversation

@Siddhant-K-code
Copy link
Copy Markdown
Owner

Closes #4

What

Two new subcommands for understanding why a session behaved the way it did and how it differed from another session.

Changes

diff.py (new):

  • diff_sessions(store, session_a, session_b) — aligns phases from both sessions via LCS (longest common subsequence) on phase labels, then reports per-phase differences: files only in A, files only in B, commands only in A/B, failed vs passed outcomes
  • format_diff() — shows divergence point and per-phase delta

why.py (new):

  • build_causal_chain(events, target_index) — walks backwards from a target event using: parent_id links (tool_result → tool_call), error→retry detection, path-reference matching (tool_result text containing a path referenced by a later tool_call), and user_prompt as the root cause terminator
  • format_why() — renders the chain root → target with reasons at each link

CLI additions:

  • agent-strace diff <session-a> <session-b>
  • agent-strace why [session-id] <event-number> (1-based, matches replay output)

Example output

Comparing: abc123 vs def456

Diverged at phase 2:

  Phase 2: run tests
    abc123 only:  $ python -m pytest
    def456 only:  $ uv run pytest

  abc123: 4m 12s, 47 events, 8 tools, 2 retries
  def456: 2m 05s, 31 events, 5 tools, 0 retries
Why did event #4 happen?

  #  4  tool_call: Bash  $ pytest tests/

Causal chain (root → target):

    #  1  user_prompt: "run the test suite"
       (prompt at #1 triggered this)
  ←  #  3  error: exit 1
       (retry after error at #3)
  ←  #  4  tool_call: Bash  $ pytest tests/

Tests

20 new tests (192 total passing)

Siddhant-K-code and others added 2 commits March 25, 2026 06:12
- diff <session-a> <session-b>: structural comparison of two sessions
  via LCS-based phase alignment; reports divergence point, per-phase
  file/command differences, and failed vs passed outcomes
- why [session-id] <event-number>: traces the causal chain backwards
  from a target event using parent_id links, error→retry detection,
  path-reference matching, and user_prompt as root cause
- 20 new tests (192 total passing)

Co-authored-by: Ona <no-reply@ona.com>
… fix import order

Co-authored-by: Ona <no-reply@ona.com>
@Siddhant-K-code
Copy link
Copy Markdown
Owner Author

Review: diff and why subcommands

Two dead variables found; both fixed upstream in 2ee71c8 before this review landed.


Bug 1 — _event_paths: dead result variable (fixed in 2ee71c8)

result = data.get("result", "") or data.get("content_preview", "")
return paths  # result never added to paths

The variable was computed but never used. Removed.


Bug 2 — build_causal_chain: dead target_tool variable (fixed in 2ee71c8)

target_tool = event.data.get("tool_name", "").lower()
# target_tool never referenced again in the function

Also removed.


Design note — _lcs_indices O(m×n) space

The DP table allocates (m+1) × (n+1) cells. For typical session phase counts (< 50) this is negligible. Worth a comment if phase counts are expected to grow significantly.


Design note — causal chain completeness

_walk terminates silently when idx is already in visited. The fallback _walk(0, "session start") at the end of the backwards scan can hit this if index 0 was already visited, producing an incomplete chain with no indication. This is acceptable for the current use case but worth documenting.


No remaining issues. ✅ Ready to merge.

@Siddhant-K-code Siddhant-K-code marked this pull request as ready for review March 28, 2026 15:49
for prev_idx in range(idx - 1, -1, -1):
prev = events[prev_idx]

# Error → next tool_call is a retry
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive: error→retry rule fires on any ERROR before any TOOL_CALL, not just adjacent ones.

The backwards scan hits the first ERROR it finds and immediately attributes the target TOOL_CALL to it, regardless of how many unrelated events are between them. A session like:

#1 user_prompt: "run tests"
#2 tool_call: Bash $ pytest        ← fails
#3 error: exit 1
#4 tool_call: Bash $ git status    ← unrelated, but scan hits #3 first
#5 tool_call: Bash $ pytest        ← actual retry

why #5 correctly links to #3. But why #4 also links to #3 even though git status is not a retry — it's the next command after the error. The rule should only fire if the error is the immediately preceding event (or at most separated by a tool_result).

return

# file_read → file_write of same file
if (prev.event_type in (EventType.TOOL_CALL, EventType.FILE_READ)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overly broad match: any TOOL_CALL with a shared path is attributed to any prior FILE_READ.

The condition event.event_type in (EventType.TOOL_CALL, EventType.FILE_WRITE) means a TOOL_CALL (e.g. Bash $ pytest) that happens to have no paths will never match here — fine. But a TOOL_CALL like Write src/foo.py will match any prior TOOL_CALL that also references src/foo.py, even if that prior call was a Bash command that merely mentioned the path in its output. The intent is read→write causality; tightening the target condition to EventType.FILE_WRITE (or tool_name in ("write", "edit")) would reduce false positives.

session_a: str,
session_b: str,
) -> SessionDiff:
result_a = explain_session(store, session_a)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain_session is called twice (once per session) and each call calls store.load_events internally. For large sessions this is fine, but store.load_meta is then called again on lines 169–170 to get tool_calls_a/b. The meta is already available inside ExplainResult (via explain_sessionstore.load_meta). Consider exposing it on ExplainResult to avoid the extra store reads, or at minimum document why the double load is acceptable.

@Siddhant-K-code
Copy link
Copy Markdown
Owner Author

Review: PR #12 — diff + why

Both modules are well-structured. The LCS alignment for diff is the right algorithm choice — it handles insertions/deletions gracefully and the implementation is correct. The why causal chain walker is a good heuristic approach with proper cycle protection via visited.

Two bugs in why.py (inline comments on lines 108 and 122):

  1. The error→retry rule fires on any preceding ERROR, not just adjacent ones — produces false positives for unrelated tool calls after an error
  2. The read→write path match is too broad — any TOOL_CALL with a shared path qualifies as a "write", not just actual write operations

One nit in diff.py (line 98): double store.load_meta call — minor but worth noting.

Test gaps:

  • test_error_causes_retry only tests the happy path (error immediately before retry). No test for the false-positive case described above — a tool call after an error that is not a retry
  • No test for diff_sessions where one session has more phases than the other (unaligned phases)
  • No test for why on a session where the target event has parent_id pointing to a non-existent event ID (should fall through to heuristic scan, not crash)
  • _lcs_indices is tested well — 5 cases covering empty, identical, no-common, partial, and insertion

What's good:

  • _lcs_indices uses bottom-up DP with correct backtracking — O(mn) time, O(mn) space, no recursion limit issues
  • _walk uses a visited set so cycles in parent_id links (malformed data) can't cause infinite recursion
  • format_why handles the empty chain case explicitly — no crash on chain.links[-1] when links is empty
  • cmd_why validates event number bounds before calling build_causal_chain
  • build_phases is reused from explain.py rather than reimplemented — good

Siddhant-K-code and others added 4 commits March 28, 2026 16:05
why.py:
- error→retry: only fire when no substantive event sits between the error
  and the tool_call (tool_result events are allowed as separators). Prevents
  unrelated tool calls after an error from being misattributed as retries.
- read→write: restrict the path-match rule to actual write operations
  (FILE_WRITE event type, or tool_name in write/edit/create). Previously
  any TOOL_CALL sharing a path with a prior read would match, including
  Bash commands that merely mentioned the path.

tests:
- error→retry false positive: unrelated call between error and retry
- write tool_call correctly linked to prior read of same path
- Bash call with shared path not linked via read→write rule
- dangling parent_id falls through to heuristic without crashing
- diff_sessions with unaligned phase counts

Co-authored-by: Ona <no-reply@ona.com>
The previous adjacency check was insufficient — git status immediately
after a failed pytest is adjacent to the error but is not a retry.
Now we look up the tool_call that caused the error and only attribute
the retry if the target tool_call uses the same tool name.

Co-authored-by: Ona <no-reply@ona.com>
Tool name alone is insufficient — git status and pytest are both Bash
calls. For Bash tool calls, require the command string to match the
failing call. For other tools, tool name match is sufficient.

Co-authored-by: Ona <no-reply@ona.com>
@Siddhant-K-code Siddhant-K-code merged commit a724556 into main Mar 28, 2026
4 checks passed
@Siddhant-K-code Siddhant-K-code deleted the feature/diff-why branch March 28, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.5.0: Session diff + causal chain (why)

1 participant