Add diff and why subcommands — session comparison and causal chain tracing#12
Add diff and why subcommands — session comparison and causal chain tracing#12Siddhant-K-code merged 6 commits intomainfrom
Conversation
- diff <session-a> <session-b>: structural comparison of two sessions via LCS-based phase alignment; reports divergence point, per-phase file/command differences, and failed vs passed outcomes - why [session-id] <event-number>: traces the causal chain backwards from a target event using parent_id links, error→retry detection, path-reference matching, and user_prompt as root cause - 20 new tests (192 total passing) Co-authored-by: Ona <no-reply@ona.com>
… fix import order Co-authored-by: Ona <no-reply@ona.com>
Review: diff and why subcommandsTwo dead variables found; both fixed upstream in Bug 1 —
|
src/agent_trace/why.py
Outdated
| for prev_idx in range(idx - 1, -1, -1): | ||
| prev = events[prev_idx] | ||
|
|
||
| # Error → next tool_call is a retry |
There was a problem hiding this comment.
False positive: error→retry rule fires on any ERROR before any TOOL_CALL, not just adjacent ones.
The backwards scan hits the first ERROR it finds and immediately attributes the target TOOL_CALL to it, regardless of how many unrelated events are between them. A session like:
#1 user_prompt: "run tests"
#2 tool_call: Bash $ pytest ← fails
#3 error: exit 1
#4 tool_call: Bash $ git status ← unrelated, but scan hits #3 first
#5 tool_call: Bash $ pytest ← actual retry
why #5 correctly links to #3. But why #4 also links to #3 even though git status is not a retry — it's the next command after the error. The rule should only fire if the error is the immediately preceding event (or at most separated by a tool_result).
| return | ||
|
|
||
| # file_read → file_write of same file | ||
| if (prev.event_type in (EventType.TOOL_CALL, EventType.FILE_READ) |
There was a problem hiding this comment.
Overly broad match: any TOOL_CALL with a shared path is attributed to any prior FILE_READ.
The condition event.event_type in (EventType.TOOL_CALL, EventType.FILE_WRITE) means a TOOL_CALL (e.g. Bash $ pytest) that happens to have no paths will never match here — fine. But a TOOL_CALL like Write src/foo.py will match any prior TOOL_CALL that also references src/foo.py, even if that prior call was a Bash command that merely mentioned the path in its output. The intent is read→write causality; tightening the target condition to EventType.FILE_WRITE (or tool_name in ("write", "edit")) would reduce false positives.
| session_a: str, | ||
| session_b: str, | ||
| ) -> SessionDiff: | ||
| result_a = explain_session(store, session_a) |
There was a problem hiding this comment.
explain_session is called twice (once per session) and each call calls store.load_events internally. For large sessions this is fine, but store.load_meta is then called again on lines 169–170 to get tool_calls_a/b. The meta is already available inside ExplainResult (via explain_session → store.load_meta). Consider exposing it on ExplainResult to avoid the extra store reads, or at minimum document why the double load is acceptable.
Review: PR #12 — diff + whyBoth modules are well-structured. The LCS alignment for Two bugs in
One nit in Test gaps:
What's good:
|
why.py: - error→retry: only fire when no substantive event sits between the error and the tool_call (tool_result events are allowed as separators). Prevents unrelated tool calls after an error from being misattributed as retries. - read→write: restrict the path-match rule to actual write operations (FILE_WRITE event type, or tool_name in write/edit/create). Previously any TOOL_CALL sharing a path with a prior read would match, including Bash commands that merely mentioned the path. tests: - error→retry false positive: unrelated call between error and retry - write tool_call correctly linked to prior read of same path - Bash call with shared path not linked via read→write rule - dangling parent_id falls through to heuristic without crashing - diff_sessions with unaligned phase counts Co-authored-by: Ona <no-reply@ona.com>
The previous adjacency check was insufficient — git status immediately after a failed pytest is adjacent to the error but is not a retry. Now we look up the tool_call that caused the error and only attribute the retry if the target tool_call uses the same tool name. Co-authored-by: Ona <no-reply@ona.com>
Tool name alone is insufficient — git status and pytest are both Bash calls. For Bash tool calls, require the command string to match the failing call. For other tools, tool name match is sufficient. Co-authored-by: Ona <no-reply@ona.com>
Co-authored-by: Ona <no-reply@ona.com>
Closes #4
What
Two new subcommands for understanding why a session behaved the way it did and how it differed from another session.
Changes
diff.py(new):diff_sessions(store, session_a, session_b)— aligns phases from both sessions via LCS (longest common subsequence) on phase labels, then reports per-phase differences: files only in A, files only in B, commands only in A/B, failed vs passed outcomesformat_diff()— shows divergence point and per-phase deltawhy.py(new):build_causal_chain(events, target_index)— walks backwards from a target event using:parent_idlinks (tool_result → tool_call), error→retry detection, path-reference matching (tool_result text containing a path referenced by a later tool_call), anduser_promptas the root cause terminatorformat_why()— renders the chain root → target with reasons at each linkCLI additions:
agent-strace diff <session-a> <session-b>agent-strace why [session-id] <event-number>(1-based, matches replay output)Example output
Tests
20 new tests (192 total passing)