feat: improve capture pipeline — opencode SQLite, codex prompt history, cursor model, MCP attribution fix by codeprakhar25 · Pull Request #5 · codeprakhar25/agentdiff

codeprakhar25 · 2026-04-22T11:50:23Z

Summary

opencode: Retrieve model and prompt from SQLite DB (~/.opencode/opencode.db) instead of relying on env vars that weren't being set reliably. Falls back to JSON log lookup, then env, then defaults.
codex: Look up prompt from Codex session history files (~/.codex/history/) by session ID, enabling prompt attribution on completions where the prompt wasn't passed directly.
cursor: Improved model capture — reads model / model_name / modelName from the hook payload; falls back to cursor-unknown rather than omitting the field.
capture-claude.py: Pre-task filter removed — was incorrectly dropping PostToolUse events before the first tool call in a session.
cursor configure (src/configure/cursor.rs): Check directory existence, not file — create hooks.json if absent rather than erroring when config file doesn't exist yet.

Bug fix

prepare-ledger.py: Files with no matching session.jsonl event but present in MCP files_read were unconditionally attributed to "human". When an MCP server provides agent/model context via pending.json, those files should inherit the MCP agent — not fall back to human. This was causing the CI mcp-smoke test to fail:

RuntimeError: expected model_id=mcp-smoke-model in trace entry, got human

Fix: before falling back to human, check if the file appears in files_read from the pending MCP context and the top-level agent is non-human. If so, use the MCP agent/model.

Logging cleanup

always_log in capture-cursor.py and capture-codex.py was writing to log files on every agent event unconditionally (regardless of AGENTDIFF_DEBUG), silently filling ~/.agentdiff/logs/. Removed the function; all call sites replaced with debug_log which is gated on the env var.

Test plan

agentdiff verify on a repo with opencode traces shows correct model (not unknown)
agentdiff list on a codex-traced repo shows prompt populated from history
MCP smoke test passes: scripts/tests/mcp-smoke.txt trace has model_id = mcp-smoke-model
~/.agentdiff/logs/ does not accumulate capture-cursor.log / capture-codex.log entries during normal (non-debug) use
agentdiff configure cursor on a machine without an existing hooks.json succeeds

🤖 Generated with Claude Code

…emove duplicate codex capture capture-codex.py: `git diff` misses brand-new untracked files. Add `git ls-files --others --exclude-standard` pass so Codex attribution works when it creates a file from scratch. Also updated `get_dirty_file_names` to include untracked files in the pre-task snapshot for correct exclusion. capture-claude.py: `get_model_and_prompt` was guessing the session file path from the session ID, but Claude Code organises sessions by repo path slug, not session ID. Switch to a recursive glob search across ~/.claude/projects/**/{session_id}.jsonl so the model name is always found. codex.rs: `agentdiff configure` was writing both `notify` in config.toml AND `UserPromptSubmit`/`Stop` in hooks.json. When codex_hooks=true, Codex fires both for the same task — doubling every session.jsonl entry. Remove the `notify` key when enabling codex_hooks so only hooks.json fires. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nds, update CI section - Fix incorrect claim that configure tracks all repos globally (init is required per repo) - Remove commands that no longer exist: stats, log, remote-status, migrate, export - Add install-ci to commands table - Fix example flags: --out-md/--out-annotations → --out, agentdiff stats → agentdiff report - Replace manual CI YAML with agentdiff install-ci workflow + correct manual example - Fix install.sh URL: master → main - Remove stale config.toml keys (data_dir, auto_amend_ledger) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…y, no session-evidence files - prepare-ledger: preserve agent="human" as semantic token; add git_author field separately so finalize-ledger can display the real git username without losing the human/AI distinction for type checks - prepare-ledger: explicitly attribute files with no session.jsonl evidence to human rather than inheriting the dominant AI agent — fixes cases where AI and human edits are committed together and untracked files were incorrectly claimed by the AI - finalize-ledger: read git_author from payload; use it for tool.name when agent=="human" so contributor.type=="human" traces show the committer name - store: remove session.jsonl load from load_entries() — only AgentTrace records belong in the committed view; add load_uncommitted_entries() for the --uncommitted path to avoid double-counting and copilot leakage - list: use load_uncommitted_entries() for the uncommitted view Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…to WSL+Windows paths - Check ~/.cursor/ directory existence instead of hooks.json existence so the file is created when Cursor is installed but hooks.json is absent - Extract configure_cursor_hooks_file() helper to apply the same hooks to multiple paths without duplication - On WSL2, Cursor is a Windows app — scan /mnt/c/Users/*/\.cursor and write hooks.json there alongside the WSL ~/.cursor/hooks.json so whichever path cursor-server resolves picks up the config - Summary in print_configure_summary now checks presence_path (dir) separately from config_path (file) for all home-based tools, giving accurate output when the tool is installed but not yet configured Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… gotchas Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…QLite DB - Added functions to retrieve the model ID and initial user prompt from the OpenCode SQLite database. - Implemented fallback mechanisms to read from model.json if the database lookup fails. - Updated the main capture logic to utilize the new retrieval functions for model and prompt. - Introduced a comprehensive test script for the agentdiff pipeline, validating the entire capture, prepare, and finalize process with real and simulated agents. - Improved cursor configuration in Rust to ensure versioning in hooks configuration.

… attribution fallback - Drop always_log from capture-cursor.py and capture-codex.py — was writing to log files on every agent event regardless of AGENTDIFF_DEBUG, silently filling ~/.agentdiff/logs/. All call sites replaced with debug_log (conditional on AGENTDIFF_DEBUG env var). - Fix prepare-ledger.py: files with no session event but present in MCP files_read now correctly inherit the MCP agent/model instead of falling back to "human". Fixes CI mcp-smoke test failure: RuntimeError: expected model_id=mcp-smoke-model in trace entry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-04-22T11:56:03Z

Greptile Summary

This PR improves capture fidelity and attribution correctness across four agent integrations (OpenCode, Codex, Cursor, Claude Code) and fixes a misattribution bug in prepare-ledger.py that was causing the MCP smoke test to fail.

Key changes:

OpenCode: Model and prompt are now read from SQLite (~/.local/share/opencode/opencode.db) with JSON log fallback, replacing unreliable env-var injection.
Codex: Prompt sourced from ~/.codex/history.jsonl by session ID; untracked files included in change detection; pre-task filtering removed in favour of commit-time resolution.
Cursor: Multi-key model lookup with cursor-unknown fallback; prompt falls back to agent-transcript JSONL; configure cursor now checks directory existence and creates hooks.json if absent; WSL2 dual-path support added.
capture-claude.py: Session file discovery switched to recursive glob; <synthetic> model values skipped; prompt sourced from ~/.claude/history.jsonl.
prepare-ledger.py / finalize-ledger.py: agent semantic token (\"human\") kept distinct from git_author display name; files with no session event but present in MCP files_read are attributed to the MCP agent rather than unconditionally falling back to human.
store.rs / list.rs: load_entries() no longer reads session.jsonl; load_uncommitted_entries() is the explicit path for pre-commit data.
Logging: Removed unconditional log writes in cursor and codex capture scripts; all logging now gated on AGENTDIFF_DEBUG.

One notable concern: the new MCP attribution logic builds files_read_set as a union of full paths and basenames. Because files_read contains absolute paths while files_touched contains repo-relative paths, the full-path branch never fires and all matching reduces to basename comparison — risking false-positive MCP attribution for files that share a common filename with any file the MCP agent read.

Confidence Score: 4/5

Safe to merge with one targeted fix — the basename-collision misattribution in prepare-ledger.py should be addressed before this hits repos with common filenames.

The majority of the PR is a clear improvement: logging cleanup, store separation, cursor configure fix, and git_author/agent separation are all correct and well-reasoned. The MCP attribution fix resolves the failing smoke test. The one concrete bug is in the new files_read_set lookup: because files_read contains absolute paths and files_touched contains repo-relative paths, all matching reduces to basename comparison, which risks misattributing unrelated files to the MCP agent whenever they share a filename. This is a real misattribution vector but limited to MCP-originated commits, so it does not break the primary single-agent path.

scripts/prepare-ledger.py — the files_read_set basename-union logic at lines 334–340 needs a path-normalisation fix to avoid false-positive MCP attribution.

Important Files Changed

Filename	Overview
scripts/prepare-ledger.py	Key MCP attribution fix — but the new files_read_set basename-union lookup can cause false-positive agent attribution for files sharing a common filename with MCP-read files.
scripts/capture-opencode.py	New SQLite-based model/prompt lookup for OpenCode; connections not guarded with context managers so exceptions can leave file locks held briefly. DB path may not cover all installation methods.
scripts/capture-codex.py	Adds prompt history lookup, untracked-file detection, removes pre-task filter and always_log; logic is sound with a minor duplicate debug-log line.
scripts/capture-cursor.py	Removes unconditional logging, adds transcript-based prompt fallback, improves model capture with multi-key lookup and cursor-unknown default; clean changes.
scripts/capture-claude.py	Replaces fragile path-guessing with a recursive glob, adds tail-read helper and history.jsonl prompt lookup, skips synthetic model values; straightforward improvement.
scripts/finalize-ledger.py	Introduces git_author field and uses it for tool.name on human-authored files, correctly separating semantic human token from display name.
src/configure/cursor.rs	Switches from file-existence to directory-existence check and creates hooks.json if absent; adds WSL2 dual-path support; clean refactor into configure_cursor_hooks_file.
src/store.rs	Adds load_uncommitted_entries() and removes session.jsonl loading from load_entries(), cleanly separating committed and uncommitted read paths.
src/commands/list.rs	Fixes run_uncommitted to use the new load_uncommitted_entries() instead of loading all entries and post-filtering.
src/configure/codex.rs	Removes legacy notify key management (now handled by codex_hooks=true), migrating existing configs cleanly by preserving forwarded tools while stripping the agentdiff entry.
src/configure/mod.rs	Splits configure summary check into presence-path vs. config-path, giving a clearer hook-missing message when tool is installed but hook file is absent.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Agent Hook Fires] --> B[Dispatch by agent type]
    B --> C[capture-claude.py]
    B --> D[capture-cursor.py]
    B --> E[capture-codex.py]
    B --> F[capture-opencode.py]
    C --> C1[glob ~/.claude/projects for session JSONL - skip synthetic model values]
    C --> C2[~/.claude/history.jsonl for prompt]
    D --> D1[model from payload keys - fallback cursor-unknown]
    D --> D2[cached prompt file or agent-transcript JSONL]
    E --> E1[git diff + ls-files --others including untracked files]
    E --> E2[~/.codex/history.jsonl for prompt]
    F --> F1[SQLite opencode.db - modelID from latest assistant msg]
    F --> F2[SQLite opencode.db - first user message text]
    C1 & C2 & D1 & D2 & E1 & E2 & F1 & F2 --> G[Write entry to session.jsonl]
    G --> H[pre-commit: prepare-ledger.py]
    H --> H1[File has session event]
    H --> H2[File has no session event]
    H1 --> H3[Use session agent and model]
    H2 --> H4[agent != human AND file in files_read]
    H4 -->|Yes| H5[Use MCP agent and model]
    H4 -->|No| H6[Attribute to human]
    H3 & H5 & H6 --> I[pending_ledger.json with per-file attribution and git_author]
    I --> J[post-commit: finalize-ledger.py]
    J --> K[tool.name = git_author for human or agent name for AI]
    K --> L[Signed AgentTrace appended to traces/branch.jsonl]

Comments Outside Diff (1)

scripts/capture-codex.py, line 546-547 (link)

Duplicate adjacent debug log lines

Two consecutive debug_log calls emit almost identical information — the first added in this PR, the second pre-existing:
```
debug_log(f"event={event_name!r} turn={turn_id!r} cwd={cwd!r} model={model!r} session={session_id!r}")
debug_log(f"event_name={event_name!r} turn_id={turn_id!r} cwd_from_events={cwd!r}")
```
The second line is a strict subset of the first (omitting model and session_id). The pre-existing line can be removed to reduce log noise.

_{Reviews (1): Last reviewed commit: "fix: remove unconditional logging from c..." | Re-trigger Greptile}

greptile-apps · 2026-04-22T11:56:06Z

+    files_read_set = {os.path.basename(f) for f in files_read} | set(files_read)
+    for fp in files_touched:
+        if fp not in events_by_file:
+            if agent != "human" and (fp in files_read_set or os.path.basename(fp) in files_read_set):
+                attribution[fp] = {"agent": agent, "model": model}
+            else:
+                attribution[fp] = {"agent": "human", "model": "human"}


Basename collision can cause false-positive MCP attribution

files_read_set is built by unioning full paths from files_read with their basenames. The check on line 337 then tests whether the committed file appears in that set by either its repo-relative path or its basename:

files_read_set = {os.path.basename(f) for f in files_read} | set(files_read) ... if agent != "human" and (fp in files_read_set or os.path.basename(fp) in files_read_set):

Because files_read entries are absolute paths (e.g. /home/user/project/src/utils.py) while files_touched entries are repo-relative (e.g. src/utils.py), the full-path test fp in files_read_set will almost never match. In practice the check always falls through to the basename comparison: os.path.basename(fp) in files_read_set.

This means any committed file whose bare filename (e.g. utils.py, README.md, config.py) matches the basename of any file the MCP agent happened to read will be attributed to that MCP agent instead of to human, even if the touched file is completely unrelated. Common basenames make this a realistic misattribution risk.

Consider normalising files_read to repo-relative paths before comparing:

files_read_rel = { os.path.relpath(f, repo_root) if f.startswith(repo_root) else f for f in files_read } for fp in files_touched: if fp not in events_by_file: if agent != "human" and fp in files_read_rel: attribution[fp] = {"agent": agent, "model": model} else: attribution[fp] = {"agent": "human", "model": "human"}

greptile-apps · 2026-04-22T11:56:07Z

+    try:
+        conn = sqlite3.connect(f"file:{_OPENCODE_DB}?mode=ro", uri=True, timeout=2)
+        # Get first user message for this session
+        row = conn.execute(
+            "SELECT id FROM message WHERE session_id=? "
+            "AND json_extract(data,'$.role')='user' "
+            "ORDER BY time_created ASC LIMIT 1",
+            (session_id,),
+        ).fetchone()
+        if not row:
+            conn.close()
+            return "unknown"
+        msg_id = row[0]
+        # Get text parts for this message
+        parts = conn.execute(
+            "SELECT data FROM part WHERE message_id=? ORDER BY time_created ASC",
+            (msg_id,),
+        ).fetchall()
+        conn.close()
+        for part_row in parts:
+            try:
+                part = json.loads(part_row[0])
+                if part.get("type") == "text" and part.get("text"):
+                    text = str(part["text"]).strip()
+                    debug_log(f"opencode prompt from DB: {text[:80]!r}")
+                    return text[:500]
+            except Exception:
+                continue
+    except Exception as exc:
+        debug_log(f"opencode prompt DB lookup failed: {exc}")
+    return "unknown"


SQLite connection leaked on exception before explicit conn.close()

In both get_opencode_model and get_opencode_prompt, the connection is closed with an explicit conn.close() only in the happy path. If conn.execute().fetchone() (or .fetchall()) raises an exception (e.g. schema mismatch, lock timeout, corrupt page), the except block logs and exits — but conn is never closed. Python's GC will eventually reclaim it, but the file lock can be held for up to timeout=2 seconds per invocation.

The standard fix is a context manager:

import contextlib with contextlib.closing(sqlite3.connect(f"file:{_OPENCODE_DB}?mode=ro", uri=True, timeout=2)) as conn: row = conn.execute(...).fetchone() ...

This applies to both get_opencode_model (lines 60–68) and get_opencode_prompt (lines 101–118).

greptile-apps · 2026-04-22T11:56:09Z

+_OPENCODE_DB = os.path.expanduser("~/.local/share/opencode/opencode.db")
+_OPENCODE_MODEL_JSON = os.path.expanduser("~/.local/state/opencode/model.json")


DB path in code differs from PR description

The PR description states the SQLite database is read from ~/.opencode/opencode.db, but the implementation uses ~/.local/share/opencode/opencode.db. If OpenCode stores its database at ~/.opencode/opencode.db on some platforms or installation methods, the DB lookup will silently fail and fall back to model.json / default "opencode".

Worth verifying the canonical DB path across all supported OpenCode installation methods (binary, npm, homebrew) and potentially probing both locations. Is ~/.local/share/opencode/opencode.db the correct path for all OpenCode installation methods, or does it sometimes reside at ~/.opencode/opencode.db as stated in the PR description?

…cate debug logs - prepare-ledger: replace basename-union files_read_set with repo-relative path normalisation; full-path match now fires correctly, eliminating false-positive MCP attribution on common filenames (e.g. utils.py) - capture-opencode: guard both SQLite connections with contextlib.closing so the file lock is released on exception; probe both DB path candidates (~/.local/share/opencode and ~/.opencode) to cover all install methods - capture-codex: remove 5 duplicate debug_log lines that were strict subsets of the preceding log call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-23T10:23:46Z

AgentDiff Report

Summary

Agent	Lines	%
claude-code	2049	56%
Prakhar Khatri	1641	44%

Files Modified

File	Lines	Dominant Agent
src/commands/status.rs	404	Prakhar Khatri
src/configure/mod.rs	402	claude-code
scripts/tests/test_extension.js	282	Prakhar Khatri
src/configure/antigravity.rs	272	claude-code
src/configure/codex.rs	257	claude-code
src/commands/remote_status.rs	185	claude-code
scripts/finalize-ledger.py	178	claude-code
src/commands/report.rs	175	Prakhar Khatri
scripts/tests/test_capture_prompts.py	166	Prakhar Khatri
src/configure/claude.rs	141	claude-code
src/commands/list.rs	135	Prakhar Khatri
src/configure/copilot.rs	114	claude-code
README.md	113	claude-code
scripts/capture-codex.py	102	claude-code
src/configure/windsurf.rs	100	claude-code
src/configure/cursor.rs	94	claude-code
scripts/tests/test_capture_cursor.py	60	Prakhar Khatri
src/configure/opencode.rs	59	claude-code
src/cli.rs	51	Prakhar Khatri
src/util.rs	43	claude-code

codeprakhar25 and others added 8 commits April 20, 2026 11:12

docs: add CLAUDE.md with project context, attribution invariants, and…

3026001

… gotchas Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: bump version to 0.1.25

8bff731

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

codeprakhar25 merged commit d2c9b58 into main Apr 23, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve capture pipeline — opencode SQLite, codex prompt history, cursor model, MCP attribution fix#5

feat: improve capture pipeline — opencode SQLite, codex prompt history, cursor model, MCP attribution fix#5
codeprakhar25 merged 9 commits into
mainfrom
test/comprehensive-pipeline

codeprakhar25 commented Apr 22, 2026

Uh oh!

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		_OPENCODE_DB = os.path.expanduser("~/.local/share/opencode/opencode.db")
		_OPENCODE_MODEL_JSON = os.path.expanduser("~/.local/state/opencode/model.json")

Conversation

codeprakhar25 commented Apr 22, 2026

Summary

Bug fix

Logging cleanup

Test plan

Uh oh!

greptile-apps Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026

AgentDiff Report

Summary

Files Modified

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading