Skip to content

feat: improve capture pipeline — opencode SQLite, codex prompt history, cursor model, MCP attribution fix#5

Merged
codeprakhar25 merged 9 commits into
mainfrom
test/comprehensive-pipeline
Apr 23, 2026
Merged

feat: improve capture pipeline — opencode SQLite, codex prompt history, cursor model, MCP attribution fix#5
codeprakhar25 merged 9 commits into
mainfrom
test/comprehensive-pipeline

Conversation

@codeprakhar25
Copy link
Copy Markdown
Owner

Summary

  • opencode: Retrieve model and prompt from SQLite DB (~/.opencode/opencode.db) instead of relying on env vars that weren't being set reliably. Falls back to JSON log lookup, then env, then defaults.
  • codex: Look up prompt from Codex session history files (~/.codex/history/) by session ID, enabling prompt attribution on completions where the prompt wasn't passed directly.
  • cursor: Improved model capture — reads model / model_name / modelName from the hook payload; falls back to cursor-unknown rather than omitting the field.
  • capture-claude.py: Pre-task filter removed — was incorrectly dropping PostToolUse events before the first tool call in a session.
  • cursor configure (src/configure/cursor.rs): Check directory existence, not file — create hooks.json if absent rather than erroring when config file doesn't exist yet.

Bug fix

prepare-ledger.py: Files with no matching session.jsonl event but present in MCP files_read were unconditionally attributed to "human". When an MCP server provides agent/model context via pending.json, those files should inherit the MCP agent — not fall back to human. This was causing the CI mcp-smoke test to fail:

RuntimeError: expected model_id=mcp-smoke-model in trace entry, got human

Fix: before falling back to human, check if the file appears in files_read from the pending MCP context and the top-level agent is non-human. If so, use the MCP agent/model.

Logging cleanup

always_log in capture-cursor.py and capture-codex.py was writing to log files on every agent event unconditionally (regardless of AGENTDIFF_DEBUG), silently filling ~/.agentdiff/logs/. Removed the function; all call sites replaced with debug_log which is gated on the env var.

Test plan

  • agentdiff verify on a repo with opencode traces shows correct model (not unknown)
  • agentdiff list on a codex-traced repo shows prompt populated from history
  • MCP smoke test passes: scripts/tests/mcp-smoke.txt trace has model_id = mcp-smoke-model
  • ~/.agentdiff/logs/ does not accumulate capture-cursor.log / capture-codex.log entries during normal (non-debug) use
  • agentdiff configure cursor on a machine without an existing hooks.json succeeds

🤖 Generated with Claude Code

codeprakhar25 and others added 8 commits April 20, 2026 11:12
…emove duplicate codex capture

capture-codex.py: `git diff` misses brand-new untracked files. Add
`git ls-files --others --exclude-standard` pass so Codex attribution
works when it creates a file from scratch. Also updated `get_dirty_file_names`
to include untracked files in the pre-task snapshot for correct exclusion.

capture-claude.py: `get_model_and_prompt` was guessing the session file
path from the session ID, but Claude Code organises sessions by repo path
slug, not session ID. Switch to a recursive glob search across
~/.claude/projects/**/{session_id}.jsonl so the model name is always found.

codex.rs: `agentdiff configure` was writing both `notify` in config.toml
AND `UserPromptSubmit`/`Stop` in hooks.json. When codex_hooks=true, Codex
fires both for the same task — doubling every session.jsonl entry. Remove
the `notify` key when enabling codex_hooks so only hooks.json fires.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nds, update CI section

- Fix incorrect claim that configure tracks all repos globally (init is required per repo)
- Remove commands that no longer exist: stats, log, remote-status, migrate, export
- Add install-ci to commands table
- Fix example flags: --out-md/--out-annotations → --out, agentdiff stats → agentdiff report
- Replace manual CI YAML with agentdiff install-ci workflow + correct manual example
- Fix install.sh URL: master → main
- Remove stale config.toml keys (data_dir, auto_amend_ledger)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y, no session-evidence files

- prepare-ledger: preserve agent="human" as semantic token; add git_author
  field separately so finalize-ledger can display the real git username
  without losing the human/AI distinction for type checks
- prepare-ledger: explicitly attribute files with no session.jsonl evidence
  to human rather than inheriting the dominant AI agent — fixes cases where
  AI and human edits are committed together and untracked files were
  incorrectly claimed by the AI
- finalize-ledger: read git_author from payload; use it for tool.name when
  agent=="human" so contributor.type=="human" traces show the committer name
- store: remove session.jsonl load from load_entries() — only AgentTrace
  records belong in the committed view; add load_uncommitted_entries() for
  the --uncommitted path to avoid double-counting and copilot leakage
- list: use load_uncommitted_entries() for the uncommitted view

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to WSL+Windows paths

- Check ~/.cursor/ directory existence instead of hooks.json existence so
  the file is created when Cursor is installed but hooks.json is absent
- Extract configure_cursor_hooks_file() helper to apply the same hooks to
  multiple paths without duplication
- On WSL2, Cursor is a Windows app — scan /mnt/c/Users/*/\.cursor and write
  hooks.json there alongside the WSL ~/.cursor/hooks.json so whichever path
  cursor-server resolves picks up the config
- Summary in print_configure_summary now checks presence_path (dir) separately
  from config_path (file) for all home-based tools, giving accurate output
  when the tool is installed but not yet configured

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… gotchas

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…QLite DB

- Added functions to retrieve the model ID and initial user prompt from the OpenCode SQLite database.
- Implemented fallback mechanisms to read from model.json if the database lookup fails.
- Updated the main capture logic to utilize the new retrieval functions for model and prompt.
- Introduced a comprehensive test script for the agentdiff pipeline, validating the entire capture, prepare, and finalize process with real and simulated agents.
- Improved cursor configuration in Rust to ensure versioning in hooks configuration.
… attribution fallback

- Drop always_log from capture-cursor.py and capture-codex.py — was
  writing to log files on every agent event regardless of AGENTDIFF_DEBUG,
  silently filling ~/.agentdiff/logs/. All call sites replaced with
  debug_log (conditional on AGENTDIFF_DEBUG env var).
- Fix prepare-ledger.py: files with no session event but present in
  MCP files_read now correctly inherit the MCP agent/model instead of
  falling back to "human". Fixes CI mcp-smoke test failure:
  RuntimeError: expected model_id=mcp-smoke-model in trace entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR improves capture fidelity and attribution correctness across four agent integrations (OpenCode, Codex, Cursor, Claude Code) and fixes a misattribution bug in prepare-ledger.py that was causing the MCP smoke test to fail.

Key changes:

  • OpenCode: Model and prompt are now read from SQLite (~/.local/share/opencode/opencode.db) with JSON log fallback, replacing unreliable env-var injection.
  • Codex: Prompt sourced from ~/.codex/history.jsonl by session ID; untracked files included in change detection; pre-task filtering removed in favour of commit-time resolution.
  • Cursor: Multi-key model lookup with cursor-unknown fallback; prompt falls back to agent-transcript JSONL; configure cursor now checks directory existence and creates hooks.json if absent; WSL2 dual-path support added.
  • capture-claude.py: Session file discovery switched to recursive glob; <synthetic> model values skipped; prompt sourced from ~/.claude/history.jsonl.
  • prepare-ledger.py / finalize-ledger.py: agent semantic token (\"human\") kept distinct from git_author display name; files with no session event but present in MCP files_read are attributed to the MCP agent rather than unconditionally falling back to human.
  • store.rs / list.rs: load_entries() no longer reads session.jsonl; load_uncommitted_entries() is the explicit path for pre-commit data.
  • Logging: Removed unconditional log writes in cursor and codex capture scripts; all logging now gated on AGENTDIFF_DEBUG.

One notable concern: the new MCP attribution logic builds files_read_set as a union of full paths and basenames. Because files_read contains absolute paths while files_touched contains repo-relative paths, the full-path branch never fires and all matching reduces to basename comparison — risking false-positive MCP attribution for files that share a common filename with any file the MCP agent read.

Confidence Score: 4/5

Safe to merge with one targeted fix — the basename-collision misattribution in prepare-ledger.py should be addressed before this hits repos with common filenames.

The majority of the PR is a clear improvement: logging cleanup, store separation, cursor configure fix, and git_author/agent separation are all correct and well-reasoned. The MCP attribution fix resolves the failing smoke test. The one concrete bug is in the new files_read_set lookup: because files_read contains absolute paths and files_touched contains repo-relative paths, all matching reduces to basename comparison, which risks misattributing unrelated files to the MCP agent whenever they share a filename. This is a real misattribution vector but limited to MCP-originated commits, so it does not break the primary single-agent path.

scripts/prepare-ledger.py — the files_read_set basename-union logic at lines 334–340 needs a path-normalisation fix to avoid false-positive MCP attribution.

Important Files Changed

Filename Overview
scripts/prepare-ledger.py Key MCP attribution fix — but the new files_read_set basename-union lookup can cause false-positive agent attribution for files sharing a common filename with MCP-read files.
scripts/capture-opencode.py New SQLite-based model/prompt lookup for OpenCode; connections not guarded with context managers so exceptions can leave file locks held briefly. DB path may not cover all installation methods.
scripts/capture-codex.py Adds prompt history lookup, untracked-file detection, removes pre-task filter and always_log; logic is sound with a minor duplicate debug-log line.
scripts/capture-cursor.py Removes unconditional logging, adds transcript-based prompt fallback, improves model capture with multi-key lookup and cursor-unknown default; clean changes.
scripts/capture-claude.py Replaces fragile path-guessing with a recursive glob, adds tail-read helper and history.jsonl prompt lookup, skips synthetic model values; straightforward improvement.
scripts/finalize-ledger.py Introduces git_author field and uses it for tool.name on human-authored files, correctly separating semantic human token from display name.
src/configure/cursor.rs Switches from file-existence to directory-existence check and creates hooks.json if absent; adds WSL2 dual-path support; clean refactor into configure_cursor_hooks_file.
src/store.rs Adds load_uncommitted_entries() and removes session.jsonl loading from load_entries(), cleanly separating committed and uncommitted read paths.
src/commands/list.rs Fixes run_uncommitted to use the new load_uncommitted_entries() instead of loading all entries and post-filtering.
src/configure/codex.rs Removes legacy notify key management (now handled by codex_hooks=true), migrating existing configs cleanly by preserving forwarded tools while stripping the agentdiff entry.
src/configure/mod.rs Splits configure summary check into presence-path vs. config-path, giving a clearer hook-missing message when tool is installed but hook file is absent.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Agent Hook Fires] --> B[Dispatch by agent type]
    B --> C[capture-claude.py]
    B --> D[capture-cursor.py]
    B --> E[capture-codex.py]
    B --> F[capture-opencode.py]
    C --> C1[glob ~/.claude/projects for session JSONL - skip synthetic model values]
    C --> C2[~/.claude/history.jsonl for prompt]
    D --> D1[model from payload keys - fallback cursor-unknown]
    D --> D2[cached prompt file or agent-transcript JSONL]
    E --> E1[git diff + ls-files --others including untracked files]
    E --> E2[~/.codex/history.jsonl for prompt]
    F --> F1[SQLite opencode.db - modelID from latest assistant msg]
    F --> F2[SQLite opencode.db - first user message text]
    C1 & C2 & D1 & D2 & E1 & E2 & F1 & F2 --> G[Write entry to session.jsonl]
    G --> H[pre-commit: prepare-ledger.py]
    H --> H1[File has session event]
    H --> H2[File has no session event]
    H1 --> H3[Use session agent and model]
    H2 --> H4[agent != human AND file in files_read]
    H4 -->|Yes| H5[Use MCP agent and model]
    H4 -->|No| H6[Attribute to human]
    H3 & H5 & H6 --> I[pending_ledger.json with per-file attribution and git_author]
    I --> J[post-commit: finalize-ledger.py]
    J --> K[tool.name = git_author for human or agent name for AI]
    K --> L[Signed AgentTrace appended to traces/branch.jsonl]
Loading

Comments Outside Diff (1)

  1. scripts/capture-codex.py, line 546-547 (link)

    P2 Duplicate adjacent debug log lines

    Two consecutive debug_log calls emit almost identical information — the first added in this PR, the second pre-existing:

    debug_log(f"event={event_name!r} turn={turn_id!r} cwd={cwd!r} model={model!r} session={session_id!r}")
    debug_log(f"event_name={event_name!r} turn_id={turn_id!r} cwd_from_events={cwd!r}")

    The second line is a strict subset of the first (omitting model and session_id). The pre-existing line can be removed to reduce log noise.

Reviews (1): Last reviewed commit: "fix: remove unconditional logging from c..." | Re-trigger Greptile

Comment thread scripts/prepare-ledger.py Outdated
Comment on lines +334 to +340
files_read_set = {os.path.basename(f) for f in files_read} | set(files_read)
for fp in files_touched:
if fp not in events_by_file:
if agent != "human" and (fp in files_read_set or os.path.basename(fp) in files_read_set):
attribution[fp] = {"agent": agent, "model": model}
else:
attribution[fp] = {"agent": "human", "model": "human"}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Basename collision can cause false-positive MCP attribution

files_read_set is built by unioning full paths from files_read with their basenames. The check on line 337 then tests whether the committed file appears in that set by either its repo-relative path or its basename:

files_read_set = {os.path.basename(f) for f in files_read} | set(files_read)
...
if agent != "human" and (fp in files_read_set or os.path.basename(fp) in files_read_set):

Because files_read entries are absolute paths (e.g. /home/user/project/src/utils.py) while files_touched entries are repo-relative (e.g. src/utils.py), the full-path test fp in files_read_set will almost never match. In practice the check always falls through to the basename comparison: os.path.basename(fp) in files_read_set.

This means any committed file whose bare filename (e.g. utils.py, README.md, config.py) matches the basename of any file the MCP agent happened to read will be attributed to that MCP agent instead of to human, even if the touched file is completely unrelated. Common basenames make this a realistic misattribution risk.

Consider normalising files_read to repo-relative paths before comparing:

files_read_rel = {
    os.path.relpath(f, repo_root) if f.startswith(repo_root) else f
    for f in files_read
}
for fp in files_touched:
    if fp not in events_by_file:
        if agent != "human" and fp in files_read_rel:
            attribution[fp] = {"agent": agent, "model": model}
        else:
            attribution[fp] = {"agent": "human", "model": "human"}

Comment on lines +100 to +130
try:
conn = sqlite3.connect(f"file:{_OPENCODE_DB}?mode=ro", uri=True, timeout=2)
# Get first user message for this session
row = conn.execute(
"SELECT id FROM message WHERE session_id=? "
"AND json_extract(data,'$.role')='user' "
"ORDER BY time_created ASC LIMIT 1",
(session_id,),
).fetchone()
if not row:
conn.close()
return "unknown"
msg_id = row[0]
# Get text parts for this message
parts = conn.execute(
"SELECT data FROM part WHERE message_id=? ORDER BY time_created ASC",
(msg_id,),
).fetchall()
conn.close()
for part_row in parts:
try:
part = json.loads(part_row[0])
if part.get("type") == "text" and part.get("text"):
text = str(part["text"]).strip()
debug_log(f"opencode prompt from DB: {text[:80]!r}")
return text[:500]
except Exception:
continue
except Exception as exc:
debug_log(f"opencode prompt DB lookup failed: {exc}")
return "unknown"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 SQLite connection leaked on exception before explicit conn.close()

In both get_opencode_model and get_opencode_prompt, the connection is closed with an explicit conn.close() only in the happy path. If conn.execute().fetchone() (or .fetchall()) raises an exception (e.g. schema mismatch, lock timeout, corrupt page), the except block logs and exits — but conn is never closed. Python's GC will eventually reclaim it, but the file lock can be held for up to timeout=2 seconds per invocation.

The standard fix is a context manager:

import contextlib

with contextlib.closing(sqlite3.connect(f"file:{_OPENCODE_DB}?mode=ro", uri=True, timeout=2)) as conn:
    row = conn.execute(...).fetchone()
    ...

This applies to both get_opencode_model (lines 60–68) and get_opencode_prompt (lines 101–118).

Comment thread scripts/capture-opencode.py Outdated
Comment on lines +48 to +49
_OPENCODE_DB = os.path.expanduser("~/.local/share/opencode/opencode.db")
_OPENCODE_MODEL_JSON = os.path.expanduser("~/.local/state/opencode/model.json")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 DB path in code differs from PR description

The PR description states the SQLite database is read from ~/.opencode/opencode.db, but the implementation uses ~/.local/share/opencode/opencode.db. If OpenCode stores its database at ~/.opencode/opencode.db on some platforms or installation methods, the DB lookup will silently fail and fall back to model.json / default "opencode".

Worth verifying the canonical DB path across all supported OpenCode installation methods (binary, npm, homebrew) and potentially probing both locations. Is ~/.local/share/opencode/opencode.db the correct path for all OpenCode installation methods, or does it sometimes reside at ~/.opencode/opencode.db as stated in the PR description?

…cate debug logs

- prepare-ledger: replace basename-union files_read_set with repo-relative
  path normalisation; full-path match now fires correctly, eliminating
  false-positive MCP attribution on common filenames (e.g. utils.py)
- capture-opencode: guard both SQLite connections with contextlib.closing so
  the file lock is released on exception; probe both DB path candidates
  (~/.local/share/opencode and ~/.opencode) to cover all install methods
- capture-codex: remove 5 duplicate debug_log lines that were strict subsets
  of the preceding log call

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codeprakhar25 codeprakhar25 merged commit d2c9b58 into main Apr 23, 2026
2 checks passed
@github-actions
Copy link
Copy Markdown

AgentDiff Report

Summary

Agent Lines %
claude-code 2049 56%
Prakhar Khatri 1641 44%

Files Modified

File Lines Dominant Agent
src/commands/status.rs 404 Prakhar Khatri
src/configure/mod.rs 402 claude-code
scripts/tests/test_extension.js 282 Prakhar Khatri
src/configure/antigravity.rs 272 claude-code
src/configure/codex.rs 257 claude-code
src/commands/remote_status.rs 185 claude-code
scripts/finalize-ledger.py 178 claude-code
src/commands/report.rs 175 Prakhar Khatri
scripts/tests/test_capture_prompts.py 166 Prakhar Khatri
src/configure/claude.rs 141 claude-code
src/commands/list.rs 135 Prakhar Khatri
src/configure/copilot.rs 114 claude-code
README.md 113 claude-code
scripts/capture-codex.py 102 claude-code
src/configure/windsurf.rs 100 claude-code
src/configure/cursor.rs 94 claude-code
scripts/tests/test_capture_cursor.py 60 Prakhar Khatri
src/configure/opencode.rs 59 claude-code
src/cli.rs 51 Prakhar Khatri
src/util.rs 43 claude-code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant