Skip to content

veltiq/groundtruth

English · 简体中文 · Español · Português · Français · Deutsch · 日本語 · Русский · العربية

groundtruth — the human-in-the-loop for AI coding

groundtruth

The human-in-the-loop for AI coding — automated.
Catch when your AI agent lies, leaves typos, or skips work — then make it prove the work before it says "done."

npm version npm downloads CI GitHub stars MIT license Zero runtime dependencies

npx @veltiq/groundtruth setup

Your agent says "Done! I added a rateLimiter to src/server.ts, fixed the timeout, and added tests." You commit and move on. Two weeks later production breaks — the rate limiter was never written. The summary lied, and nothing checked it against the diff.

groundtruth is the reviewer that does, on every turn — deterministically, with zero LLM calls for the check:

When the summary lies — every claim here is a phantom (the whole "codebase" was one README edit):

groundtruth flags three claims the diff doesn't support

When it's honest — the same kind of summary, each claim backed by the real diff:

groundtruth verifies four honest claims against the diff

Why

Left unsupervised, AI agents confidently report work they never did. A 2026 study of 23,247 agentic pull requests (Gong et al., MSR'26) found that descriptions claiming changes that were never implemented are the single most common message-vs-code inconsistency (45.4%) — and those PRs were accepted 51.7% less often. Tests catch code that's wrong; nothing catches code that was simply never written but reported as done. That's the gap — and the faster agents code, the more slips through.

groundtruth closes it in two stages:

  1. Verify the claims. It reads the agent's end-of-turn summary, extracts each concrete claim, and grades it against the ground truth — which files changed, which symbols appear in the diff, whether tests or installs actually ran. Built on one rule: the diff doesn't lie.
  2. Make the agent prove it works (opt-in verify loop). Before finishing, the agent must run / screenshot / test the change against your original request, hunt for its own mistakes, and fix-and-recheck until it holds up.

→ higher-quality output you don't have to babysit. (How it compares to tests, manual review, and AI code reviewers.)

Install

Requires Node ≥ 20. One command wires the Stop hook + verify loop + status line, idempotently:

npx @veltiq/groundtruth setup

Restart Claude Code (or run /hooks) and it checks every turn automatically.

Try it in 30 seconds · manual install · plugin
# See it catch a phantom change against a canned transcript — no install, no config:
npx @veltiq/groundtruth verify --transcript examples/phantom-change.jsonl --no-git

# Check the current session without installing anything:
npx @veltiq/groundtruth verify

# Just the claim-check hook (no loop), this project or globally:
npx @veltiq/groundtruth install
npx @veltiq/groundtruth install --global

Prefer plugins?

/plugin marketplace add veltiq/groundtruth
/plugin install groundtruth

The loop can never trap you: a per-session round cap always lets a turn finish, and GROUNDTRUTH_NO_LOOP=1 instantly pauses it.

How it works

transcript ─▶ Turn ─▶ ( Evidence + Claims ) ─▶ Verdicts ─▶ Report
            summary       diff      prose       per-claim
            + tools    ground truth  parse        check
Verdict Meaning
verified Concrete evidence in the diff backs the claim.
unsupported Concretely checkable and zero matching evidence — a phantom change.
⚠️ review Vague or semantic ("fixed the bug") — shown for attention, never a failure.

A deliberate bias toward silence: false alarms get a tool like this uninstalled, so a claim is only unsupported when it's unambiguously checkable and nothing supports it. Everything fuzzy becomes review. It would rather miss a claim than wrongly accuse a correct one. → docs/how-it-works.md · docs/design.md

Verify loop — make the agent prove it (opt-in)

The loop screenshots a page, catches an invisible button, fixes it, and re-verifies — no human needed

The claim check grades a turn's words; the loop grades its behavior. With it on (setup enables it, or GROUNDTRUTH_LOOP=1), a turn that changed something is held at the Stop event and the agent must verify by the kind of work — open the page in a browser and read a screenshot (web), run the command (CLI), hit the endpoint (API), run the tests (library) — check it against your original request, fix any mistakes, and only finish once it passes. It never judges the work itself (no false positives of its own) and a round cap means it can't loop forever. → docs/verify-loop.md

More

CLI usage & flags
groundtruth verify                       # check the latest session for this project
groundtruth verify --transcript x.jsonl  # a specific transcript
groundtruth verify --markdown            # markdown (great as a PR comment)
groundtruth verify --json | --sarif      # machine-readable / GitHub code scanning
groundtruth verify --strict              # exit non-zero if anything is unsupported
groundtruth stats [--all]                # local tally: verified / unsupported / review
groundtruth install --events Stop,SubagentStop,SessionEnd --statusline

By default the hook is non-blocking — it prints a report and gets out of the way. --strict (or GROUNDTRUTH_STRICT=1) makes it block on unsupported claims.

What it checks
Claim Example Verified when…
file "updated src/auth.ts" that file was touched this turn
symbol "added validateInput" the identifier appears in the added/removed code
test "added tests" a test file changed or a test command ran
dependency "installed zod" a manifest changed or an install command ran
command "ran the build" a matching command ran via Bash (advisory)
action "fixed the timeout bug" not machine-checkable → flagged for review

Full details in docs/claim-types.md.

Use in CI · commit messages · pre-commit

Grade a PR description against its diff as a sticky comment (works on any PR, zero agent setup):

# .github/workflows/groundtruth.yml
name: groundtruth
on: pull_request
permissions: { contents: read, pull-requests: write }
jobs:
  claim-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
        with: { fetch-depth: 0 }
      - uses: veltiq/groundtruth@v0.6.1   # add  with: { strict: true }  to gate merges

Verify a commit message against the staged diff — drop in .git/hooks/commit-msg, or via pre-commit:

repos:
  - repo: https://github.com/veltiq/groundtruth
    rev: v0.6.1
    hooks:
      - id: groundtruth

docs/github-action.md

Other agents · config · library API

verify reads other agents' transcripts too — the claim engine is agent-neutral:

groundtruth verify --agent codex|gemini|cursor|opencode|aider|auto

Optional .groundtruthrc.json (or a "groundtruth" key in package.json):

{
  "strict": false,
  "ignore": ["CHANGELOG.md", "*.generated.ts"],
  "ignoreKinds": ["command"],
  "loop": { "enabled": false, "maxRounds": 6 }
}

ignore is your escape hatch for any false positive. Use as a library:

import { runPipeline, renderMarkdown } from "@veltiq/groundtruth";
const report = runPipeline({ transcriptPath: "session.jsonl", cwd: process.cwd() });
console.log(renderMarkdown(report));
Privacy & honest limitations
  • Runs entirely locally. Reads your transcript and git, writes nothing except on install. Zero network calls, zero runtime deps. The local tally (~/.groundtruth/ledger.jsonl) stores counts only — never code or prompts.
  • It verifies claimed work exists in the diff, not that it's correct — that's what tests (and the verify loop) are for.
  • Extraction favors precision over recall: it misses vague claims rather than risk a false accusation.

Contributing

Issues and PRs welcome — especially new claim patterns, agent adapters, and false-positive reports (those are gold). See CONTRIBUTING.md.

If groundtruth ever catches your agent in a lie, a ⭐ helps others find it.

License

MIT © Veltiq

About

The human-in-the-loop for AI coding, automated. Catches when your agent lies, leaves typos, or skips work — verifies every turn against the real diff, then makes it run, screenshot & self-fix before finishing. Claude Code hook + CLI.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors