fix: complete proof-backed public explain-runtime full-win rows by mohanagy · Pull Request #520 · mohanagy/madar

mohanagy · 2026-06-10T06:17:58Z

Summary

Completes the public benchmark-proof push on top of #519: fixes the two remaining runtime-proof blockers in retrieve.ts, fixes a source-discovery regression that gutted the dub graph, addresses all CodeRabbit review findings, and refreshes all six public TypeScript explain-runtime legacy receipts to proof-backed full_win bundles generated from the final binary.

Runtime fixes (each with a failing-first regression test)

Out-of-scope runtime-proof recovery (src/runtime/retrieve.ts): the first missing-obligation recovery loop built branch steps only from nodeById (the initial in-scope slice), silently dropping recovery candidates whose evidence exists in the graph but outside the slice scope. It now materializes those nodes via executionSliceStepFromGraph with rootPath threaded through, matching the phase-recovery loop. Regression: tests/unit/retrieve-slice-v1.test.ts.
Dangling stdio relationships after compaction (src/runtime/retrieve.ts): compactRetrievePayloadForStdioProfile sliced matched_nodes and relationships independently, so retained relationships could reference trimmed nodes. Relationships are now filtered against the retained matched-node ID set before the cap. Regression: tests/unit/retrieve.test.ts.
Top-level lib/ discovery fidelity (src/shared/source-discovery.ts): the blanket top-level lib/** hard-ignore also dropped hand-written TypeScript source (e.g. dub's apps/web/lib/** link-resolution middleware), flipping the dub row's benchmark readiness to degraded. The hard-ignore now targets compiled artifacts only (.js/.cjs/.mjs/.d.ts). Regressions: tests/unit/source-discovery.test.ts.

CodeRabbit review fixes

Thread rootPath into runtime-proof recovery so recovered branch steps emit workspace-relative source files (fixes the mixed path formats flagged in receipts).
Use real primary-path boundaries in recovery phase-coverage scoring instead of [].
Include focused bash follow-ups in prompt-contract follow-up input extraction, matching focused-call classification.
Activate preserveFinalRuntimeEntrypointContextPreview by removing its self-excluding kept-key filter.
Decide file-stem uniqueness on _makeId-normalized ids with deterministic collision suffixes (foo-bar.ts vs foo_bar.ts), with a regression test.
Include the module stem in the Express analysis cache validity check.
Fail fast when the benchmark suite is missing the built CLI.
Platform-aware path assertions for Windows CI.

Carried review follow-ups

Same-turn retrieve persistence in compare, prompt-contract targeting for same-turn follow-ups, routing/tool/latency false-win scoring, SPI cache invalidation for nested tsconfig/jsconfig, Express entrypoint proof recognition, nested Next.js src/app|pages|middleware SPI detection, claims doc relative links, duplicate-stem extraction tests.

Receipts

All six public legacy rows were rerun sequentially with the final binary after all fixes. The success criteria below apply to the warm-cache/legacy/ trial reports; any /spi/ trial in the same bundle is a separate receipt and is not claimed to meet the same criteria. Every legacy report has benchmark_outcome = "full_win", benchmark_readiness = "ready", answer_quality.madar.passed = true, prompt_contract = "followed", answer_contract.runtime_proof.missing_obligations = [], and consistent workspace-relative evidence paths:

Repo	Receipt
documenso	`docs/benchmarks/suite/results/2026-06-10T07-01-18`
formbricks	`docs/benchmarks/suite/results/2026-06-10T07-10-44`
dub	`docs/benchmarks/suite/results/2026-06-10T07-24-01`
twenty	`docs/benchmarks/suite/results/2026-06-10T07-39-11`
cal-diy	`docs/benchmarks/suite/results/2026-06-10T08-08-44`
novu	`docs/benchmarks/suite/results/2026-06-10T08-24-49`

docs/benchmarks/suite/README.md, docs/claims-and-evidence.md, and tests/unit/benchmark-suite-docs.test.ts all point at these bundles.

Test plan

npm run typecheck
npm run build
CI=1 npm run test:run — 188 files, 2539 passed, 1 skipped
All regressions written failing-first, then fixed
Six isolated warm explain-runtime rows rerun sequentially with the final binary — 6/6 legacy full_win
Share-safe reports verified for all documented invariants, path-format consistency, and no local path leaks

Summary by CodeRabbit

New Features
- Stronger runtime-proof evidence matching and stricter retrieval behavior for more reliable proof-based answers
- Expanded SPI/workspace support across additional repos and languages; more consistent workspace provisioning for benchmarks
Documentation
- Updated benchmark suite with per-repo dated results, new share-safe reports, and clearer claims-and-evidence guidance
Bug Fixes
- More stable multi-file ID generation and improved execution-slice planning/confidence for accurate runtime analysis

- recover out-of-scope runtime-proof obligation evidence: the first missing-obligation recovery loop now materializes graph nodes outside the initial slice scope, matching the phase-recovery loop - drop dangling stdio relationships: compaction now filters relationships against the retained matched-node ID set before the cap - keep hand-written top-level lib/ TypeScript source discoverable while still hard-ignoring compiled lib output (js/cjs/mjs/d.ts), restoring dub apps/web/lib middleware evidence - carry same-turn retrieve persistence, prompt-contract targeting, routing/tool/latency scoring, SPI cache invalidation, Express and nested Next.js SPI detection fixes from review follow-ups - refresh all six public TypeScript explain-runtime legacy receipts (documenso, formbricks, dub, twenty, cal-diy, novu) with proof-backed full_win bundles generated sequentially from the final binary, and point suite README, claims-and-evidence, and docs tests at them

coderabbitai · 2026-06-10T06:18:12Z

📝 Walkthrough

Walkthrough

Benchmark docs/results were refreshed and many runtime systems were extended: strict runtime-proof retrieval/slicing, trace-derived compare-pack merging, stdio compaction, file-stem extraction identity, JS/TS resolution and SPI projector changes, workspace Madar shim + .mcp.json PATH pinning, and updated tests.

Changes

Benchmark artifacts and runtime-proof infrastructure

Layer / File(s)	Summary
Docs: receipts, claims, and suite README `docs/benchmarks/suite/README.md`, `docs/claims-and-evidence.md`, `docs/benchmarks/suite/repos.json`, `docs/benchmarks/suite/runtime-proof.json`, `docs/benchmarks/suite/results/*`	Updated published per-repo summary receipts, result bundles, runtime-proof obligation evidence_terms, and public-claim mapping; many new dated result artifacts and summary files were added.
Compare, trace merging, and prompt shaping `src/infrastructure/compare.ts`, `src/infrastructure/context-pack-command.ts`	Collects tool_result payloads from Madar traces, reconstructs/merges follow-up compare-packs into an effective pack, extends native-agent report to include answer_contract/execution_slice, and threads retrieval-ready signals into strict prompt guidance and context-pack payloads.
Runtime retrieval, slicing, and runtime-proof logic `src/runtime/retrieve.ts`, `src/runtime/retrieve/slicing.ts`, `src/runtime/runtime-proof.ts`, `src/runtime/stdio/tools.ts`	Adds strict runtime-proof scoring/anchor/neighborhood selection, direct-evidence gating, stdio compaction profiles, runtime-proof term matching/counting API, and stdio tool strict override wiring with strict retrieve payload projection.
Extraction identity and cross-file linking `src/pipeline/extract/core.ts`, `src/pipeline/extract.ts`, `src/pipeline/extract/cross-file.ts`, `src/pipeline/extract/*`	Introduces file-stem context helpers (`withExtractionFileStemContext`, `fileStemForPath`, `fileNodeIdForPath`), upgrades cache payloads to include __madarFileStem, updates many extractors and cross-file resolvers to use stem-aware IDs, and wraps extract workflow in the stem context.
SPI, TS project config resolution, and projector changes `src/pipeline/spi/build.ts`, `src/pipeline/spi/cache.ts`, `src/pipeline/spi/projector.ts`, `src/pipeline/spi/framework-nextjs.ts`	Nearest-project tsconfig discovery, per-file compilerOptions resolver, SPI projector file-stem disambiguation, synthetic external call symbol handling, and Next.js convention/path normalization.
Workspace provisioning and test updates `src/infrastructure/benchmark/suite.ts`, `tests/unit/benchmark-suite.test.ts`, `tests/unit/benchmark-suite-docs.test.ts`	Writes a local Madar shim into workspace `.claude/bin`, prepends it into `.mcp.json` server.env PATH/Path, updates tests to stub Madar CLI and assert .mcp.json contains the pinned PATH, and refreshes test expectations for updated manifests and receipts.
Source discovery ignores `src/shared/source-discovery.ts`	Refines hard-ignore globs/regexes to only ignore `lib/` JS/module/type files rather than the entire `lib/*` subtree.

Sequence Diagram(s)

sequenceDiagram
  participant TestHarness
  participant BenchmarkSuite
  participant MadarTrace
  participant CompareService
  participant RetrieveService
  participant ContextPack

  TestHarness->>BenchmarkSuite: generate run (writes artifacts, prompts, results)
  BenchmarkSuite->>MadarTrace: run madar arm (captures verbose trace + tool_result)
  MadarTrace->>CompareService: provide raw trace + tool_result payloads
  CompareService->>RetrieveService: reconstruct follow-up retrievals & merge into effective pack
  RetrieveService->>ContextPack: provide matched_nodes + execution_slice for explain payload
  ContextPack->>CompareService: preserve runtime-entry-context expandables

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Possibly related PRs

mohanagy/madar#391: Related compare/native-agent prompt and contract handling changes.
mohanagy/madar#395: Related benchmark-readiness and compare logic updates.
mohanagy/madar#411: Overlapping native-agent prompt/contract adjustments.

Poem

🐰 I hop through receipts and proofs with care,
Stems aligned and traces laid bare.
A shim, a pack, a merged-up plot,
New results written — neat and hot.
I nibble bugs and then I hop — review complete, I’ll never stop.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch benchmark-public-full-win

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (3)

tests/unit/benchmark-suite-docs.test.ts (1)
242-290: ⚡ Quick win

Remove the duplicated test block.

This it('publishes a scoped Twenty receipt when the root suite graph is too large', ...) block duplicates the earlier one verbatim, so it adds runtime and maintenance cost without new coverage.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/benchmark-suite-docs.test.ts` around lines 242 - 290, The test
block duplicated is the it(...) titled "publishes a scoped Twenty receipt when
the root suite graph is too large" — remove the entire duplicate it(...) block
(the second occurrence) so only one such test remains; search for the
it("publishes a scoped Twenty receipt when the root suite graph is too large",
...) and delete the redundant copy, then run the unit tests to confirm no
regressions.
src/runtime/retrieve/slicing.ts (1)
429-469: 💤 Low value

Consider extracting the max anchor calculation to avoid duplication.

The maxAnchors calculation logic appears twice (lines 794-796 and 860-862) with identical formulas. Consider extracting this into a helper function or computing it once at the start of buildAnchors.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/runtime/retrieve/slicing.ts` around lines 429 - 469, The maxAnchors
calculation is duplicated in buildAnchors; extract the logic into a single
helper (e.g., computeMaxAnchors or getMaxAnchors) and call it from buildAnchors
instead of repeating the formula, updating both sites where the formula appears
(the duplicated blocks around lines referencing maxAnchors) to use that helper;
ensure the helper accepts the same inputs used in the original formula and
return the computed number so buildAnchors remains functionally identical.
src/runtime/stdio/tools.ts (1)
168-205: 💤 Low value

Consider simplifying the conditional type extractions.

The StrictRuntimeProofRetrievePayload interface uses nested conditional type extractions (extends infer T ? T extends { ... } ? ...) which are verbose. Since RetrieveResult is a known interface, you could directly reference the nested types or define explicit type aliases for clarity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/runtime/stdio/tools.ts` around lines 168 - 205, The conditional type
extractions inside StrictRuntimeProofRetrievePayload (for answer_contract and
execution_slice) are overly verbose; replace them by introducing simple type
aliases from RetrieveResult (e.g., type AnswerContract =
RetrieveResult['answer_contract'] and type ExecutionSlice =
RetrieveResult['execution_slice']) and then reference
AnswerContract['confidence'], AnswerContract['runtime_proof'],
ExecutionSlice['status'], ExecutionSlice['confidence'], etc., or directly use
RetrieveResult['...'] properties without the nested infer pattern to make the
interface clearer and easier to maintain.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/report.json`:
- Around line 170-176: The PR incorrectly implies that all trials (including
warm-cache/spi/trial-001) meet the "ready/full_win/no missing obligations"
criteria; however the SPI trial shows benchmark_outcome.outcome: "not_measured",
benchmark_readiness.status: "not_ready" and a missing runtime proof obligation
("send preparation"), whereas warm-cache/legacy shows full_win/ready/no missing
obligations (answer_contract.runtime_proof.missing_obligations: []). Update the
PR text to either explicitly exclude warm-cache/spi trials from the
"ready/full_win/no missing obligations" claim or add a clarifying sentence that
only warm-cache/legacy trials (not warm-cache/spi) meet those criteria,
referencing the fields benchmark_outcome.outcome and benchmark_readiness.status
to justify the distinction.

In
`@docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json`:
- Around line 260-314: The report generator is emitting mixed source_file paths
in answer_contract.runtime_proof.obligations[].evidence[].source_file;
standardize this in the CLI emitter by choosing one format and applying it
consistently (e.g., always use path.relative(projectRoot, filePath) for relative
paths or always prefix with "<project-root>/"). Locate the reporter function
that builds evidence.source_file (search for answer_contract or runtime_proof
emission logic and the function that serializes obligations/evidence) and
replace the current ad-hoc path handling with a single utility that normalizes
paths (use a projectRoot constant and either path.relative + normalize or
prepend "<project-root>/" unconditionally), and add a unit test asserting both
example inputs (apps/... and packages/...) produce the same chosen format.

In `@src/infrastructure/benchmark/suite.ts`:
- Around line 544-546: The benchmarkWorkspaceCliPath() helper currently returns
a path to the built CLI without ensuring the file exists; update it to validate
the target file (e.g., use fs.existsSync or fs.promises.access) and either throw
a clear Error or return undefined when the file is missing so callers that
generate shims (the code that writes shim wrappers for the CLI) can fail fast;
specifically modify benchmarkWorkspaceCliPath() to compute the path, check
existence of join(findPackageRoot(), 'dist', 'src', 'cli', 'bin.js'), and on
missing file throw an Error with a descriptive message ('built CLI not found at
<path>') or adjust callers to handle the missing case and avoid writing broken
shims.

In `@src/infrastructure/compare.ts`:
- Around line 1646-1658: The focused follow-up extraction only accepts
Madar/Read tools and thus misses focused bash follow-ups; update the filter in
traceFocusedFollowUpInputs (the block using
isMadarTraceToolName/isFocusedFollowUpTraceToolName and pushing into
followUpInputs from toolInputs[index]) to also consider focused bash calls
(e.g., include isBashTraceToolName(toolName) or extend
isFocusedFollowUpTraceToolName to return true for bash). Ensure the same
normalization via normalizeAnswerQualityText and pushing into followUpInputs
remains unchanged so focused bash inputs are captured for runtime-proof
targeting.

In `@src/infrastructure/context-pack-command.ts`:
- Around line 823-845: The preview generation is a no-op because keptKeys is
built from matchedNodes and then used to reject every preview candidate; fix by
changing the rejection logic so preview candidates are not self-excluded—either
(A) build keptKeys from a different source (e.g., the set of keys that must
truly be preserved, not from matchedNodes), or (B) remove the keptKeys.has(key)
check from the filter inside the preview mapping so only existingPreviewKeys
blocks duplication; update the code around matchedNodes, keptKeys,
existingPreviewKeys and the preview mapping (the preview variable and
runtimePrimaryPathRecordKey/asJsonRecord usages) accordingly.

In `@src/pipeline/extract/core.ts`:
- Around line 25-63: buildFileStemMap currently ensures uniqueness only on raw
basenames but fileNodeIdForPath later calls _makeId(stem), causing collisions
(e.g., foo-bar vs foo_bar). Update buildFileStemMap to consider the normalized
id from _makeId when deciding uniqueness: for any basename that collides
(stemCounts > 1), compute relativeStem as you do now, then iteratively widen the
candidate stem by prepending parent segments (or otherwise adjusting) and call
_makeId(candidate) until the returned id is unique among all already-chosen
_makeId results; store the original filePath -> chosen stem (not the id) so
fileNodeIdForPath/_makeId produces consistent, collision-free ids. Ensure you
reference buildFileStemMap, fileNodeIdForPath, and _makeId in the change.

In `@src/pipeline/extract/frameworks/express.ts`:
- Around line 92-93: The cached Express analysis (exportedBindings and
routeRecords) keyed only by filePath+sourceText can become incorrect when
moduleStem(filePath) (which calls fileStemForPath) yields a different stem;
update analyzeExpressModule's caching to include the current stem (derived via
moduleStem(filePath)) in the cache key or store the stem with the cached value
and invalidate/refresh the cache when the stored stem differs from the current
moduleStem. Locate analyzeExpressModule and its cache lookup/insert logic and
ensure the cache key/value accounts for moduleStem(filePath) (or performs a stem
equality check before reusing exportedBindings/routeRecords).

In `@src/runtime/retrieve.ts`:
- Around line 3597-3611: The recovery logic in
recoverMissingRuntimeProofBranches recreates out-of-scope steps without passing
the rootPath used by buildExecutionSlice, causing raw source_file paths to leak;
fix it by threading the rootPath argument through the recovery helper(s) that
call executionSliceStepFromGraph so those calls use
executionSliceStepFromGraph(..., rootPath) instead of the no-rootPath overload,
updating all recovery passes (the second recovery pass and the similar blocks
referenced around 3710-3736 and 3903-3919) to accept and forward rootPath from
buildExecutionSlice into the branch materialization flow (ensure function
signatures and invocations for the helper(s) and
recoverMissingRuntimeProofBranches reflect the added rootPath parameter).
- Around line 3686-3693: The call to phaseCoverageForPath is passing an empty
array for primary path boundaries, causing queue to be treated as missing;
update the helper to pass the real primaryBoundaries variable instead of []
wherever phaseCoverageForPath is invoked (e.g., the assignment to
currentPhaseCoverage and the other similar calls shown) so the recovery-phase
scoring uses the actual primary-path boundaries; locate uses of
phaseCoverageForPath (e.g., currentPhaseCoverage) and replace the empty array
argument with primaryBoundaries, and apply the same change to the other
occurrences listed in the comment.

---

Nitpick comments:
In `@src/runtime/retrieve/slicing.ts`:
- Around line 429-469: The maxAnchors calculation is duplicated in buildAnchors;
extract the logic into a single helper (e.g., computeMaxAnchors or
getMaxAnchors) and call it from buildAnchors instead of repeating the formula,
updating both sites where the formula appears (the duplicated blocks around
lines referencing maxAnchors) to use that helper; ensure the helper accepts the
same inputs used in the original formula and return the computed number so
buildAnchors remains functionally identical.

In `@src/runtime/stdio/tools.ts`:
- Around line 168-205: The conditional type extractions inside
StrictRuntimeProofRetrievePayload (for answer_contract and execution_slice) are
overly verbose; replace them by introducing simple type aliases from
RetrieveResult (e.g., type AnswerContract = RetrieveResult['answer_contract']
and type ExecutionSlice = RetrieveResult['execution_slice']) and then reference
AnswerContract['confidence'], AnswerContract['runtime_proof'],
ExecutionSlice['status'], ExecutionSlice['confidence'], etc., or directly use
RetrieveResult['...'] properties without the nested infer pattern to make the
interface clearer and easier to maintain.

In `@tests/unit/benchmark-suite-docs.test.ts`:
- Around line 242-290: The test block duplicated is the it(...) titled
"publishes a scoped Twenty receipt when the root suite graph is too large" —
remove the entire duplicate it(...) block (the second occurrence) so only one
such test remains; search for the it("publishes a scoped Twenty receipt when the
root suite graph is too large", ...) and delete the redundant copy, then run the
unit tests to confirm no regressions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 872b4dbe-4d11-4a52-afec-63688e1069c5

📥 Commits

Reviewing files that changed from the base of the PR and between 92bd488 and 947812a.

📒 Files selected for processing (136)

docs/benchmarks/suite/README.md
docs/benchmarks/suite/repos.json
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/report.json
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/run-state.json
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txt
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/madar-answer.txt
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/report.json
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/report.share-safe.json
docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/run-state.json
docs/benchmarks/suite/results/2026-06-10T04-39-55/summary.json
docs/benchmarks/suite/results/2026-06-10T04-39-55/summary.md
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/report.json
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/run-state.json
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txt
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/madar-answer.txt
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txt
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/report.json
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/report.share-safe.json
docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/run-state.json
docs/benchmarks/suite/results/2026-06-10T04-48-49/summary.json
docs/benchmarks/suite/results/2026-06-10T04-48-49/summary.md
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/report.json
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/run-state.json
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txt
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/madar-answer.txt
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/report.json
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/report.share-safe.json
docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/run-state.json
docs/benchmarks/suite/results/2026-06-10T05-03-03/summary.json
docs/benchmarks/suite/results/2026-06-10T05-03-03/summary.md
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/report.json
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/run-state.json
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txt
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/madar-answer.txt
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/report.json
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/report.share-safe.json
docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/run-state.json
docs/benchmarks/suite/results/2026-06-10T05-18-23/summary.json
docs/benchmarks/suite/results/2026-06-10T05-18-23/summary.md
docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json
docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/run-state.json
docs/benchmarks/suite/results/2026-06-10T05-46-15/summary.json
docs/benchmarks/suite/results/2026-06-10T05-46-15/summary.md
docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/report.json
docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/run-state.json
docs/benchmarks/suite/results/2026-06-10T06-03-05/summary.json
docs/benchmarks/suite/results/2026-06-10T06-03-05/summary.md
docs/benchmarks/suite/runtime-proof.json
docs/claims-and-evidence.md
src/infrastructure/benchmark/suite.ts
src/infrastructure/compare.ts
src/infrastructure/context-pack-command.ts
src/pipeline/extract.ts
src/pipeline/extract/core.ts
src/pipeline/extract/cross-file.ts
src/pipeline/extract/frameworks/express.ts
src/pipeline/extract/frameworks/nest.ts
src/pipeline/extract/generic.ts
src/pipeline/extract/non-code.ts
src/pipeline/extract/python-rationale.ts
src/pipeline/spi/build.ts
src/pipeline/spi/cache.ts
src/pipeline/spi/framework-nextjs.ts
src/pipeline/spi/projector.ts
src/runtime/retrieve.ts
src/runtime/retrieve/slicing.ts
src/runtime/runtime-proof.ts
src/runtime/stdio/tools.ts
src/shared/source-discovery.ts
tests/fixtures/go-semantic-workspace/cmd/chi/main.go
tests/unit/benchmark-suite-docs.test.ts
tests/unit/benchmark-suite.test.ts
tests/unit/compare-native-agent.test.ts
tests/unit/compare.test.ts
tests/unit/extract-duplicate-file-stems.test.ts
tests/unit/extract-go.test.ts
tests/unit/extract-tsconfig-alias-runtime-proof.test.ts
tests/unit/retrieve-framework-boost-v0-19.test.ts
tests/unit/retrieve-slice-v1.test.ts
tests/unit/retrieve.test.ts
tests/unit/runtime-proof.test.ts
tests/unit/source-discovery.test.ts
tests/unit/spi-cache.test.ts
tests/unit/spi-calls.test.ts
tests/unit/spi-framework-nextjs.test.ts
tests/unit/spi-projector.test.ts
tests/unit/spi-scoped-root-tsconfig.test.ts
tests/unit/stdio-server.test.ts
tests/unit/stdio-slice-surface.test.ts

👮 Files not reviewed due to content moderation or server errors (8)

docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/run-state.json
docs/benchmarks/suite/results/2026-06-10T06-03-05/summary.json
docs/benchmarks/suite/results/2026-06-10T06-03-05/summary.md
src/pipeline/extract/cross-file.ts
src/pipeline/extract/frameworks/nest.ts
src/pipeline/spi/build.ts
src/pipeline/spi/cache.ts
src/pipeline/spi/framework-nextjs.ts

coderabbitai · 2026-06-10T06:42:36Z

+      "obligations": [
+        {
+          "id": "booking_request",
+          "label": "booking request",
+          "kind": "entrypoint",
+          "required": true,
+          "evidence": [
+            {
+              "label": "API /api/book/event",
+              "source_file": "apps/web/pages/api/book/event.ts",
+              "line_number": 1
+            }
+          ]
+        },
+        {
+          "id": "availability_validation",
+          "label": "availability validation",
+          "kind": "handoff",
+          "required": true,
+          "evidence": [
+            {
+              "label": "checkForConflicts()",
+              "source_file": "packages/features/bookings/lib/handleNewBooking/ensureAvailableUsers.ts",
+              "line_number": 243
+            }
+          ]
+        },
+        {
+          "id": "scheduled_event_persistence",
+          "label": "scheduled event persistence",
+          "kind": "terminal",
+          "required": true,
+          "evidence": [
+            {
+              "label": "saveBooking()",
+              "source_file": "packages/features/bookings/lib/handleNewBooking/createBooking.ts",
+              "line_number": 96
+            }
+          ]
+        },
+        {
+          "id": "notification_delivery",
+          "label": "notification delivery",
+          "kind": "terminal",
+          "required": true,
+          "evidence": [
+            {
+              "label": "sendScheduledEmailsAndSMS()",
+              "source_file": "<project-root>/packages/app-store/_utils/payments/handlePaymentSuccess.ts",
+              "line_number": 211
+            }
+          ]
+        }
+      ],
+      "missing_obligations": []


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check if other report.json files in this PR have the same inconsistency rg -n --type=json '<project-root>/' -g 'report.json' -C2

Repository: mohanagy/madar

Length of output: 50370

🏁 Script executed:

#!/bin/bash set -euo pipefail TARGET="docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json" python3 - <<'PY' import json, os, re from pathlib import Path target = Path("docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json") data = json.loads(target.read_text()) # Extract all source_file values from runtime_proof.obligations[*].evidence[*].source_file sources = [] try: rp = data["runtime_proof"] except Exception: rp = data.get("runtime_proof", {}) obs = rp.get("obligations", []) for o in obs: for ev in o.get("evidence", []): sf = ev.get("source_file") if sf is not None: sources.append(sf) print("TARGET_FILE:", str(target)) print("source_file count:", len(sources)) unique = sorted(set(sources)) print("unique formats:") for s in unique: print(" -", s) def bucket(s): if s.startswith("<project-root>/"): return "project_root_prefixed" return "relative_or_other" buckets = { "project_root_prefixed":0, "relative_or_other":0 } for s in sources: buckets[bucket(s)] += 1 print("bucket counts:", buckets) PY python3 - <<'PY' import json from pathlib import Path base = Path("docs/benchmarks/suite/results") report_files = list(base.rglob("report.json")) def extract_sources(path): try: data = json.loads(path.read_text()) except Exception: return [] rp = data.get("runtime_proof") or {} obs = rp.get("obligations") or [] sources=[] for o in obs: for ev in o.get("evidence") or []: sf = ev.get("source_file") if sf: sources.append(sf) return sources mixed=[] total=0 for p in report_files: total += 1 sources = extract_sources(p) if not sources: continue has_pref = any(s.startswith("<project-root>/") for s in sources) has_other = any(not s.startswith("<project-root>/") for s in sources) if has_pref and has_other: # summarize counts pref = sum(1 for s in sources if s.startswith("<project-root>/")) other = len(sources)-pref mixed.append((str(p), pref, other, len(sources))) if len(mixed) >= 15: break print("TOTAL report.json under docs/benchmarks/suite/results:", total) print("Found mixed-format files (first 15):", len(mixed)) for path,pref,other,total_sources in mixed: print(f"- {path} (pref={pref}, other={other}, total={total_sources})") PY

Repository: mohanagy/madar

Length of output: 391

🏁 Script executed:

#!/bin/bash set -euo pipefail TARGET="docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json" python3 - <<'PY' import json from pathlib import Path p = Path("docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json") data = json.loads(p.read_text()) # 1) See top-level keys print("TOP_LEVEL_KEYS:", sorted(data.keys())) # 2) Recursively count all source_file occurrences sources=[] paths=[] def walk(obj, path=""): if isinstance(obj, dict): for k,v in obj.items(): np = f"{path}.{k}" if path else k if k=="source_file" and isinstance(v,str): sources.append(v) paths.append(np) walk(v, np) elif isinstance(obj, list): for i,v in enumerate(obj): walk(v, f"{path}[{i}]") walk(data) print("TOTAL_SOURCE_FILE_OCCURRENCES:", len(sources)) print("UNIQUE_SOURCE_FILE_VALUES (up to 50):") for s in sorted(set(sources))[:50]: print(" -", s) pref = sum(1 for s in sources if s.startswith("<project-root>/")) other = len(s)-pref print("BUCKET_COUNTS:", {"project_root_prefixed":pref, "relative_or_other":other}) # 3) Locate any 'obligations' keys and print nearby structure clues obligations_paths=[] def find_obligations(obj, path=""): if isinstance(obj, dict): for k,v in obj.items(): np = f"{path}.{k}" if path else k if k=="obligations": obligations_paths.append(np) find_obligations(v, np) elif isinstance(obj, list): for i,v in enumerate(obj): find_obligations(v, f"{path}[{i}]") find_obligations(data) print("OBLIGATIONS_PATHS:", obligations_paths[:20]) PY

Repository: mohanagy/madar

Length of output: 1187

🏁 Script executed:

#!/bin/bash set -euo pipefail python3 - <<'PY' import json from pathlib import Path p = Path("docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json") data = json.loads(p.read_text()) rp = data["answer_contract"]["runtime_proof"] obs = rp.get("obligations", []) print("OBLIGATIONS_COUNT:", len(obs)) if obs: o0 = obs[0] print("OBLIGATION_0_KEYS:", sorted(o0.keys())) # print one evidence-like field if present for k,v in o0.items(): if isinstance(v, list): print(f" - list field: {k} len={len(v)} sample_type={type(v[0]).__name__ if v else 'n/a'}") if k=="evidence": print(" - evidence sample:", v[0] if v else None) # gather all source_file keys within obligations objects sources=[] def walk(x): if isinstance(x, dict): for k,v in x.items(): if k=="source_file" and isinstance(v,str): sources.append(v) walk(v) elif isinstance(x, list): for i in x: walk(i) walk(obs) print("OBLIGATIONS_SOURCE_FILE_COUNT:", len(sources)) print("OBLIGATIONS_UNIQUE_SOURCE_FILE_VALUES:", sorted(set(sources))) PY python3 - <<'PY' import json from pathlib import Path base = Path("docs/benchmarks/suite/results") report_files = list(base.rglob("report.json")) def extract_sources_in_runtime_proof_obligations(path): try: data = json.loads(path.read_text()) except Exception: return [] ac = data.get("answer_contract") or {} rp = ac.get("runtime_proof") or {} obs = rp.get("obligations") or [] sources=[] def walk(x): if isinstance(x, dict): for k,v in x.items(): if k=="source_file" and isinstance(v,str): sources.append(v) walk(v) elif isinstance(x, list): for i in x: walk(i) walk(obs) return sources mixed=[] total=0 for p in report_files: total += 1 sources = extract_sources_in_runtime_proof_obligations(p) if not sources: continue has_pref = any(s.startswith("<project-root>/") for s in sources) has_other = any(not s.startswith("<project-root>/") for s in sources) if has_pref and has_other: pref = sum(1 for s in sources if s.startswith("<project-root>/")) other = len(s)-pref mixed.append((str(p), pref, other, len(sources))) if len(mixed) >= 15: break print("TOTAL_REPORT_JSON_UNDER_RESULTS:", total) print("FILES_WITH_MIXED_FORMAT_IN_runtime_proof_obligations (first 15):", len(mixed)) for path,pref,other,total_sources in mixed: print(f"- {path} (pref={pref}, other={other}, total={total_sources})") PY

Repository: mohanagy/madar

Length of output: 771

🏁 Script executed:

#!/bin/bash set -euo pipefail python3 - <<'PY' import json from pathlib import Path base = Path("docs/benchmarks/suite/results") report_files = list(base.rglob("report.json")) def extract_sources_in_runtime_proof_obligations(path: Path): try: data = json.loads(path.read_text()) except Exception: return [] ac = data.get("answer_contract") or {} rp = ac.get("runtime_proof") or {} obs = rp.get("obligations") or [] sources=[] def walk(x): if isinstance(x, dict): for k,v in x.items(): if k=="source_file" and isinstance(v,str): sources.append(v) walk(v) elif isinstance(x, list): for i in x: walk(i) walk(obs) return sources mixed=[] total_with_sources=0 for p in report_files: sources = extract_sources_in_runtime_proof_obligations(p) if not sources: continue total_with_sources += 1 has_pref = any(s.startswith("<project-root>/") for s in sources) has_other = any(not s.startswith("<project-root>/") for s in sources) if has_pref and has_other: pref = sum(1 for s in sources if s.startswith("<project-root>/")) other = len(sources)-pref mixed.append((p, pref, other, len(sources), sorted(set(sources))[:6])) print("TOTAL report.json files under docs/benchmarks/suite/results:", len(report_files)) print("TOTAL files with runtime_proof.obligations source_file occurrences:", total_with_sources) print("FILES with MIXED-format source_file values:", len(mixed)) for item in mixed[:20]: path,pref,other,count,preview = item print(f"- {path} (pref={pref}, other={other}, total_evidence_source_files={count})") print(" sample unique source_file values:", preview) PY

Repository: mohanagy/madar

Length of output: 1138

Make answer_contract.runtime_proof.obligations[].evidence[].source_file path formatting consistent
In docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json, source_file mixes relative paths (e.g. apps/web/pages/api/book/event.ts, packages/features/...) with a <project-root>/...-prefixed path (packages/app-store/_utils/payments/handlePaymentSuccess.ts). The same mixed formatting appears in docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/report.json (2/93 report.json files). Ensure the CLI emits one uniform format (all relative or all <project-root>/), or document the meaning of the <project-root>/ prefix.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json` around lines 260 - 314, The report generator is emitting mixed source_file paths in answer_contract.runtime_proof.obligations[].evidence[].source_file; standardize this in the CLI emitter by choosing one format and applying it consistently (e.g., always use path.relative(projectRoot, filePath) for relative paths or always prefix with "<project-root>/"). Locate the reporter function that builds evidence.source_file (search for answer_contract or runtime_proof emission logic and the function that serializes obligations/evidence) and replace the current ad-hoc path handling with a single utility that normalizes paths (use a projectRoot constant and either path.relative + normalize or prepend "<project-root>/" unconditionally), and add a unit test asserting both example inputs (apps/... and packages/...) produce the same chosen format.

Source: Learnings

Fixed in 63040dd: this was caused by runtime-proof recovery materializing out-of-scope branch steps without the workspace rootPath, so their source_file values stayed absolute and were sanitized to /... while in-scope steps were workspace-relative. rootPath is now threaded through recovery, all six receipts were regenerated, and a verification pass confirmed no report mixes the two formats anymore.

@mohanagy, thanks for the detailed root cause explanation — threading rootPath through the recovery path is the right fix. Glad the verification pass confirmed all six receipts are now consistently formatted.

(ﾉ◕ヮ◕)ﾉ*:･ﾟ✧

🧠 Learnings used

Learnt from: mohanagy Repo: mohanagy/graphify-ts PR: 238 File: docs/benchmarks/2026-05-21-govalidate-v0-23-0-validation/report-generation-pack-only.report.share-safe.json:12-17 Timestamp: 2026-05-21T19:32:18.424Z Learning: Benchmark artifacts under docs/benchmarks/ are published faithful snapshots of the CLI’s emitted JSON output and should not be manually edited to “correct” formatting or numeric rounding. If you need to change how token/ratio values are serialized (e.g., computeReductionRatio() rounding to one decimal place such that ratios near 1 may become 1.0), make the precision/serialization change in src/infrastructure/compare.ts so regenerated artifacts reflect the updated CLI behavior.

- thread rootPath through runtime-proof recovery so recovered branch steps emit workspace-relative source files (no mixed path formats in receipts) - use real primary-path boundaries in recovery phase-coverage scoring instead of an empty boundary list - include focused bash follow-ups in prompt-contract follow-up input extraction, matching focused-call classification - activate preserveFinalRuntimeEntrypointContextPreview by removing the self-excluding kept-key filter - decide file-stem uniqueness on normalized ids and disambiguate deterministic collisions (foo-bar.ts vs foo_bar.ts) - include the module stem in the Express analysis cache validity check - fail fast when the benchmark suite is missing the built CLI - regenerate all six public explain-runtime legacy receipts with the final binary; every report is full_win/ready with consistent workspace-relative evidence paths

coderabbitai

🧹 Nitpick comments (1)

tests/unit/benchmark-suite.test.ts (1)

21-27: 💤 Low value

Consider adding a clarifying comment.

The stub CLI scaffolding serves a non-obvious purpose (satisfying existence checks without requiring a full build). A brief comment explaining the design intent would help future maintainers, e.g.:

// Create a minimal CLI stub so tests can run without `npm run build`.
// The stub satisfies existsSync checks in benchmarkWorkspaceCliPath()
// but is never executed since tests mock all execution paths.
const cliStubDir = mkdtempSync(join(tmpdir(), 'madar-bench-cli-stub-'))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/benchmark-suite.test.ts` around lines 21 - 27, Add a short
clarifying comment above the cli stub setup (the cliStubDir and cliStubPath
initialization and the beforeAll that writes the file and sets
process.env.MADAR_BENCH_CLI_PATH) explaining that this creates a minimal CLI
stub to satisfy existence checks (e.g., benchmarkWorkspaceCliPath()) so tests
can run without an npm build, and that the stub is never executed because tests
mock execution paths.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/benchmark-suite.test.ts`:
- Around line 21-27: Add a short clarifying comment above the cli stub setup
(the cliStubDir and cliStubPath initialization and the beforeAll that writes the
file and sets process.env.MADAR_BENCH_CLI_PATH) explaining that this creates a
minimal CLI stub to satisfy existence checks (e.g., benchmarkWorkspaceCliPath())
so tests can run without an npm build, and that the stub is never executed
because tests mock execution paths.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9ab87096-a895-4794-968a-6fd27c176cd5

📥 Commits

Reviewing files that changed from the base of the PR and between 63040dd and 6b091dd.

📒 Files selected for processing (2)

src/infrastructure/benchmark/suite.ts
tests/unit/benchmark-suite.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/infrastructure/benchmark/suite.ts

test: make graph-scope and duplicate-stem path assertions platform-aware

2f02a15

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

mohanagy added 2 commits June 10, 2026 12:39

fix: honor MADAR_BENCH_CLI_PATH override so suite tests run before build

6b091dd

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

mohanagy merged commit bf5e70c into next Jun 10, 2026
7 checks passed

mohanagy mentioned this pull request Jun 10, 2026

Merge next into main for 0.28.0 release #521

Merged

Conversation

mohanagy commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Runtime fixes (each with a failing-first regression test)

CodeRabbit review fixes

Carried review follow-ups

Receipts

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mohanagy Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mohanagy commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot Jun 10, 2026 •

edited

Loading