Skip to content

fix: complete proof-backed public explain-runtime full-win rows#520

Merged
mohanagy merged 4 commits into
nextfrom
benchmark-public-full-win
Jun 10, 2026
Merged

fix: complete proof-backed public explain-runtime full-win rows#520
mohanagy merged 4 commits into
nextfrom
benchmark-public-full-win

Conversation

@mohanagy

@mohanagy mohanagy commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

Completes the public benchmark-proof push on top of #519: fixes the two remaining runtime-proof blockers in retrieve.ts, fixes a source-discovery regression that gutted the dub graph, addresses all CodeRabbit review findings, and refreshes all six public TypeScript explain-runtime legacy receipts to proof-backed full_win bundles generated from the final binary.

Runtime fixes (each with a failing-first regression test)

  1. Out-of-scope runtime-proof recovery (src/runtime/retrieve.ts): the first missing-obligation recovery loop built branch steps only from nodeById (the initial in-scope slice), silently dropping recovery candidates whose evidence exists in the graph but outside the slice scope. It now materializes those nodes via executionSliceStepFromGraph with rootPath threaded through, matching the phase-recovery loop. Regression: tests/unit/retrieve-slice-v1.test.ts.

  2. Dangling stdio relationships after compaction (src/runtime/retrieve.ts): compactRetrievePayloadForStdioProfile sliced matched_nodes and relationships independently, so retained relationships could reference trimmed nodes. Relationships are now filtered against the retained matched-node ID set before the cap. Regression: tests/unit/retrieve.test.ts.

  3. Top-level lib/ discovery fidelity (src/shared/source-discovery.ts): the blanket top-level lib/** hard-ignore also dropped hand-written TypeScript source (e.g. dub's apps/web/lib/** link-resolution middleware), flipping the dub row's benchmark readiness to degraded. The hard-ignore now targets compiled artifacts only (.js/.cjs/.mjs/.d.ts). Regressions: tests/unit/source-discovery.test.ts.

CodeRabbit review fixes

  • Thread rootPath into runtime-proof recovery so recovered branch steps emit workspace-relative source files (fixes the mixed path formats flagged in receipts).
  • Use real primary-path boundaries in recovery phase-coverage scoring instead of [].
  • Include focused bash follow-ups in prompt-contract follow-up input extraction, matching focused-call classification.
  • Activate preserveFinalRuntimeEntrypointContextPreview by removing its self-excluding kept-key filter.
  • Decide file-stem uniqueness on _makeId-normalized ids with deterministic collision suffixes (foo-bar.ts vs foo_bar.ts), with a regression test.
  • Include the module stem in the Express analysis cache validity check.
  • Fail fast when the benchmark suite is missing the built CLI.
  • Platform-aware path assertions for Windows CI.

Carried review follow-ups

Same-turn retrieve persistence in compare, prompt-contract targeting for same-turn follow-ups, routing/tool/latency false-win scoring, SPI cache invalidation for nested tsconfig/jsconfig, Express entrypoint proof recognition, nested Next.js src/app|pages|middleware SPI detection, claims doc relative links, duplicate-stem extraction tests.

Receipts

All six public legacy rows were rerun sequentially with the final binary after all fixes. The success criteria below apply to the warm-cache/legacy/ trial reports; any /spi/ trial in the same bundle is a separate receipt and is not claimed to meet the same criteria. Every legacy report has benchmark_outcome = "full_win", benchmark_readiness = "ready", answer_quality.madar.passed = true, prompt_contract = "followed", answer_contract.runtime_proof.missing_obligations = [], and consistent workspace-relative evidence paths:

Repo Receipt
documenso docs/benchmarks/suite/results/2026-06-10T07-01-18
formbricks docs/benchmarks/suite/results/2026-06-10T07-10-44
dub docs/benchmarks/suite/results/2026-06-10T07-24-01
twenty docs/benchmarks/suite/results/2026-06-10T07-39-11
cal-diy docs/benchmarks/suite/results/2026-06-10T08-08-44
novu docs/benchmarks/suite/results/2026-06-10T08-24-49

docs/benchmarks/suite/README.md, docs/claims-and-evidence.md, and tests/unit/benchmark-suite-docs.test.ts all point at these bundles.

Test plan

  • npm run typecheck
  • npm run build
  • CI=1 npm run test:run — 188 files, 2539 passed, 1 skipped
  • All regressions written failing-first, then fixed
  • Six isolated warm explain-runtime rows rerun sequentially with the final binary — 6/6 legacy full_win
  • Share-safe reports verified for all documented invariants, path-format consistency, and no local path leaks

Summary by CodeRabbit

  • New Features

    • Stronger runtime-proof evidence matching and stricter retrieval behavior for more reliable proof-based answers
    • Expanded SPI/workspace support across additional repos and languages; more consistent workspace provisioning for benchmarks
  • Documentation

    • Updated benchmark suite with per-repo dated results, new share-safe reports, and clearer claims-and-evidence guidance
  • Bug Fixes

    • More stable multi-file ID generation and improved execution-slice planning/confidence for accurate runtime analysis

- recover out-of-scope runtime-proof obligation evidence: the first
  missing-obligation recovery loop now materializes graph nodes outside
  the initial slice scope, matching the phase-recovery loop
- drop dangling stdio relationships: compaction now filters
  relationships against the retained matched-node ID set before the cap
- keep hand-written top-level lib/ TypeScript source discoverable while
  still hard-ignoring compiled lib output (js/cjs/mjs/d.ts), restoring
  dub apps/web/lib middleware evidence
- carry same-turn retrieve persistence, prompt-contract targeting,
  routing/tool/latency scoring, SPI cache invalidation, Express and
  nested Next.js SPI detection fixes from review follow-ups
- refresh all six public TypeScript explain-runtime legacy receipts
  (documenso, formbricks, dub, twenty, cal-diy, novu) with proof-backed
  full_win bundles generated sequentially from the final binary, and
  point suite README, claims-and-evidence, and docs tests at them
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Benchmark docs/results were refreshed and many runtime systems were extended: strict runtime-proof retrieval/slicing, trace-derived compare-pack merging, stdio compaction, file-stem extraction identity, JS/TS resolution and SPI projector changes, workspace Madar shim + .mcp.json PATH pinning, and updated tests.

Changes

Benchmark artifacts and runtime-proof infrastructure

Layer / File(s) Summary
Docs: receipts, claims, and suite README
docs/benchmarks/suite/README.md, docs/claims-and-evidence.md, docs/benchmarks/suite/repos.json, docs/benchmarks/suite/runtime-proof.json, docs/benchmarks/suite/results/*
Updated published per-repo summary receipts, result bundles, runtime-proof obligation evidence_terms, and public-claim mapping; many new dated result artifacts and summary files were added.
Compare, trace merging, and prompt shaping
src/infrastructure/compare.ts, src/infrastructure/context-pack-command.ts
Collects tool_result payloads from Madar traces, reconstructs/merges follow-up compare-packs into an effective pack, extends native-agent report to include answer_contract/execution_slice, and threads retrieval-ready signals into strict prompt guidance and context-pack payloads.
Runtime retrieval, slicing, and runtime-proof logic
src/runtime/retrieve.ts, src/runtime/retrieve/slicing.ts, src/runtime/runtime-proof.ts, src/runtime/stdio/tools.ts
Adds strict runtime-proof scoring/anchor/neighborhood selection, direct-evidence gating, stdio compaction profiles, runtime-proof term matching/counting API, and stdio tool strict override wiring with strict retrieve payload projection.
Extraction identity and cross-file linking
src/pipeline/extract/core.ts, src/pipeline/extract.ts, src/pipeline/extract/cross-file.ts, src/pipeline/extract/*
Introduces file-stem context helpers (withExtractionFileStemContext, fileStemForPath, fileNodeIdForPath), upgrades cache payloads to include __madarFileStem, updates many extractors and cross-file resolvers to use stem-aware IDs, and wraps extract workflow in the stem context.
SPI, TS project config resolution, and projector changes
src/pipeline/spi/build.ts, src/pipeline/spi/cache.ts, src/pipeline/spi/projector.ts, src/pipeline/spi/framework-nextjs.ts
Nearest-project tsconfig discovery, per-file compilerOptions resolver, SPI projector file-stem disambiguation, synthetic external call symbol handling, and Next.js convention/path normalization.
Workspace provisioning and test updates
src/infrastructure/benchmark/suite.ts, tests/unit/benchmark-suite.test.ts, tests/unit/benchmark-suite-docs.test.ts
Writes a local Madar shim into workspace .claude/bin, prepends it into .mcp.json server.env PATH/Path, updates tests to stub Madar CLI and assert .mcp.json contains the pinned PATH, and refreshes test expectations for updated manifests and receipts.
Source discovery ignores
src/shared/source-discovery.ts
Refines hard-ignore globs/regexes to only ignore lib/* JS/module/type files rather than the entire lib/** subtree.

Sequence Diagram(s)

sequenceDiagram
  participant TestHarness
  participant BenchmarkSuite
  participant MadarTrace
  participant CompareService
  participant RetrieveService
  participant ContextPack

  TestHarness->>BenchmarkSuite: generate run (writes artifacts, prompts, results)
  BenchmarkSuite->>MadarTrace: run madar arm (captures verbose trace + tool_result)
  MadarTrace->>CompareService: provide raw trace + tool_result payloads
  CompareService->>RetrieveService: reconstruct follow-up retrievals & merge into effective pack
  RetrieveService->>ContextPack: provide matched_nodes + execution_slice for explain payload
  ContextPack->>CompareService: preserve runtime-entry-context expandables
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Possibly related PRs

Poem

🐰 I hop through receipts and proofs with care,
Stems aligned and traces laid bare.
A shim, a pack, a merged-up plot,
New results written — neat and hot.
I nibble bugs and then I hop — review complete, I’ll never stop.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch benchmark-public-full-win

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (3)
tests/unit/benchmark-suite-docs.test.ts (1)

242-290: ⚡ Quick win

Remove the duplicated test block.

This it('publishes a scoped Twenty receipt when the root suite graph is too large', ...) block duplicates the earlier one verbatim, so it adds runtime and maintenance cost without new coverage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/benchmark-suite-docs.test.ts` around lines 242 - 290, The test
block duplicated is the it(...) titled "publishes a scoped Twenty receipt when
the root suite graph is too large" — remove the entire duplicate it(...) block
(the second occurrence) so only one such test remains; search for the
it("publishes a scoped Twenty receipt when the root suite graph is too large",
...) and delete the redundant copy, then run the unit tests to confirm no
regressions.
src/runtime/retrieve/slicing.ts (1)

429-469: 💤 Low value

Consider extracting the max anchor calculation to avoid duplication.

The maxAnchors calculation logic appears twice (lines 794-796 and 860-862) with identical formulas. Consider extracting this into a helper function or computing it once at the start of buildAnchors.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/runtime/retrieve/slicing.ts` around lines 429 - 469, The maxAnchors
calculation is duplicated in buildAnchors; extract the logic into a single
helper (e.g., computeMaxAnchors or getMaxAnchors) and call it from buildAnchors
instead of repeating the formula, updating both sites where the formula appears
(the duplicated blocks around lines referencing maxAnchors) to use that helper;
ensure the helper accepts the same inputs used in the original formula and
return the computed number so buildAnchors remains functionally identical.
src/runtime/stdio/tools.ts (1)

168-205: 💤 Low value

Consider simplifying the conditional type extractions.

The StrictRuntimeProofRetrievePayload interface uses nested conditional type extractions (extends infer T ? T extends { ... } ? ...) which are verbose. Since RetrieveResult is a known interface, you could directly reference the nested types or define explicit type aliases for clarity.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/runtime/stdio/tools.ts` around lines 168 - 205, The conditional type
extractions inside StrictRuntimeProofRetrievePayload (for answer_contract and
execution_slice) are overly verbose; replace them by introducing simple type
aliases from RetrieveResult (e.g., type AnswerContract =
RetrieveResult['answer_contract'] and type ExecutionSlice =
RetrieveResult['execution_slice']) and then reference
AnswerContract['confidence'], AnswerContract['runtime_proof'],
ExecutionSlice['status'], ExecutionSlice['confidence'], etc., or directly use
RetrieveResult['...'] properties without the nested infer pattern to make the
interface clearer and easier to maintain.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/report.json`:
- Around line 170-176: The PR incorrectly implies that all trials (including
warm-cache/spi/trial-001) meet the "ready/full_win/no missing obligations"
criteria; however the SPI trial shows benchmark_outcome.outcome: "not_measured",
benchmark_readiness.status: "not_ready" and a missing runtime proof obligation
("send preparation"), whereas warm-cache/legacy shows full_win/ready/no missing
obligations (answer_contract.runtime_proof.missing_obligations: []). Update the
PR text to either explicitly exclude warm-cache/spi trials from the
"ready/full_win/no missing obligations" claim or add a clarifying sentence that
only warm-cache/legacy trials (not warm-cache/spi) meet those criteria,
referencing the fields benchmark_outcome.outcome and benchmark_readiness.status
to justify the distinction.

In
`@docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json`:
- Around line 260-314: The report generator is emitting mixed source_file paths
in answer_contract.runtime_proof.obligations[].evidence[].source_file;
standardize this in the CLI emitter by choosing one format and applying it
consistently (e.g., always use path.relative(projectRoot, filePath) for relative
paths or always prefix with "<project-root>/"). Locate the reporter function
that builds evidence.source_file (search for answer_contract or runtime_proof
emission logic and the function that serializes obligations/evidence) and
replace the current ad-hoc path handling with a single utility that normalizes
paths (use a projectRoot constant and either path.relative + normalize or
prepend "<project-root>/" unconditionally), and add a unit test asserting both
example inputs (apps/... and packages/...) produce the same chosen format.

In `@src/infrastructure/benchmark/suite.ts`:
- Around line 544-546: The benchmarkWorkspaceCliPath() helper currently returns
a path to the built CLI without ensuring the file exists; update it to validate
the target file (e.g., use fs.existsSync or fs.promises.access) and either throw
a clear Error or return undefined when the file is missing so callers that
generate shims (the code that writes shim wrappers for the CLI) can fail fast;
specifically modify benchmarkWorkspaceCliPath() to compute the path, check
existence of join(findPackageRoot(), 'dist', 'src', 'cli', 'bin.js'), and on
missing file throw an Error with a descriptive message ('built CLI not found at
<path>') or adjust callers to handle the missing case and avoid writing broken
shims.

In `@src/infrastructure/compare.ts`:
- Around line 1646-1658: The focused follow-up extraction only accepts
Madar/Read tools and thus misses focused bash follow-ups; update the filter in
traceFocusedFollowUpInputs (the block using
isMadarTraceToolName/isFocusedFollowUpTraceToolName and pushing into
followUpInputs from toolInputs[index]) to also consider focused bash calls
(e.g., include isBashTraceToolName(toolName) or extend
isFocusedFollowUpTraceToolName to return true for bash). Ensure the same
normalization via normalizeAnswerQualityText and pushing into followUpInputs
remains unchanged so focused bash inputs are captured for runtime-proof
targeting.

In `@src/infrastructure/context-pack-command.ts`:
- Around line 823-845: The preview generation is a no-op because keptKeys is
built from matchedNodes and then used to reject every preview candidate; fix by
changing the rejection logic so preview candidates are not self-excluded—either
(A) build keptKeys from a different source (e.g., the set of keys that must
truly be preserved, not from matchedNodes), or (B) remove the keptKeys.has(key)
check from the filter inside the preview mapping so only existingPreviewKeys
blocks duplication; update the code around matchedNodes, keptKeys,
existingPreviewKeys and the preview mapping (the preview variable and
runtimePrimaryPathRecordKey/asJsonRecord usages) accordingly.

In `@src/pipeline/extract/core.ts`:
- Around line 25-63: buildFileStemMap currently ensures uniqueness only on raw
basenames but fileNodeIdForPath later calls _makeId(stem), causing collisions
(e.g., foo-bar vs foo_bar). Update buildFileStemMap to consider the normalized
id from _makeId when deciding uniqueness: for any basename that collides
(stemCounts > 1), compute relativeStem as you do now, then iteratively widen the
candidate stem by prepending parent segments (or otherwise adjusting) and call
_makeId(candidate) until the returned id is unique among all already-chosen
_makeId results; store the original filePath -> chosen stem (not the id) so
fileNodeIdForPath/_makeId produces consistent, collision-free ids. Ensure you
reference buildFileStemMap, fileNodeIdForPath, and _makeId in the change.

In `@src/pipeline/extract/frameworks/express.ts`:
- Around line 92-93: The cached Express analysis (exportedBindings and
routeRecords) keyed only by filePath+sourceText can become incorrect when
moduleStem(filePath) (which calls fileStemForPath) yields a different stem;
update analyzeExpressModule's caching to include the current stem (derived via
moduleStem(filePath)) in the cache key or store the stem with the cached value
and invalidate/refresh the cache when the stored stem differs from the current
moduleStem. Locate analyzeExpressModule and its cache lookup/insert logic and
ensure the cache key/value accounts for moduleStem(filePath) (or performs a stem
equality check before reusing exportedBindings/routeRecords).

In `@src/runtime/retrieve.ts`:
- Around line 3597-3611: The recovery logic in
recoverMissingRuntimeProofBranches recreates out-of-scope steps without passing
the rootPath used by buildExecutionSlice, causing raw source_file paths to leak;
fix it by threading the rootPath argument through the recovery helper(s) that
call executionSliceStepFromGraph so those calls use
executionSliceStepFromGraph(..., rootPath) instead of the no-rootPath overload,
updating all recovery passes (the second recovery pass and the similar blocks
referenced around 3710-3736 and 3903-3919) to accept and forward rootPath from
buildExecutionSlice into the branch materialization flow (ensure function
signatures and invocations for the helper(s) and
recoverMissingRuntimeProofBranches reflect the added rootPath parameter).
- Around line 3686-3693: The call to phaseCoverageForPath is passing an empty
array for primary path boundaries, causing queue to be treated as missing;
update the helper to pass the real primaryBoundaries variable instead of []
wherever phaseCoverageForPath is invoked (e.g., the assignment to
currentPhaseCoverage and the other similar calls shown) so the recovery-phase
scoring uses the actual primary-path boundaries; locate uses of
phaseCoverageForPath (e.g., currentPhaseCoverage) and replace the empty array
argument with primaryBoundaries, and apply the same change to the other
occurrences listed in the comment.

---

Nitpick comments:
In `@src/runtime/retrieve/slicing.ts`:
- Around line 429-469: The maxAnchors calculation is duplicated in buildAnchors;
extract the logic into a single helper (e.g., computeMaxAnchors or
getMaxAnchors) and call it from buildAnchors instead of repeating the formula,
updating both sites where the formula appears (the duplicated blocks around
lines referencing maxAnchors) to use that helper; ensure the helper accepts the
same inputs used in the original formula and return the computed number so
buildAnchors remains functionally identical.

In `@src/runtime/stdio/tools.ts`:
- Around line 168-205: The conditional type extractions inside
StrictRuntimeProofRetrievePayload (for answer_contract and execution_slice) are
overly verbose; replace them by introducing simple type aliases from
RetrieveResult (e.g., type AnswerContract = RetrieveResult['answer_contract']
and type ExecutionSlice = RetrieveResult['execution_slice']) and then reference
AnswerContract['confidence'], AnswerContract['runtime_proof'],
ExecutionSlice['status'], ExecutionSlice['confidence'], etc., or directly use
RetrieveResult['...'] properties without the nested infer pattern to make the
interface clearer and easier to maintain.

In `@tests/unit/benchmark-suite-docs.test.ts`:
- Around line 242-290: The test block duplicated is the it(...) titled
"publishes a scoped Twenty receipt when the root suite graph is too large" —
remove the entire duplicate it(...) block (the second occurrence) so only one
such test remains; search for the it("publishes a scoped Twenty receipt when the
root suite graph is too large", ...) and delete the redundant copy, then run the
unit tests to confirm no regressions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 872b4dbe-4d11-4a52-afec-63688e1069c5

📥 Commits

Reviewing files that changed from the base of the PR and between 92bd488 and 947812a.

📒 Files selected for processing (136)
  • docs/benchmarks/suite/README.md
  • docs/benchmarks/suite/repos.json
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/report.json
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/run-state.json
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/madar-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/report.json
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/report.share-safe.json
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/run-state.json
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/summary.json
  • docs/benchmarks/suite/results/2026-06-10T04-39-55/summary.md
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/report.json
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/run-state.json
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/madar-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/report.json
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/report.share-safe.json
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/run-state.json
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/summary.json
  • docs/benchmarks/suite/results/2026-06-10T04-48-49/summary.md
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/report.json
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/run-state.json
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/madar-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/report.json
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/report.share-safe.json
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/run-state.json
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/summary.json
  • docs/benchmarks/suite/results/2026-06-10T05-03-03/summary.md
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/report.json
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/run-state.json
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/madar-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/report.json
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/report.share-safe.json
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/run-state.json
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/summary.json
  • docs/benchmarks/suite/results/2026-06-10T05-18-23/summary.md
  • docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json
  • docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
  • docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/run-state.json
  • docs/benchmarks/suite/results/2026-06-10T05-46-15/summary.json
  • docs/benchmarks/suite/results/2026-06-10T05-46-15/summary.md
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txt
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txt
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/report.json
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.json
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/run-state.json
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/summary.json
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/summary.md
  • docs/benchmarks/suite/runtime-proof.json
  • docs/claims-and-evidence.md
  • src/infrastructure/benchmark/suite.ts
  • src/infrastructure/compare.ts
  • src/infrastructure/context-pack-command.ts
  • src/pipeline/extract.ts
  • src/pipeline/extract/core.ts
  • src/pipeline/extract/cross-file.ts
  • src/pipeline/extract/frameworks/express.ts
  • src/pipeline/extract/frameworks/nest.ts
  • src/pipeline/extract/generic.ts
  • src/pipeline/extract/non-code.ts
  • src/pipeline/extract/python-rationale.ts
  • src/pipeline/spi/build.ts
  • src/pipeline/spi/cache.ts
  • src/pipeline/spi/framework-nextjs.ts
  • src/pipeline/spi/projector.ts
  • src/runtime/retrieve.ts
  • src/runtime/retrieve/slicing.ts
  • src/runtime/runtime-proof.ts
  • src/runtime/stdio/tools.ts
  • src/shared/source-discovery.ts
  • tests/fixtures/go-semantic-workspace/cmd/chi/main.go
  • tests/unit/benchmark-suite-docs.test.ts
  • tests/unit/benchmark-suite.test.ts
  • tests/unit/compare-native-agent.test.ts
  • tests/unit/compare.test.ts
  • tests/unit/extract-duplicate-file-stems.test.ts
  • tests/unit/extract-go.test.ts
  • tests/unit/extract-tsconfig-alias-runtime-proof.test.ts
  • tests/unit/retrieve-framework-boost-v0-19.test.ts
  • tests/unit/retrieve-slice-v1.test.ts
  • tests/unit/retrieve.test.ts
  • tests/unit/runtime-proof.test.ts
  • tests/unit/source-discovery.test.ts
  • tests/unit/spi-cache.test.ts
  • tests/unit/spi-calls.test.ts
  • tests/unit/spi-framework-nextjs.test.ts
  • tests/unit/spi-projector.test.ts
  • tests/unit/spi-scoped-root-tsconfig.test.ts
  • tests/unit/stdio-server.test.ts
  • tests/unit/stdio-slice-surface.test.ts
👮 Files not reviewed due to content moderation or server errors (8)
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/run-state.json
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/summary.json
  • docs/benchmarks/suite/results/2026-06-10T06-03-05/summary.md
  • src/pipeline/extract/cross-file.ts
  • src/pipeline/extract/frameworks/nest.ts
  • src/pipeline/spi/build.ts
  • src/pipeline/spi/cache.ts
  • src/pipeline/spi/framework-nextjs.ts

Comment on lines +260 to +314
"obligations": [
{
"id": "booking_request",
"label": "booking request",
"kind": "entrypoint",
"required": true,
"evidence": [
{
"label": "API /api/book/event",
"source_file": "apps/web/pages/api/book/event.ts",
"line_number": 1
}
]
},
{
"id": "availability_validation",
"label": "availability validation",
"kind": "handoff",
"required": true,
"evidence": [
{
"label": "checkForConflicts()",
"source_file": "packages/features/bookings/lib/handleNewBooking/ensureAvailableUsers.ts",
"line_number": 243
}
]
},
{
"id": "scheduled_event_persistence",
"label": "scheduled event persistence",
"kind": "terminal",
"required": true,
"evidence": [
{
"label": "saveBooking()",
"source_file": "packages/features/bookings/lib/handleNewBooking/createBooking.ts",
"line_number": 96
}
]
},
{
"id": "notification_delivery",
"label": "notification delivery",
"kind": "terminal",
"required": true,
"evidence": [
{
"label": "sendScheduledEmailsAndSMS()",
"source_file": "<project-root>/packages/app-store/_utils/payments/handlePaymentSuccess.ts",
"line_number": 211
}
]
}
],
"missing_obligations": []

@coderabbitai coderabbitai Bot Jun 10, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check if other report.json files in this PR have the same inconsistency

rg -n --type=json '<project-root>/' -g 'report.json' -C2

Repository: mohanagy/madar

Length of output: 50370


🏁 Script executed:

#!/bin/bash
set -euo pipefail

TARGET="docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json"

python3 - <<'PY'
import json, os, re
from pathlib import Path

target = Path("docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json")
data = json.loads(target.read_text())
# Extract all source_file values from runtime_proof.obligations[*].evidence[*].source_file
sources = []
try:
    rp = data["runtime_proof"]
except Exception:
    rp = data.get("runtime_proof", {})
obs = rp.get("obligations", [])
for o in obs:
    for ev in o.get("evidence", []):
        sf = ev.get("source_file")
        if sf is not None:
            sources.append(sf)

print("TARGET_FILE:", str(target))
print("source_file count:", len(sources))
unique = sorted(set(sources))
print("unique formats:")
for s in unique:
    print(" -", s)

def bucket(s):
    if s.startswith("<project-root>/"):
        return "project_root_prefixed"
    return "relative_or_other"
buckets = { "project_root_prefixed":0, "relative_or_other":0 }
for s in sources:
    buckets[bucket(s)] += 1
print("bucket counts:", buckets)
PY

python3 - <<'PY'
import json
from pathlib import Path

base = Path("docs/benchmarks/suite/results")
report_files = list(base.rglob("report.json"))
def extract_sources(path):
    try:
        data = json.loads(path.read_text())
    except Exception:
        return []
    rp = data.get("runtime_proof") or {}
    obs = rp.get("obligations") or []
    sources=[]
    for o in obs:
        for ev in o.get("evidence") or []:
            sf = ev.get("source_file")
            if sf:
                sources.append(sf)
    return sources

mixed=[]
total=0
for p in report_files:
    total += 1
    sources = extract_sources(p)
    if not sources:
        continue
    has_pref = any(s.startswith("<project-root>/") for s in sources)
    has_other = any(not s.startswith("<project-root>/") for s in sources)
    if has_pref and has_other:
        # summarize counts
        pref = sum(1 for s in sources if s.startswith("<project-root>/"))
        other = len(sources)-pref
        mixed.append((str(p), pref, other, len(sources)))
        if len(mixed) >= 15:
            break

print("TOTAL report.json under docs/benchmarks/suite/results:", total)
print("Found mixed-format files (first 15):", len(mixed))
for path,pref,other,total_sources in mixed:
    print(f"- {path} (pref={pref}, other={other}, total={total_sources})")
PY

Repository: mohanagy/madar

Length of output: 391


🏁 Script executed:

#!/bin/bash
set -euo pipefail

TARGET="docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json"

python3 - <<'PY'
import json
from pathlib import Path

p = Path("docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json")
data = json.loads(p.read_text())

# 1) See top-level keys
print("TOP_LEVEL_KEYS:", sorted(data.keys()))

# 2) Recursively count all source_file occurrences
sources=[]
paths=[]

def walk(obj, path=""):
    if isinstance(obj, dict):
        for k,v in obj.items():
            np = f"{path}.{k}" if path else k
            if k=="source_file" and isinstance(v,str):
                sources.append(v)
                paths.append(np)
            walk(v, np)
    elif isinstance(obj, list):
        for i,v in enumerate(obj):
            walk(v, f"{path}[{i}]")

walk(data)

print("TOTAL_SOURCE_FILE_OCCURRENCES:", len(sources))
print("UNIQUE_SOURCE_FILE_VALUES (up to 50):")
for s in sorted(set(sources))[:50]:
    print(" -", s)

pref = sum(1 for s in sources if s.startswith("<project-root>/"))
other = len(s)-pref
print("BUCKET_COUNTS:", {"project_root_prefixed":pref, "relative_or_other":other})

# 3) Locate any 'obligations' keys and print nearby structure clues
obligations_paths=[]

def find_obligations(obj, path=""):
    if isinstance(obj, dict):
        for k,v in obj.items():
            np = f"{path}.{k}" if path else k
            if k=="obligations":
                obligations_paths.append(np)
            find_obligations(v, np)
    elif isinstance(obj, list):
        for i,v in enumerate(obj):
            find_obligations(v, f"{path}[{i}]")

find_obligations(data)
print("OBLIGATIONS_PATHS:", obligations_paths[:20])
PY

Repository: mohanagy/madar

Length of output: 1187


🏁 Script executed:

#!/bin/bash
set -euo pipefail
python3 - <<'PY'
import json
from pathlib import Path

p = Path("docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json")
data = json.loads(p.read_text())
rp = data["answer_contract"]["runtime_proof"]
obs = rp.get("obligations", [])
print("OBLIGATIONS_COUNT:", len(obs))
if obs:
    o0 = obs[0]
    print("OBLIGATION_0_KEYS:", sorted(o0.keys()))
    # print one evidence-like field if present
    for k,v in o0.items():
        if isinstance(v, list):
            print(f" - list field: {k} len={len(v)} sample_type={type(v[0]).__name__ if v else 'n/a'}")
        if k=="evidence":
            print(" - evidence sample:", v[0] if v else None)
    # gather all source_file keys within obligations objects
    sources=[]
    def walk(x):
        if isinstance(x, dict):
            for k,v in x.items():
                if k=="source_file" and isinstance(v,str):
                    sources.append(v)
                walk(v)
        elif isinstance(x, list):
            for i in x: walk(i)
    walk(obs)
    print("OBLIGATIONS_SOURCE_FILE_COUNT:", len(sources))
    print("OBLIGATIONS_UNIQUE_SOURCE_FILE_VALUES:", sorted(set(sources)))
PY

python3 - <<'PY'
import json
from pathlib import Path

base = Path("docs/benchmarks/suite/results")
report_files = list(base.rglob("report.json"))
def extract_sources_in_runtime_proof_obligations(path):
    try:
        data = json.loads(path.read_text())
    except Exception:
        return []
    ac = data.get("answer_contract") or {}
    rp = ac.get("runtime_proof") or {}
    obs = rp.get("obligations") or []
    sources=[]
    def walk(x):
        if isinstance(x, dict):
            for k,v in x.items():
                if k=="source_file" and isinstance(v,str):
                    sources.append(v)
                walk(v)
        elif isinstance(x, list):
            for i in x: walk(i)
    walk(obs)
    return sources

mixed=[]
total=0
for p in report_files:
    total += 1
    sources = extract_sources_in_runtime_proof_obligations(p)
    if not sources:
        continue
    has_pref = any(s.startswith("<project-root>/") for s in sources)
    has_other = any(not s.startswith("<project-root>/") for s in sources)
    if has_pref and has_other:
        pref = sum(1 for s in sources if s.startswith("<project-root>/"))
        other = len(s)-pref
        mixed.append((str(p), pref, other, len(sources)))
        if len(mixed) >= 15:
            break

print("TOTAL_REPORT_JSON_UNDER_RESULTS:", total)
print("FILES_WITH_MIXED_FORMAT_IN_runtime_proof_obligations (first 15):", len(mixed))
for path,pref,other,total_sources in mixed:
    print(f"- {path} (pref={pref}, other={other}, total={total_sources})")
PY

Repository: mohanagy/madar

Length of output: 771


🏁 Script executed:

#!/bin/bash
set -euo pipefail
python3 - <<'PY'
import json
from pathlib import Path

base = Path("docs/benchmarks/suite/results")
report_files = list(base.rglob("report.json"))

def extract_sources_in_runtime_proof_obligations(path: Path):
    try:
        data = json.loads(path.read_text())
    except Exception:
        return []
    ac = data.get("answer_contract") or {}
    rp = ac.get("runtime_proof") or {}
    obs = rp.get("obligations") or []
    sources=[]
    def walk(x):
        if isinstance(x, dict):
            for k,v in x.items():
                if k=="source_file" and isinstance(v,str):
                    sources.append(v)
                walk(v)
        elif isinstance(x, list):
            for i in x: walk(i)
    walk(obs)
    return sources

mixed=[]
total_with_sources=0
for p in report_files:
    sources = extract_sources_in_runtime_proof_obligations(p)
    if not sources:
        continue
    total_with_sources += 1
    has_pref = any(s.startswith("<project-root>/") for s in sources)
    has_other = any(not s.startswith("<project-root>/") for s in sources)
    if has_pref and has_other:
        pref = sum(1 for s in sources if s.startswith("<project-root>/"))
        other = len(sources)-pref
        mixed.append((p, pref, other, len(sources), sorted(set(sources))[:6]))

print("TOTAL report.json files under docs/benchmarks/suite/results:", len(report_files))
print("TOTAL files with runtime_proof.obligations source_file occurrences:", total_with_sources)
print("FILES with MIXED-format source_file values:", len(mixed))
for item in mixed[:20]:
    path,pref,other,count,preview = item
    print(f"- {path} (pref={pref}, other={other}, total_evidence_source_files={count})")
    print("  sample unique source_file values:", preview)
PY

Repository: mohanagy/madar

Length of output: 1138


Make answer_contract.runtime_proof.obligations[].evidence[].source_file path formatting consistent
In docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json, source_file mixes relative paths (e.g. apps/web/pages/api/book/event.ts, packages/features/...) with a <project-root>/...-prefixed path (packages/app-store/_utils/payments/handlePaymentSuccess.ts). The same mixed formatting appears in docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/report.json (2/93 report.json files). Ensure the CLI emits one uniform format (all relative or all <project-root>/), or document the meaning of the <project-root>/ prefix.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json`
around lines 260 - 314, The report generator is emitting mixed source_file paths
in answer_contract.runtime_proof.obligations[].evidence[].source_file;
standardize this in the CLI emitter by choosing one format and applying it
consistently (e.g., always use path.relative(projectRoot, filePath) for relative
paths or always prefix with "<project-root>/"). Locate the reporter function
that builds evidence.source_file (search for answer_contract or runtime_proof
emission logic and the function that serializes obligations/evidence) and
replace the current ad-hoc path handling with a single utility that normalizes
paths (use a projectRoot constant and either path.relative + normalize or
prepend "<project-root>/" unconditionally), and add a unit test asserting both
example inputs (apps/... and packages/...) produce the same chosen format.

Source: Learnings

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 63040dd: this was caused by runtime-proof recovery materializing out-of-scope branch steps without the workspace rootPath, so their source_file values stayed absolute and were sanitized to /... while in-scope steps were workspace-relative. rootPath is now threaded through recovery, all six receipts were regenerated, and a verification pass confirmed no report mixes the two formats anymore.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohanagy, thanks for the detailed root cause explanation — threading rootPath through the recovery path is the right fix. Glad the verification pass confirmed all six receipts are now consistently formatted.

(ノ◕ヮ◕)ノ*:・゚✧


🧠 Learnings used
Learnt from: mohanagy
Repo: mohanagy/graphify-ts PR: 238
File: docs/benchmarks/2026-05-21-govalidate-v0-23-0-validation/report-generation-pack-only.report.share-safe.json:12-17
Timestamp: 2026-05-21T19:32:18.424Z
Learning: Benchmark artifacts under docs/benchmarks/ are published faithful snapshots of the CLI’s emitted JSON output and should not be manually edited to “correct” formatting or numeric rounding. If you need to change how token/ratio values are serialized (e.g., computeReductionRatio() rounding to one decimal place such that ratios near 1 may become 1.0), make the precision/serialization change in src/infrastructure/compare.ts so regenerated artifacts reflect the updated CLI behavior.

Comment thread src/infrastructure/benchmark/suite.ts
Comment thread src/infrastructure/compare.ts Outdated
Comment thread src/infrastructure/context-pack-command.ts Outdated
Comment thread src/pipeline/extract/core.ts
Comment thread src/pipeline/extract/frameworks/express.ts
Comment thread src/runtime/retrieve.ts
Comment thread src/runtime/retrieve.ts
mohanagy added 2 commits June 10, 2026 12:39
- thread rootPath through runtime-proof recovery so recovered branch
  steps emit workspace-relative source files (no mixed path formats in
  receipts)
- use real primary-path boundaries in recovery phase-coverage scoring
  instead of an empty boundary list
- include focused bash follow-ups in prompt-contract follow-up input
  extraction, matching focused-call classification
- activate preserveFinalRuntimeEntrypointContextPreview by removing the
  self-excluding kept-key filter
- decide file-stem uniqueness on normalized ids and disambiguate
  deterministic collisions (foo-bar.ts vs foo_bar.ts)
- include the module stem in the Express analysis cache validity check
- fail fast when the benchmark suite is missing the built CLI
- regenerate all six public explain-runtime legacy receipts with the
  final binary; every report is full_win/ready with consistent
  workspace-relative evidence paths

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit/benchmark-suite.test.ts (1)

21-27: 💤 Low value

Consider adding a clarifying comment.

The stub CLI scaffolding serves a non-obvious purpose (satisfying existence checks without requiring a full build). A brief comment explaining the design intent would help future maintainers, e.g.:

// Create a minimal CLI stub so tests can run without `npm run build`.
// The stub satisfies existsSync checks in benchmarkWorkspaceCliPath()
// but is never executed since tests mock all execution paths.
const cliStubDir = mkdtempSync(join(tmpdir(), 'madar-bench-cli-stub-'))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/benchmark-suite.test.ts` around lines 21 - 27, Add a short
clarifying comment above the cli stub setup (the cliStubDir and cliStubPath
initialization and the beforeAll that writes the file and sets
process.env.MADAR_BENCH_CLI_PATH) explaining that this creates a minimal CLI
stub to satisfy existence checks (e.g., benchmarkWorkspaceCliPath()) so tests
can run without an npm build, and that the stub is never executed because tests
mock execution paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/benchmark-suite.test.ts`:
- Around line 21-27: Add a short clarifying comment above the cli stub setup
(the cliStubDir and cliStubPath initialization and the beforeAll that writes the
file and sets process.env.MADAR_BENCH_CLI_PATH) explaining that this creates a
minimal CLI stub to satisfy existence checks (e.g., benchmarkWorkspaceCliPath())
so tests can run without an npm build, and that the stub is never executed
because tests mock execution paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9ab87096-a895-4794-968a-6fd27c176cd5

📥 Commits

Reviewing files that changed from the base of the PR and between 63040dd and 6b091dd.

📒 Files selected for processing (2)
  • src/infrastructure/benchmark/suite.ts
  • tests/unit/benchmark-suite.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/infrastructure/benchmark/suite.ts

@mohanagy mohanagy merged commit bf5e70c into next Jun 10, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant