fix: complete proof-backed public explain-runtime full-win rows#520
Conversation
- recover out-of-scope runtime-proof obligation evidence: the first missing-obligation recovery loop now materializes graph nodes outside the initial slice scope, matching the phase-recovery loop - drop dangling stdio relationships: compaction now filters relationships against the retained matched-node ID set before the cap - keep hand-written top-level lib/ TypeScript source discoverable while still hard-ignoring compiled lib output (js/cjs/mjs/d.ts), restoring dub apps/web/lib middleware evidence - carry same-turn retrieve persistence, prompt-contract targeting, routing/tool/latency scoring, SPI cache invalidation, Express and nested Next.js SPI detection fixes from review follow-ups - refresh all six public TypeScript explain-runtime legacy receipts (documenso, formbricks, dub, twenty, cal-diy, novu) with proof-backed full_win bundles generated sequentially from the final binary, and point suite README, claims-and-evidence, and docs tests at them
📝 WalkthroughWalkthroughBenchmark docs/results were refreshed and many runtime systems were extended: strict runtime-proof retrieval/slicing, trace-derived compare-pack merging, stdio compaction, file-stem extraction identity, JS/TS resolution and SPI projector changes, workspace Madar shim + .mcp.json PATH pinning, and updated tests. ChangesBenchmark artifacts and runtime-proof infrastructure
Sequence Diagram(s)sequenceDiagram
participant TestHarness
participant BenchmarkSuite
participant MadarTrace
participant CompareService
participant RetrieveService
participant ContextPack
TestHarness->>BenchmarkSuite: generate run (writes artifacts, prompts, results)
BenchmarkSuite->>MadarTrace: run madar arm (captures verbose trace + tool_result)
MadarTrace->>CompareService: provide raw trace + tool_result payloads
CompareService->>RetrieveService: reconstruct follow-up retrievals & merge into effective pack
RetrieveService->>ContextPack: provide matched_nodes + execution_slice for explain payload
ContextPack->>CompareService: preserve runtime-entry-context expandables
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes Possibly related PRs
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (3)
tests/unit/benchmark-suite-docs.test.ts (1)
242-290: ⚡ Quick winRemove the duplicated test block.
This
it('publishes a scoped Twenty receipt when the root suite graph is too large', ...)block duplicates the earlier one verbatim, so it adds runtime and maintenance cost without new coverage.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/benchmark-suite-docs.test.ts` around lines 242 - 290, The test block duplicated is the it(...) titled "publishes a scoped Twenty receipt when the root suite graph is too large" — remove the entire duplicate it(...) block (the second occurrence) so only one such test remains; search for the it("publishes a scoped Twenty receipt when the root suite graph is too large", ...) and delete the redundant copy, then run the unit tests to confirm no regressions.src/runtime/retrieve/slicing.ts (1)
429-469: 💤 Low valueConsider extracting the max anchor calculation to avoid duplication.
The
maxAnchorscalculation logic appears twice (lines 794-796 and 860-862) with identical formulas. Consider extracting this into a helper function or computing it once at the start ofbuildAnchors.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/runtime/retrieve/slicing.ts` around lines 429 - 469, The maxAnchors calculation is duplicated in buildAnchors; extract the logic into a single helper (e.g., computeMaxAnchors or getMaxAnchors) and call it from buildAnchors instead of repeating the formula, updating both sites where the formula appears (the duplicated blocks around lines referencing maxAnchors) to use that helper; ensure the helper accepts the same inputs used in the original formula and return the computed number so buildAnchors remains functionally identical.src/runtime/stdio/tools.ts (1)
168-205: 💤 Low valueConsider simplifying the conditional type extractions.
The
StrictRuntimeProofRetrievePayloadinterface uses nested conditional type extractions (extends infer T ? T extends { ... } ? ...) which are verbose. SinceRetrieveResultis a known interface, you could directly reference the nested types or define explicit type aliases for clarity.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/runtime/stdio/tools.ts` around lines 168 - 205, The conditional type extractions inside StrictRuntimeProofRetrievePayload (for answer_contract and execution_slice) are overly verbose; replace them by introducing simple type aliases from RetrieveResult (e.g., type AnswerContract = RetrieveResult['answer_contract'] and type ExecutionSlice = RetrieveResult['execution_slice']) and then reference AnswerContract['confidence'], AnswerContract['runtime_proof'], ExecutionSlice['status'], ExecutionSlice['confidence'], etc., or directly use RetrieveResult['...'] properties without the nested infer pattern to make the interface clearer and easier to maintain.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@docs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/report.json`:
- Around line 170-176: The PR incorrectly implies that all trials (including
warm-cache/spi/trial-001) meet the "ready/full_win/no missing obligations"
criteria; however the SPI trial shows benchmark_outcome.outcome: "not_measured",
benchmark_readiness.status: "not_ready" and a missing runtime proof obligation
("send preparation"), whereas warm-cache/legacy shows full_win/ready/no missing
obligations (answer_contract.runtime_proof.missing_obligations: []). Update the
PR text to either explicitly exclude warm-cache/spi trials from the
"ready/full_win/no missing obligations" claim or add a clarifying sentence that
only warm-cache/legacy trials (not warm-cache/spi) meet those criteria,
referencing the fields benchmark_outcome.outcome and benchmark_readiness.status
to justify the distinction.
In
`@docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json`:
- Around line 260-314: The report generator is emitting mixed source_file paths
in answer_contract.runtime_proof.obligations[].evidence[].source_file;
standardize this in the CLI emitter by choosing one format and applying it
consistently (e.g., always use path.relative(projectRoot, filePath) for relative
paths or always prefix with "<project-root>/"). Locate the reporter function
that builds evidence.source_file (search for answer_contract or runtime_proof
emission logic and the function that serializes obligations/evidence) and
replace the current ad-hoc path handling with a single utility that normalizes
paths (use a projectRoot constant and either path.relative + normalize or
prepend "<project-root>/" unconditionally), and add a unit test asserting both
example inputs (apps/... and packages/...) produce the same chosen format.
In `@src/infrastructure/benchmark/suite.ts`:
- Around line 544-546: The benchmarkWorkspaceCliPath() helper currently returns
a path to the built CLI without ensuring the file exists; update it to validate
the target file (e.g., use fs.existsSync or fs.promises.access) and either throw
a clear Error or return undefined when the file is missing so callers that
generate shims (the code that writes shim wrappers for the CLI) can fail fast;
specifically modify benchmarkWorkspaceCliPath() to compute the path, check
existence of join(findPackageRoot(), 'dist', 'src', 'cli', 'bin.js'), and on
missing file throw an Error with a descriptive message ('built CLI not found at
<path>') or adjust callers to handle the missing case and avoid writing broken
shims.
In `@src/infrastructure/compare.ts`:
- Around line 1646-1658: The focused follow-up extraction only accepts
Madar/Read tools and thus misses focused bash follow-ups; update the filter in
traceFocusedFollowUpInputs (the block using
isMadarTraceToolName/isFocusedFollowUpTraceToolName and pushing into
followUpInputs from toolInputs[index]) to also consider focused bash calls
(e.g., include isBashTraceToolName(toolName) or extend
isFocusedFollowUpTraceToolName to return true for bash). Ensure the same
normalization via normalizeAnswerQualityText and pushing into followUpInputs
remains unchanged so focused bash inputs are captured for runtime-proof
targeting.
In `@src/infrastructure/context-pack-command.ts`:
- Around line 823-845: The preview generation is a no-op because keptKeys is
built from matchedNodes and then used to reject every preview candidate; fix by
changing the rejection logic so preview candidates are not self-excluded—either
(A) build keptKeys from a different source (e.g., the set of keys that must
truly be preserved, not from matchedNodes), or (B) remove the keptKeys.has(key)
check from the filter inside the preview mapping so only existingPreviewKeys
blocks duplication; update the code around matchedNodes, keptKeys,
existingPreviewKeys and the preview mapping (the preview variable and
runtimePrimaryPathRecordKey/asJsonRecord usages) accordingly.
In `@src/pipeline/extract/core.ts`:
- Around line 25-63: buildFileStemMap currently ensures uniqueness only on raw
basenames but fileNodeIdForPath later calls _makeId(stem), causing collisions
(e.g., foo-bar vs foo_bar). Update buildFileStemMap to consider the normalized
id from _makeId when deciding uniqueness: for any basename that collides
(stemCounts > 1), compute relativeStem as you do now, then iteratively widen the
candidate stem by prepending parent segments (or otherwise adjusting) and call
_makeId(candidate) until the returned id is unique among all already-chosen
_makeId results; store the original filePath -> chosen stem (not the id) so
fileNodeIdForPath/_makeId produces consistent, collision-free ids. Ensure you
reference buildFileStemMap, fileNodeIdForPath, and _makeId in the change.
In `@src/pipeline/extract/frameworks/express.ts`:
- Around line 92-93: The cached Express analysis (exportedBindings and
routeRecords) keyed only by filePath+sourceText can become incorrect when
moduleStem(filePath) (which calls fileStemForPath) yields a different stem;
update analyzeExpressModule's caching to include the current stem (derived via
moduleStem(filePath)) in the cache key or store the stem with the cached value
and invalidate/refresh the cache when the stored stem differs from the current
moduleStem. Locate analyzeExpressModule and its cache lookup/insert logic and
ensure the cache key/value accounts for moduleStem(filePath) (or performs a stem
equality check before reusing exportedBindings/routeRecords).
In `@src/runtime/retrieve.ts`:
- Around line 3597-3611: The recovery logic in
recoverMissingRuntimeProofBranches recreates out-of-scope steps without passing
the rootPath used by buildExecutionSlice, causing raw source_file paths to leak;
fix it by threading the rootPath argument through the recovery helper(s) that
call executionSliceStepFromGraph so those calls use
executionSliceStepFromGraph(..., rootPath) instead of the no-rootPath overload,
updating all recovery passes (the second recovery pass and the similar blocks
referenced around 3710-3736 and 3903-3919) to accept and forward rootPath from
buildExecutionSlice into the branch materialization flow (ensure function
signatures and invocations for the helper(s) and
recoverMissingRuntimeProofBranches reflect the added rootPath parameter).
- Around line 3686-3693: The call to phaseCoverageForPath is passing an empty
array for primary path boundaries, causing queue to be treated as missing;
update the helper to pass the real primaryBoundaries variable instead of []
wherever phaseCoverageForPath is invoked (e.g., the assignment to
currentPhaseCoverage and the other similar calls shown) so the recovery-phase
scoring uses the actual primary-path boundaries; locate uses of
phaseCoverageForPath (e.g., currentPhaseCoverage) and replace the empty array
argument with primaryBoundaries, and apply the same change to the other
occurrences listed in the comment.
---
Nitpick comments:
In `@src/runtime/retrieve/slicing.ts`:
- Around line 429-469: The maxAnchors calculation is duplicated in buildAnchors;
extract the logic into a single helper (e.g., computeMaxAnchors or
getMaxAnchors) and call it from buildAnchors instead of repeating the formula,
updating both sites where the formula appears (the duplicated blocks around
lines referencing maxAnchors) to use that helper; ensure the helper accepts the
same inputs used in the original formula and return the computed number so
buildAnchors remains functionally identical.
In `@src/runtime/stdio/tools.ts`:
- Around line 168-205: The conditional type extractions inside
StrictRuntimeProofRetrievePayload (for answer_contract and execution_slice) are
overly verbose; replace them by introducing simple type aliases from
RetrieveResult (e.g., type AnswerContract = RetrieveResult['answer_contract']
and type ExecutionSlice = RetrieveResult['execution_slice']) and then reference
AnswerContract['confidence'], AnswerContract['runtime_proof'],
ExecutionSlice['status'], ExecutionSlice['confidence'], etc., or directly use
RetrieveResult['...'] properties without the nested infer pattern to make the
interface clearer and easier to maintain.
In `@tests/unit/benchmark-suite-docs.test.ts`:
- Around line 242-290: The test block duplicated is the it(...) titled
"publishes a scoped Twenty receipt when the root suite graph is too large" —
remove the entire duplicate it(...) block (the second occurrence) so only one
such test remains; search for the it("publishes a scoped Twenty receipt when the
root suite graph is too large", ...) and delete the redundant copy, then run the
unit tests to confirm no regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 872b4dbe-4d11-4a52-afec-63688e1069c5
📒 Files selected for processing (136)
docs/benchmarks/suite/README.mddocs/benchmarks/suite/repos.jsondocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txtdocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txtdocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/report.jsondocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.jsondocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/legacy/trial-001/run-state.jsondocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txtdocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/madar-answer.txtdocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/report.jsondocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/report.share-safe.jsondocs/benchmarks/suite/results/2026-06-10T04-39-55/raw/documenso/explain-runtime/warm-cache/spi/trial-001/run-state.jsondocs/benchmarks/suite/results/2026-06-10T04-39-55/summary.jsondocs/benchmarks/suite/results/2026-06-10T04-39-55/summary.mddocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txtdocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txtdocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/report.jsondocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.jsondocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/legacy/trial-001/run-state.jsondocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txtdocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/madar-answer.txtdocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txtdocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/report.jsondocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/report.share-safe.jsondocs/benchmarks/suite/results/2026-06-10T04-48-49/raw/formbricks/explain-runtime/warm-cache/spi/trial-001/run-state.jsondocs/benchmarks/suite/results/2026-06-10T04-48-49/summary.jsondocs/benchmarks/suite/results/2026-06-10T04-48-49/summary.mddocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txtdocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txtdocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/report.jsondocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.jsondocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/legacy/trial-001/run-state.jsondocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txtdocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/madar-answer.txtdocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/report.jsondocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/report.share-safe.jsondocs/benchmarks/suite/results/2026-06-10T05-03-03/raw/dub/explain-runtime/warm-cache/spi/trial-001/run-state.jsondocs/benchmarks/suite/results/2026-06-10T05-03-03/summary.jsondocs/benchmarks/suite/results/2026-06-10T05-03-03/summary.mddocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txtdocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txtdocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/report.jsondocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.jsondocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/run-state.jsondocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/baseline-answer.txtdocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/baseline-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/madar-answer.txtdocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/madar-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/native_agent-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/report.jsondocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/report.share-safe.jsondocs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/spi/trial-001/run-state.jsondocs/benchmarks/suite/results/2026-06-10T05-18-23/summary.jsondocs/benchmarks/suite/results/2026-06-10T05-18-23/summary.mddocs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txtdocs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txtdocs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txtdocs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.jsondocs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.jsondocs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/run-state.jsondocs/benchmarks/suite/results/2026-06-10T05-46-15/summary.jsondocs/benchmarks/suite/results/2026-06-10T05-46-15/summary.mddocs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/baseline-answer.txtdocs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/baseline-prompt.txtdocs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/madar-answer.txtdocs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/madar-prompt.txtdocs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/native_agent-prompt.txtdocs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/report.jsondocs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/report.share-safe.jsondocs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/run-state.jsondocs/benchmarks/suite/results/2026-06-10T06-03-05/summary.jsondocs/benchmarks/suite/results/2026-06-10T06-03-05/summary.mddocs/benchmarks/suite/runtime-proof.jsondocs/claims-and-evidence.mdsrc/infrastructure/benchmark/suite.tssrc/infrastructure/compare.tssrc/infrastructure/context-pack-command.tssrc/pipeline/extract.tssrc/pipeline/extract/core.tssrc/pipeline/extract/cross-file.tssrc/pipeline/extract/frameworks/express.tssrc/pipeline/extract/frameworks/nest.tssrc/pipeline/extract/generic.tssrc/pipeline/extract/non-code.tssrc/pipeline/extract/python-rationale.tssrc/pipeline/spi/build.tssrc/pipeline/spi/cache.tssrc/pipeline/spi/framework-nextjs.tssrc/pipeline/spi/projector.tssrc/runtime/retrieve.tssrc/runtime/retrieve/slicing.tssrc/runtime/runtime-proof.tssrc/runtime/stdio/tools.tssrc/shared/source-discovery.tstests/fixtures/go-semantic-workspace/cmd/chi/main.gotests/unit/benchmark-suite-docs.test.tstests/unit/benchmark-suite.test.tstests/unit/compare-native-agent.test.tstests/unit/compare.test.tstests/unit/extract-duplicate-file-stems.test.tstests/unit/extract-go.test.tstests/unit/extract-tsconfig-alias-runtime-proof.test.tstests/unit/retrieve-framework-boost-v0-19.test.tstests/unit/retrieve-slice-v1.test.tstests/unit/retrieve.test.tstests/unit/runtime-proof.test.tstests/unit/source-discovery.test.tstests/unit/spi-cache.test.tstests/unit/spi-calls.test.tstests/unit/spi-framework-nextjs.test.tstests/unit/spi-projector.test.tstests/unit/spi-scoped-root-tsconfig.test.tstests/unit/stdio-server.test.tstests/unit/stdio-slice-surface.test.ts
👮 Files not reviewed due to content moderation or server errors (8)
- docs/benchmarks/suite/results/2026-06-10T06-03-05/raw/novu/explain-runtime/warm-cache/legacy/trial-001/run-state.json
- docs/benchmarks/suite/results/2026-06-10T06-03-05/summary.json
- docs/benchmarks/suite/results/2026-06-10T06-03-05/summary.md
- src/pipeline/extract/cross-file.ts
- src/pipeline/extract/frameworks/nest.ts
- src/pipeline/spi/build.ts
- src/pipeline/spi/cache.ts
- src/pipeline/spi/framework-nextjs.ts
| "obligations": [ | ||
| { | ||
| "id": "booking_request", | ||
| "label": "booking request", | ||
| "kind": "entrypoint", | ||
| "required": true, | ||
| "evidence": [ | ||
| { | ||
| "label": "API /api/book/event", | ||
| "source_file": "apps/web/pages/api/book/event.ts", | ||
| "line_number": 1 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "id": "availability_validation", | ||
| "label": "availability validation", | ||
| "kind": "handoff", | ||
| "required": true, | ||
| "evidence": [ | ||
| { | ||
| "label": "checkForConflicts()", | ||
| "source_file": "packages/features/bookings/lib/handleNewBooking/ensureAvailableUsers.ts", | ||
| "line_number": 243 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "id": "scheduled_event_persistence", | ||
| "label": "scheduled event persistence", | ||
| "kind": "terminal", | ||
| "required": true, | ||
| "evidence": [ | ||
| { | ||
| "label": "saveBooking()", | ||
| "source_file": "packages/features/bookings/lib/handleNewBooking/createBooking.ts", | ||
| "line_number": 96 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "id": "notification_delivery", | ||
| "label": "notification delivery", | ||
| "kind": "terminal", | ||
| "required": true, | ||
| "evidence": [ | ||
| { | ||
| "label": "sendScheduledEmailsAndSMS()", | ||
| "source_file": "<project-root>/packages/app-store/_utils/payments/handlePaymentSuccess.ts", | ||
| "line_number": 211 | ||
| } | ||
| ] | ||
| } | ||
| ], | ||
| "missing_obligations": [] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check if other report.json files in this PR have the same inconsistency
rg -n --type=json '<project-root>/' -g 'report.json' -C2Repository: mohanagy/madar
Length of output: 50370
🏁 Script executed:
#!/bin/bash
set -euo pipefail
TARGET="docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json"
python3 - <<'PY'
import json, os, re
from pathlib import Path
target = Path("docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json")
data = json.loads(target.read_text())
# Extract all source_file values from runtime_proof.obligations[*].evidence[*].source_file
sources = []
try:
rp = data["runtime_proof"]
except Exception:
rp = data.get("runtime_proof", {})
obs = rp.get("obligations", [])
for o in obs:
for ev in o.get("evidence", []):
sf = ev.get("source_file")
if sf is not None:
sources.append(sf)
print("TARGET_FILE:", str(target))
print("source_file count:", len(sources))
unique = sorted(set(sources))
print("unique formats:")
for s in unique:
print(" -", s)
def bucket(s):
if s.startswith("<project-root>/"):
return "project_root_prefixed"
return "relative_or_other"
buckets = { "project_root_prefixed":0, "relative_or_other":0 }
for s in sources:
buckets[bucket(s)] += 1
print("bucket counts:", buckets)
PY
python3 - <<'PY'
import json
from pathlib import Path
base = Path("docs/benchmarks/suite/results")
report_files = list(base.rglob("report.json"))
def extract_sources(path):
try:
data = json.loads(path.read_text())
except Exception:
return []
rp = data.get("runtime_proof") or {}
obs = rp.get("obligations") or []
sources=[]
for o in obs:
for ev in o.get("evidence") or []:
sf = ev.get("source_file")
if sf:
sources.append(sf)
return sources
mixed=[]
total=0
for p in report_files:
total += 1
sources = extract_sources(p)
if not sources:
continue
has_pref = any(s.startswith("<project-root>/") for s in sources)
has_other = any(not s.startswith("<project-root>/") for s in sources)
if has_pref and has_other:
# summarize counts
pref = sum(1 for s in sources if s.startswith("<project-root>/"))
other = len(sources)-pref
mixed.append((str(p), pref, other, len(sources)))
if len(mixed) >= 15:
break
print("TOTAL report.json under docs/benchmarks/suite/results:", total)
print("Found mixed-format files (first 15):", len(mixed))
for path,pref,other,total_sources in mixed:
print(f"- {path} (pref={pref}, other={other}, total={total_sources})")
PYRepository: mohanagy/madar
Length of output: 391
🏁 Script executed:
#!/bin/bash
set -euo pipefail
TARGET="docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json"
python3 - <<'PY'
import json
from pathlib import Path
p = Path("docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json")
data = json.loads(p.read_text())
# 1) See top-level keys
print("TOP_LEVEL_KEYS:", sorted(data.keys()))
# 2) Recursively count all source_file occurrences
sources=[]
paths=[]
def walk(obj, path=""):
if isinstance(obj, dict):
for k,v in obj.items():
np = f"{path}.{k}" if path else k
if k=="source_file" and isinstance(v,str):
sources.append(v)
paths.append(np)
walk(v, np)
elif isinstance(obj, list):
for i,v in enumerate(obj):
walk(v, f"{path}[{i}]")
walk(data)
print("TOTAL_SOURCE_FILE_OCCURRENCES:", len(sources))
print("UNIQUE_SOURCE_FILE_VALUES (up to 50):")
for s in sorted(set(sources))[:50]:
print(" -", s)
pref = sum(1 for s in sources if s.startswith("<project-root>/"))
other = len(s)-pref
print("BUCKET_COUNTS:", {"project_root_prefixed":pref, "relative_or_other":other})
# 3) Locate any 'obligations' keys and print nearby structure clues
obligations_paths=[]
def find_obligations(obj, path=""):
if isinstance(obj, dict):
for k,v in obj.items():
np = f"{path}.{k}" if path else k
if k=="obligations":
obligations_paths.append(np)
find_obligations(v, np)
elif isinstance(obj, list):
for i,v in enumerate(obj):
find_obligations(v, f"{path}[{i}]")
find_obligations(data)
print("OBLIGATIONS_PATHS:", obligations_paths[:20])
PYRepository: mohanagy/madar
Length of output: 1187
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
import json
from pathlib import Path
p = Path("docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json")
data = json.loads(p.read_text())
rp = data["answer_contract"]["runtime_proof"]
obs = rp.get("obligations", [])
print("OBLIGATIONS_COUNT:", len(obs))
if obs:
o0 = obs[0]
print("OBLIGATION_0_KEYS:", sorted(o0.keys()))
# print one evidence-like field if present
for k,v in o0.items():
if isinstance(v, list):
print(f" - list field: {k} len={len(v)} sample_type={type(v[0]).__name__ if v else 'n/a'}")
if k=="evidence":
print(" - evidence sample:", v[0] if v else None)
# gather all source_file keys within obligations objects
sources=[]
def walk(x):
if isinstance(x, dict):
for k,v in x.items():
if k=="source_file" and isinstance(v,str):
sources.append(v)
walk(v)
elif isinstance(x, list):
for i in x: walk(i)
walk(obs)
print("OBLIGATIONS_SOURCE_FILE_COUNT:", len(sources))
print("OBLIGATIONS_UNIQUE_SOURCE_FILE_VALUES:", sorted(set(sources)))
PY
python3 - <<'PY'
import json
from pathlib import Path
base = Path("docs/benchmarks/suite/results")
report_files = list(base.rglob("report.json"))
def extract_sources_in_runtime_proof_obligations(path):
try:
data = json.loads(path.read_text())
except Exception:
return []
ac = data.get("answer_contract") or {}
rp = ac.get("runtime_proof") or {}
obs = rp.get("obligations") or []
sources=[]
def walk(x):
if isinstance(x, dict):
for k,v in x.items():
if k=="source_file" and isinstance(v,str):
sources.append(v)
walk(v)
elif isinstance(x, list):
for i in x: walk(i)
walk(obs)
return sources
mixed=[]
total=0
for p in report_files:
total += 1
sources = extract_sources_in_runtime_proof_obligations(p)
if not sources:
continue
has_pref = any(s.startswith("<project-root>/") for s in sources)
has_other = any(not s.startswith("<project-root>/") for s in sources)
if has_pref and has_other:
pref = sum(1 for s in sources if s.startswith("<project-root>/"))
other = len(s)-pref
mixed.append((str(p), pref, other, len(sources)))
if len(mixed) >= 15:
break
print("TOTAL_REPORT_JSON_UNDER_RESULTS:", total)
print("FILES_WITH_MIXED_FORMAT_IN_runtime_proof_obligations (first 15):", len(mixed))
for path,pref,other,total_sources in mixed:
print(f"- {path} (pref={pref}, other={other}, total={total_sources})")
PYRepository: mohanagy/madar
Length of output: 771
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
import json
from pathlib import Path
base = Path("docs/benchmarks/suite/results")
report_files = list(base.rglob("report.json"))
def extract_sources_in_runtime_proof_obligations(path: Path):
try:
data = json.loads(path.read_text())
except Exception:
return []
ac = data.get("answer_contract") or {}
rp = ac.get("runtime_proof") or {}
obs = rp.get("obligations") or []
sources=[]
def walk(x):
if isinstance(x, dict):
for k,v in x.items():
if k=="source_file" and isinstance(v,str):
sources.append(v)
walk(v)
elif isinstance(x, list):
for i in x: walk(i)
walk(obs)
return sources
mixed=[]
total_with_sources=0
for p in report_files:
sources = extract_sources_in_runtime_proof_obligations(p)
if not sources:
continue
total_with_sources += 1
has_pref = any(s.startswith("<project-root>/") for s in sources)
has_other = any(not s.startswith("<project-root>/") for s in sources)
if has_pref and has_other:
pref = sum(1 for s in sources if s.startswith("<project-root>/"))
other = len(sources)-pref
mixed.append((p, pref, other, len(sources), sorted(set(sources))[:6]))
print("TOTAL report.json files under docs/benchmarks/suite/results:", len(report_files))
print("TOTAL files with runtime_proof.obligations source_file occurrences:", total_with_sources)
print("FILES with MIXED-format source_file values:", len(mixed))
for item in mixed[:20]:
path,pref,other,count,preview = item
print(f"- {path} (pref={pref}, other={other}, total_evidence_source_files={count})")
print(" sample unique source_file values:", preview)
PYRepository: mohanagy/madar
Length of output: 1138
Make answer_contract.runtime_proof.obligations[].evidence[].source_file path formatting consistent
In docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json, source_file mixes relative paths (e.g. apps/web/pages/api/book/event.ts, packages/features/...) with a <project-root>/...-prefixed path (packages/app-store/_utils/payments/handlePaymentSuccess.ts). The same mixed formatting appears in docs/benchmarks/suite/results/2026-06-10T05-18-23/raw/twenty/explain-runtime/warm-cache/legacy/trial-001/report.json (2/93 report.json files). Ensure the CLI emits one uniform format (all relative or all <project-root>/), or document the meaning of the <project-root>/ prefix.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@docs/benchmarks/suite/results/2026-06-10T05-46-15/raw/cal-diy/explain-runtime/warm-cache/legacy/trial-001/report.json`
around lines 260 - 314, The report generator is emitting mixed source_file paths
in answer_contract.runtime_proof.obligations[].evidence[].source_file;
standardize this in the CLI emitter by choosing one format and applying it
consistently (e.g., always use path.relative(projectRoot, filePath) for relative
paths or always prefix with "<project-root>/"). Locate the reporter function
that builds evidence.source_file (search for answer_contract or runtime_proof
emission logic and the function that serializes obligations/evidence) and
replace the current ad-hoc path handling with a single utility that normalizes
paths (use a projectRoot constant and either path.relative + normalize or
prepend "<project-root>/" unconditionally), and add a unit test asserting both
example inputs (apps/... and packages/...) produce the same chosen format.
Source: Learnings
There was a problem hiding this comment.
Fixed in 63040dd: this was caused by runtime-proof recovery materializing out-of-scope branch steps without the workspace rootPath, so their source_file values stayed absolute and were sanitized to /... while in-scope steps were workspace-relative. rootPath is now threaded through recovery, all six receipts were regenerated, and a verification pass confirmed no report mixes the two formats anymore.
There was a problem hiding this comment.
@mohanagy, thanks for the detailed root cause explanation — threading rootPath through the recovery path is the right fix. Glad the verification pass confirmed all six receipts are now consistently formatted.
(ノ◕ヮ◕)ノ*:・゚✧
🧠 Learnings used
Learnt from: mohanagy
Repo: mohanagy/graphify-ts PR: 238
File: docs/benchmarks/2026-05-21-govalidate-v0-23-0-validation/report-generation-pack-only.report.share-safe.json:12-17
Timestamp: 2026-05-21T19:32:18.424Z
Learning: Benchmark artifacts under docs/benchmarks/ are published faithful snapshots of the CLI’s emitted JSON output and should not be manually edited to “correct” formatting or numeric rounding. If you need to change how token/ratio values are serialized (e.g., computeReductionRatio() rounding to one decimal place such that ratios near 1 may become 1.0), make the precision/serialization change in src/infrastructure/compare.ts so regenerated artifacts reflect the updated CLI behavior.
- thread rootPath through runtime-proof recovery so recovered branch steps emit workspace-relative source files (no mixed path formats in receipts) - use real primary-path boundaries in recovery phase-coverage scoring instead of an empty boundary list - include focused bash follow-ups in prompt-contract follow-up input extraction, matching focused-call classification - activate preserveFinalRuntimeEntrypointContextPreview by removing the self-excluding kept-key filter - decide file-stem uniqueness on normalized ids and disambiguate deterministic collisions (foo-bar.ts vs foo_bar.ts) - include the module stem in the Express analysis cache validity check - fail fast when the benchmark suite is missing the built CLI - regenerate all six public explain-runtime legacy receipts with the final binary; every report is full_win/ready with consistent workspace-relative evidence paths
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/benchmark-suite.test.ts (1)
21-27: 💤 Low valueConsider adding a clarifying comment.
The stub CLI scaffolding serves a non-obvious purpose (satisfying existence checks without requiring a full build). A brief comment explaining the design intent would help future maintainers, e.g.:
// Create a minimal CLI stub so tests can run without `npm run build`. // The stub satisfies existsSync checks in benchmarkWorkspaceCliPath() // but is never executed since tests mock all execution paths. const cliStubDir = mkdtempSync(join(tmpdir(), 'madar-bench-cli-stub-'))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/benchmark-suite.test.ts` around lines 21 - 27, Add a short clarifying comment above the cli stub setup (the cliStubDir and cliStubPath initialization and the beforeAll that writes the file and sets process.env.MADAR_BENCH_CLI_PATH) explaining that this creates a minimal CLI stub to satisfy existence checks (e.g., benchmarkWorkspaceCliPath()) so tests can run without an npm build, and that the stub is never executed because tests mock execution paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/benchmark-suite.test.ts`:
- Around line 21-27: Add a short clarifying comment above the cli stub setup
(the cliStubDir and cliStubPath initialization and the beforeAll that writes the
file and sets process.env.MADAR_BENCH_CLI_PATH) explaining that this creates a
minimal CLI stub to satisfy existence checks (e.g., benchmarkWorkspaceCliPath())
so tests can run without an npm build, and that the stub is never executed
because tests mock execution paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9ab87096-a895-4794-968a-6fd27c176cd5
📒 Files selected for processing (2)
src/infrastructure/benchmark/suite.tstests/unit/benchmark-suite.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/infrastructure/benchmark/suite.ts
Summary
Completes the public benchmark-proof push on top of #519: fixes the two remaining runtime-proof blockers in
retrieve.ts, fixes a source-discovery regression that gutted the dub graph, addresses all CodeRabbit review findings, and refreshes all six public TypeScriptexplain-runtimelegacy receipts to proof-backedfull_winbundles generated from the final binary.Runtime fixes (each with a failing-first regression test)
Out-of-scope runtime-proof recovery (
src/runtime/retrieve.ts): the first missing-obligation recovery loop built branch steps only fromnodeById(the initial in-scope slice), silently dropping recovery candidates whose evidence exists in the graph but outside the slice scope. It now materializes those nodes viaexecutionSliceStepFromGraphwithrootPaththreaded through, matching the phase-recovery loop. Regression:tests/unit/retrieve-slice-v1.test.ts.Dangling stdio relationships after compaction (
src/runtime/retrieve.ts):compactRetrievePayloadForStdioProfileslicedmatched_nodesandrelationshipsindependently, so retained relationships could reference trimmed nodes. Relationships are now filtered against the retained matched-node ID set before the cap. Regression:tests/unit/retrieve.test.ts.Top-level
lib/discovery fidelity (src/shared/source-discovery.ts): the blanket top-levellib/**hard-ignore also dropped hand-written TypeScript source (e.g. dub'sapps/web/lib/**link-resolution middleware), flipping the dub row's benchmark readiness todegraded. The hard-ignore now targets compiled artifacts only (.js/.cjs/.mjs/.d.ts). Regressions:tests/unit/source-discovery.test.ts.CodeRabbit review fixes
rootPathinto runtime-proof recovery so recovered branch steps emit workspace-relative source files (fixes the mixed path formats flagged in receipts).[].preserveFinalRuntimeEntrypointContextPreviewby removing its self-excluding kept-key filter._makeId-normalized ids with deterministic collision suffixes (foo-bar.tsvsfoo_bar.ts), with a regression test.Carried review follow-ups
Same-turn retrieve persistence in compare, prompt-contract targeting for same-turn follow-ups, routing/tool/latency false-win scoring, SPI cache invalidation for nested tsconfig/jsconfig, Express entrypoint proof recognition, nested Next.js
src/app|pages|middlewareSPI detection, claims doc relative links, duplicate-stem extraction tests.Receipts
All six public legacy rows were rerun sequentially with the final binary after all fixes. The success criteria below apply to the
warm-cache/legacy/trial reports; any/spi/trial in the same bundle is a separate receipt and is not claimed to meet the same criteria. Every legacy report hasbenchmark_outcome = "full_win",benchmark_readiness = "ready",answer_quality.madar.passed = true,prompt_contract = "followed",answer_contract.runtime_proof.missing_obligations = [], and consistent workspace-relative evidence paths:docs/benchmarks/suite/results/2026-06-10T07-01-18docs/benchmarks/suite/results/2026-06-10T07-10-44docs/benchmarks/suite/results/2026-06-10T07-24-01docs/benchmarks/suite/results/2026-06-10T07-39-11docs/benchmarks/suite/results/2026-06-10T08-08-44docs/benchmarks/suite/results/2026-06-10T08-24-49docs/benchmarks/suite/README.md,docs/claims-and-evidence.md, andtests/unit/benchmark-suite-docs.test.tsall point at these bundles.Test plan
npm run typechecknpm run buildCI=1 npm run test:run— 188 files, 2539 passed, 1 skippedexplain-runtimerows rerun sequentially with the final binary — 6/6 legacyfull_winSummary by CodeRabbit
New Features
Documentation
Bug Fixes