Conversation
Fix benchmark Pages npm links, add git-backed public benchmark rows, and harden suite workspace preparation for public repos. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Publish isolated public explain-runtime receipts for documenso, formbricks, dub, cal-diy, and novu, plus a scoped Twenty benchmark receipt and the docs/tests that link to them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Downgrade non-answer or not-ready compare results to not_measured, require direct evidence for the public explain-runtime gates, surface benchmark outcomes in suite summaries, refresh the final rerun receipts, and remove the superseded invalid 12:xx receipt bundles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Write native_agent-prompt.txt from the baseline prompt instead of the Madar prompt and refresh the published public-repo receipt artifacts to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ime-proof Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
fix: enforce strict runtime proof for benchmarks
Add safe relative graphRoot support to benchmark suite repos so large public monorepo rows can generate, install, warm up, and compare from scoped graph roots instead of oversized repo roots. Also reset unsafe repo-local agent config at both the copied parent workspace and scoped root to preserve benchmark isolation for scoped runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move runtime Claude/Cursor state out of the tracked isolation fixture, fail fast when the isolated benchmark profile is not authenticated, and print the exact runtime-profile login command required for measured runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Allow the isolation launcher tests to override the CLI path so CI can exercise the auth-preflight branches without depending on a prebuilt dist tree. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
feat: support scoped benchmark repo roots
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
fix: harden partial runtime proof guidance
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
fix: harden explain-runtime proof gates
- recover out-of-scope runtime-proof obligation evidence: the first missing-obligation recovery loop now materializes graph nodes outside the initial slice scope, matching the phase-recovery loop - drop dangling stdio relationships: compaction now filters relationships against the retained matched-node ID set before the cap - keep hand-written top-level lib/ TypeScript source discoverable while still hard-ignoring compiled lib output (js/cjs/mjs/d.ts), restoring dub apps/web/lib middleware evidence - carry same-turn retrieve persistence, prompt-contract targeting, routing/tool/latency scoring, SPI cache invalidation, Express and nested Next.js SPI detection fixes from review follow-ups - refresh all six public TypeScript explain-runtime legacy receipts (documenso, formbricks, dub, twenty, cal-diy, novu) with proof-backed full_win bundles generated sequentially from the final binary, and point suite README, claims-and-evidence, and docs tests at them
- thread rootPath through runtime-proof recovery so recovered branch steps emit workspace-relative source files (no mixed path formats in receipts) - use real primary-path boundaries in recovery phase-coverage scoring instead of an empty boundary list - include focused bash follow-ups in prompt-contract follow-up input extraction, matching focused-call classification - activate preserveFinalRuntimeEntrypointContextPreview by removing the self-excluding kept-key filter - decide file-stem uniqueness on normalized ids and disambiguate deterministic collisions (foo-bar.ts vs foo_bar.ts) - include the module stem in the Express analysis cache validity check - fail fast when the benchmark suite is missing the built CLI - regenerate all six public explain-runtime legacy receipts with the final binary; every report is full_win/ready with consistent workspace-relative evidence paths
fix: complete proof-backed public explain-runtime full-win rows
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughBenchmark receipts, runtime-proof contracts, isolation behavior, and graph/extraction plumbing were updated together. The PR also refreshes explain-runtime benchmark artifacts and tests for multiple repos. ChangesBenchmark runtime-proof and suite wiring
Sequence Diagram(s)sequenceDiagram
participant CompareTrace
participant RuntimeProof
participant ContextPack
participant Report
CompareTrace->>RuntimeProof: buildRuntimeProofAssessment(profile, candidates)
RuntimeProof-->>CompareTrace: obligations, missing_obligations
CompareTrace->>ContextPack: buildNativeAgentPrompt(strictRuntimeProof)
CompareTrace->>Report: mergeCompareReportPackWithTraceFollowUps(...)
Report-->>CompareTrace: answer_contract, execution_slice, readiness
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes Possibly related PRs
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
mohanagy
added a commit
that referenced
this pull request
Jun 10, 2026
Merge pull request #521 from mohanagy/next
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nextline intomainfor the next stable release.full_winreceipts into the stable branch.Notes
Verification
Summary by CodeRabbit
New Features
graphRoot) for monorepo configurations.Documentation
Bug Fixes & Improvements