fix: harden explain-runtime proof gates by mohanagy · Pull Request #519 · mohanagy/madar

mohanagy · 2026-06-08T12:38:15Z

Summary

add deterministic runtime-proof profiles for the public explain-runtime benchmark rows
harden compare-time readiness, prompt-contract, and answer-quality checks around strict runtime-proof obligations
make retrieval slice-v1 completeness-first for strict runtime-proof rows and add regression coverage for the new fail-closed cases

Verification

npm run typecheck
npm run build
CI=1 npm run test:run

Benchmark status

Local non-isolated warm receipts for the six public explain-runtime rows remain fail-closed rather than tuned wins:
- dub: not_measured
- twenty: not_measured
- formbricks: not_measured
- documenso: not_measured
- cal-diy: not_measured
- novu: not_measured
Common pattern: readiness stays not_ready because required runtime-proof obligations are still missing from retrieval evidence, and answer quality also fails when the answer does not cite direct evidence for at least one required obligation.
This PR does not weaken gates, does not turn not_measured into wins, and does not hide failed rows.

Summary by CodeRabbit

New Features
- Added runtime-proof benchmark suite and per-question runtime-proof profiles to enforce execution-obligation checks.
- Prompts now include a "Required proof checklist" and targeted follow-up guidance when obligations are missing.
- Retrieval/answers surface a runtime_proof assessment and readiness gating that can mark a question "not_ready" if obligations lack evidence.
- Answer-quality now requires direct-evidence citations for missing runtime obligations.
Tests
- Expanded unit tests to cover runtime-proof loading, readiness behavior, prompt checklist inclusion, and answer-quality enforcement.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai · 2026-06-08T12:38:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 38939d81-2766-4abd-b300-21c38e6e5342

📥 Commits

Reviewing files that changed from the base of the PR and between 55be19c and ac9980f.

📒 Files selected for processing (1)

tests/unit/compare-native-agent.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unit/compare-native-agent.test.ts

📝 Walkthrough

Walkthrough

Introduces “runtime proof”: contracts and benchmark profiles, scoring primitives, retrieval/slice selection integration, compare-time readiness/prompt constraints, answer-quality citation checks, prompt-contract follow-up validation, and tests exercising checklist prompting and follow-up targeting.

Changes

Runtime Proof Validation & Native-Agent Integration

Layer / File(s)	Summary
Runtime Proof Data Contracts `src/contracts/runtime-proof.ts`, `src/contracts/context-pack.ts`	Defines obligation kinds (entrypoint/handoff/terminal), profile/obligation/evidence shapes, obligation assessments with required=true, and overall assessment including missing obligation IDs; extends context-pack answer contract with optional `runtime_proof` field.
Runtime Proof Scoring & Obligation Assessment `src/runtime/runtime-proof.ts`	Implements evidence-scoring by normalizing candidate text, scoring against obligation evidence terms with kind-specific regex bonuses, detecting direct terminal signals, and building `RuntimeProofAssessment` by deduplicating evidence and reporting missing obligation IDs.
Benchmark Profile Loading & Matching `src/infrastructure/benchmark/runtime-proof.ts`	Parses and validates runtime-proof.json configuration files: validates profile structure (prompts, flags, obligations with evidence_terms and kinds), converts to Map, and matches profiles by exact prompt text.
Benchmark Suite Configuration `docs/benchmarks/suite/runtime-proof.json`	Defines runtime-proof expectations for six explain-runtime benchmark entries, specifying prompts, obligations (entrypoint/handoff/terminal), and evidence terms for runtime lifecycle coverage validation.
Trace Tool Input Enrichment `src/infrastructure/compare.ts` (RAW_TRACE_TOOL_INPUTS symbol and helpers)	Captures and summarizes tool inputs from Claude tool-use events into Madar trace turns, enabling later prompt-contract validation to check that focused follow-ups target missing runtime obligations.
Benchmark Readiness with Runtime Proof `src/infrastructure/compare.ts` (readiness assessment path)	Extends readiness assessment to accept optional runtimeProofProfile, compute obligation satisfaction from retrieval results, report missing obligations as reasons, apply SPI evidence gating, and gate on strict-runtime-proof mode applicability.
Native Agent Orchestration `src/infrastructure/compare.ts` (executeNativeAgentCompare and helpers)	Loads profiles per question, threads through readiness prep, builds strict-runtime-proof prompt checklists and missing-obligation guidance, enforces required direct-evidence citations in answer quality, and validates that focused follow-ups target missing obligations via prompt-contract assessment.
Retrieval Threading & Options `src/runtime/retrieve.ts` (import and RetrieveOptions)	Extends RetrieveOptions with optional runtimeProofProfile and imports runtime-proof helper utilities.
Retrieval Scope & Path Scoring `src/runtime/retrieve.ts` (slice expansion, scope augmentation, path scoring)	Augments execution scope by adding graph paths targeted to missing obligations, adjusts execution path scoring to account for covered/missing obligations with entrypoint/terminal bonuses.
Retrieval Answer Contract & Slice Building `src/runtime/retrieve.ts` (buildRuntimeGenerationAnswerContract, buildExecutionSlice)	Computes and embeds runtime_proof assessment in answer contract, extends uncertainty notes to cover missing runtime obligations, uses runtime-proof completion for strict mode slice status.
Slice Anchor Selection with Runtime Proof Bonus `src/runtime/retrieve/slicing.ts`	Computes runtime-proof-driven anchor bonus from obligation match scores, introduces runtimeProofAnchors pool as fallback candidate source, gates selection based on strict-runtime-proof and broad-runtime-generation applicability.
Tests: Configuration & Readiness `tests/unit/benchmark-suite.test.ts`, `tests/unit/compare.test.ts`	Validates profile loading (checks obligation kinds and evidence_terms, verifies flags), and readiness assessment (ready when all obligations have evidence, not_ready when terminal obligations missing).
Tests: Native Agent Compare `tests/unit/compare-native-agent.test.ts`	Parametrized checklist tests, answer-quality validation (enforces direct-evidence citations, handles basename splitting), and follow-up targeting tests (validates that follow-ups target missing obligations via prompt-contract violation).
Tests: Slice-v1 Retrieval `tests/unit/retrieve-slice-v1.test.ts`	Graph-builder helpers for Dub/Formbricks/Twenty scenarios, validates that execution slices surface correct steps/side-effects and runtime-proof obligations with empty missing_obligations.

Sequence Diagram

sequenceDiagram
  participant Executor as executeNativeAgentCompare
  participant Loader as loadBenchmarkRuntimeProofProfiles
  participant Readiness as prepareNativeAgentBenchmarkReadiness
  participant Prompt as buildNativeAgentPrompt
  participant Quality as evaluateNativeAgentAnswerQualityReport
  participant Contract as assessNativeAgentPromptContract
  Executor->>Loader: load runtime-proof profiles
  Executor->>Readiness: pass runtimeProofProfile
  Executor->>Prompt: include requiredObligations / missingObligations
  Executor->>Quality: check required direct-evidence citations
  Executor->>Contract: validate follow-up targets missing obligations

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

mohanagy/madar#515: Overlaps with strict runtime-proof enforcement changes to buildNativeAgentPrompt and readiness gating logic in src/infrastructure/compare.ts.
mohanagy/madar#518: Modifies strict runtime-proof prompt guidance generation and missing-obligation instruction logic at the same prompt-contract injection points.
mohanagy/madar#411: Related changes to buildNativeAgentPrompt and assessNativeAgentPromptContract in native-agent prompt construction and contract validation.

"I am a rabbit, I hop through code,
Checklist in paw, obligations glowed.
From entry to terminal I sniff for proof,
Citations gathered, no loose hoof.
Hooray — runtime truth, neatly stowed!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: harden explain-runtime proof gates' clearly and specifically summarizes the main change: hardening strict runtime-proof obligation enforcement for explain-runtime benchmarks, which aligns with the primary objectives of adding deterministic profiles and strengthening readiness/quality checks.
Description check	✅ Passed	The pull request description covers the key objectives, verification steps, and benchmark status with appropriate detail. However, it lacks a 'Related issues' section and does not explicitly address the testing and documentation checklist items from the template.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch benchmark-proof-completeness

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/runtime/retrieve.ts (1)

3597-3608: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Strict runtime-proof slices can still be marked complete when required obligations are missing.

When strict_runtime_proof is enabled, runtimeProofComplete === false just falls back to phaseCoverage.missing[0]. If generic phase coverage is complete but a required obligation like analytics/redirect/persistence proof is still missing, missingPhase stays undefined, so this emits status: 'complete' with no boundary reason. That breaks the fail-closed contract and disagrees with the answer_contract.runtime_proof.missing_obligations you already compute downstream.

Suggested fix

   const runtimeProof = buildRuntimeProofAssessment(
     runtimeProofProfile,
     tracedSteps.map((step) => runtimeProofCandidateFromExecutionStep(step)),
   )
-  const runtimeProofComplete = runtimeProofProfile?.strict_runtime_proof === true
-    && runtimeProof !== undefined
-    && runtimeProof.missing_obligations.length === 0
-  const missingPhase = runtimeProofComplete ? undefined : phaseCoverage.missing[0]
-  const boundaryReason = missingPhase ? missingExecutionPhaseBoundaryReason(missingPhase) : undefined
+  const missingRuntimeObligation = runtimeProofProfile?.strict_runtime_proof === true
+    ? runtimeProof?.missing_obligations[0]
+    : undefined
+  const missingPhase = missingRuntimeObligation ? undefined : phaseCoverage.missing[0]
+  const boundaryReason = missingRuntimeObligation
+    ? `missing runtime-proof obligation: ${runtimeProofProfile?.obligations.find((o) => o.id === missingRuntimeObligation)?.label ?? missingRuntimeObligation}`
+    : missingPhase
+      ? missingExecutionPhaseBoundaryReason(missingPhase)
+      : undefined
   const executionSlice: ContextPackExecutionSlice = {
-    status: missingPhase ? 'partial' : 'complete',
+    status: missingRuntimeObligation || missingPhase ? 'partial' : 'complete',

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/runtime/retrieve.ts` around lines 3597 - 3608, The code currently
computes missingPhase from phaseCoverage only, so when
runtimeProofProfile?.strict_runtime_proof is true but runtimeProofComplete is
false the slice can still be marked complete; update the logic that derives
missingPhase (used to set executionSlice.status and boundary_reason) to also
consider runtimeProofComplete: if strict_runtime_proof is true and
runtimeProofComplete === false then set missingPhase (or treat as missing) so
executionSlice becomes 'partial' and boundary_reason is populated (e.g., reuse
missingExecutionPhaseBoundaryReason or create a boundary reason reflecting
runtime proof obligations), ensuring runtimeProof, runtimeProofProfile,
runtimeProofComplete, phaseCoverage, executionSlice and
answer_contract.runtime_proof.missing_obligations remain consistent.

🧹 Nitpick comments (3)

tests/unit/retrieve-slice-v1.test.ts (1)

415-527: ⚡ Quick win

Add one fail-closed runtime-proof regression here.

These cases only cover fully satisfied proofs. Please add a variant where phase coverage is still complete but one strict obligation is absent, and assert execution_slice.status === 'partial' plus the missing obligation in answer_contract.runtime_proof. That is the edge this PR summary says should remain fail-closed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/retrieve-slice-v1.test.ts` around lines 415 - 527, Add a
fail-closed regression test variant using compactForWithRuntimeProof where
strict_runtime_proof is true but one obligation from the provided obligations
array is not satisfied by the chosen graph (e.g., use
buildFormbricksRuntimeProofGraph or buildTwentyRuntimeProofGraph but omit the
persistence/analytics evidence term), then assert that
compact.execution_slice?.status === 'partial' and that
compact.answer_contract?.runtime_proof?.missing_obligations contains the missing
obligation id (and that answer_contract.runtime_proof.obligations includes the
others); ensure the test references compactForWithRuntimeProof and the chosen
build*RuntimeProofGraph function and checks the exact missing obligation id.

tests/unit/compare-native-agent.test.ts (2)

2032-2036: ⚡ Quick win

Model intermediate obligations as handoff to preserve runtime-proof stage semantics.

For 3-stage checklists, mapping all non-first items to terminal skips handoff-specific behavior and can miss regressions in stage-aware scoring/gating. Use handoff for middle items and terminal only for the last item.

Suggested patch

-          kind: index === 0 ? 'entrypoint' : 'terminal',
+          kind:
+            index === 0
+              ? 'entrypoint'
+              : index === checklist.length - 1
+                ? 'terminal'
+                : 'handoff',

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/compare-native-agent.test.ts` around lines 2032 - 2036, The
checklist mapping currently sets kind to 'entrypoint' for index 0 and 'terminal'
for all others, which loses the required 'handoff' semantics for intermediate
stages; update the mapping in the checklist.map (the object with id, label,
kind, evidence_terms) so that kind is 'entrypoint' when index === 0, 'terminal'
when index === checklist.length - 1, and 'handoff' for all other middle indices,
keeping id, label, and evidence_terms generation the same.

3073-3083: ⚡ Quick win

Assert baseline direct-evidence pass to avoid one-sided false positives.

These tests currently pass as long as Madar fails. Add a baseline passed: true assertion so they also catch regressions where strict runtime-proof citation checks fail both arms.

Also applies to: 3182-3192

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/compare-native-agent.test.ts` around lines 3073 - 3083, Add an
assertion that the baseline arm actually passed direct-evidence checks before
asserting Madar failed: check result.reports (the opposing report array entry)
for an answer_quality object with passed: true (e.g.,
expect(result.reports[1]?.answer_quality).toEqual(expect.objectContaining({
passed: true }))) and then keep the existing Madar failure assertion
(expect(result.reports[0]?.answer_quality).toEqual(expect.objectContaining({
madar: ... }))). Apply the same added baseline passed: true assertion to the
analogous block at lines 3182-3192.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/infrastructure/compare.ts`:
- Around line 1267-1297: The summarizer function summarizeTraceToolInput
currently only inspects a small prioritized key list and drops other object
fields which can cause false negatives; update summarizeTraceToolInput to
perform a bounded recursive fallback: after processing prioritizedKeys, if
collected is empty, iterate over the remaining object values
(Object.values(input)) and recursively summarize them (respecting Array handling
already present), but enforce a max recursion depth and a cheap item/length cap
to prevent deep or huge traversal; ensure the recursion uses a depth parameter
(e.g., depth++), returns '' when depth exceeds the cap, and merges any non-empty
summaries into collected before returning so non-whitelisted keys contribute to
the final string.

In `@src/runtime/runtime-proof.ts`:
- Around line 74-81: The profile parser must enforce the invariant that terminal
obligations include at least two evidence terms so runtime checks in
runtime-proof.ts (the switch on obligation.kind, and functions
runtimeProofHasDirectTerminalSignal/runtimeProofObligationMatchScore) cannot
produce unsatisfiable terminal obligations; add validation in the
profile-parse/validation routine (the function that builds/validates obligation
objects, e.g., parse/validateProfile or similar) to reject or normalize any
obligation with obligation.kind === 'terminal' that has fewer than 2 terms,
returning a clear parse/validation error so such profiles never reach runtime.

---

Outside diff comments:
In `@src/runtime/retrieve.ts`:
- Around line 3597-3608: The code currently computes missingPhase from
phaseCoverage only, so when runtimeProofProfile?.strict_runtime_proof is true
but runtimeProofComplete is false the slice can still be marked complete; update
the logic that derives missingPhase (used to set executionSlice.status and
boundary_reason) to also consider runtimeProofComplete: if strict_runtime_proof
is true and runtimeProofComplete === false then set missingPhase (or treat as
missing) so executionSlice becomes 'partial' and boundary_reason is populated
(e.g., reuse missingExecutionPhaseBoundaryReason or create a boundary reason
reflecting runtime proof obligations), ensuring runtimeProof,
runtimeProofProfile, runtimeProofComplete, phaseCoverage, executionSlice and
answer_contract.runtime_proof.missing_obligations remain consistent.

---

Nitpick comments:
In `@tests/unit/compare-native-agent.test.ts`:
- Around line 2032-2036: The checklist mapping currently sets kind to
'entrypoint' for index 0 and 'terminal' for all others, which loses the required
'handoff' semantics for intermediate stages; update the mapping in the
checklist.map (the object with id, label, kind, evidence_terms) so that kind is
'entrypoint' when index === 0, 'terminal' when index === checklist.length - 1,
and 'handoff' for all other middle indices, keeping id, label, and
evidence_terms generation the same.
- Around line 3073-3083: Add an assertion that the baseline arm actually passed
direct-evidence checks before asserting Madar failed: check result.reports (the
opposing report array entry) for an answer_quality object with passed: true
(e.g.,
expect(result.reports[1]?.answer_quality).toEqual(expect.objectContaining({
passed: true }))) and then keep the existing Madar failure assertion
(expect(result.reports[0]?.answer_quality).toEqual(expect.objectContaining({
madar: ... }))). Apply the same added baseline passed: true assertion to the
analogous block at lines 3182-3192.

In `@tests/unit/retrieve-slice-v1.test.ts`:
- Around line 415-527: Add a fail-closed regression test variant using
compactForWithRuntimeProof where strict_runtime_proof is true but one obligation
from the provided obligations array is not satisfied by the chosen graph (e.g.,
use buildFormbricksRuntimeProofGraph or buildTwentyRuntimeProofGraph but omit
the persistence/analytics evidence term), then assert that
compact.execution_slice?.status === 'partial' and that
compact.answer_contract?.runtime_proof?.missing_obligations contains the missing
obligation id (and that answer_contract.runtime_proof.obligations includes the
others); ensure the test references compactForWithRuntimeProof and the chosen
build*RuntimeProofGraph function and checks the exact missing obligation id.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e24fb832-a3e3-4eb1-8f42-f29f1da7b636

📥 Commits

Reviewing files that changed from the base of the PR and between 2bee5ef and 55be19c.

📒 Files selected for processing (12)

docs/benchmarks/suite/runtime-proof.json
src/contracts/context-pack.ts
src/contracts/runtime-proof.ts
src/infrastructure/benchmark/runtime-proof.ts
src/infrastructure/compare.ts
src/runtime/retrieve.ts
src/runtime/retrieve/slicing.ts
src/runtime/runtime-proof.ts
tests/unit/benchmark-suite.test.ts
tests/unit/compare-native-agent.test.ts
tests/unit/compare.test.ts
tests/unit/retrieve-slice-v1.test.ts

coderabbitai · 2026-06-08T12:48:22Z

+function summarizeTraceToolInput(input: unknown): string {
+  if (typeof input === 'string') {
+    return input.trim()
+  }
+  if (Array.isArray(input)) {
+    return input.map((entry) => summarizeTraceToolInput(entry)).filter((value) => value.length > 0).join(' ')
+  }
+  if (!isRecord(input)) {
+    return ''
+  }
+
+  const prioritizedKeys = ['question', 'query', 'prompt', 'path', 'paths', 'label', 'source', 'target', 'description']
+  const collected: string[] = []
+  for (const key of prioritizedKeys) {
+    const value = input[key]
+    if (typeof value === 'string' && value.trim().length > 0) {
+      collected.push(value.trim())
+      continue
+    }
+    if (Array.isArray(value)) {
+      const summarized = value.map((entry) => summarizeTraceToolInput(entry)).filter((entry) => entry.length > 0).join(' ')
+      if (summarized.length > 0) {
+        collected.push(summarized)
+      }
+    }
+  }
+  if (collected.length > 0) {
+    return collected.join(' ')
+  }
+  return ''
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tool-input summarization is too lossy and can falsely fail strict prompt-contract checks.

This summarizer only inspects a small key whitelist, but downstream targeting validation at Line 4515 relies on these summaries. Inputs using other keys can be dropped and wrongly marked as “did not target missing runtime obligation.”

Suggested fix (add bounded recursive fallback over object values)

function summarizeTraceToolInput(input: unknown): string { + return summarizeTraceToolInputInternal(input, 0) +} + +function summarizeTraceToolInputInternal(input: unknown, depth: number): string { + if (depth > 3) { + return '' + } if (typeof input === 'string') { return input.trim() } if (Array.isArray(input)) { - return input.map((entry) => summarizeTraceToolInput(entry)).filter((value) => value.length > 0).join(' ') + return input.map((entry) => summarizeTraceToolInputInternal(entry, depth + 1)).filter((value) => value.length > 0).join(' ') } if (!isRecord(input)) { return '' } const prioritizedKeys = ['question', 'query', 'prompt', 'path', 'paths', 'label', 'source', 'target', 'description'] const collected: string[] = [] @@ - const summarized = value.map((entry) => summarizeTraceToolInput(entry)).filter((entry) => entry.length > 0).join(' ') + const summarized = value.map((entry) => summarizeTraceToolInputInternal(entry, depth + 1)).filter((entry) => entry.length > 0).join(' ') if (summarized.length > 0) { collected.push(summarized) } } } if (collected.length > 0) { return collected.join(' ') } - return '' + + // Fallback: capture any nested string-bearing fields so targeted follow-up + // checks are not coupled to a fixed key list. + const fallback = Object.values(input) + .map((value) => summarizeTraceToolInputInternal(value, depth + 1)) + .filter((value) => value.length > 0) + return fallback.join(' ') }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function summarizeTraceToolInput(input: unknown): string {

if (typeof input === 'string') {

return input.trim()

}

if (Array.isArray(input)) {

return input.map((entry) => summarizeTraceToolInput(entry)).filter((value) => value.length > 0).join(' ')

}

if (!isRecord(input)) {

return ''

}

const prioritizedKeys = ['question', 'query', 'prompt', 'path', 'paths', 'label', 'source', 'target', 'description']

const collected: string[] = []

for (const key of prioritizedKeys) {

const value = input[key]

if (typeof value === 'string' && value.trim().length > 0) {

collected.push(value.trim())

continue

}

if (Array.isArray(value)) {

const summarized = value.map((entry) => summarizeTraceToolInput(entry)).filter((entry) => entry.length > 0).join(' ')

if (summarized.length > 0) {

collected.push(summarized)

}

}

}

if (collected.length > 0) {

return collected.join(' ')

}

return ''

}

function summarizeTraceToolInput(input: unknown): string {

return summarizeTraceToolInputInternal(input, 0)

}

function summarizeTraceToolInputInternal(input: unknown, depth: number): string {

if (depth > 3) {

return ''

}

if (typeof input === 'string') {

return input.trim()

}

if (Array.isArray(input)) {

return input.map((entry) => summarizeTraceToolInputInternal(entry, depth + 1)).filter((value) => value.length > 0).join(' ')

}

if (!isRecord(input)) {

return ''

}

const prioritizedKeys = ['question', 'query', 'prompt', 'path', 'paths', 'label', 'source', 'target', 'description']

const collected: string[] = []

for (const key of prioritizedKeys) {

const value = input[key]

if (typeof value === 'string' && value.trim().length > 0) {

collected.push(value.trim())

continue

}

if (Array.isArray(value)) {

const summarized = value.map((entry) => summarizeTraceToolInputInternal(entry, depth + 1)).filter((entry) => entry.length > 0).join(' ')

if (summarized.length > 0) {

collected.push(summarized)

}

}

}

if (collected.length > 0) {

return collected.join(' ')

}

// Fallback: capture any nested string-bearing fields so targeted follow-up

// checks are not coupled to a fixed key list.

const fallback = Object.values(input)

.map((value) => summarizeTraceToolInputInternal(value, depth + 1))

.filter((value) => value.length > 0)

return fallback.join(' ')

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/infrastructure/compare.ts` around lines 1267 - 1297, The summarizer function summarizeTraceToolInput currently only inspects a small prioritized key list and drops other object fields which can cause false negatives; update summarizeTraceToolInput to perform a bounded recursive fallback: after processing prioritizedKeys, if collected is empty, iterate over the remaining object values (Object.values(input)) and recursively summarize them (respecting Array handling already present), but enforce a max recursion depth and a cheap item/length cap to prevent deep or huge traversal; ensure the recursion uses a depth parameter (e.g., depth++), returns '' when depth exceeds the cap, and merges any non-empty summaries into collected before returning so non-whitelisted keys contribute to the final string.

coderabbitai · 2026-06-08T12:48:22Z

+  switch (obligation.kind) {
+    case 'entrypoint':
+      return matchedTerms >= 1
+    case 'handoff':
+      return matchedTerms >= 1 && runtimeProofObligationMatchScore(candidate, obligation) >= 4
+    case 'terminal':
+      return matchedTerms >= 2 && runtimeProofHasDirectTerminalSignal(candidate)
+  }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Terminal obligations can become unsatisfiable due to an unvalidated invariant.

Terminal proof currently requires at least two matched evidence terms, but profile validation allows terminal obligations with only one term. That creates impossible obligations and hard fail-closed outcomes.

Suggested fix (enforce the invariant at profile-parse time)

diff --git a/src/infrastructure/benchmark/runtime-proof.ts b/src/infrastructure/benchmark/runtime-proof.ts @@ - return { + const evidenceTerms = parseRuntimeProofStringArray( + profileName, + `obligations.${entry.id}.evidence_terms`, + entry.evidence_terms, + ) + const kind = parseRuntimeProofKind(profileName, entry.id.trim(), entry.kind) + if (kind === 'terminal' && evidenceTerms.length < 2) { + throw new Error( + `Malformed runtime proof profile "${profileName}" obligation "${entry.id}": terminal obligations require at least two evidence_terms`, + ) + } + return { id: entry.id.trim(), label: entry.label.trim(), - kind: parseRuntimeProofKind(profileName, entry.id.trim(), entry.kind), - evidence_terms: parseRuntimeProofStringArray(profileName, `obligations.${entry.id}.evidence_terms`, entry.evidence_terms), + kind, + evidence_terms: evidenceTerms, }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/runtime/runtime-proof.ts` around lines 74 - 81, The profile parser must enforce the invariant that terminal obligations include at least two evidence terms so runtime checks in runtime-proof.ts (the switch on obligation.kind, and functions runtimeProofHasDirectTerminalSignal/runtimeProofObligationMatchScore) cannot produce unsatisfiable terminal obligations; add validation in the profile-parse/validation routine (the function that builds/validates obligation objects, e.g., parse/validateProfile or similar) to reject or normalize any obligation with obligation.kind === 'terminal' that has fewer than 2 terms, returning a clear parse/validation error so such profiles never reach runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: harden explain-runtime proof gates

55be19c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

test: relax implement timeout race budget

ac9980f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mohanagy merged commit 92bd488 into next Jun 8, 2026
7 checks passed

mohanagy mentioned this pull request Jun 10, 2026

fix: complete proof-backed public explain-runtime full-win rows #520

Merged

6 tasks

This was referenced Jun 10, 2026

chore: prepare 0.28.0 release #522

Merged

Merge next into main for 0.28.0 release #521

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden explain-runtime proof gates#519

fix: harden explain-runtime proof gates#519
mohanagy merged 2 commits into
nextfrom
benchmark-proof-completeness

mohanagy commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Uh oh!

coderabbitai Bot Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mohanagy commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Benchmark status

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mohanagy commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading