fix: harden explain-runtime proof gates#519
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughIntroduces “runtime proof”: contracts and benchmark profiles, scoring primitives, retrieval/slice selection integration, compare-time readiness/prompt constraints, answer-quality citation checks, prompt-contract follow-up validation, and tests exercising checklist prompting and follow-up targeting. ChangesRuntime Proof Validation & Native-Agent Integration
Sequence DiagramsequenceDiagram
participant Executor as executeNativeAgentCompare
participant Loader as loadBenchmarkRuntimeProofProfiles
participant Readiness as prepareNativeAgentBenchmarkReadiness
participant Prompt as buildNativeAgentPrompt
participant Quality as evaluateNativeAgentAnswerQualityReport
participant Contract as assessNativeAgentPromptContract
Executor->>Loader: load runtime-proof profiles
Executor->>Readiness: pass runtimeProofProfile
Executor->>Prompt: include requiredObligations / missingObligations
Executor->>Quality: check required direct-evidence citations
Executor->>Contract: validate follow-up targets missing obligations
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly Related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/runtime/retrieve.ts (1)
3597-3608:⚠️ Potential issue | 🟠 Major | ⚡ Quick winStrict runtime-proof slices can still be marked complete when required obligations are missing.
When
strict_runtime_proofis enabled,runtimeProofComplete === falsejust falls back tophaseCoverage.missing[0]. If generic phase coverage is complete but a required obligation like analytics/redirect/persistence proof is still missing,missingPhasestaysundefined, so this emitsstatus: 'complete'with no boundary reason. That breaks the fail-closed contract and disagrees with theanswer_contract.runtime_proof.missing_obligationsyou already compute downstream.Suggested fix
const runtimeProof = buildRuntimeProofAssessment( runtimeProofProfile, tracedSteps.map((step) => runtimeProofCandidateFromExecutionStep(step)), ) - const runtimeProofComplete = runtimeProofProfile?.strict_runtime_proof === true - && runtimeProof !== undefined - && runtimeProof.missing_obligations.length === 0 - const missingPhase = runtimeProofComplete ? undefined : phaseCoverage.missing[0] - const boundaryReason = missingPhase ? missingExecutionPhaseBoundaryReason(missingPhase) : undefined + const missingRuntimeObligation = runtimeProofProfile?.strict_runtime_proof === true + ? runtimeProof?.missing_obligations[0] + : undefined + const missingPhase = missingRuntimeObligation ? undefined : phaseCoverage.missing[0] + const boundaryReason = missingRuntimeObligation + ? `missing runtime-proof obligation: ${runtimeProofProfile?.obligations.find((o) => o.id === missingRuntimeObligation)?.label ?? missingRuntimeObligation}` + : missingPhase + ? missingExecutionPhaseBoundaryReason(missingPhase) + : undefined const executionSlice: ContextPackExecutionSlice = { - status: missingPhase ? 'partial' : 'complete', + status: missingRuntimeObligation || missingPhase ? 'partial' : 'complete',🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/runtime/retrieve.ts` around lines 3597 - 3608, The code currently computes missingPhase from phaseCoverage only, so when runtimeProofProfile?.strict_runtime_proof is true but runtimeProofComplete is false the slice can still be marked complete; update the logic that derives missingPhase (used to set executionSlice.status and boundary_reason) to also consider runtimeProofComplete: if strict_runtime_proof is true and runtimeProofComplete === false then set missingPhase (or treat as missing) so executionSlice becomes 'partial' and boundary_reason is populated (e.g., reuse missingExecutionPhaseBoundaryReason or create a boundary reason reflecting runtime proof obligations), ensuring runtimeProof, runtimeProofProfile, runtimeProofComplete, phaseCoverage, executionSlice and answer_contract.runtime_proof.missing_obligations remain consistent.
🧹 Nitpick comments (3)
tests/unit/retrieve-slice-v1.test.ts (1)
415-527: ⚡ Quick winAdd one fail-closed runtime-proof regression here.
These cases only cover fully satisfied proofs. Please add a variant where phase coverage is still complete but one strict obligation is absent, and assert
execution_slice.status === 'partial'plus the missing obligation inanswer_contract.runtime_proof. That is the edge this PR summary says should remain fail-closed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/retrieve-slice-v1.test.ts` around lines 415 - 527, Add a fail-closed regression test variant using compactForWithRuntimeProof where strict_runtime_proof is true but one obligation from the provided obligations array is not satisfied by the chosen graph (e.g., use buildFormbricksRuntimeProofGraph or buildTwentyRuntimeProofGraph but omit the persistence/analytics evidence term), then assert that compact.execution_slice?.status === 'partial' and that compact.answer_contract?.runtime_proof?.missing_obligations contains the missing obligation id (and that answer_contract.runtime_proof.obligations includes the others); ensure the test references compactForWithRuntimeProof and the chosen build*RuntimeProofGraph function and checks the exact missing obligation id.tests/unit/compare-native-agent.test.ts (2)
2032-2036: ⚡ Quick winModel intermediate obligations as
handoffto preserve runtime-proof stage semantics.For 3-stage checklists, mapping all non-first items to
terminalskips handoff-specific behavior and can miss regressions in stage-aware scoring/gating. Usehandofffor middle items andterminalonly for the last item.Suggested patch
- kind: index === 0 ? 'entrypoint' : 'terminal', + kind: + index === 0 + ? 'entrypoint' + : index === checklist.length - 1 + ? 'terminal' + : 'handoff',🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/compare-native-agent.test.ts` around lines 2032 - 2036, The checklist mapping currently sets kind to 'entrypoint' for index 0 and 'terminal' for all others, which loses the required 'handoff' semantics for intermediate stages; update the mapping in the checklist.map (the object with id, label, kind, evidence_terms) so that kind is 'entrypoint' when index === 0, 'terminal' when index === checklist.length - 1, and 'handoff' for all other middle indices, keeping id, label, and evidence_terms generation the same.
3073-3083: ⚡ Quick winAssert baseline direct-evidence pass to avoid one-sided false positives.
These tests currently pass as long as Madar fails. Add a baseline
passed: trueassertion so they also catch regressions where strict runtime-proof citation checks fail both arms.Also applies to: 3182-3192
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/compare-native-agent.test.ts` around lines 3073 - 3083, Add an assertion that the baseline arm actually passed direct-evidence checks before asserting Madar failed: check result.reports (the opposing report array entry) for an answer_quality object with passed: true (e.g., expect(result.reports[1]?.answer_quality).toEqual(expect.objectContaining({ passed: true }))) and then keep the existing Madar failure assertion (expect(result.reports[0]?.answer_quality).toEqual(expect.objectContaining({ madar: ... }))). Apply the same added baseline passed: true assertion to the analogous block at lines 3182-3192.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/infrastructure/compare.ts`:
- Around line 1267-1297: The summarizer function summarizeTraceToolInput
currently only inspects a small prioritized key list and drops other object
fields which can cause false negatives; update summarizeTraceToolInput to
perform a bounded recursive fallback: after processing prioritizedKeys, if
collected is empty, iterate over the remaining object values
(Object.values(input)) and recursively summarize them (respecting Array handling
already present), but enforce a max recursion depth and a cheap item/length cap
to prevent deep or huge traversal; ensure the recursion uses a depth parameter
(e.g., depth++), returns '' when depth exceeds the cap, and merges any non-empty
summaries into collected before returning so non-whitelisted keys contribute to
the final string.
In `@src/runtime/runtime-proof.ts`:
- Around line 74-81: The profile parser must enforce the invariant that terminal
obligations include at least two evidence terms so runtime checks in
runtime-proof.ts (the switch on obligation.kind, and functions
runtimeProofHasDirectTerminalSignal/runtimeProofObligationMatchScore) cannot
produce unsatisfiable terminal obligations; add validation in the
profile-parse/validation routine (the function that builds/validates obligation
objects, e.g., parse/validateProfile or similar) to reject or normalize any
obligation with obligation.kind === 'terminal' that has fewer than 2 terms,
returning a clear parse/validation error so such profiles never reach runtime.
---
Outside diff comments:
In `@src/runtime/retrieve.ts`:
- Around line 3597-3608: The code currently computes missingPhase from
phaseCoverage only, so when runtimeProofProfile?.strict_runtime_proof is true
but runtimeProofComplete is false the slice can still be marked complete; update
the logic that derives missingPhase (used to set executionSlice.status and
boundary_reason) to also consider runtimeProofComplete: if strict_runtime_proof
is true and runtimeProofComplete === false then set missingPhase (or treat as
missing) so executionSlice becomes 'partial' and boundary_reason is populated
(e.g., reuse missingExecutionPhaseBoundaryReason or create a boundary reason
reflecting runtime proof obligations), ensuring runtimeProof,
runtimeProofProfile, runtimeProofComplete, phaseCoverage, executionSlice and
answer_contract.runtime_proof.missing_obligations remain consistent.
---
Nitpick comments:
In `@tests/unit/compare-native-agent.test.ts`:
- Around line 2032-2036: The checklist mapping currently sets kind to
'entrypoint' for index 0 and 'terminal' for all others, which loses the required
'handoff' semantics for intermediate stages; update the mapping in the
checklist.map (the object with id, label, kind, evidence_terms) so that kind is
'entrypoint' when index === 0, 'terminal' when index === checklist.length - 1,
and 'handoff' for all other middle indices, keeping id, label, and
evidence_terms generation the same.
- Around line 3073-3083: Add an assertion that the baseline arm actually passed
direct-evidence checks before asserting Madar failed: check result.reports (the
opposing report array entry) for an answer_quality object with passed: true
(e.g.,
expect(result.reports[1]?.answer_quality).toEqual(expect.objectContaining({
passed: true }))) and then keep the existing Madar failure assertion
(expect(result.reports[0]?.answer_quality).toEqual(expect.objectContaining({
madar: ... }))). Apply the same added baseline passed: true assertion to the
analogous block at lines 3182-3192.
In `@tests/unit/retrieve-slice-v1.test.ts`:
- Around line 415-527: Add a fail-closed regression test variant using
compactForWithRuntimeProof where strict_runtime_proof is true but one obligation
from the provided obligations array is not satisfied by the chosen graph (e.g.,
use buildFormbricksRuntimeProofGraph or buildTwentyRuntimeProofGraph but omit
the persistence/analytics evidence term), then assert that
compact.execution_slice?.status === 'partial' and that
compact.answer_contract?.runtime_proof?.missing_obligations contains the missing
obligation id (and that answer_contract.runtime_proof.obligations includes the
others); ensure the test references compactForWithRuntimeProof and the chosen
build*RuntimeProofGraph function and checks the exact missing obligation id.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: e24fb832-a3e3-4eb1-8f42-f29f1da7b636
📒 Files selected for processing (12)
docs/benchmarks/suite/runtime-proof.jsonsrc/contracts/context-pack.tssrc/contracts/runtime-proof.tssrc/infrastructure/benchmark/runtime-proof.tssrc/infrastructure/compare.tssrc/runtime/retrieve.tssrc/runtime/retrieve/slicing.tssrc/runtime/runtime-proof.tstests/unit/benchmark-suite.test.tstests/unit/compare-native-agent.test.tstests/unit/compare.test.tstests/unit/retrieve-slice-v1.test.ts
| function summarizeTraceToolInput(input: unknown): string { | ||
| if (typeof input === 'string') { | ||
| return input.trim() | ||
| } | ||
| if (Array.isArray(input)) { | ||
| return input.map((entry) => summarizeTraceToolInput(entry)).filter((value) => value.length > 0).join(' ') | ||
| } | ||
| if (!isRecord(input)) { | ||
| return '' | ||
| } | ||
|
|
||
| const prioritizedKeys = ['question', 'query', 'prompt', 'path', 'paths', 'label', 'source', 'target', 'description'] | ||
| const collected: string[] = [] | ||
| for (const key of prioritizedKeys) { | ||
| const value = input[key] | ||
| if (typeof value === 'string' && value.trim().length > 0) { | ||
| collected.push(value.trim()) | ||
| continue | ||
| } | ||
| if (Array.isArray(value)) { | ||
| const summarized = value.map((entry) => summarizeTraceToolInput(entry)).filter((entry) => entry.length > 0).join(' ') | ||
| if (summarized.length > 0) { | ||
| collected.push(summarized) | ||
| } | ||
| } | ||
| } | ||
| if (collected.length > 0) { | ||
| return collected.join(' ') | ||
| } | ||
| return '' | ||
| } |
There was a problem hiding this comment.
Tool-input summarization is too lossy and can falsely fail strict prompt-contract checks.
This summarizer only inspects a small key whitelist, but downstream targeting validation at Line 4515 relies on these summaries. Inputs using other keys can be dropped and wrongly marked as “did not target missing runtime obligation.”
Suggested fix (add bounded recursive fallback over object values)
function summarizeTraceToolInput(input: unknown): string {
+ return summarizeTraceToolInputInternal(input, 0)
+}
+
+function summarizeTraceToolInputInternal(input: unknown, depth: number): string {
+ if (depth > 3) {
+ return ''
+ }
if (typeof input === 'string') {
return input.trim()
}
if (Array.isArray(input)) {
- return input.map((entry) => summarizeTraceToolInput(entry)).filter((value) => value.length > 0).join(' ')
+ return input.map((entry) => summarizeTraceToolInputInternal(entry, depth + 1)).filter((value) => value.length > 0).join(' ')
}
if (!isRecord(input)) {
return ''
}
const prioritizedKeys = ['question', 'query', 'prompt', 'path', 'paths', 'label', 'source', 'target', 'description']
const collected: string[] = []
@@
- const summarized = value.map((entry) => summarizeTraceToolInput(entry)).filter((entry) => entry.length > 0).join(' ')
+ const summarized = value.map((entry) => summarizeTraceToolInputInternal(entry, depth + 1)).filter((entry) => entry.length > 0).join(' ')
if (summarized.length > 0) {
collected.push(summarized)
}
}
}
if (collected.length > 0) {
return collected.join(' ')
}
- return ''
+
+ // Fallback: capture any nested string-bearing fields so targeted follow-up
+ // checks are not coupled to a fixed key list.
+ const fallback = Object.values(input)
+ .map((value) => summarizeTraceToolInputInternal(value, depth + 1))
+ .filter((value) => value.length > 0)
+ return fallback.join(' ')
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function summarizeTraceToolInput(input: unknown): string { | |
| if (typeof input === 'string') { | |
| return input.trim() | |
| } | |
| if (Array.isArray(input)) { | |
| return input.map((entry) => summarizeTraceToolInput(entry)).filter((value) => value.length > 0).join(' ') | |
| } | |
| if (!isRecord(input)) { | |
| return '' | |
| } | |
| const prioritizedKeys = ['question', 'query', 'prompt', 'path', 'paths', 'label', 'source', 'target', 'description'] | |
| const collected: string[] = [] | |
| for (const key of prioritizedKeys) { | |
| const value = input[key] | |
| if (typeof value === 'string' && value.trim().length > 0) { | |
| collected.push(value.trim()) | |
| continue | |
| } | |
| if (Array.isArray(value)) { | |
| const summarized = value.map((entry) => summarizeTraceToolInput(entry)).filter((entry) => entry.length > 0).join(' ') | |
| if (summarized.length > 0) { | |
| collected.push(summarized) | |
| } | |
| } | |
| } | |
| if (collected.length > 0) { | |
| return collected.join(' ') | |
| } | |
| return '' | |
| } | |
| function summarizeTraceToolInput(input: unknown): string { | |
| return summarizeTraceToolInputInternal(input, 0) | |
| } | |
| function summarizeTraceToolInputInternal(input: unknown, depth: number): string { | |
| if (depth > 3) { | |
| return '' | |
| } | |
| if (typeof input === 'string') { | |
| return input.trim() | |
| } | |
| if (Array.isArray(input)) { | |
| return input.map((entry) => summarizeTraceToolInputInternal(entry, depth + 1)).filter((value) => value.length > 0).join(' ') | |
| } | |
| if (!isRecord(input)) { | |
| return '' | |
| } | |
| const prioritizedKeys = ['question', 'query', 'prompt', 'path', 'paths', 'label', 'source', 'target', 'description'] | |
| const collected: string[] = [] | |
| for (const key of prioritizedKeys) { | |
| const value = input[key] | |
| if (typeof value === 'string' && value.trim().length > 0) { | |
| collected.push(value.trim()) | |
| continue | |
| } | |
| if (Array.isArray(value)) { | |
| const summarized = value.map((entry) => summarizeTraceToolInputInternal(entry, depth + 1)).filter((entry) => entry.length > 0).join(' ') | |
| if (summarized.length > 0) { | |
| collected.push(summarized) | |
| } | |
| } | |
| } | |
| if (collected.length > 0) { | |
| return collected.join(' ') | |
| } | |
| // Fallback: capture any nested string-bearing fields so targeted follow-up | |
| // checks are not coupled to a fixed key list. | |
| const fallback = Object.values(input) | |
| .map((value) => summarizeTraceToolInputInternal(value, depth + 1)) | |
| .filter((value) => value.length > 0) | |
| return fallback.join(' ') | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/infrastructure/compare.ts` around lines 1267 - 1297, The summarizer
function summarizeTraceToolInput currently only inspects a small prioritized key
list and drops other object fields which can cause false negatives; update
summarizeTraceToolInput to perform a bounded recursive fallback: after
processing prioritizedKeys, if collected is empty, iterate over the remaining
object values (Object.values(input)) and recursively summarize them (respecting
Array handling already present), but enforce a max recursion depth and a cheap
item/length cap to prevent deep or huge traversal; ensure the recursion uses a
depth parameter (e.g., depth++), returns '' when depth exceeds the cap, and
merges any non-empty summaries into collected before returning so
non-whitelisted keys contribute to the final string.
| switch (obligation.kind) { | ||
| case 'entrypoint': | ||
| return matchedTerms >= 1 | ||
| case 'handoff': | ||
| return matchedTerms >= 1 && runtimeProofObligationMatchScore(candidate, obligation) >= 4 | ||
| case 'terminal': | ||
| return matchedTerms >= 2 && runtimeProofHasDirectTerminalSignal(candidate) | ||
| } |
There was a problem hiding this comment.
Terminal obligations can become unsatisfiable due to an unvalidated invariant.
Terminal proof currently requires at least two matched evidence terms, but profile validation allows terminal obligations with only one term. That creates impossible obligations and hard fail-closed outcomes.
Suggested fix (enforce the invariant at profile-parse time)
diff --git a/src/infrastructure/benchmark/runtime-proof.ts b/src/infrastructure/benchmark/runtime-proof.ts
@@
- return {
+ const evidenceTerms = parseRuntimeProofStringArray(
+ profileName,
+ `obligations.${entry.id}.evidence_terms`,
+ entry.evidence_terms,
+ )
+ const kind = parseRuntimeProofKind(profileName, entry.id.trim(), entry.kind)
+ if (kind === 'terminal' && evidenceTerms.length < 2) {
+ throw new Error(
+ `Malformed runtime proof profile "${profileName}" obligation "${entry.id}": terminal obligations require at least two evidence_terms`,
+ )
+ }
+ return {
id: entry.id.trim(),
label: entry.label.trim(),
- kind: parseRuntimeProofKind(profileName, entry.id.trim(), entry.kind),
- evidence_terms: parseRuntimeProofStringArray(profileName, `obligations.${entry.id}.evidence_terms`, entry.evidence_terms),
+ kind,
+ evidence_terms: evidenceTerms,
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/runtime/runtime-proof.ts` around lines 74 - 81, The profile parser must
enforce the invariant that terminal obligations include at least two evidence
terms so runtime checks in runtime-proof.ts (the switch on obligation.kind, and
functions runtimeProofHasDirectTerminalSignal/runtimeProofObligationMatchScore)
cannot produce unsatisfiable terminal obligations; add validation in the
profile-parse/validation routine (the function that builds/validates obligation
objects, e.g., parse/validateProfile or similar) to reject or normalize any
obligation with obligation.kind === 'terminal' that has fewer than 2 terms,
returning a clear parse/validation error so such profiles never reach runtime.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Verification
Benchmark status
Summary by CodeRabbit
New Features
Tests