Skip to content

eval: run calibration benchmark for FaithfulnessJudge with --thinking (v0.1.15 follow-up) #40

@silversurfer562

Description

@silversurfer562

Context

PR #15 shipped opt-in extended thinking on FaithfulnessJudge
(v0.1.15, merged 2026-05-15). The spec's stated motivation was:

Sharper judge → better signal for the eventual
use_native_citations default-flip decision

The empirical step that validates the motivation — running the
golden query set with thinking-off vs thinking-on and comparing
verdicts — was deferred from the PR test plan and never run.
Until it runs, we don't know whether thinking actually changes
scores enough to justify the token spend, nor whether to make
it the judge's default.

What to do

Run attune-rag-benchmark --with-faithfulness twice against the
golden query set:

  1. Baseline: thinking off (current default)
  2. Comparison: thinking on (--thinking, default
    --thinking-budget 32768)

Capture for each query:

  • Score
  • Supported / unsupported claim lists (the verdict shape, not
    just the scalar)
  • Reasoning
  • Latency
  • Estimated cost (thinking tokens × Sonnet rate)

Attach the side-by-side table to this issue or to a short
docs/rag/faithfulness-thinking-calibration.md.

Decision the data should drive

Pick exactly one:

  • A. Make --thinking the default — if it meaningfully
    improves judge agreement (verdicts disagree on ≥10% of
    golden queries AND thinking-version aligns better with
    hand-labeled ground truth)
  • B. Keep opt-in — if thinking changes < 5% of verdicts or
    the changes don't track ground truth better
  • C. Retire --thinking — if the data shows thinking-mode
    scores are noisier (cost without signal); keep the parser
    fallback code as it's already proven useful for non-thinking
    text-block responses

Pre-committing the matrix here so the result routes cleanly
(per the project's "pre-committed decision matrices survive
contact with data" pattern).

Effort

Token spend: ~1.5x the normal --with-faithfulness run (the
thinking pass burns more tokens). Wall time: ~10 min for the
benchmark + 30 min for the write-up. Live API access required.

Acceptance

  • Side-by-side table committed to repo or attached here
  • Decision (A/B/C) recorded with rationale
  • If A: a tiny follow-up PR that flips the default; if B
    or C: changelog note for v0.1.16 explaining the calibration
    outcome

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions