Context
PR #15 shipped opt-in extended thinking on FaithfulnessJudge
(v0.1.15, merged 2026-05-15). The spec's stated motivation was:
Sharper judge → better signal for the eventual
use_native_citations default-flip decision
The empirical step that validates the motivation — running the
golden query set with thinking-off vs thinking-on and comparing
verdicts — was deferred from the PR test plan and never run.
Until it runs, we don't know whether thinking actually changes
scores enough to justify the token spend, nor whether to make
it the judge's default.
What to do
Run attune-rag-benchmark --with-faithfulness twice against the
golden query set:
- Baseline: thinking off (current default)
- Comparison: thinking on (
--thinking, default
--thinking-budget 32768)
Capture for each query:
- Score
- Supported / unsupported claim lists (the verdict shape, not
just the scalar)
- Reasoning
- Latency
- Estimated cost (thinking tokens × Sonnet rate)
Attach the side-by-side table to this issue or to a short
docs/rag/faithfulness-thinking-calibration.md.
Decision the data should drive
Pick exactly one:
- A. Make
--thinking the default — if it meaningfully
improves judge agreement (verdicts disagree on ≥10% of
golden queries AND thinking-version aligns better with
hand-labeled ground truth)
- B. Keep opt-in — if thinking changes < 5% of verdicts or
the changes don't track ground truth better
- C. Retire
--thinking — if the data shows thinking-mode
scores are noisier (cost without signal); keep the parser
fallback code as it's already proven useful for non-thinking
text-block responses
Pre-committing the matrix here so the result routes cleanly
(per the project's "pre-committed decision matrices survive
contact with data" pattern).
Effort
Token spend: ~1.5x the normal --with-faithfulness run (the
thinking pass burns more tokens). Wall time: ~10 min for the
benchmark + 30 min for the write-up. Live API access required.
Acceptance
Related
Context
PR #15 shipped opt-in extended thinking on
FaithfulnessJudge(v0.1.15, merged 2026-05-15). The spec's stated motivation was:
The empirical step that validates the motivation — running the
golden query set with thinking-off vs thinking-on and comparing
verdicts — was deferred from the PR test plan and never run.
Until it runs, we don't know whether thinking actually changes
scores enough to justify the token spend, nor whether to make
it the judge's default.
What to do
Run
attune-rag-benchmark --with-faithfulnesstwice against thegolden query set:
--thinking, default--thinking-budget 32768)Capture for each query:
just the scalar)
Attach the side-by-side table to this issue or to a short
docs/rag/faithfulness-thinking-calibration.md.Decision the data should drive
Pick exactly one:
--thinkingthe default — if it meaningfullyimproves judge agreement (verdicts disagree on ≥10% of
golden queries AND thinking-version aligns better with
hand-labeled ground truth)
the changes don't track ground truth better
--thinking— if the data shows thinking-modescores are noisier (cost without signal); keep the parser
fallback code as it's already proven useful for non-thinking
text-block responses
Pre-committing the matrix here so the result routes cleanly
(per the project's "pre-committed decision matrices survive
contact with data" pattern).
Effort
Token spend: ~1.5x the normal
--with-faithfulnessrun (thethinking pass burns more tokens). Wall time: ~10 min for the
benchmark + 30 min for the write-up. Live API access required.
Acceptance
or C: changelog note for v0.1.16 explaining the calibration
outcome
Related
deferred this calibration)