eval: run calibration benchmark for FaithfulnessJudge with --thinking (v0.1.15 follow-up)

## Context

PR #15 shipped opt-in extended thinking on `FaithfulnessJudge`
(v0.1.15, merged 2026-05-15). The spec's stated motivation was:

> Sharper judge → better signal for the eventual
> `use_native_citations` default-flip decision

The empirical step that validates the motivation — running the
golden query set with thinking-off vs thinking-on and comparing
verdicts — was deferred from the PR test plan and never run.
Until it runs, we don't know whether thinking actually changes
scores enough to justify the token spend, nor whether to make
it the judge's default.

## What to do

Run `attune-rag-benchmark --with-faithfulness` twice against the
golden query set:

1. Baseline: thinking off (current default)
2. Comparison: thinking on (`--thinking`, default
   `--thinking-budget 32768`)

Capture for each query:
- Score
- Supported / unsupported claim lists (the verdict shape, not
  just the scalar)
- Reasoning
- Latency
- Estimated cost (thinking tokens × Sonnet rate)

Attach the side-by-side table to this issue or to a short
`docs/rag/faithfulness-thinking-calibration.md`.

## Decision the data should drive

Pick exactly one:

- **A. Make `--thinking` the default** — if it meaningfully
  improves judge agreement (verdicts disagree on ≥10% of
  golden queries AND thinking-version aligns better with
  hand-labeled ground truth)
- **B. Keep opt-in** — if thinking changes < 5% of verdicts or
  the changes don't track ground truth better
- **C. Retire `--thinking`** — if the data shows thinking-mode
  scores are noisier (cost without signal); keep the parser
  fallback code as it's already proven useful for non-thinking
  text-block responses

Pre-committing the matrix here so the result routes cleanly
(per the project's "pre-committed decision matrices survive
contact with data" pattern).

## Effort

Token spend: ~1.5x the normal `--with-faithfulness` run (the
thinking pass burns more tokens). Wall time: ~10 min for the
benchmark + 30 min for the write-up. Live API access required.

## Acceptance

- [ ] Side-by-side table committed to repo or attached here
- [ ] Decision (A/B/C) recorded with rationale
- [ ] If A: a tiny follow-up PR that flips the default; if B
  or C: changelog note for v0.1.16 explaining the calibration
  outcome

## Related

- #15 — FaithfulnessJudge extended thinking (the PR that
  deferred this calibration)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: run calibration benchmark for FaithfulnessJudge with --thinking (v0.1.15 follow-up) #40

Context

What to do

Decision the data should drive

Effort

Acceptance

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

eval: run calibration benchmark for FaithfulnessJudge with --thinking (v0.1.15 follow-up) #40

Description

Context

What to do

Decision the data should drive

Effort

Acceptance

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions