LongMemEval judge: K-replicate variance with majority voting#35
Conversation
…22) Adds grade_answer_replicated() to characterize intra-rater LLM-judge stochasticity. Backward compatible: replicates=1 delegates to grade_answer with temperature=0.0 (paper setting). For K>1, runs K independent judge calls at temperature=0.3 by default and aggregates via majority vote, excluding ERROR replicates from the vote. Returns added diagnostics for K>1: - replicates: list of per-call results - label_counts: dict of non-ERROR label -> count - agreement_rate / flip_rate: fraction matching the majority - usage: summed across all K calls (cost accounting still accurate) Threads a temperature parameter through _call_openai and grade_answer so K-replicate calls can opt into stochasticity without changing the default deterministic behaviour. Closes M0nkeyFl0wer#22 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a K-replicate judging API to LongMemEval evaluation to measure intra-rater variance, including temperature control, majority-vote aggregation, and summed usage accounting.
Changes:
- Add
grade_answer_replicated(...)with majority-vote aggregation (excludingERROR) and replicate diagnostics. - Extend
grade_answer(...)/_call_openai(...)to accept a configurabletemperature. - Add pytest coverage for K=1 compatibility, voting/usage logic, and error handling; export new API in
sme.eval.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
sme/eval/longmemeval_judge.py |
Adds replicated judging + temperature plumbing and aggregation logic. |
tests/test_longmemeval_judge_replicates.py |
Introduces tests covering replicated judge behavior and error modes. |
sme/eval/__init__.py |
Exposes grade_answer_replicated as part of the package API. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| label_counts = dict(counter.most_common()) | ||
| majority_label = counter.most_common(1)[0][0] | ||
| agreement_count = label_counts[majority_label] |
| # still reflects K calls. | ||
| first = dict(results[0]) | ||
| first["usage"] = total_usage | ||
| first["replicates"] = results |
| class _Completions: | ||
| def create(self, *, model, messages, temperature=0.0): | ||
| outer.calls.append({"model": model, "messages": messages, | ||
| "temperature": temperature}) | ||
| return _fake_response(next(outer._iter)) |
|
|
||
| counter = Counter(labels) | ||
| label_counts = dict(counter.most_common()) | ||
| majority_label = counter.most_common(1)[0][0] |
|
K-replicate variance characterization is exactly what #22 was asking for — the agreement-rate and replicate-trace fields are the right diagnostic surface, and the K=1 → delegate-to- One real blocker before merge: Suggest pinning an explicit preference order before merge — e.g., The tie-break decision has spec implications worth flagging in the PR body: any framework reproducing your K-replicate methodology needs to match the tie-break rule or they'll report different |
|
Following up on the tie-break point above — I want to be more explicit than my earlier comment suggested: I'd treat this as a merge blocker, not a fix-it-later. Reasoning: The whole point of the K-replicate work is to characterize and report intra-judge stochasticity. Shipping with a non-deterministic `majority_label` means that exact stochasticity becomes a property of two unobserved variables (the replicate distribution and Python's hash seed at vote time). Two researchers running the same K=4 reading on the same questions can get different `majority_label` columns for any split-vote question — and the variance statistics computed from that label become unreproducible in a way that's invisible without running it twice with `PYTHONHASHSEED` controlled. Concrete ask before merge: pin an explicit preference order (suggesting `CORRECT > PARTIAL > INCORRECT > ABSTAIN` since it matches the LongMemEval label hierarchy), add a tie-break test with `replicates=2` and a 1-1 split, and add a sentence to `docs/sme_spec_v8.md` naming the rule so reproducers match it. Once that's in, the all-ERROR-shape inconsistency Copilot flagged is the natural follow-on fix in the same touch — same return-shape-depends-on-input failure mode. |
Counter.most_common() tie-break is arbitrary (PYTHONHASHSEED-dependent), so
split-vote majority_label was non-reproducible across runs — a measurement-
instability bug in the very work meant to characterize judge variance.
- Pin TIE_BREAK_ORDER: CORRECT > PARTIAL > INCORRECT > ABSTAIN (LongMemEval
label hierarchy); break majority-vote ties by this order, not most_common().
- Make the all-ERROR return path shape-consistent: label_counts={},
agreement_rate=0.0, flip_rate=1.0 (ERROR still excluded from the vote).
- Add tie-break tests: K=2 1-1 split, K=4 2-2 split, PARTIAL>INCORRECT.
- Document the tie-break rule in docs/sme_spec_v8.md so any framework
reproducing the K-replicate methodology matches it.
Resolves Ben's merge blocker on M0nkeyFl0wer#35
(escalated 2026-05-26); addresses M0nkeyFl0wer#22.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Blocker resolved in 63fec8c — point by point:
CI green on 3.10/3.11/3.12. 🫏 |
|
Verified — The spec-doc paragraph in One forward-looking thing for the calibration work (#46) and not for this PR: the chosen tie-break means a K=2 1–1 CORRECT/INCORRECT split resolves to CORRECT, which biases the headline accuracy up on split votes. That's a defensible choice (CORRECT-first follows the LongMemEval label hierarchy as documented), but the N=50 human-agreement calibration should probably surface (a) how often split votes occur on real corpora and (b) what label the human raters would assign on those same cases — otherwise the variance characterization can look healthier than it is. Worth a methodology note in the calibration writeup when that lands. |
Summary
Closes #22.
Adds
grade_answer_replicated()to the LongMemEval judge wrapper for characterizing intra-rater stochasticity via K-replicate majority voting.grade_answer_replicated()— runs K independent judge calls at temperature=0.3 (configurable), aggregates via majority vote excluding ERROR replicatesagreement_rate,flip_rate,label_counts, and full replicate tracegrade_answer()at temperature=0.0 (backward compatible)temperatureparameter tograde_answer()(was hardcoded 0.0)Design decisions:
Test plan
tests/test_longmemeval_judge_replicates.py— 14 tests covering:🫏 Generated with Claude Code