Skip to content

LongMemEval judge: K-replicate variance with majority voting#35

Open
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/judge-replicates
Open

LongMemEval judge: K-replicate variance with majority voting#35
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/judge-replicates

Conversation

@jphein
Copy link
Copy Markdown
Contributor

@jphein jphein commented May 25, 2026

Summary

Closes #22.

Adds grade_answer_replicated() to the LongMemEval judge wrapper for characterizing intra-rater stochasticity via K-replicate majority voting.

  • grade_answer_replicated() — runs K independent judge calls at temperature=0.3 (configurable), aggregates via majority vote excluding ERROR replicates
  • Reports agreement_rate, flip_rate, label_counts, and full replicate trace
  • K=1 delegates to grade_answer() at temperature=0.0 (backward compatible)
  • Also adds temperature parameter to grade_answer() (was hardcoded 0.0)

Design decisions:

  • Default temp=0.3 for K>1 (enough variance to measure without overwhelming the signal)
  • ERROR replicates excluded from voting (call failures are not verdicts)
  • Rationale from first majority-aligned replicate used for explainability

Test plan

  • tests/test_longmemeval_judge_replicates.py — 14 tests covering:
    • K=1 delegation, K>1 majority voting, all-ERROR handling
    • Agreement rate computation, flip rate
    • Temperature forwarding, usage aggregation
  • All existing judge tests pass unchanged

🫏 Generated with Claude Code

…22)

Adds grade_answer_replicated() to characterize intra-rater LLM-judge
stochasticity. Backward compatible: replicates=1 delegates to
grade_answer with temperature=0.0 (paper setting). For K>1, runs K
independent judge calls at temperature=0.3 by default and aggregates
via majority vote, excluding ERROR replicates from the vote.

Returns added diagnostics for K>1:
- replicates: list of per-call results
- label_counts: dict of non-ERROR label -> count
- agreement_rate / flip_rate: fraction matching the majority
- usage: summed across all K calls (cost accounting still accurate)

Threads a temperature parameter through _call_openai and grade_answer
so K-replicate calls can opt into stochasticity without changing the
default deterministic behaviour.

Closes M0nkeyFl0wer#22

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 25, 2026 01:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a K-replicate judging API to LongMemEval evaluation to measure intra-rater variance, including temperature control, majority-vote aggregation, and summed usage accounting.

Changes:

  • Add grade_answer_replicated(...) with majority-vote aggregation (excluding ERROR) and replicate diagnostics.
  • Extend grade_answer(...) / _call_openai(...) to accept a configurable temperature.
  • Add pytest coverage for K=1 compatibility, voting/usage logic, and error handling; export new API in sme.eval.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
sme/eval/longmemeval_judge.py Adds replicated judging + temperature plumbing and aggregation logic.
tests/test_longmemeval_judge_replicates.py Introduces tests covering replicated judge behavior and error modes.
sme/eval/__init__.py Exposes grade_answer_replicated as part of the package API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +417 to +419
label_counts = dict(counter.most_common())
majority_label = counter.most_common(1)[0][0]
agreement_count = label_counts[majority_label]
Comment thread sme/eval/longmemeval_judge.py Outdated
Comment on lines +410 to +413
# still reflects K calls.
first = dict(results[0])
first["usage"] = total_usage
first["replicates"] = results
Comment on lines +39 to +43
class _Completions:
def create(self, *, model, messages, temperature=0.0):
outer.calls.append({"model": model, "messages": messages,
"temperature": temperature})
return _fake_response(next(outer._iter))
Comment thread sme/eval/longmemeval_judge.py Outdated

counter = Counter(labels)
label_counts = dict(counter.most_common())
majority_label = counter.most_common(1)[0][0]
@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

K-replicate variance characterization is exactly what #22 was asking for — the agreement-rate and replicate-trace fields are the right diagnostic surface, and the K=1 → delegate-to-grade_answer shape preserves backward compatibility cleanly. Excluding ERROR replicates from the vote (rather than counting them as a fourth class) is the right call.

One real blocker before merge: Counter.most_common() tie-break is documented as arbitrary. For even K, or split votes at any K, majority_label becomes non-deterministic between runs on the same inputs. That ships a measurement-instability bug into the variance characterization that's supposed to fix measurement instability — same family of failure as the assert issue in #36 (silent behavior masquerading as working code).

Suggest pinning an explicit preference order before merge — e.g., CORRECT > PARTIAL > INCORRECT > ABSTAIN — with a tie-break test covering replicates=2 with a 1–1 split. Once that's in, the all-ERROR-path shape inconsistency Copilot flagged (label_counts / agreement_rate missing on the failure path) can be fixed in the same touch — both are "return-shape-depends-on-input" bugs.

The tie-break decision has spec implications worth flagging in the PR body: any framework reproducing your K-replicate methodology needs to match the tie-break rule or they'll report different majority_label on the same replicate distribution. Worth a sentence in docs/sme_spec_v8.md once pinned.

@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

Following up on the tie-break point above — I want to be more explicit than my earlier comment suggested: I'd treat this as a merge blocker, not a fix-it-later. Reasoning:

The whole point of the K-replicate work is to characterize and report intra-judge stochasticity. Shipping with a non-deterministic `majority_label` means that exact stochasticity becomes a property of two unobserved variables (the replicate distribution and Python's hash seed at vote time). Two researchers running the same K=4 reading on the same questions can get different `majority_label` columns for any split-vote question — and the variance statistics computed from that label become unreproducible in a way that's invisible without running it twice with `PYTHONHASHSEED` controlled.

Concrete ask before merge: pin an explicit preference order (suggesting `CORRECT > PARTIAL > INCORRECT > ABSTAIN` since it matches the LongMemEval label hierarchy), add a tie-break test with `replicates=2` and a 1-1 split, and add a sentence to `docs/sme_spec_v8.md` naming the rule so reproducers match it.

Once that's in, the all-ERROR-shape inconsistency Copilot flagged is the natural follow-on fix in the same touch — same return-shape-depends-on-input failure mode.

Counter.most_common() tie-break is arbitrary (PYTHONHASHSEED-dependent), so
split-vote majority_label was non-reproducible across runs — a measurement-
instability bug in the very work meant to characterize judge variance.

- Pin TIE_BREAK_ORDER: CORRECT > PARTIAL > INCORRECT > ABSTAIN (LongMemEval
  label hierarchy); break majority-vote ties by this order, not most_common().
- Make the all-ERROR return path shape-consistent: label_counts={},
  agreement_rate=0.0, flip_rate=1.0 (ERROR still excluded from the vote).
- Add tie-break tests: K=2 1-1 split, K=4 2-2 split, PARTIAL>INCORRECT.
- Document the tie-break rule in docs/sme_spec_v8.md so any framework
  reproducing the K-replicate methodology matches it.

Resolves Ben's merge blocker on M0nkeyFl0wer#35
(escalated 2026-05-26); addresses M0nkeyFl0wer#22.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jphein
Copy link
Copy Markdown
Contributor Author

jphein commented May 27, 2026

Blocker resolved in 63fec8c — point by point:

  1. Pinned tie-break order. TIE_BREAK_ORDER = CORRECT > PARTIAL > INCORRECT > ABSTAIN (the LongMemEval hierarchy). majority_label now comes from a deterministic max on (count, -preference_index); Counter.most_common() is no longer load-bearing for the verdict.
  2. Tie-break tests. The K=2 1–1 split → CORRECT, plus a K=4 2–1–1 top-tie and a PARTIAL-vs-INCORRECT tie.
  3. Spec. Named the rule in docs/sme_spec_v8.md, including the requirement that any framework reproducing the K-replicate methodology use the same order, so majority_label and the variance stats derived from it are reproducible without pinning PYTHONHASHSEED.
  4. All-ERROR shape. Same touch: the all-ERROR path now returns the consistent shape (label_counts / agreement_rate present); ERROR replicates remain excluded from the vote rather than counted as a label.

CI green on 3.10/3.11/3.12.

🫏

@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

Verified — TIE_BREAK_ORDER = [CORRECT, PARTIAL, INCORRECT, ABSTAIN] with the deterministic max keyed on (count, -index) is exactly the shape the blocker was asking for, and the all-ERROR return path now matches the success-path shape (label_counts={}, agreement_rate=0.0, flip_rate=1.0) instead of being a different artifact entirely.

The spec-doc paragraph in sme_spec_v8.md is the part I want to call out specifically — it closes the methodology loop, not just the code loop. The "any framework reproducing the K-replicate methodology must use this same tie-break order" sentence is what makes the reading reproducible across reimplementations, not just stable within this repo. That's the right level of carefulness for measurement work, and exactly the discipline #22 was asking for. Ship.

One forward-looking thing for the calibration work (#46) and not for this PR: the chosen tie-break means a K=2 1–1 CORRECT/INCORRECT split resolves to CORRECT, which biases the headline accuracy up on split votes. That's a defensible choice (CORRECT-first follows the LongMemEval label hierarchy as documented), but the N=50 human-agreement calibration should probably surface (a) how often split votes occur on real corpora and (b) what label the human raters would assign on those same cases — otherwise the variance characterization can look healthier than it is. Worth a methodology note in the calibration writeup when that lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Judge variance: characterize intra-rater LLM-judge stochasticity, K-replicate readings

3 participants