LongMemEval judge: K-replicate variance with majority voting by jphein · Pull Request #35 · M0nkeyFl0wer/multipass-structural-memory-eval

jphein · 2026-05-25T01:08:24Z

Summary

Closes #22.

Adds grade_answer_replicated() to the LongMemEval judge wrapper for characterizing intra-rater stochasticity via K-replicate majority voting.

grade_answer_replicated() — runs K independent judge calls at temperature=0.3 (configurable), aggregates via majority vote excluding ERROR replicates
Reports agreement_rate, flip_rate, label_counts, and full replicate trace
K=1 delegates to grade_answer() at temperature=0.0 (backward compatible)
Also adds temperature parameter to grade_answer() (was hardcoded 0.0)

Design decisions:

Default temp=0.3 for K>1 (enough variance to measure without overwhelming the signal)
ERROR replicates excluded from voting (call failures are not verdicts)
Rationale from first majority-aligned replicate used for explainability

Test plan

tests/test_longmemeval_judge_replicates.py — 14 tests covering:
- K=1 delegation, K>1 majority voting, all-ERROR handling
- Agreement rate computation, flip rate
- Temperature forwarding, usage aggregation
All existing judge tests pass unchanged

🫏 Generated with Claude Code

…22) Adds grade_answer_replicated() to characterize intra-rater LLM-judge stochasticity. Backward compatible: replicates=1 delegates to grade_answer with temperature=0.0 (paper setting). For K>1, runs K independent judge calls at temperature=0.3 by default and aggregates via majority vote, excluding ERROR replicates from the vote. Returns added diagnostics for K>1: - replicates: list of per-call results - label_counts: dict of non-ERROR label -> count - agreement_rate / flip_rate: fraction matching the majority - usage: summed across all K calls (cost accounting still accurate) Threads a temperature parameter through _call_openai and grade_answer so K-replicate calls can opt into stochasticity without changing the default deterministic behaviour. Closes M0nkeyFl0wer#22 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a K-replicate judging API to LongMemEval evaluation to measure intra-rater variance, including temperature control, majority-vote aggregation, and summed usage accounting.

Changes:

Add grade_answer_replicated(...) with majority-vote aggregation (excluding ERROR) and replicate diagnostics.
Extend grade_answer(...) / _call_openai(...) to accept a configurable temperature.
Add pytest coverage for K=1 compatibility, voting/usage logic, and error handling; export new API in sme.eval.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`sme/eval/longmemeval_judge.py`	Adds replicated judging + temperature plumbing and aggregation logic.
`tests/test_longmemeval_judge_replicates.py`	Introduces tests covering replicated judge behavior and error modes.
`sme/eval/__init__.py`	Exposes `grade_answer_replicated` as part of the package API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    label_counts = dict(counter.most_common())
+    majority_label = counter.most_common(1)[0][0]
+    agreement_count = label_counts[majority_label]


+        # still reflects K calls.
+        first = dict(results[0])
+        first["usage"] = total_usage
+        first["replicates"] = results


+        class _Completions:
+            def create(self, *, model, messages, temperature=0.0):
+                outer.calls.append({"model": model, "messages": messages,
+                                    "temperature": temperature})
+                return _fake_response(next(outer._iter))


+
+    counter = Counter(labels)
+    label_counts = dict(counter.most_common())
+    majority_label = counter.most_common(1)[0][0]


M0nkeyFl0wer · 2026-05-25T23:34:25Z

K-replicate variance characterization is exactly what #22 was asking for — the agreement-rate and replicate-trace fields are the right diagnostic surface, and the K=1 → delegate-to-grade_answer shape preserves backward compatibility cleanly. Excluding ERROR replicates from the vote (rather than counting them as a fourth class) is the right call.

One real blocker before merge: Counter.most_common() tie-break is documented as arbitrary. For even K, or split votes at any K, majority_label becomes non-deterministic between runs on the same inputs. That ships a measurement-instability bug into the variance characterization that's supposed to fix measurement instability — same family of failure as the assert issue in #36 (silent behavior masquerading as working code).

Suggest pinning an explicit preference order before merge — e.g., CORRECT > PARTIAL > INCORRECT > ABSTAIN — with a tie-break test covering replicates=2 with a 1–1 split. Once that's in, the all-ERROR-path shape inconsistency Copilot flagged (label_counts / agreement_rate missing on the failure path) can be fixed in the same touch — both are "return-shape-depends-on-input" bugs.

The tie-break decision has spec implications worth flagging in the PR body: any framework reproducing your K-replicate methodology needs to match the tie-break rule or they'll report different majority_label on the same replicate distribution. Worth a sentence in docs/sme_spec_v8.md once pinned.

M0nkeyFl0wer · 2026-05-26T01:48:27Z

Following up on the tie-break point above — I want to be more explicit than my earlier comment suggested: I'd treat this as a merge blocker, not a fix-it-later. Reasoning:

The whole point of the K-replicate work is to characterize and report intra-judge stochasticity. Shipping with a non-deterministic `majority_label` means that exact stochasticity becomes a property of two unobserved variables (the replicate distribution and Python's hash seed at vote time). Two researchers running the same K=4 reading on the same questions can get different `majority_label` columns for any split-vote question — and the variance statistics computed from that label become unreproducible in a way that's invisible without running it twice with `PYTHONHASHSEED` controlled.

Concrete ask before merge: pin an explicit preference order (suggesting `CORRECT > PARTIAL > INCORRECT > ABSTAIN` since it matches the LongMemEval label hierarchy), add a tie-break test with `replicates=2` and a 1-1 split, and add a sentence to `docs/sme_spec_v8.md` naming the rule so reproducers match it.

Once that's in, the all-ERROR-shape inconsistency Copilot flagged is the natural follow-on fix in the same touch — same return-shape-depends-on-input failure mode.

Counter.most_common() tie-break is arbitrary (PYTHONHASHSEED-dependent), so split-vote majority_label was non-reproducible across runs — a measurement- instability bug in the very work meant to characterize judge variance. - Pin TIE_BREAK_ORDER: CORRECT > PARTIAL > INCORRECT > ABSTAIN (LongMemEval label hierarchy); break majority-vote ties by this order, not most_common(). - Make the all-ERROR return path shape-consistent: label_counts={}, agreement_rate=0.0, flip_rate=1.0 (ERROR still excluded from the vote). - Add tie-break tests: K=2 1-1 split, K=4 2-2 split, PARTIAL>INCORRECT. - Document the tie-break rule in docs/sme_spec_v8.md so any framework reproducing the K-replicate methodology matches it. Resolves Ben's merge blocker on M0nkeyFl0wer#35 (escalated 2026-05-26); addresses M0nkeyFl0wer#22. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jphein · 2026-05-27T14:11:04Z

Blocker resolved in 63fec8c — point by point:

Pinned tie-break order. TIE_BREAK_ORDER = CORRECT > PARTIAL > INCORRECT > ABSTAIN (the LongMemEval hierarchy). majority_label now comes from a deterministic max on (count, -preference_index); Counter.most_common() is no longer load-bearing for the verdict.
Tie-break tests. The K=2 1–1 split → CORRECT, plus a K=4 2–1–1 top-tie and a PARTIAL-vs-INCORRECT tie.
Spec. Named the rule in docs/sme_spec_v8.md, including the requirement that any framework reproducing the K-replicate methodology use the same order, so majority_label and the variance stats derived from it are reproducible without pinning PYTHONHASHSEED.
All-ERROR shape. Same touch: the all-ERROR path now returns the consistent shape (label_counts / agreement_rate present); ERROR replicates remain excluded from the vote rather than counted as a label.

CI green on 3.10/3.11/3.12.

🫏

M0nkeyFl0wer · 2026-05-27T14:56:27Z

Verified — TIE_BREAK_ORDER = [CORRECT, PARTIAL, INCORRECT, ABSTAIN] with the deterministic max keyed on (count, -index) is exactly the shape the blocker was asking for, and the all-ERROR return path now matches the success-path shape (label_counts={}, agreement_rate=0.0, flip_rate=1.0) instead of being a different artifact entirely.

The spec-doc paragraph in sme_spec_v8.md is the part I want to call out specifically — it closes the methodology loop, not just the code loop. The "any framework reproducing the K-replicate methodology must use this same tie-break order" sentence is what makes the reading reproducible across reimplementations, not just stable within this repo. That's the right level of carefulness for measurement work, and exactly the discipline #22 was asking for. Ship.

One forward-looking thing for the calibration work (#46) and not for this PR: the chosen tie-break means a K=2 1–1 CORRECT/INCORRECT split resolves to CORRECT, which biases the headline accuracy up on split votes. That's a defensible choice (CORRECT-first follows the LongMemEval label hierarchy as documented), but the N=50 human-agreement calibration should probably surface (a) how often split votes occur on real corpora and (b) what label the human raters would assign on those same cases — otherwise the variance characterization can look healthier than it is. Worth a methodology note in the calibration writeup when that lands.

Copilot AI review requested due to automatic review settings May 25, 2026 01:08

Copilot AI reviewed May 25, 2026

View reviewed changes

M0nkeyFl0wer mentioned this pull request May 25, 2026

Cross-validate SME categories against LongMemEval / LoCoMo / MemoryBench #9

Open

6 tasks

M0nkeyFl0wer mentioned this pull request May 25, 2026

Paired bootstrap CIs and Benjamini-Hochberg FDR correction #36

Open

1 task

M0nkeyFl0wer mentioned this pull request May 26, 2026

Judge-human agreement calibration on SME corpora (N=50) #46

Open

This was referenced May 27, 2026

Random + oracle retrieval baselines (TREC bounds) #32

Open

test+refactor: adapter contract testkit, registry allowlist, test coverage (#8, #18, #19, #20) #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongMemEval judge: K-replicate variance with majority voting#35

LongMemEval judge: K-replicate variance with majority voting#35
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/judge-replicates

jphein commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

M0nkeyFl0wer commented May 26, 2026

Uh oh!

jphein commented May 27, 2026

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jphein commented May 25, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

M0nkeyFl0wer commented May 26, 2026

Uh oh!

jphein commented May 27, 2026

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants