Paired bootstrap CIs and Benjamini-Hochberg FDR correction by jphein · Pull Request #36 · M0nkeyFl0wer/multipass-structural-memory-eval

jphein · 2026-05-25T01:10:07Z

Summary

Closes #21.

Adds a sme/stats/ module with two standard statistical testing tools:

paired_bootstrap_ci() — non-parametric confidence interval on per-question paired differences between conditions (Efron & Tibshirani 1993). Seeded for reproducibility. Reports mean difference, CI bounds, and approximate two-tailed p-value.
benjamini_hochberg() — BH-FDR correction for multiple comparisons when testing across many categories. Returns adjusted p-values and rejection decisions.

Design decisions:

Pure numpy implementation (already a core dep) — no scipy dependency needed
Seed-deterministic bootstrap for reproducible CI reports
P-value approximation via bootstrap distribution (fraction crossing zero, two-tailed)

Test plan

tests/test_stats.py — covers:
- Identical scores (zero diff, CI contains zero)
- Clear winner (positive CI, significant p-value)
- Seed determinism
- Mismatched lengths (assertion error)
- Empty p-values, all-significant, mixed significance
- Adjusted p-value ordering

🫏 Generated with Claude Code

Adds sme/stats/ package with two standard tools for distinguishing real effects from noise across conditions: - paired_bootstrap_ci: non-parametric CI on per-question paired score differences (Efron & Tibshirani 1993) - benjamini_hochberg: FDR correction for multiple comparisons across categories or conditions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds statistical utilities under sme/stats with paired bootstrap confidence intervals and Benjamini–Hochberg FDR correction, plus coverage tests.

Changes:

Introduce paired_bootstrap_ci returning a structured CI + approximate p-value result.
Introduce benjamini_hochberg returning adjusted p-values and rejection decisions.
Add pytest coverage validating expected behavior and determinism.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
`sme/stats/bootstrap.py`	Implements paired bootstrap CI + approximate p-value result container.
`sme/stats/fdr.py`	Implements Benjamini–Hochberg correction + result container.
`sme/stats/__init__.py`	Exposes new stats utilities as package API.
`tests/test_stats.py`	Adds tests for bootstrap CI and BH correction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    Returns:
+        BootstrapCIResult with mean difference and CI bounds
+    """
+    assert len(scores_a) == len(scores_b), "Paired scores must be same length"


+    rng = np.random.RandomState(seed)
+    a = np.array(scores_a, dtype=float)
+    b = np.array(scores_b, dtype=float)
+    diffs = a - b
+    observed_mean = float(np.mean(diffs))
+    n = len(diffs)
+
+    boot_means = np.empty(n_bootstrap)
+    for i in range(n_bootstrap):
+        indices = rng.randint(0, n, size=n)
+        boot_means[i] = np.mean(diffs[indices])


+    rng = np.random.RandomState(seed)
+    a = np.array(scores_a, dtype=float)
+    b = np.array(scores_b, dtype=float)
+    diffs = a - b
+    observed_mean = float(np.mean(diffs))
+    n = len(diffs)
+
+    boot_means = np.empty(n_bootstrap)
+    for i in range(n_bootstrap):
+        indices = rng.randint(0, n, size=n)
+        boot_means[i] = np.mean(diffs[indices])


+    ci_upper: float
+    n_bootstrap: int
+    confidence_level: float
+    p_value_approx: float  # fraction of bootstrap diffs crossing zero


+    if observed_mean >= 0:
+        p_approx = float(np.mean(boot_means < 0))
+    else:
+        p_approx = float(np.mean(boot_means > 0))
+    p_approx = min(2 * p_approx, 1.0)


+    def test_identical_scores_zero_diff(self):
+        """Identical paired scores should give CI containing zero."""
+        scores = [0.8, 0.6, 0.7, 0.9, 0.5]
+        result = paired_bootstrap_ci(scores, scores)


+        """When A clearly beats B, CI should be entirely positive."""
+        a = [1.0] * 50
+        b = [0.0] * 50
+        result = paired_bootstrap_ci(a, b)


+        r1 = paired_bootstrap_ci(a, b, seed=123)
+        r2 = paired_bootstrap_ci(a, b, seed=123)


+def benjamini_hochberg(
+    p_values: list[float],
+    *,
+    alpha: float = 0.05,
+) -> FDRResult:


M0nkeyFl0wer · 2026-05-25T23:34:40Z

The right two statistical tools for this work — paired bootstrap percentile is the standard Efron & Tibshirani move for "is this per-question paired delta non-zero," and BH-FDR with the monotonicity enforcement is the correct multiple-comparisons correction for the cross-category readouts. Logic on both looks right.

Two real fixes worth doing before merge:

assert → ValueError for input validation. Asserts get stripped under python -O, so any deployed reading silently loses validation. The bootstrap function asserts len(a) == len(b); that's the exact case where bad input should raise loudly, not "raise unless someone optimized." Same fix for the BH function — currently has no input validation at all (accepts p-values outside [0,1], NaN, alpha outside (0,1]) and silently produces wrong results.
p-value docstring mismatch in bootstrap.py — comment says "fraction crossing zero" but the code does two-sided (doubling one-tail). Fix the comment.

The bootstrap-loop vectorization Copilot flagged is a real perf nit but not a merge blocker — the loop is only slow for very large n_bootstrap, which is a CI-speed concern not a correctness concern. Setting n_bootstrap=200 in tests (rather than the default 10000) is the simpler short-term fix.

Same family-of-failure note as #35: both fixes here are for the class of silent-wrong-answer bug they themselves contain (asserts that evaporate, no input validation). Once both PRs land, a pre-commit hook scanning for assert outside tests/ would prevent this class structurally — happy to add it if useful.

Replace the bootstrap length assert with an explicit ValueError so validation survives `python -O` (asserts get stripped under -O, which would silently skip the check in optimized deployments). Add the previously-absent input validation to benjamini_hochberg: reject p-values outside [0,1], NaN, and alpha outside (0,1]. Fix the p_value_approx comment in bootstrap.py to describe the two-sided computation the code actually performs (2x one-tail, capped at 1.0) instead of "fraction crossing zero". Tests cover the new ValueError paths; slow bootstrap tests pinned to n_bootstrap=200 to keep CI fast. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jphein · 2026-05-27T14:11:06Z

Pushed 45fa392:

assert → ValueError. The bootstrap length-mismatch check now raises (asserts evaporate under python -O). The BH function, which had no validation, now rejects p-values outside [0, 1], NaN, and alpha outside (0, 1].
Docstring. The bootstrap p-value doc now matches the code's two-sided computation (doubled one-tail) instead of claiming "fraction crossing zero".

Added tests for the new ValueError paths (mismatched lengths; out-of-range p-values / alpha / NaN). CI green on 3.10/3.11/3.12.

🫏

M0nkeyFl0wer · 2026-05-27T14:56:30Z

Verified — assert → ValueError on the bootstrap length check, BH now rejects out-of-range p-values, NaN, and bad alpha, and the docstring matches the two-sided computation the code actually performs. The n_bootstrap=200 pinning on the slow tests is a nice touch — keeps CI fast without losing property coverage. Ship.

Copilot AI review requested due to automatic review settings May 25, 2026 01:10

Copilot AI reviewed May 25, 2026

View reviewed changes

M0nkeyFl0wer mentioned this pull request May 25, 2026

LongMemEval judge: K-replicate variance with majority voting #35

Open

2 tasks

M0nkeyFl0wer mentioned this pull request May 26, 2026

Tier-1 standards integration: ship the five audited items #41

Open

This was referenced May 27, 2026

Random + oracle retrieval baselines (TREC bounds) #32

Open

test+refactor: adapter contract testkit, registry allowlist, test coverage (#8, #18, #19, #20) #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paired bootstrap CIs and Benjamini-Hochberg FDR correction#36

Paired bootstrap CIs and Benjamini-Hochberg FDR correction#36
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/stats-bootstrap-ci-fdr

jphein commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

jphein commented May 27, 2026

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		r1 = paired_bootstrap_ci(a, b, seed=123)
		r2 = paired_bootstrap_ci(a, b, seed=123)

Conversation

jphein commented May 25, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

jphein commented May 27, 2026

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants