Paired bootstrap CIs and Benjamini-Hochberg FDR correction#36
Conversation
Adds sme/stats/ package with two standard tools for distinguishing real effects from noise across conditions: - paired_bootstrap_ci: non-parametric CI on per-question paired score differences (Efron & Tibshirani 1993) - benjamini_hochberg: FDR correction for multiple comparisons across categories or conditions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds statistical utilities under sme/stats with paired bootstrap confidence intervals and Benjamini–Hochberg FDR correction, plus coverage tests.
Changes:
- Introduce
paired_bootstrap_cireturning a structured CI + approximate p-value result. - Introduce
benjamini_hochbergreturning adjusted p-values and rejection decisions. - Add pytest coverage validating expected behavior and determinism.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
sme/stats/bootstrap.py |
Implements paired bootstrap CI + approximate p-value result container. |
sme/stats/fdr.py |
Implements Benjamini–Hochberg correction + result container. |
sme/stats/__init__.py |
Exposes new stats utilities as package API. |
tests/test_stats.py |
Adds tests for bootstrap CI and BH correction. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Returns: | ||
| BootstrapCIResult with mean difference and CI bounds | ||
| """ | ||
| assert len(scores_a) == len(scores_b), "Paired scores must be same length" |
| rng = np.random.RandomState(seed) | ||
| a = np.array(scores_a, dtype=float) | ||
| b = np.array(scores_b, dtype=float) | ||
| diffs = a - b | ||
| observed_mean = float(np.mean(diffs)) | ||
| n = len(diffs) | ||
|
|
||
| boot_means = np.empty(n_bootstrap) | ||
| for i in range(n_bootstrap): | ||
| indices = rng.randint(0, n, size=n) | ||
| boot_means[i] = np.mean(diffs[indices]) |
| rng = np.random.RandomState(seed) | ||
| a = np.array(scores_a, dtype=float) | ||
| b = np.array(scores_b, dtype=float) | ||
| diffs = a - b | ||
| observed_mean = float(np.mean(diffs)) | ||
| n = len(diffs) | ||
|
|
||
| boot_means = np.empty(n_bootstrap) | ||
| for i in range(n_bootstrap): | ||
| indices = rng.randint(0, n, size=n) | ||
| boot_means[i] = np.mean(diffs[indices]) |
| ci_upper: float | ||
| n_bootstrap: int | ||
| confidence_level: float | ||
| p_value_approx: float # fraction of bootstrap diffs crossing zero |
| if observed_mean >= 0: | ||
| p_approx = float(np.mean(boot_means < 0)) | ||
| else: | ||
| p_approx = float(np.mean(boot_means > 0)) | ||
| p_approx = min(2 * p_approx, 1.0) |
| def test_identical_scores_zero_diff(self): | ||
| """Identical paired scores should give CI containing zero.""" | ||
| scores = [0.8, 0.6, 0.7, 0.9, 0.5] | ||
| result = paired_bootstrap_ci(scores, scores) |
| """When A clearly beats B, CI should be entirely positive.""" | ||
| a = [1.0] * 50 | ||
| b = [0.0] * 50 | ||
| result = paired_bootstrap_ci(a, b) |
| r1 = paired_bootstrap_ci(a, b, seed=123) | ||
| r2 = paired_bootstrap_ci(a, b, seed=123) |
| def benjamini_hochberg( | ||
| p_values: list[float], | ||
| *, | ||
| alpha: float = 0.05, | ||
| ) -> FDRResult: |
|
The right two statistical tools for this work — paired bootstrap percentile is the standard Efron & Tibshirani move for "is this per-question paired delta non-zero," and BH-FDR with the monotonicity enforcement is the correct multiple-comparisons correction for the cross-category readouts. Logic on both looks right. Two real fixes worth doing before merge:
The bootstrap-loop vectorization Copilot flagged is a real perf nit but not a merge blocker — the loop is only slow for very large Same family-of-failure note as #35: both fixes here are for the class of silent-wrong-answer bug they themselves contain (asserts that evaporate, no input validation). Once both PRs land, a pre-commit hook scanning for |
Replace the bootstrap length assert with an explicit ValueError so validation survives `python -O` (asserts get stripped under -O, which would silently skip the check in optimized deployments). Add the previously-absent input validation to benjamini_hochberg: reject p-values outside [0,1], NaN, and alpha outside (0,1]. Fix the p_value_approx comment in bootstrap.py to describe the two-sided computation the code actually performs (2x one-tail, capped at 1.0) instead of "fraction crossing zero". Tests cover the new ValueError paths; slow bootstrap tests pinned to n_bootstrap=200 to keep CI fast. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Pushed 45fa392:
Added tests for the new 🫏 |
|
Verified — |
Summary
Closes #21.
Adds a
sme/stats/module with two standard statistical testing tools:paired_bootstrap_ci()— non-parametric confidence interval on per-question paired differences between conditions (Efron & Tibshirani 1993). Seeded for reproducibility. Reports mean difference, CI bounds, and approximate two-tailed p-value.benjamini_hochberg()— BH-FDR correction for multiple comparisons when testing across many categories. Returns adjusted p-values and rejection decisions.Design decisions:
Test plan
tests/test_stats.py— covers:🫏 Generated with Claude Code