Skip to content

Paired bootstrap CIs and Benjamini-Hochberg FDR correction#36

Open
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/stats-bootstrap-ci-fdr
Open

Paired bootstrap CIs and Benjamini-Hochberg FDR correction#36
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/stats-bootstrap-ci-fdr

Conversation

@jphein
Copy link
Copy Markdown
Contributor

@jphein jphein commented May 25, 2026

Summary

Closes #21.

Adds a sme/stats/ module with two standard statistical testing tools:

  • paired_bootstrap_ci() — non-parametric confidence interval on per-question paired differences between conditions (Efron & Tibshirani 1993). Seeded for reproducibility. Reports mean difference, CI bounds, and approximate two-tailed p-value.
  • benjamini_hochberg() — BH-FDR correction for multiple comparisons when testing across many categories. Returns adjusted p-values and rejection decisions.

Design decisions:

  • Pure numpy implementation (already a core dep) — no scipy dependency needed
  • Seed-deterministic bootstrap for reproducible CI reports
  • P-value approximation via bootstrap distribution (fraction crossing zero, two-tailed)

Test plan

  • tests/test_stats.py — covers:
    • Identical scores (zero diff, CI contains zero)
    • Clear winner (positive CI, significant p-value)
    • Seed determinism
    • Mismatched lengths (assertion error)
    • Empty p-values, all-significant, mixed significance
    • Adjusted p-value ordering

🫏 Generated with Claude Code

Adds sme/stats/ package with two standard tools for distinguishing
real effects from noise across conditions:

- paired_bootstrap_ci: non-parametric CI on per-question paired
  score differences (Efron & Tibshirani 1993)
- benjamini_hochberg: FDR correction for multiple comparisons
  across categories or conditions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 25, 2026 01:10
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds statistical utilities under sme/stats with paired bootstrap confidence intervals and Benjamini–Hochberg FDR correction, plus coverage tests.

Changes:

  • Introduce paired_bootstrap_ci returning a structured CI + approximate p-value result.
  • Introduce benjamini_hochberg returning adjusted p-values and rejection decisions.
  • Add pytest coverage validating expected behavior and determinism.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
sme/stats/bootstrap.py Implements paired bootstrap CI + approximate p-value result container.
sme/stats/fdr.py Implements Benjamini–Hochberg correction + result container.
sme/stats/__init__.py Exposes new stats utilities as package API.
tests/test_stats.py Adds tests for bootstrap CI and BH correction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sme/stats/bootstrap.py Outdated
Returns:
BootstrapCIResult with mean difference and CI bounds
"""
assert len(scores_a) == len(scores_b), "Paired scores must be same length"
Comment thread sme/stats/bootstrap.py
Comment on lines +45 to +55
rng = np.random.RandomState(seed)
a = np.array(scores_a, dtype=float)
b = np.array(scores_b, dtype=float)
diffs = a - b
observed_mean = float(np.mean(diffs))
n = len(diffs)

boot_means = np.empty(n_bootstrap)
for i in range(n_bootstrap):
indices = rng.randint(0, n, size=n)
boot_means[i] = np.mean(diffs[indices])
Comment thread sme/stats/bootstrap.py
Comment on lines +45 to +55
rng = np.random.RandomState(seed)
a = np.array(scores_a, dtype=float)
b = np.array(scores_b, dtype=float)
diffs = a - b
observed_mean = float(np.mean(diffs))
n = len(diffs)

boot_means = np.empty(n_bootstrap)
for i in range(n_bootstrap):
indices = rng.randint(0, n, size=n)
boot_means[i] = np.mean(diffs[indices])
Comment thread sme/stats/bootstrap.py Outdated
ci_upper: float
n_bootstrap: int
confidence_level: float
p_value_approx: float # fraction of bootstrap diffs crossing zero
Comment thread sme/stats/bootstrap.py
Comment on lines +61 to +65
if observed_mean >= 0:
p_approx = float(np.mean(boot_means < 0))
else:
p_approx = float(np.mean(boot_means > 0))
p_approx = min(2 * p_approx, 1.0)
Comment thread tests/test_stats.py
def test_identical_scores_zero_diff(self):
"""Identical paired scores should give CI containing zero."""
scores = [0.8, 0.6, 0.7, 0.9, 0.5]
result = paired_bootstrap_ci(scores, scores)
Comment thread tests/test_stats.py Outdated
"""When A clearly beats B, CI should be entirely positive."""
a = [1.0] * 50
b = [0.0] * 50
result = paired_bootstrap_ci(a, b)
Comment thread tests/test_stats.py Outdated
Comment on lines +31 to +32
r1 = paired_bootstrap_ci(a, b, seed=123)
r2 = paired_bootstrap_ci(a, b, seed=123)
Comment thread sme/stats/fdr.py
Comment on lines +19 to +23
def benjamini_hochberg(
p_values: list[float],
*,
alpha: float = 0.05,
) -> FDRResult:
@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

The right two statistical tools for this work — paired bootstrap percentile is the standard Efron & Tibshirani move for "is this per-question paired delta non-zero," and BH-FDR with the monotonicity enforcement is the correct multiple-comparisons correction for the cross-category readouts. Logic on both looks right.

Two real fixes worth doing before merge:

  1. assertValueError for input validation. Asserts get stripped under python -O, so any deployed reading silently loses validation. The bootstrap function asserts len(a) == len(b); that's the exact case where bad input should raise loudly, not "raise unless someone optimized." Same fix for the BH function — currently has no input validation at all (accepts p-values outside [0,1], NaN, alpha outside (0,1]) and silently produces wrong results.
  2. p-value docstring mismatch in bootstrap.py — comment says "fraction crossing zero" but the code does two-sided (doubling one-tail). Fix the comment.

The bootstrap-loop vectorization Copilot flagged is a real perf nit but not a merge blocker — the loop is only slow for very large n_bootstrap, which is a CI-speed concern not a correctness concern. Setting n_bootstrap=200 in tests (rather than the default 10000) is the simpler short-term fix.

Same family-of-failure note as #35: both fixes here are for the class of silent-wrong-answer bug they themselves contain (asserts that evaporate, no input validation). Once both PRs land, a pre-commit hook scanning for assert outside tests/ would prevent this class structurally — happy to add it if useful.

Replace the bootstrap length assert with an explicit ValueError so
validation survives `python -O` (asserts get stripped under -O, which
would silently skip the check in optimized deployments). Add the
previously-absent input validation to benjamini_hochberg: reject
p-values outside [0,1], NaN, and alpha outside (0,1].

Fix the p_value_approx comment in bootstrap.py to describe the
two-sided computation the code actually performs (2x one-tail, capped
at 1.0) instead of "fraction crossing zero".

Tests cover the new ValueError paths; slow bootstrap tests pinned to
n_bootstrap=200 to keep CI fast.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jphein
Copy link
Copy Markdown
Contributor Author

jphein commented May 27, 2026

Pushed 45fa392:

  1. assertValueError. The bootstrap length-mismatch check now raises (asserts evaporate under python -O). The BH function, which had no validation, now rejects p-values outside [0, 1], NaN, and alpha outside (0, 1].
  2. Docstring. The bootstrap p-value doc now matches the code's two-sided computation (doubled one-tail) instead of claiming "fraction crossing zero".

Added tests for the new ValueError paths (mismatched lengths; out-of-range p-values / alpha / NaN). CI green on 3.10/3.11/3.12.

🫏

@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

Verified — assert → ValueError on the bootstrap length check, BH now rejects out-of-range p-values, NaN, and bad alpha, and the docstring matches the two-sided computation the code actually performs. The n_bootstrap=200 pinning on the slow tests is a nice touch — keeps CI fast without losing property coverage. Ship.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Statistical hygiene: bootstrap CIs + BH-FDR correction across all readout deltas

3 participants