Skip to content

Conversation

@shntnu
Copy link
Member

@shntnu shntnu commented Dec 8, 2025

Summary

Implements two-stage hierarchical FDR to reduce over-correction when testing related hypotheses (e.g., multiple doses of the same compound).

  • Add hierarchical_by parameter to mean_average_precision()
  • Stage 1: Use minimum p-value per group, apply BH at group level
  • Stage 2: For significant groups, apply BH within each group

Usage

result = mean_average_precision(
    ap_scores,
    sameby=["compound", "dose"],      # mAP per compound×dose
    hierarchical_by=["compound"],     # but correct at compound level
    null_size=10000,
    threshold=0.05,
    seed=42,
)

When hierarchical_by is specified, the result includes additional columns:

  • stage1_p_value: Group-level p-value (minimum p-value in group)
  • stage1_corrected_p_value: BH-corrected Stage 1 p-value
  • stage1_significant: Whether the group passed Stage 1

Why min p-value instead of Simes?

For dose-response data, low doses are expected to be inactive. Simes' method penalizes compounds for having inactive low doses, which is biologically normal. Min p-value is more appropriate: a compound passes Stage 1 if ANY dose shows activity.

Example on LINCS data (4 plates, 58 compounds × 6 doses):

  • Flat BH: 26 significant doses
  • Hierarchical with Simes: 33 significant doses (but misses compounds with strong high-dose signal)
  • Hierarchical with min-p: 49 significant doses (88% power gain over flat BH)

Why hierarchical FDR matters

With 1000 compounds × 5 doses = 5000 tests, standard BH treats each as independent. But doses of the same compound test the same underlying hypothesis. Hierarchical FDR:

  • Stage 1: 1000 compound tests → say 50 pass (if any dose is active)
  • Stage 2: 50 × 5 = 250 dose tests, corrected in groups of 5
  • Much less harsh than correcting across 5000

Test plan

  • 8 tests for hierarchical FDR behavior
  • All tests pass
  • Ruff checks pass

Context for Broadies: See https://github.com/broadinstitute/cpg0037-oasis-broad-U2OS-data/issues/9#issuecomment-3624961526 for a real-world example of its utility

Closes #115

🤖 Generated with Claude Code

Implements two-stage hierarchical FDR (Yekutieli 2008) to reduce
over-correction when testing related hypotheses (e.g., multiple
doses of the same compound).

- Add `hierarchical_by` parameter to `mean_average_precision()`
- Stage 1: Aggregate p-values by group using Simes' method, apply BH
- Stage 2: For significant groups, apply BH within each group
- Add `simes_pvalue()` function for p-value combination
- Fix `silent_thread_map` bug (handle `leave` kwarg)
- Add comprehensive tests for hierarchical FDR

Closes #115

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Simes method penalizes compounds for having inactive low doses, which is
the expected biological behavior in dose-response data. Min p-value is
more appropriate: a compound passes Stage 1 if ANY dose shows activity.

- Replace simes_pvalue() aggregation with simple min()
- Remove unused simes_pvalue function and tests
- Update docstrings to reflect the change
@shntnu shntnu changed the title feat: add hierarchical FDR correction for grouped hypotheses feat: add hierarchical FDR correction for dose-response data Dec 8, 2025
@shntnu
Copy link
Member Author

shntnu commented Dec 8, 2025

Design note: Why min p-value instead of Simes?

The initial implementation 1ede068 used Simes' method for Stage 1 p-value aggregation, following Yekutieli (2008) hierarchical FDR. However, testing on real dose-response data (LINCS, 4 plates, 58 compounds × 6 doses) revealed a problem:

Simes penalizes compounds for having inactive low doses - which is the expected biological behavior in dose-response data.

For example, compound BRD-K72414522-001-06-7:

  • 10µM dose: mAP=0.47, p=0.0029 (clearly active)
  • Low doses: p=0.03-0.14 (expected to be inactive)
  • Simes p-value: 0.012 → after BH: 0.058 (just missed 0.05 threshold!)

The compound has a strong signal at high dose but Simes dilutes it with the inactive low doses.

Min p-value is more appropriate: a compound passes Stage 1 if ANY dose is active. This matches the biological question: "Does this compound have a phenotype at any tested dose?"

Results on LINCS data:

Method Stage 1 compounds Significant doses
Flat BH - 26
Hierarchical + Simes 10 33
Hierarchical + min-p 19 49

Min-p provides 88% power gain over flat BH while correctly handling the dose-response structure.

When would Simes be appropriate?

Simes would be better when you expect most or all group members to be active (e.g., testing replicates of the same condition). For dose-response, where only high doses are expected to be active, min-p is the right choice.

If there's future demand, we could add a stage1_method parameter to allow users to choose. For now, min-p is hardcoded as the biologically appropriate default for the dose-response use case.

kwargs = kwargs.copy()
max_workers = kwargs.pop("max_workers", min(32, cpu_count() + 4))
chunksize = kwargs.pop("chunksize", 1)
kwargs.pop("leave", None) # Remove tqdm-specific kwarg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this unnecessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was annoying me; I have a separate PR for this alone :D

#113

I can remove it from this PR to avoid clutter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine here I guess, I just don't know how it can be annoying, as it should just be ignored.

Copy link
Contributor

@afermg afermg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic seems okay to me, but (and I consulted with @alxndrkalinin) we probabl don't want to add a bunch of arguments to a single function: Let's do composition.

At a general level, this means:

  • Split p value calculation from statistical correction
  • Isolate hierarchical statistical correction into its own function (ideally in a new file).

I'm a bit torn as to how much to modify the original mean_average_precision funciton, but I think it's worth isolating the two main steps to avoid code duplication, before and after the p-value is calculated this section.

Then we would have composition:

  • One small function (e.g., get_map_pvalue that covers the p-value calculation (the first section of mean_average_precision)
  • One function with hierarchical FDR correction
  • refactor the function mean_average_precision into get_map_pvalue and multipletests, to retain backwards compatibility.
  • Potentially another function that wraps get_map_pvalue and either multipletests or hierarchical_fdr, if you want the convenience of map+hierarchical fdr in one.

This minimises repetition (as it is only the call to multipletests), while maximising modularity in case we have to add a different statistical correction in the future. It should be relatively simple, the code would remain modular enough, and we wouldn't start accumulating a bunch of flags and arguments on the main functions.

  • Tests run on my side so far.

map_scores["p_value"], method="fdr_bh"
)
map_scores["corrected_p_value"] = pvals_corrected
if hierarchical_by is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hierarchical correction section is >50% of the code in the function. Since we are now able to switch to different correction methods I propose the different corrections should be independent functions and we should use map.map as an orchestrator of these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also keep the function tidy for maintenance.

# Hierarchical should generally find more or equal significant results
# (less over-correction for related hypotheses)
# Note: This is a probabilistic test, but with our setup it should hold
assert n_sig_hier >= n_sig_flat * 0.8 # Allow some tolerance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there an upper limit to the difference we should also be wary of?

- `below_p`: Boolean indicating if the p-value is below the threshold.
- `below_corrected_p`: Boolean indicating if the corrected p-value is below the threshold.
When `hierarchical_by` is used, additional columns are included:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of adding yet more corrected columns, as it becomes confusing. when we have different types of corrections. That said, I don't think there is an easy solution to this problem.

kwargs = kwargs.copy()
max_workers = kwargs.pop("max_workers", min(32, cpu_count() + 4))
chunksize = kwargs.pop("chunksize", 1)
kwargs.pop("leave", None) # Remove tqdm-specific kwarg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine here I guess, I just don't know how it can be annoying, as it should just be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add hierarchical FDR correction for grouped hypotheses

3 participants