This package implements the mathematical framework described in "A General Consensus Framework for Robust Statistical Locus Selection" by Connor M. Frankston.
The consensus framework integrates diverse sets of genomic loci (from multiple BED files) into a robust consensus based on statistical dependencies among selectors. It operates on three main assumptions:
- Saturation - The union of all selectors covers the target loci
- Independence of Incompetent Selection (IIS) - False loci are selected independently
- Independence of Incompetent Omission (IIO) - True loci are omitted independently
# Clone the repository
git clone https://github.com/frankston/consensus-framework.git
cd consensus-framework
# Install dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .# Basic usage
consensus-framework --location input_table.tsv --output-dir ./results
# With multiple testing correction and significance threshold
consensus-framework --location input_table.tsv --output-dir ./results --alpha 0.05 --mtc bh
# Enable verbose logging
consensus-framework --location input_table.tsv --output-dir ./results --verbose--location: Path to the input table file containing BED file references (required)--sep: Separator used in the input table (default: tab)--table-columns: Comma-separated string of column labels--output-dir: Directory to save output files (required)--skip-p-vals/--get-p-vals: Skip or compute p-values (default: compute)--alpha: Significance threshold for consensus computation--n-simulations: Number of simulations for null distribution (default: 10000)--recompute: Recompute outputs even if existing files are found--mtc: Multiple-testing correction method (raw,bonf,sidak,bh) (default:bh)--verbose: Enable verbose (debug) logging
from consensus_framework import (
load_basis_identifiers,
create_prior_localizations,
compute_identifier_selectivity,
apply_multiple_testing_correction
)
# Load identifiers
selected_loci_df, basis_identifiers_df, num_basis_identifiers = load_basis_identifiers(
"input_table.tsv", "\t", "location,sep,table_columns"
)
# Create prior localizations
prior_localization_df, atoms_df = create_prior_localizations(
selected_loci_df, num_basis_identifiers
)
# Compute selectivity
atoms_df, coverage_sums, prior_localizations, selectivity_values, n_unique = compute_identifier_selectivity(
atoms_df, num_basis_identifiers
)
# ... additional processing stepsThe input table should contain information about each BED file to be used as a basis identifier:
location sep table_columns
file1.bed \t chromosome,start,end
file2.bed \t chromosome,start,end
prior_localization_regions.bed: BED file with prior localization regions (union of all identifiers)consensus_track.bed: BED file with consensus statistics and p-valuesconsensus_regions.selection.alpha-X.mtc-Y.bed: Significant regions based on selection (if alpha is provided)consensus_regions.omission.alpha-X.mtc-Y.bed: Significant regions based on omission (if alpha is provided)
The framework is implemented in two main modules:
locus_consensus.py: Contains the core statistical consensus functionalitylocus_consensus_cli.py: Provides the command-line interface
- Data Loading: Load basis identifiers and their corresponding BED files
- Prior Localization Construction: Create atomic regions and prior localizations
- Selectivity Computation: Calculate selectivity for each basis identifier
- Consensus Statistics: Compute selection and omission statistics
- Null Distribution Simulation: Generate empirical null distributions
- Multiple Testing Correction: Apply corrections (Bonferroni, Šidák, Benjamini-Hochberg)
- Result Saving: Output consensus regions as BED files
The package implements the mathematical framework described in the manuscript, including:
- Creating prior localizations as connected components of the union of all identifiers
- Computing selectivity as the expected probability of identifier selection
- Calculating consensus statistics based on information-theoretic principles
- Testing against independence hypotheses through simulation
- Applying multiple testing correction to control error rates
Run the unit tests with pytest:
pytest tests/- The implementation uses vectorized operations where possible for improved performance
- For large datasets, consider adjusting the
--n-simulationsparameter to balance accuracy and speed - Memory usage scales with the number and size of input BED files