Skip to content

Commit abda008

Browse files
eprifticlaude
andcommitted
docs: Add comprehensive Methods section with 6 methodology pages
New pages covering the mathematical and algorithmic foundations: - Model Languages & Scoring: BTR + Pow2, threshold optimization, CI, fitness - Search Algorithms: GA, Beam, MCMC, cross-validation, feature pre-selection - Family of Best Models: FBM definition, co-presence, jury voting - Stability Analysis: Kuncheva, Tanimoto, CW_rel, clustering, dendrograms - Ecosystem Network: Spearman, Louvain, taxonomy coloring, FBM overlay - Feature Importance & Evaluation: MDA, SHAP, metrics, external validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d241562 commit abda008

10 files changed

Lines changed: 1012 additions & 3 deletions

docs/Features.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
layout: default
33
title: Features
4-
parent: Documentation
54
nav_order: 3
65
---
76

docs/Installation.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
---
22
layout: default
33
title: Installation
4-
parent: Documentation
5-
nav_order: 2
4+
nav_order: 7
65
---
76

87
# Installation

docs/Methods-Algorithms.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
---
2+
layout: default
3+
title: Search Algorithms
4+
parent: Methods
5+
nav_order: 2
6+
---
7+
8+
# Search Algorithms
9+
10+
Predomics provides three complementary optimization algorithms for searching the model space. Each explores different trade-offs between exhaustiveness, speed, and stochasticity.
11+
12+
## Genetic Algorithm (GA)
13+
14+
The primary search strategy. A population-based evolutionary algorithm that maintains and evolves a diverse set of candidate models over multiple generations (epochs).
15+
16+
### Algorithm Outline
17+
18+
1. **Initialization**: Generate a random population of individuals, each with a random subset of k features and random coefficients (drawn from the language alphabet). Feature `prior_weight` annotations can bias initial feature selection.
19+
20+
2. **Evaluation**: Score each individual on the training data (compute AUC, optimize threshold, apply penalties).
21+
22+
3. **Selection**: Choose parents for the next generation:
23+
- **Elite selection**: Top-performing individuals survive unchanged
24+
- **Niche selection** (`select_niche_pct`): Within each language x data-type niche, the top N% are guaranteed as parents. This prevents any single niche from dominating.
25+
- **Tournament/random selection**: Fill remaining parent slots
26+
27+
4. **Crossover**: Combine pairs of parent models to produce offspring. Feature sets are recombined (union, intersection, or mixed strategies). Coefficients are inherited or re-assigned.
28+
29+
5. **Mutation**: Randomly modify offspring:
30+
- Add or remove a feature
31+
- Flip a coefficient sign
32+
- Change the language or data type (rare, large mutations)
33+
34+
6. **Diversity filtration** (`forced_diversity_pct`): At periodic intervals (`forced_diversity_epochs`), remove individuals that are too similar to others (measured by signed Jaccard distance on feature sets). Binary/Ternary/Pow2 models are filtered as one group; Ratio models are filtered separately. This prevents premature convergence.
35+
36+
7. **Repeat** steps 2-6 for the configured number of epochs.
37+
38+
### Niche System
39+
40+
The population is partitioned into **niches** based on (language, data_type) combinations. For example, with languages {bin, ter, ratio} and data types {raw, prev, log}, there are up to 9 niches. The `select_niche_pct` parameter ensures each niche contributes parents to the next generation, maintaining structural diversity.
41+
42+
### When to Use
43+
44+
- **Default choice** for most analyses
45+
- Best for broad exploration of the feature space
46+
- Handles all languages and data types simultaneously
47+
- Most robust to local optima due to population diversity
48+
49+
## Beam Search
50+
51+
A deterministic, greedy heuristic that builds models incrementally.
52+
53+
### Algorithm Outline
54+
55+
1. Start with empty model (k=0)
56+
2. For each candidate feature, tentatively add it to the model
57+
3. Score all (current model + 1 feature) combinations
58+
4. Keep the top B candidates (beam width)
59+
5. Repeat until reaching the target model size k
60+
6. Optionally, continue with **backward elimination**: try removing each feature and keep the best
61+
62+
### Variants
63+
64+
- **Limited Exhaustive**: Evaluate all subsets up to size k (exponential, only practical for small k and feature sets)
65+
- **Parallel Forward Selection**: Multiple starting points explored in parallel
66+
67+
### When to Use
68+
69+
- **Fast prototyping**: Deterministic and reproducible
70+
- **Small feature sets**: When the filtered feature space is small enough for near-exhaustive search
71+
- **Benchmark**: Compare GA results against a greedy baseline
72+
- **Reproducibility**: Same input always produces the same output (no stochasticity)
73+
74+
### Limitations
75+
76+
- Can get trapped in local optima (greedy decisions are irreversible)
77+
- Does not naturally explore multiple languages/data types simultaneously
78+
- Less effective when feature interactions are complex
79+
80+
## MCMC (Markov Chain Monte Carlo)
81+
82+
A probabilistic sampler that explores the model space through random walks.
83+
84+
### Algorithm Outline
85+
86+
1. Start with a random model
87+
2. **Propose** a modification:
88+
- Add a random feature
89+
- Remove a random feature
90+
- Swap a feature for another
91+
- Change a coefficient
92+
3. **Accept or reject** the proposal based on the Metropolis-Hastings criterion:
93+
- If the new model is better: always accept
94+
- If worse: accept with probability proportional to exp(-delta_fitness / temperature)
95+
4. Record the current model
96+
5. Repeat for many iterations
97+
98+
### Sequential Backward Selection Variant
99+
100+
A deterministic variant that starts with a large feature set and iteratively removes the least important feature (by MDA permutation), producing models at each sparsity level.
101+
102+
### When to Use
103+
104+
- **Posterior estimation**: Estimate the probability that each feature belongs to the optimal model
105+
- **Uncertainty quantification**: The chain samples proportional to model quality
106+
- **Complementary exploration**: Can find models missed by GA and Beam
107+
108+
## Algorithm Comparison
109+
110+
| Property | GA | Beam | MCMC |
111+
|----------|:--:|:----:|:----:|
112+
| Stochastic | Yes | No | Yes |
113+
| Multi-language | Yes | Per-run | Per-run |
114+
| Global search | Strong | Weak | Moderate |
115+
| Speed | Moderate | Fast | Slow |
116+
| Reproducibility | Seed-dependent | Deterministic | Seed-dependent |
117+
| Feature interactions | Captured | Limited | Moderate |
118+
| Recommended for | General use | Quick exploration | Posterior estimation |
119+
120+
## Cross-Validation
121+
122+
All algorithms support two levels of cross-validation:
123+
124+
### Outer Cross-Validation (K-Fold)
125+
126+
The training data is split into K folds (default K=5). The algorithm is run K times in parallel, each time using K-1 folds for training and 1 fold for validation. The final population gathers the Family of Best Models (FBM) from all folds.
127+
128+
```
129+
Fold 1: Train on folds {2,3,4,5}, validate on fold 1
130+
Fold 2: Train on folds {1,3,4,5}, validate on fold 2
131+
...
132+
Final: Combine FBMs, evaluate on full training data + external test set
133+
```
134+
135+
### Inner Cross-Validation (Overfit Control)
136+
137+
Within each training run, the data is further split into inner folds. At each epoch, model fitness is penalized for the train-validation gap:
138+
139+
```
140+
fitness = mean(train_fit - |train_fit - valid_fit| * overfit_penalty)
141+
```
142+
143+
Inner folds can be re-drawn periodically (`resampling_inner_folds_epochs`) to prevent indirect learning of fold structure.
144+
145+
### Stratification
146+
147+
Folds are stratified by class label to preserve class proportions. Since v0.7.4, **double stratification** is supported: folds are stratified first by class, then by a metadata variable (e.g., hospital, batch, sequencing center) to ensure that confounding factors are balanced across folds.
148+
149+
## Feature Pre-selection
150+
151+
Before running any algorithm, features are statistically pre-filtered to reduce the search space:
152+
153+
1. **Prevalence filter** (`feature_minimal_prevalence_pct`): Remove features present in fewer than N% of samples in both classes
154+
2. **Mean filter** (`feature_minimal_feature_value`): Remove features with negligible mean abundance in both classes
155+
3. **Statistical test**: Three options:
156+
- **Wilcoxon rank-sum** (default): Non-parametric, robust to skewness and outliers
157+
- **Student's t-test**: Parametric, assumes normality
158+
- **Bayesian Fisher's exact**: For binary/presence-absence data, uses Bayes factors
159+
4. **Multiple testing correction**: Benjamini-Hochberg FDR for Wilcoxon and t-test
160+
5. **Feature ranking**: Features ranked by significance, optionally capped at `max_features_per_class` per class
161+
162+
## References
163+
164+
- Holland, J. H. (1992). *Adaptation in Natural and Artificial Systems*. MIT Press.
165+
- Blondel, V. D. et al. (2008). Fast unfolding of communities in large networks. *J. Stat. Mech.*, P10008.
166+
- Metropolis, N. et al. (1953). Equation of state calculations by fast computing machines. *J. Chem. Phys.*, 21(6), 1087-1092.

0 commit comments

Comments
 (0)