|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: Search Algorithms |
| 4 | +parent: Methods |
| 5 | +nav_order: 2 |
| 6 | +--- |
| 7 | + |
| 8 | +# Search Algorithms |
| 9 | + |
| 10 | +Predomics provides three complementary optimization algorithms for searching the model space. Each explores different trade-offs between exhaustiveness, speed, and stochasticity. |
| 11 | + |
| 12 | +## Genetic Algorithm (GA) |
| 13 | + |
| 14 | +The primary search strategy. A population-based evolutionary algorithm that maintains and evolves a diverse set of candidate models over multiple generations (epochs). |
| 15 | + |
| 16 | +### Algorithm Outline |
| 17 | + |
| 18 | +1. **Initialization**: Generate a random population of individuals, each with a random subset of k features and random coefficients (drawn from the language alphabet). Feature `prior_weight` annotations can bias initial feature selection. |
| 19 | + |
| 20 | +2. **Evaluation**: Score each individual on the training data (compute AUC, optimize threshold, apply penalties). |
| 21 | + |
| 22 | +3. **Selection**: Choose parents for the next generation: |
| 23 | + - **Elite selection**: Top-performing individuals survive unchanged |
| 24 | + - **Niche selection** (`select_niche_pct`): Within each language x data-type niche, the top N% are guaranteed as parents. This prevents any single niche from dominating. |
| 25 | + - **Tournament/random selection**: Fill remaining parent slots |
| 26 | + |
| 27 | +4. **Crossover**: Combine pairs of parent models to produce offspring. Feature sets are recombined (union, intersection, or mixed strategies). Coefficients are inherited or re-assigned. |
| 28 | + |
| 29 | +5. **Mutation**: Randomly modify offspring: |
| 30 | + - Add or remove a feature |
| 31 | + - Flip a coefficient sign |
| 32 | + - Change the language or data type (rare, large mutations) |
| 33 | + |
| 34 | +6. **Diversity filtration** (`forced_diversity_pct`): At periodic intervals (`forced_diversity_epochs`), remove individuals that are too similar to others (measured by signed Jaccard distance on feature sets). Binary/Ternary/Pow2 models are filtered as one group; Ratio models are filtered separately. This prevents premature convergence. |
| 35 | + |
| 36 | +7. **Repeat** steps 2-6 for the configured number of epochs. |
| 37 | + |
| 38 | +### Niche System |
| 39 | + |
| 40 | +The population is partitioned into **niches** based on (language, data_type) combinations. For example, with languages {bin, ter, ratio} and data types {raw, prev, log}, there are up to 9 niches. The `select_niche_pct` parameter ensures each niche contributes parents to the next generation, maintaining structural diversity. |
| 41 | + |
| 42 | +### When to Use |
| 43 | + |
| 44 | +- **Default choice** for most analyses |
| 45 | +- Best for broad exploration of the feature space |
| 46 | +- Handles all languages and data types simultaneously |
| 47 | +- Most robust to local optima due to population diversity |
| 48 | + |
| 49 | +## Beam Search |
| 50 | + |
| 51 | +A deterministic, greedy heuristic that builds models incrementally. |
| 52 | + |
| 53 | +### Algorithm Outline |
| 54 | + |
| 55 | +1. Start with empty model (k=0) |
| 56 | +2. For each candidate feature, tentatively add it to the model |
| 57 | +3. Score all (current model + 1 feature) combinations |
| 58 | +4. Keep the top B candidates (beam width) |
| 59 | +5. Repeat until reaching the target model size k |
| 60 | +6. Optionally, continue with **backward elimination**: try removing each feature and keep the best |
| 61 | + |
| 62 | +### Variants |
| 63 | + |
| 64 | +- **Limited Exhaustive**: Evaluate all subsets up to size k (exponential, only practical for small k and feature sets) |
| 65 | +- **Parallel Forward Selection**: Multiple starting points explored in parallel |
| 66 | + |
| 67 | +### When to Use |
| 68 | + |
| 69 | +- **Fast prototyping**: Deterministic and reproducible |
| 70 | +- **Small feature sets**: When the filtered feature space is small enough for near-exhaustive search |
| 71 | +- **Benchmark**: Compare GA results against a greedy baseline |
| 72 | +- **Reproducibility**: Same input always produces the same output (no stochasticity) |
| 73 | + |
| 74 | +### Limitations |
| 75 | + |
| 76 | +- Can get trapped in local optima (greedy decisions are irreversible) |
| 77 | +- Does not naturally explore multiple languages/data types simultaneously |
| 78 | +- Less effective when feature interactions are complex |
| 79 | + |
| 80 | +## MCMC (Markov Chain Monte Carlo) |
| 81 | + |
| 82 | +A probabilistic sampler that explores the model space through random walks. |
| 83 | + |
| 84 | +### Algorithm Outline |
| 85 | + |
| 86 | +1. Start with a random model |
| 87 | +2. **Propose** a modification: |
| 88 | + - Add a random feature |
| 89 | + - Remove a random feature |
| 90 | + - Swap a feature for another |
| 91 | + - Change a coefficient |
| 92 | +3. **Accept or reject** the proposal based on the Metropolis-Hastings criterion: |
| 93 | + - If the new model is better: always accept |
| 94 | + - If worse: accept with probability proportional to exp(-delta_fitness / temperature) |
| 95 | +4. Record the current model |
| 96 | +5. Repeat for many iterations |
| 97 | + |
| 98 | +### Sequential Backward Selection Variant |
| 99 | + |
| 100 | +A deterministic variant that starts with a large feature set and iteratively removes the least important feature (by MDA permutation), producing models at each sparsity level. |
| 101 | + |
| 102 | +### When to Use |
| 103 | + |
| 104 | +- **Posterior estimation**: Estimate the probability that each feature belongs to the optimal model |
| 105 | +- **Uncertainty quantification**: The chain samples proportional to model quality |
| 106 | +- **Complementary exploration**: Can find models missed by GA and Beam |
| 107 | + |
| 108 | +## Algorithm Comparison |
| 109 | + |
| 110 | +| Property | GA | Beam | MCMC | |
| 111 | +|----------|:--:|:----:|:----:| |
| 112 | +| Stochastic | Yes | No | Yes | |
| 113 | +| Multi-language | Yes | Per-run | Per-run | |
| 114 | +| Global search | Strong | Weak | Moderate | |
| 115 | +| Speed | Moderate | Fast | Slow | |
| 116 | +| Reproducibility | Seed-dependent | Deterministic | Seed-dependent | |
| 117 | +| Feature interactions | Captured | Limited | Moderate | |
| 118 | +| Recommended for | General use | Quick exploration | Posterior estimation | |
| 119 | + |
| 120 | +## Cross-Validation |
| 121 | + |
| 122 | +All algorithms support two levels of cross-validation: |
| 123 | + |
| 124 | +### Outer Cross-Validation (K-Fold) |
| 125 | + |
| 126 | +The training data is split into K folds (default K=5). The algorithm is run K times in parallel, each time using K-1 folds for training and 1 fold for validation. The final population gathers the Family of Best Models (FBM) from all folds. |
| 127 | + |
| 128 | +``` |
| 129 | +Fold 1: Train on folds {2,3,4,5}, validate on fold 1 |
| 130 | +Fold 2: Train on folds {1,3,4,5}, validate on fold 2 |
| 131 | +... |
| 132 | +Final: Combine FBMs, evaluate on full training data + external test set |
| 133 | +``` |
| 134 | + |
| 135 | +### Inner Cross-Validation (Overfit Control) |
| 136 | + |
| 137 | +Within each training run, the data is further split into inner folds. At each epoch, model fitness is penalized for the train-validation gap: |
| 138 | + |
| 139 | +``` |
| 140 | +fitness = mean(train_fit - |train_fit - valid_fit| * overfit_penalty) |
| 141 | +``` |
| 142 | + |
| 143 | +Inner folds can be re-drawn periodically (`resampling_inner_folds_epochs`) to prevent indirect learning of fold structure. |
| 144 | + |
| 145 | +### Stratification |
| 146 | + |
| 147 | +Folds are stratified by class label to preserve class proportions. Since v0.7.4, **double stratification** is supported: folds are stratified first by class, then by a metadata variable (e.g., hospital, batch, sequencing center) to ensure that confounding factors are balanced across folds. |
| 148 | + |
| 149 | +## Feature Pre-selection |
| 150 | + |
| 151 | +Before running any algorithm, features are statistically pre-filtered to reduce the search space: |
| 152 | + |
| 153 | +1. **Prevalence filter** (`feature_minimal_prevalence_pct`): Remove features present in fewer than N% of samples in both classes |
| 154 | +2. **Mean filter** (`feature_minimal_feature_value`): Remove features with negligible mean abundance in both classes |
| 155 | +3. **Statistical test**: Three options: |
| 156 | + - **Wilcoxon rank-sum** (default): Non-parametric, robust to skewness and outliers |
| 157 | + - **Student's t-test**: Parametric, assumes normality |
| 158 | + - **Bayesian Fisher's exact**: For binary/presence-absence data, uses Bayes factors |
| 159 | +4. **Multiple testing correction**: Benjamini-Hochberg FDR for Wilcoxon and t-test |
| 160 | +5. **Feature ranking**: Features ranked by significance, optionally capped at `max_features_per_class` per class |
| 161 | + |
| 162 | +## References |
| 163 | + |
| 164 | +- Holland, J. H. (1992). *Adaptation in Natural and Artificial Systems*. MIT Press. |
| 165 | +- Blondel, V. D. et al. (2008). Fast unfolding of communities in large networks. *J. Stat. Mech.*, P10008. |
| 166 | +- Metropolis, N. et al. (1953). Equation of state calculations by fast computing machines. *J. Chem. Phys.*, 21(6), 1087-1092. |
0 commit comments