Markov Chain Monte Carlo (MCMC) is a Bayesian approach to feature selection and model estimation in Gpredomics. Unlike the optimization-based algorithms (GA, ACO, SA, Beam), MCMC samples from the posterior distribution of models, providing probabilistic statements about feature importance and coefficient uncertainty. It is combined with Sequential Backward Selection (SBS) for automatic dimensionality reduction.
All pre-selected features (p features)
|
v
┌─────────────────────────────────────────────┐
│ MCMC Posterior Sampling (n_iter iterations) │
│ │
│ For each iteration: │
│ 1. Optimize betas (a, b, c) via Brent │
│ 2. Sample betas from truncated normals │
│ 3. Propose coef flips for each feature │
│ 4. Accept/reject via Metropolis-Hastings │
│ │
│ After burn-in: record posterior samples │
└─────────────────────────────────────────────┘
|
v
Compute feature probabilities: P(+1), P(0), P(-1)
|
v
Drop feature with highest P(neutral)
|
v
┌─────────────────────────────────────────────┐
│ SBS: Repeat MCMC with p-1 features │
│ then p-2, p-3, ..., nmin features │
└─────────────────────────────────────────────┘
|
v
Select optimal subset by log evidence
|
v
Final MCMC run on optimal feature subset
The algorithm:
- Pre-select features using statistical tests (same as other algorithms)
- Initialize a model with all selected features, random coefficients in {-1, 0, +1}, and betas (a=1, b=-1, c=1)
- Run MCMC for
n_iteriterations:- Optimize each beta (a, b, c) using Brent's method, then sample from a truncated normal
- For each feature, propose a coefficient change and accept/reject via Metropolis-Hastings
- Discard the first
n_burniterations (burn-in period) - Compute posterior feature probabilities from post-burn-in samples
- SBS loop: drop the feature with highest neutral probability, repeat MCMC with fewer features
- Select the feature subset that maximizes log evidence
- Final run: recompute full MCMC on the optimal subset
The MCMC approach models the probability of class membership using logistic regression with structured coefficients:
P(y=1 | x, model) = logistic(a * sum_pos + b * sum_neg + c)
Where:
- sum_pos = sum of feature values where coefficient = +1
- sum_neg = sum of feature values where coefficient = -1
- a (positive coefficient weight): constrained to be positive via truncated normal sampling
- b (negative coefficient weight): constrained to be negative via truncated normal sampling
- c (intercept): unconstrained, sampled from a normal distribution
This structure mirrors the BTR model languages but with continuous scaling factors (a, b, c) instead of fixed discrete coefficients.
The algorithm uses a two-phase sampling scheme within each iteration:
Phase 1 -- Continuous parameters (betas): For each beta (a, b, c):
- Find the mode via Brent optimization (minimizing negative log posterior)
- Compute the local curvature (sigma) at the mode
- Sample a new value from a truncated normal centered at the mode:
- Beta a: truncated to positive values
- Beta b: truncated to negative values
- Beta c: unrestricted normal
Phase 2 -- Discrete parameters (feature coefficients): For each feature:
- Propose a new coefficient: current + random({2,3}) mod 3 - 1, cycling through {-1, 0, +1}
- Compute the log posterior ratio between proposed and current states
- Accept with probability min(1, exp(log_posterior_new - log_posterior_old))
The log posterior combines the likelihood and a Gaussian prior:
log P(model | data) = log_likelihood + log_prior
log_likelihood = sum_i [ y_i * log(sigma(z_i)) + (1-y_i) * log(sigma(-z_i)) ]
where z_i = a * pos_i + b * neg_i + c
log_prior = -lambda * (a^2 + b^2 + c^2)
The lambda parameter controls regularization strength: higher lambda shrinks betas toward zero, preferring simpler models.
The first n_burn iterations are discarded to allow the chain to converge to the stationary distribution. During burn-in, the sampler explores broadly; after burn-in, samples approximate the true posterior.
Only post-burn-in samples contribute to:
- Feature inclusion probabilities P(+1), P(0), P(-1)
- Beta coefficient means and variances
- Log posterior statistics
SBS wraps the core MCMC to perform automatic feature selection:
- Run MCMC with all p pre-selected features
- Identify the feature with the highest P(neutral) -- the least important feature
- Remove that feature and run MCMC again with p-1 features
- Repeat until only
nminfeatures remain - For each step, compute the log evidence:
log10(posterior_mean) - n_features * log10(3) - Select the step with the highest log evidence as the optimal feature subset
- Re-run MCMC on the optimal subset to produce the final posterior
| Aspect | BTR (GA/ACO/SA/Beam) | MCMC |
|---|---|---|
| Feature coefficients | Discrete: {-1, 0, +1} | Discrete: {-1, 0, +1} (sampled) |
| Scaling | None (fixed at 1) | Continuous betas (a, b, c) |
| Uncertainty | None (point estimate) | Full posterior distribution |
| Feature importance | Frequency in FBM | Posterior inclusion probability |
| Language | Binary/Ternary/Ratio/Pow2 | Generic (single language) |
The MCMC implementation is currently in beta status with the following limitations:
- Single-threaded: the MCMC chain is inherently sequential; no parallelization across chains
- Single language: only the generic MCMC language is supported (no binary/ternary/ratio/pow2 distinction during sampling)
- Single data type: only one data type per run (if multiple are specified, only the first is used)
- Memory-intensive: SBS stores the full posterior trace for each feature count, which can be large for many features or iterations
- Slow for many features: each SBS step runs a full MCMC chain; with p features, this means p - nmin MCMC runs
- No GPU support: all computation is CPU-based
- No cross-validation: MCMC does not currently integrate with the CV framework
mcmc:
n_iter: 1000 # Total MCMC iterations per chain
n_burn: 500 # Burn-in iterations to discard
lambda: 0.1 # L2 regularization strength for betas
nmin: 5 # Minimum features to keep in SBS (0 = no SBS)| Parameter | Effect of increasing | Recommended |
|---|---|---|
| n_iter | Better posterior approximation, slower | 1000--10000 |
| n_burn | More burn-in discarded, must be < n_iter | 50% of n_iter |
| lambda | Stronger regularization, smaller betas | 0.01--1.0 |
| nmin | SBS stops earlier, keeps more features (0 = skip SBS) | 3--20 |
Quick start: Use n_iter: 2000, n_burn: 1000, lambda: 0.1. Set nmin to your target model size or 0 to run MCMC without SBS.
Use MCMC when:
- You need posterior uncertainty estimates for feature importance
- You want probabilistic feature inclusion/exclusion (not just point estimates)
- You need to quantify how confidently each feature contributes to prediction
- You want an automated feature elimination procedure (SBS)
- The number of pre-selected features is moderate (< 100)
Don't use MCMC when:
- Speed is critical (MCMC is significantly slower than GA/SA/Beam)
- You have many features (> 200 pre-selected) -- SBS will be very slow
- You need multi-language comparison (binary vs ternary vs ratio)
- You need GPU acceleration
- You need cross-validation integrated into the search
general:
algo: mcmc
language: ter
data_type: prev
seed: 42
fit: auc
mcmc:
n_iter: 5000
n_burn: 2500
lambda: 0.1
nmin: 5For a quick exploratory run without SBS:
mcmc:
n_iter: 2000
n_burn: 1000
lambda: 0.1
nmin: 0 # Disable SBS, run MCMC on all pre-selected features- Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. Journal of Chemical Physics, 21(6), 1087--1092.
- Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika, 57(1), 97--109.
- Robert, C. P. & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer.
- O'Hara, R. B. & Sillanpaa, M. J. (2009). A Review of Bayesian Variable Selection Methods. Bayesian Analysis, 4(1), 85--117.
- Prifti, E. et al. (2020). Interpretable and accurate prediction models for metagenomics data. GigaScience, 9(3), giaa010.
Last updated: v0.9.0