MCMC overfits with more iterations — needs regularization/CV during SBS

## Problem

More MCMC iterations lead to worse test performance — classic overfitting:

| n_iter | lambda | Train AUC | Test AUC |
|--------|--------|:---------:|:--------:|
| 200 | 0.001 | 0.925 | **0.738** |
| 1000 | 0.001 | 0.960 | 0.662 |
| 1000 | 0.01 | 0.960 | 0.687 |

## Root cause

SBS eliminates features based solely on training posterior probability — no holdout or CV during feature elimination. As n_iter increases, the posterior concentrates on features that fit the training data, not features that generalize.

## Proposed fixes (by complexity)

### 1. Inner holdout during SBS (medium)
Split train data into train/valid before SBS. At each step, run MCMC on train, evaluate eliminated candidates on valid. Drop the feature whose removal least hurts validation performance.

### 2. Bayesian model averaging (medium)
Instead of committing to one feature subset via SBS, average predictions over multiple subsets weighted by their posterior model probability. This naturally regularizes by not over-committing.

### 3. Spike-and-slab prior with stronger shrinkage (low complexity)
The current L2 prior (lambda) weakly regularizes coefficients. A spike-and-slab prior explicitly models P(feature included) vs P(feature excluded) and can be tuned to be more aggressive.

### 4. Early stopping on SBS (simple)
Monitor a held-out validation metric during SBS. Stop eliminating features when validation performance starts decreasing, even if nmin hasn't been reached.

### 5. Cross-validated SBS (high, most principled)
Run K-fold CV at each SBS step — each fold runs MCMC independently, elimination decision based on average posterior across folds.

## References

- O'Hara, R.B. & Sillanpää, M.J. (2009). A review of Bayesian variable selection methods. Bayesian Analysis.
- Mitchell, T.J. & Beauchamp, J.J. (1988). Bayesian variable selection in linear regression. JASA.
- Ishwaran, H. & Rao, J.S. (2005). Spike and slab variable selection. Annals of Statistics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCMC overfits with more iterations — needs regularization/CV during SBS #72

Problem

Root cause

Proposed fixes (by complexity)

1. Inner holdout during SBS (medium)

2. Bayesian model averaging (medium)

3. Spike-and-slab prior with stronger shrinkage (low complexity)

4. Early stopping on SBS (simple)

5. Cross-validated SBS (high, most principled)

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MCMC overfits with more iterations — needs regularization/CV during SBS #72

Description

Problem

Root cause

Proposed fixes (by complexity)

1. Inner holdout during SBS (medium)

2. Bayesian model averaging (medium)

3. Spike-and-slab prior with stronger shrinkage (low complexity)

4. Early stopping on SBS (simple)

5. Cross-validated SBS (high, most principled)

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions