Statistical-Rethinking/ch7ModelComparison.qmd at main · probably-jaden/Statistical-Rethinking · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
---
editor: visual
execute:
  echo: false
  warning: false
  message: false
  cache: true
  cache.lazy: false
  fig-align: center
---
```{r setup}
knitr::opts_chunk$set(fig.align = "center")

library(tidyverse)
library(brms)
library(bayesplot)
library(tidybayes)
library(patchwork)
library(GGally)
library(dagitty)
library(ggdag)
library(ggrepel)

library(rethinking)
detach(package:rethinking, unload = T)
data(milk)

at <- c(-3, -2, -1, 0, 1, 2, 3)
```

# TLDR

We don't necessarily care about fitting the data well, we care about making good predictions about the future. Occam's razor should shape our models so that we prefer parsimonious models all else equal. How do we quantify the "parsimonious-ness" of a model? We have a grab bag of tools to hopefully steer us between the opposing twin dangers of over-fitting or under-fitting.

What are our tools?

**(1) Regularizing the prior**

- Don't get to excited about the data

- Same as frequentists use of a penalized likelihood (e.g. Lasso, Ridge)

**(2) Scoring Devices**

- Information Criteria: AIC BIC, WAIC

- Cross Validation: PSIS-LOO

## The Problem with Parameters

Last chapter we spent a lot of time building causal models. Causal models are fantastic because they allow you to see the consequences of changing a variable, i.e. the $Do(X)$ operation. If we don't care about understanding the effects of our actions, rather we only want to predict what $Y$ will be, it's tempting to put as many variables into our regression as possible. The more variables we add, inevitably our model will fit the data better.

$R^2$ is a common metric that incentives naive scientists to keep adding variables. $R^2$ is defined as:

$$R^2 = \frac{\text{var(outcome) - var(residuals)}}{\text{var(outcome)}} = 1 - \frac{\text{var(residuals)}}{\text{var(outcome)}}$$

$R^2$ always increases as more variables are added even when you just add random numbers which have no relation to the outcome.

```{r}
#| fig-width: 5
#| fig-height: 4

# Alternative: Function to run multiple simulations for robustness
simulate_r2_inflation_robust <- function(n_sims = 30, n_covariates = 20, n_iterations = 100, seed = 123) {
  set.seed(seed)

  # Run multiple iterations
  all_results <- map_dfr(1:n_iterations, function(iter) {
    # Generate new data for each iteration
    y <- rnorm(n_sims, mean = 0, sd = 1)
    covariate_data <- map_dfc(1:n_covariates, ~rnorm(n_sims)) %>%
      set_names(paste0("x", 1:n_covariates))
    simData <- bind_cols(covariate_data, y = y)

    # Calculate R-squared for each number of covariates
    map_dfr(1:n_covariates, function(k) {
      covars <- paste0("x", 1:k, collapse = " + ")
      formula_str <- paste("y ~", covars)
      model <- lm(as.formula(formula_str), data = simData)

      tibble(
        iteration = iter,
        n_covariates = k,
        r_squared = summary(model)$r.squared,
        adjusted_r_squared = summary(model)$adj.r.squared
      )
    })
  })

  # Summarize across iterations
  summary_results <- all_results %>%
    group_by(n_covariates) %>%
    summarise(
      mean_r_squared = mean(r_squared),
      se_r_squared = sd(r_squared) / sqrt(n()),
      mean_adj_r_squared = mean(adjusted_r_squared),
      se_adj_r_squared = sd(adjusted_r_squared) / sqrt(n()),
      .groups = "drop"
    )

  return(summary_results)
}

# Example usage of robust version (uncomment to run):
robust_results <- simulate_r2_inflation_robust(n_sims = 30, n_covariates = 30, n_iterations = 30)
# print(robust_results)

robust_results %>%
  ggplot(aes(x = n_covariates, y = mean_r_squared)) +
  geom_ribbon(aes(ymin = mean_r_squared - 1.96 * se_r_squared,
                  ymax = mean_r_squared + 1.96 * se_r_squared),
              alpha = 0.2, fill = "#e74c3c") +
  geom_line(color = "#e74c3c", size = 1.2) +
  geom_point(color = "#e74c3c", size = 2) +
  labs(
    title = "R² Inflation with Random Covariates",
    subtitle = paste("Average across 30 simulations with 30 data points", "random covariates"),
    x = "Number of Covariates",
    y = "Mean R²",
    caption = "Shaded area shows 95% confidence interval"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "gray50"),
    plot.caption = element_text(size = 10, color = "gray60")
  ) +
  scale_y_continuous(limits = c(0, 1), labels = scales::percent_format())
```

While these more complex models predict the data better, they will often predict the new data worse. We have overfitted the data at that point.

### Brain Body Vignette

Plotted below is the relationship between the body size of several primate species and their corresponding brain size. There's not a strong *a priori* reason to think body and brain size are perfectly linearly related. The true relationship between brain and body size could be any number of polynomial or log'd functions. Let's go through a couple to see what's gained and what's lost with trying ever more complex functions to fit the data.

```{r}
#| fig-width: 5
#| fig-height: 4

speciesBrain <-
  tibble(species = c("afarensis", "africanus", "habilis", "boisei", "rudolfensis", "ergaster", "sapiens"),
         brain   = c(438, 452, 612, 521, 752, 871, 1350),
         mass    = c(37.0, 35.5, 34.5, 41.5, 55.5, 61.0, 53.5))

p1 <- ggplot(data = speciesBrain, aes(x = mass, y = brain))+
  geom_point(shape = 21, color = "dodgerblue2", fill = "blue4", size =3)+
  theme_minimal()+
  ggrepel::geom_text_repel(aes(label = species), size = 3)+
  labs(x = "body mass (kg)", y = "brain volume (cc)")

p1

p1_built <- ggplot_build(p1)
# Extract the axis limits and breaks
x_limits <- p1_built$layout$panel_params[[1]]$x.range
y_limits <- p1_built$layout$panel_params[[1]]$y.range

x_limits_z <- (x_limits - mean(speciesBrain$mass))/sd(speciesBrain$mass)
```

### Linear

Simplest model is a linear one.

$$\text{Brain Size}_i \sim \text{Normal}(\mu_i, \sigma)$$

$$\mu_i = \alpha + \beta_b\text{Body Mass}_i$$

$$\alpha \sim \text{Normal(0.5,  1)}$$

$$\beta_b \sim \text{Normal(0,  10)}$$

$$\sigma \sim \text{Log-Normal(0,  1)}$$

```{r}
speciesBrain <- speciesBrain %>%
  mutate(mass_z = (mass - mean(mass))/sd(mass),
         brain_z = (brain - mean(brain))/sd(brain))

R2_fun<- function(brm_fit, seed = 7, ...) {
  set.seed(seed)
  p <- predict(brm_fit, summary = F, ...)
  # in my experience, it's more typical to define residuals as the criterion minus the predictions
  r <- speciesBrain$brain_z - apply(p, 2, mean)
  1 - rethinking::var2(r) / rethinking::var2(speciesBrain$brain_z )
}


```


```{r make poly models}
b7.0 <- brm(data = speciesBrain,
            family = gaussian,
            brain_z ~ 1 + mass_z,
            prior = c(
              prior(normal(0.5, 1), class = Intercept),
              prior(normal(0, 3), class = b),
              prior(lognormal(0,1), class = sigma)
            ),
            iter = 2000, warmup = 500, seed = 4, cores = 4,
            backend = "cmdstanr", silent = 2, file = "fits/b07.0.6"
            )

#quadratic
b7.1 <- brm(data = speciesBrain,
            family = gaussian,
            brain_z ~ 1 + mass_z + I(mass_z^2),
            prior = c(
              prior(normal(0.5, 1), class = Intercept),
              prior(normal(0, 3), class = b),
              prior(lognormal(0,1), class = sigma)
            ),
            iter = 2000, warmup = 500, seed = 4, cores = 4,
            backend = "cmdstanr", silent = 2, file = "fits/b07.1.0"
            )

#cubic
b7.2 <- brm(data = speciesBrain,
            family = gaussian,
            brain_z ~ 1 + mass_z + I(mass_z^2) + I(mass_z^3),
            prior = c(
              prior(normal(0.5, 1), class = Intercept),
              prior(normal(0, 3), class = b),
              prior(lognormal(0,1), class = sigma)
            ),
            iter = 2000, warmup = 500, seed = 4, cores = 4,
            backend = "cmdstanr", silent = 2, file = "fits/b07.2.0"
            )

b7.3 <- brm(data = speciesBrain,
            family = gaussian,
            brain_z ~ 1 + mass_z + I(mass_z^2) + I(mass_z^3) + I(mass_z^4),
            prior = c(
              prior(normal(0.5, 1), class = Intercept),
              prior(normal(0, 3), class = b),
              prior(lognormal(0,1), class = sigma)
            ),
            iter = 2000, warmup = 500, seed = 4, cores = 4,
            backend = "cmdstanr", silent = 2, file = "fits/b07.3.0"
            )

b7.4 <- brm(data = speciesBrain,
            family = gaussian,
            brain_z ~ 1 + mass_z + I(mass_z^2) + I(mass_z^3) + I(mass_z^4) + I(mass_z^5),
            prior = c(
              prior(normal(0.5, 1), class = Intercept),
              prior(normal(0, 3), class = b),
              prior(lognormal(0,1), class = sigma)
            ),
            iter = 2000, warmup = 500, seed = 4, cores = 4, adapt_delta = 0.99,
            backend = "cmdstanr", silent = 2, file = "fits/b07.4.1",
            )

b7.5 <- brm(data = speciesBrain,
            family = gaussian,
            brain_z ~ 1 + mass_z + I(mass_z^2) + I(mass_z^3) + I(mass_z^4) + I(mass_z^5) + I(mass_z^6),
            prior = c(
              prior(normal(0.5, 1), class = Intercept),
              prior(normal(0, 3), class = b),
              prior(lognormal(0,1), class = sigma)
            ),
            iter = 2000, warmup = 500, seed = 4, cores = 4, adapt_delta = 0.999, max_treedepth = 15,
            backend = "cmdstanr", silent = 2, file = "fits/b07.5.2"
            )


```

```{r get heat map draws}

simulate_brain_model <- function(model_samples, polynomial_terms = 1) {
  as_tibble(model_samples) %>%
    mutate(
      simMass = seq(from = min(speciesBrain$mass_z), to = max(speciesBrain$mass_z), length.out = n()),
      simBrainEst = if (polynomial_terms == 1) {
        Intercept + (b_mass_z * simMass)
      } else if (polynomial_terms == 2) {
        Intercept + (b_mass_z * simMass) + (b_Imass_zE2 * simMass^2)
      } else if (polynomial_terms == 3) {
        Intercept + (b_mass_z * simMass) + (b_Imass_zE2 * simMass^2) + (b_Imass_zE3 * simMass^3)
      } else if (polynomial_terms == 4) {
        Intercept + (b_mass_z * simMass) + (b_Imass_zE2 * simMass^2) + (b_Imass_zE3 * simMass^3) + (b_Imass_zE4 * simMass^4)
      } else if (polynomial_terms == 5) {
        Intercept + (b_mass_z * simMass) + (b_Imass_zE2 * simMass^2) + (b_Imass_zE3 * simMass^3) + (b_Imass_zE4 * simMass^4) + (b_Imass_zE5 * simMass^5)
      } else if (polynomial_terms == 6) {
        Intercept + (b_mass_z * simMass) + (b_Imass_zE2 * simMass^2) + (b_Imass_zE3 * simMass^3) + (b_Imass_zE4 * simMass^4) + (b_Imass_zE5 * simMass^5) + (b_Imass_zE6 * simMass^6)
      },
      simBrain = rnorm(n(), simBrainEst, sd = sigma),
      simMass_orig = simMass * sd(speciesBrain$mass) + mean(speciesBrain$mass),
      simBrainEst_orig = simBrainEst * sd(speciesBrain$brain) + mean(speciesBrain$brain),
      simBrain_orig = simBrain * sd(speciesBrain$brain) + mean(speciesBrain$brain)
    )
}

b7.0_sim <- simulate_brain_model(b7.0, polynomial_terms = 1)
b7.1_sim <- simulate_brain_model(b7.1, polynomial_terms = 2)
b7.2_sim <- simulate_brain_model(b7.2, polynomial_terms = 3)
b7.3_sim <- simulate_brain_model(b7.3, polynomial_terms = 4)
b7.4_sim <- simulate_brain_model(b7.4, polynomial_terms = 5)
b7.5_sim <- simulate_brain_model(b7.5, polynomial_terms = 6)

```

```{r draw lines}
#| eval: false

new_data_sim <- tibble(mass_z = seq(from = x_limits_z[[1]], to = x_limits_z[[2]], length.out = 60))
brain_loo_lines <- function(brms_fit, row, ...) {

  # # refit the model
  new_fit <-
    update(brms_fit,
           newdata = filter(speciesBrain, row_number() != row),
           iter = 2000, warmup = 1000, chains = 4, cores = 4,
           seed = 7,
           refresh = 0,
           ...)

  # pull the lines values
  fitted(new_fit,
         newdata = new_data_sim) %>%
    data.frame() %>%
    select(Estimate) %>%
    bind_cols(new_data_sim)
}


b7.0_fits <-
  tibble(row = 1:7) %>%
  mutate(post = purrr::map(row, ~brain_loo_lines(brms_fit = b7.0, row = .))) %>%
  unnest(post) %>%
  filter(Estimate >= min(speciesBrain$brain_z) - (1.5 *sd(speciesBrain$brain_z)) & Estimate <=  max(speciesBrain$brain_z) + (1.5 *sd(speciesBrain$brain_z)))

b7.1_fits <-
  tibble(row = 1:7) %>%
  mutate(post = purrr::map(row, ~brain_loo_lines(brms_fit = b7.1, row = .))) %>%
  unnest(post) %>%
  filter(Estimate >= min(speciesBrain$brain_z) -(1.5 * sd(speciesBrain$brain_z)) & Estimate <=  max(speciesBrain$brain_z) + (1.5 *sd(speciesBrain$brain_z)))

b7.2_fits <-
  tibble(row = 1:7) %>%
  mutate(post = purrr::map(row, ~brain_loo_lines(brms_fit = b7.2, row = .))) %>%
  unnest(post) %>%
  filter(Estimate >= min(speciesBrain$brain_z) - (1.5 *sd(speciesBrain$brain_z)) & Estimate <=  max(speciesBrain$brain_z) + (1.5 *sd(speciesBrain$brain_z)))

b7.3_fits <-
  tibble(row = 1:7) %>%
  mutate(post = purrr::map(row, ~brain_loo_lines(brms_fit = b7.3, row = .))) %>%
  unnest(post) %>%
  filter(Estimate >= min(speciesBrain$brain_z) - (1.5 *sd(speciesBrain$brain_z)) & Estimate <=  max(speciesBrain$brain_z) + (1.5 *sd(speciesBrain$brain_z)))

b7.4_fits <-
  tibble(row = 1:7) %>%
  mutate(post = purrr::map(row, ~brain_loo_lines(brms_fit = b7.4, row = .))) %>%
  unnest(post) %>%
  filter(Estimate >= min(speciesBrain$brain_z) - (1.5 *sd(speciesBrain$brain_z)) & Estimate <=  max(speciesBrain$brain_z) + (1.5 *sd(speciesBrain$brain_z)))


b7.5_fits <-
  tibble(row = 1:7) %>%
  mutate(post = purrr::map(row, ~brain_loo_lines(brms_fit = b7.5, row = .))) %>%
  unnest(post) %>%
  filter(Estimate >= min(speciesBrain$brain_z) - (1.5 * sd(speciesBrain$brain_z)) & Estimate <= (1.5 * max(speciesBrain$brain_z) + sd(speciesBrain$brain_z)))


```


```{r linear plot}

ggplot() +
  labs(title = "Linear",x = "body mass (kg)", y = "brain volume (cc)") +
  theme_minimal() +
  guides(fill = "none")+
  coord_cartesian(xlim = x_limits * c(0.99, 1.01),
                  ylim = y_limits * c(0.99, 1.01)) +
  stat_density_2d(data = b7.0_sim,
                   aes(x = simMass_orig, y = simBrainEst_orig, fill = after_stat(ndensity)),
                   geom = "raster", contour = FALSE) +
  scale_fill_viridis_c(option = "magma") +
  geom_point(data = speciesBrain,
             aes(x = mass, y = brain),
             shape = 21, color = "white", fill = "black", lwd = 3, alpha = 1, size = 2.5) +
  ggrepel::geom_text_repel(data = speciesBrain,
                          aes(x = mass, y = brain, label = species),
                          size = 4, color = "white",
                          box.padding = 2.5, point.padding = 1)
```
This model has an $R^2$ value of `r round(R2_fun(b7.0), 2)`

### Polynomial

For some comparison models we can add polynomial terms of *Mass* to our regression. For example the scond degee polynomial that relates body size to brain size is a parabola which has this form:

$$\mu_i = \alpha + \beta_1\text{Body Mass}_i + \beta_2(\text{Body Mass}_i)^2$$
We'll make 4 more models like this each adding another polynomial term like $\beta_3(\text{Body Mass}_i)^3$, $\beta_4(\text{Body Mass}_i)^4$... etc

Below are our six models posterior $mu$'s plotted with the data. *Note this is not the posterior predictive distribution since it doesn't include $sigma$

In the figure below I've plotted both the heat map for where the model thinks the $\mu$ estimate should be given all of the data. Additionally I've plotted golden lines where I get the best fit lines if I refitted the model by leaving out a different data point each time. This is a "Leave One Out" (LOO) sort of method that helps us see how good our model would do if we didn't have one of our data points and we wanted to predict it. There are seven of these best fit lines in each model because there are seven different data points we can drop.

For the simple linear 1st order regression we could remove any one point from the sample and get pretty much the same regression line. In contrast, the most complex model 6th order is very sensitive to the sample. The predicted mean would change course a lot, if we removed any one point from the sample. You can see the truth of this in the below plots. On the top left, each line is the best fit for the linear regression.  The curves on the bottom right are instead different sixth-order polynomials. Notice that the straight lines hardly vary, while the curves fly about wildly.


```{r}
#| eval: false

base_plt2 <- ggplot() +
  # labs(title = "brain ~ mass", subtitle = "Mu estimate",
  #      x = "body mass (kg)", y = "brain volume (cc)") +
  theme_minimal() +
  guides(fill = "none")+
  coord_cartesian(xlim = x_limits * c(0.995, 1.005),
                  ylim = y_limits * c(0.995, 1.005))

add_heatmap_and_points <- function(base_plot, sim_data, fit_lines) {
  base_plot +
    stat_density_2d(data = sim_data,
                     aes(x = simMass_orig, y = simBrainEst_orig, fill = after_stat(ndensity)),
                     geom = "raster", contour = FALSE) +
    scale_fill_viridis_c(option = "magma") +
    geom_point(data = speciesBrain,
               aes(x = mass, y = brain),
               shape = 21, color = "white", fill = "black", lwd = 3, alpha = 1, size = 2.5) +
    theme(
      axis.text.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.x = element_blank(),
      axis.ticks.y = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank()
    ) +
    geom_line(data = fit_lines, aes(x = (mass_z * sd(speciesBrain$mass)) + mean(speciesBrain$mass),
                                y = (Estimate * sd(speciesBrain$brain)) + mean(speciesBrain$brain), group = row),
          color = "gold1", alpha = .5)
}

p2.0 <- add_heatmap_and_points(base_plt2, b7.0_sim, b7.0_fits) + labs(title = "Linear") + theme(plot.title = element_text(size = 10))
p2.1 <- add_heatmap_and_points(base_plt2, b7.1_sim, b7.1_fits) + labs(title = "Quadratic") + theme(plot.title = element_text(size = 10))
p2.2 <- add_heatmap_and_points(base_plt2, b7.2_sim, b7.2_fits) + labs(title = "Cubic") + theme(plot.title = element_text(size = 10))
p2.3 <- add_heatmap_and_points(base_plt2, b7.3_sim, b7.3_fits) + labs(title = "4th order") + theme(plot.title = element_text(size = 10))
p2.4 <- add_heatmap_and_points(base_plt2, b7.4_sim, b7.4_fits) + labs(title = "5th order") + theme(plot.title = element_text(size = 10))
p2.5 <- add_heatmap_and_points(base_plt2, b7.5_sim, b7.5_fits) + labs(title = "6th order") + theme(plot.title = element_text(size = 10))


polyCompPlt <- p2.0 + p2.1 + p2.2 + p2.3 + p2.4 + p2.5

saveRDS(polyCompPlt, "plots/polyCompPlt.rds")
```

```{r}
#| fig-width: 10
#| fig-height: 6

readRDS("plots/polyCompPlt.rds")
```


The overfit polynomial models manage to fit the data extremely well, but they suffer for this within-sample accuracy by making nonsensical out-of- sample predictions. In contrast, underfitting produces models that are inaccurate both within and out of sample. They have learned too little, failing to recover regular features of the sample.

Another perspective on the absurd models just above is to consider that model fitting can be considered a form of data compression. Parameters summarize relationships among the data. These summaries compress the data into a simpler form, although with loss of information (“lossy” compression) about the sample. The parameters can then be used to generate new data, effectively decompressing the data.

When a model has a parameter to correspond to each datum, then there is actually no compression. The model just encodes the raw data in a different form, using parameters instead. As a result, we learn nothing about the data from such a model. Learning about the data requires using a simpler model that achieves some compression, but not too much. This view of model selection is often known as Minimum Description Length (MDL).