srcd2021_bayesian/01.Intro_Bayes.Rmd at master · agbeckner/srcd2021_bayesian · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
---
title: "Session 1: Introduction to Bayesian Inference"
author: 'Angeline Tsui & MH Tessler'
date: '`r Sys.Date()`'
output:
  tufte::tufte_html:
    toc: yes
    toc_depth: 1
  tufte::tufte_handout:
    citation_package: natbib
    latex_engine: pdflatex
  tufte::tufte_book:
    citation_package: natbib
    latex_engine: pdflatex
link-citations: yes
---

```{r, echo=FALSE}
library(knitr)
library(tidyverse)
opts_chunk$set(echo=TRUE,
               warning=FALSE, message=FALSE,
               cache=FALSE)
```

This document is developed by [Angeline Tsui](mailto:angelinetsui@gmail.com) and [MH Tessler](mailto:michael.h.tessler@gmail.com), and all documents can be found [here](https://github.com/angelinetsui/srcd2021_bayesian.git).

# Introduction and installation of the brms package in R

This short tutorial covers some basic concepts of Bayesian inference.

We will use the package `brms` to illstrate some examples of running statistical modes in our workshop. `brms` is based on the probabilistic programming language Stan and you will need to install an underlying C++ compiler. Please follow the instructions [here](https://github.com/paul-buerkner/brms#faq). For Windows, you can install [Rtools](https://cran.r-project.org/bin/windows/Rtools/). For Mac, you need to install Xcode (available from the Mac App Store).

## Learning goals

By the end of this session, you will have a basic understanding of the following:

* **Probability in the Frequestist vs Probability in the Bayesian world**
* **Bayes' Theorem**
* **Three important elements of Bayesian statistics: prior, likelihood & posterior**

# Probability in the Frequestist vs Probability in the Bayesian world

What is a probability? If you ask a mathematician, they will tell you "a probability is a number between 0 and 1". This may be satisfactory for those who care only about numbers, but what does a probability represent *in the world*?

Philosophers of mathematics and probability divide into two camps when it comes to answering this question: the Frequentists and the Bayesians (sometimes also called the *subjectivists*). The distinction is not only a philosophical one. In fact, the different definitions of probability give rise to different fields of statistics -- Frequentists Statistics and Bayesian Statisics -- which  has practical implications for our everyday lives as scientists. But first, let's stick at the philosophical level.

## Probability in the Frequestist world

For a Frequentist, a probability is the *long run relative frequency of events*. For example, if I have a coin which I say the probability of 0.2 that it lands on Heads, what a frequentist takes that to mean is: *if I were to flip this coin many many times, then the relative frequency of the coin landing on Heads will be 20%*.

This concept of repeating the events is fundamental to Frequentist Statistics. In the Frequestist world, probability is a long-run concept and it refers to the frequency of events when we repeat the trials many many times. This idea is at the core of why *optional stopping* (where one decides whether or not to collect more data based on the p-value computed on partial sample) is problematic.

Technincal Note: This assumption runs into (philosophical) problems when a Frequentist tries to describe the probability of a singular event: For example, the probability of it raining on April 8, 2021 in Palo Alto, California is a singular event in time. How do we define what counts as a sufficiently similar event? It also has trouble when describing probabilities of probabilities. For example, what is the probability that my special coin has a probability of 0.2 of landing of Heads?

A Frequentist often thinks about distributions of possible *outcomes*. If we were to repeat the process, what outcomes might we expect?

### A distribution over the number of Heads, flipping a coin of probability 0.7 15 times

```{r}
possible_outcomes = seq(0, 15)

data.frame(x = possible_outcomes,
           y = dbinom(x= possible_outcomes, size = 15, prob = 0.7) ) %>%
  ggplot(., aes( x, y))+
  geom_col()+
  labs(x = "Number of heads", y = "Probability mass")+
  theme_classic()
```


## Probability in the Bayesian world

The Bayesian perspective does not consider repeating identical trials (or an identical process) a large number of times. Instead, the Bayesian sees probabilty as a measure of uncertainty about what will come to happen. That is, because we (as scientists, or as agents in the world) do not have perfect information about all of the relevant variables in the world (including interactions of subatomical particles and the like), we cannot say with certainty what will happen in the future, but we can quantify our uncertainty using probabilities.

For example, we can estimate the probability of rain April 8, 2021 in Palo Alto, California using predictive models that encode our scientific knowledge of how whether systems evolve. These in turn might be guided by historical data (a kind of frequentist idea), but they may also include structured, theoretical knowledge about climate and weather.

While it might seem subtle, this move allows the Bayesian to consider concepts that the Frequentist does not consider. For example, the Bayesian can ask "How certain are you that the coin has probability 0.7 of coming up heads?" (For a Frequentist, this would amout to talking about *long run frequencies* of *long run frequencies*.) So while a Frequentist considers distributions over possible outcomes, the Bayesian can also consider distributions over probabilties.

### A distribution over the bias (or weight) of a coin (i.e., the probability that the coin will come up Heads)


```{r}
possible_probabilities = seq(0.01, 0.99, 0.01)

data.frame(x = possible_probabilities,
           y = dbeta(x= possible_probabilities, shape1 = 3, shape2 = 2) ) %>%
  ggplot(., aes( x, y))+
  geom_line()+
  labs(x = "Weight of a coin", y = "Probability density")+
  theme_classic()
```

**Question: What is/are the difference(s) between the two distributions shown above?**

## Bayes' theorem

First of all, let us revisit the concept of conditional probability.

### Conditional probability is the probability of A given B.

$$
P(A \mid B) = \frac{P(A \cap B)}{P(B)}
$$
For example, Let us assume for the following:

P(A) is the probability of receiving dormitory housing at the university = 0.2.
P(B) is the probability of getting accepted in the graduate school = 0.6.
P(A & B) = probability of receiving dormitory housing and being accepted in the graduate school = P(A) * P(B) = 0.2*0.6 = 0.12.
However, we know that the probability of an individual that is given a domitory housing will be higher, if the individual is accepted in the graduate school. So we want to calculate P(A | B).
P(A | B) = P(A & B)/P(B) = 0.12/0.6 = 0.2.

### Bayes' Theorem

Conditional probability forms the basis of Bayes' Theorem. Let us rearrange some equations.

This is the conditional probability of A, given B.
$$
P(A \mid B) = \frac{P(A \cap B)}{P(B)}
$$
Thus, we can rewrite the equation and we know that the P(A & B) can be expressed in the following way:
$$
P(A \mid B) * {P(B)} = {P(A \cap B)}
$$
On the other hand, let us think of the conditional probability of B, given A
$$
P(B \mid A) = \frac{P(A \cap B)}{P(A)}
$$
We can also rewrite the equation and we know that the P(A & B) can be expressed in the following way:
$$
P(B \mid A) * {P(A)} = {P(A \cap B)}
$$
In sum,
$$
{P(A \cap B)} = P(A \mid B) * {P(B)} = P(B \mid A) * {P(A)}
$$
We can simply rearrange the terms to arrive at Bayes' theorem:
$$
P(B \mid A) = \frac{P(A \mid B) * {P(B)}}{P(A)}
$$
In short, Bayes' Theorem is a way of calculating "inverse" conditional probabilities. It is very useful for understanding what we want to measure in statistics within the Frequestist and Bayesian frameworks.

# Overview of Statistical Analysis

In the Frequestist framework, we often run hypothesis tests and determine statistical significance by comparing p-values with an alpha level (0.05). The p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is true.

Thus, in frequestist framework, we are calculating the probability of observing some data (technically: data at least as extreme as the observed data), conditional on the null hypothesis being true. In other words, we can rewrite this as:

$$
\text{p-value}= P(data \mid h0)
$$
However, scientists are often interested in finding out whether or not a hypothesis is true (or, the probability that a hypothesis is true), conditional on the observed data: P(h | data). This is of course not what the p-value describes.

Within the Bayesian framework, we can make use of the Bayes' Theorem to find out P(h | data).

$$
P(h | data) = \frac{P(data | h) * P(h)}{P(data)}
$$

There are three key terms of Bayes Theorem, which we will go into in more detail throughout the tutorial.

- P(h) is the *prior distribution* of the hypothsis: What do we believe about the hypothesis before having observed any data?
- P(data | h) is the *likelihood distribition* of the data, conditional on the hypothesis: What kind of data would you expect to see if the hypothesis were true?
- P(h | data) is the *posterior distribution* of the hypothesis: What do we believe about the hypothesis after observing the data?

The posterior distribituion is often the statistical object that we want in order to make scientific inferences.

There is a fourth term in the equation -- P(data). For reasons that will become clear in the next section, this is termed the *marginal likelihood* of the data. It is not often looked at explicitly, except in the context of Hypothesis Testing (next session). In fact, this term is often omitted from Bayes Theorem, which can be written as a proportionality (rather than an equation):

$$
P(h | data) \propto P(data | h) * P(h)
$$

## Two flavors of data analysis: Parameter estimation and Hypothesis testing

For most applications of psychological data analysis, we are either interested in estimating the parameters of a model or deciding which, among a set of alternatives, model is the best explanation of the data. Broadly, these two tasks are referred to *parameter estimation* and *hypothesis testing* (or, *model comparison*). We will take a look at parameter estimation here. In the next session, we will take up hypothesis testing.


## An example of Parameter Estimation

Imagine we are investigating the origins of prosocial behavior: Some scientists believe humans are born with a self-interested tendency and only acquire prosociality through instruction. Others believe altruism in innate; as soon as children are physically and cognitively able to help, they will. To investigate this, we put pre-verbal infants in a context where a person is struggling to complete their goal and test whether or not the child helps the struggling individual.

A parameter estimation question: What is the underlying population’s propensity to help? (i.e., how much do children help in this context?). We call the "propensity to help" (or, the theoretical proportion of kids of this population who would help in this context) a *parameter*, since it is not directly observable. Let's call this parameter `helpingness`.

Let's imagine we collected data from 25 infants and each participant contributes a single, binary response (help or not help). Imagine that 17 out of the 25 help. Numerically, that is `r (17/25)*100`%. But that is just the proportion of kids who help in this study. What can we say about the underlying population's propensity to help? (Or, in theory, how many kids out of the population would help in this context?)


Bayes Rule tells us, mathematically, what we should believe about `helpingness` given the data we've observed.

$$
P(h \mid  d) \propto P(d \mid h) \times P(h)
$$
Let's try to rewrite that in a slightly more readable way by saying:

$$
P(\text{helpingness} \mid  \text{data}) \propto P(\text{data} \mid \text{helpingness}) \times P(\text{helpingness})
$$

### A prior distribution

The prior distribution $P(helpingness)$ encodes our expectations about the underlying parameter. What do we know about `helpingness` (kids' propensity to help in our experimental context)? One very reasonable thing to assume is that we do not know anything! This results in what is referred to as an *uninformative prior*. Since the propensity to help is itself a probability (or, proportion), it must be a number between 0 and 1, but beyond that, we do not need to assume anything about what values are more likely than others. This results in a very boring distribution.

```{r}
prior_parameters <- list(shape1 = 1, shape2 = 1) # as shape1 increases, the prior will favor higher numbers; as shape2 increases, the prior will favor lower numbers

bins <- seq(0.001, 0.999, 0.001)

data.frame(
  src = rep("prior", length(bins)),
  helpingness = bins,
  density = dbeta(x = bins, shape1 = prior_parameters$shape1,
          shape2 = prior_parameters$shape2)
) %>%
  ggplot(., aes( x = helpingness, y = density, linetype = src))+
  geom_line()+
  ggtitle("An uninformed prior distribution")+
  theme_classic()

```

Here, all values of $helpingness$ are equally likely a priori. Note that this is not the only prior distribution that we could specify. Perhaps based on the prior literature, we do have an *informed prior* expectation that kids are unlikely to help in this context. Though the numerical details would have to be justified (e.g., by a prior meta-analysis), we could encode this in the prior using a different shape.

```{r}
prior_parameters <- list(shape1 = 2, shape2 = 4) # as shape1 increases, the prior will favor higher numbers; as shape2 increases, the prior will favor lower numbers

data.frame(
  src = rep("prior", length(bins)),
  helpingness = bins,
  density = dbeta(x = bins, shape1 = prior_parameters$shape1,
          shape2 = prior_parameters$shape2)
) %>%
  ggplot(., aes( x = helpingness, y = density, linetype = src))+
  geom_line()+
  ggtitle("An informed prior distribution") +
  theme_classic()

```


Under this prior distribution, our best guess (*before having seen the data from our experiment*) is that $helpingness$ is around 0.25, though we do not have super strong beliefs about this. This kind of informed prior could be used when the scientist has indendently justified beliefs about $helpingness$ (e.g., via a meta-analysis of the prior literature).

### A likelihood distribution

The likelihood distribution encodes what data we would expect to see given a particular value of the parameter. This is sometimes called the "forward model", because it describes the process that hypothetically generates the observed data. The likelihood distribution can be quite complicated (e.g., in the case of mixed-effects regression models, as we will see later), but it can also be quite simple. In fact, in our case of binary data where each participant contributes one data point, the likeihood distribution is as simple as flipping a coin (literally).

The data we have is a forced choice (Help vs. No help). Multiple (independent) forced choice responses are samples from a Binomial distribution (i.e., they are binomially-distributed).

```{r}
hypothetical_true_population_parameter <- 0.3
data_bins <- seq(0, 25, 1)

data.frame(
  src = rep("likelihood", length(data_bins)),
  x = data_bins,
  prob = dbinom(x = data_bins, size = 25, p = hypothetical_true_population_parameter)
) %>%
  ggplot(., aes( x = x, y = prob, linetype = src))+
  geom_col()+
  xlab("hypothetical data (positive observations)")+
  ylab("probability mass")+
  ggtitle(paste("P(d | helpingness = ",hypothetical_true_population_parameter ,")", sep = ""))+
  theme_classic()
```

Here we see the likeilhood of the observed outcomes (number of kids who help) assuming that the underlying parameter $\theta$ has a value of 0.3. We can look at the likelihood distribution assuming different values of $\theta$.


```{r}
hypothetical_true_population_parameter <- 0.3
data_bins <- seq(0, 25, 1)

data.frame(
  src = paste("P(d | helpingness = ", c(
    rep(0.1, length(data_bins)),
    rep(0.3, length(data_bins)),
    rep(0.5, length(data_bins)),
    rep(0.7, length(data_bins)),
    rep(0.9, length(data_bins))
  ), ")", sep = ""),
  x = c(data_bins,data_bins,data_bins,data_bins,data_bins),
  prob = c(
    dbinom(x = data_bins, size = 25, p = 0.1),
    dbinom(x = data_bins, size = 25, p = 0.3),
    dbinom(x = data_bins, size = 25, p = 0.5),
    dbinom(x = data_bins, size = 25, p = 0.7),
    dbinom(x = data_bins, size = 25, p = 0.9)
    )
) %>%
  ggplot(., aes( x = x, y = prob))+
  geom_col()+
  xlab("hypothetical data (positive observations)")+
  ylab("probability mass")+
  facet_wrap(~src, nrow = 1)+
  theme_classic()
```

Here we see different (probabilistic) predictions about what the data should look if the underlying parameter had a particular value. You can think of the different parameter values as different hypotheses, which make more or less strong predictions about the observed data would look like. For example, notice how when theta = 0.9 or 0.1 the predictions are much less spread out in comparison to the theta = 0.5.

We have now specified a likelihood distribution (or rather, a family of likelihood distributions). We can now use Bayes Theorem to inverse the direction of inference: rather than looking at what the data should look like assuming different parameter values, we can ask what the parameter should look like assuming a particular observed data set.


### Putting the two together to make a posterior distribution


Once you have a prior distribution and a likelihood distribution, we can use Bayes Theorem to compute a posterior distribution. In practice, there are several ways to do this computation, which we will not have the chance to go into in this tutorial. In practice, we will be concerning ourselves with parameter estimation of regression models, which we will cover in Session 3. Here we illustrate what posterior distributions look like (glossing over for now the calculation of how you actually get a posterior distribution).

Assuming the uninformative (flat) prior above and the Binomial likelihood model, we arrive at the following posterior:

```{r}
prior_parameters <- list(shape1 = 1, shape2 = 1) # as shape1 increases, the prior will favor higher numbers; as shape2 increases, the prior will favor lower numbers

observed_data <- 17
number_of_observations <- 25

# this example is a classic Beta-Binomial problem, where the posterior distribution can be computed analytically
posterior_parameters <- list(
  shape1 = prior_parameters$shape1 + observed_data,
  shape2 = prior_parameters$shape2 + number_of_observations - observed_data
  )

bins <- seq(0.001, 0.999, 0.001)

data.frame(
  src = c(
    rep("posterior",length(bins)),
    rep("prior", length(bins))
    ),
  helpingness = c(bins, bins),
  density = c(
    dbeta(x = bins, shape1 = posterior_parameters$shape1,
          shape2 = posterior_parameters$shape2),
    dbeta(x = bins, shape1 = prior_parameters$shape1,
          shape2 = prior_parameters$shape2)
  )
) -> df_posterior


df_posterior %>%
  ggplot(., aes( x = helpingness, y = density, linetype = src))+
  geom_line()+
  theme_classic()

```

The posterior distribution encodes the methematically correct beliefs about the parameter given the prior, the likelihood, and the observed data.

We can interrogate the posterior to get answers to our burning scientific questions.

### What is the most likely value of the parameter given the data?

The most likely value is referred to as the *Maximum A-Posteriori* (or, MAP) value. We can find it by picking out the value of theta that corresponds to the maximum probability (or density) value.

```{r}
df_posterior %>%
  filter(src == "posterior") %>%
  summarize(
    MAP = helpingness[which.max(density)]
  )
```

### How likely is it that the parameter is greater than 0.5?

Since the posterior distribution encodes our degrees of belief about all values of the parameters, we can ask specific questions about the probabilities that correspond to ranges of values. For example, we are often interested in critical values like 0.5 (chance performance).

```{r}
df_posterior %>%
  filter(src == "posterior") %>%
  mutate(norm_prob = density / sum(density)) %>% # this calculation is technically an integral (not a sum), but we can approximate it by normalizing the density values into probabilities
  filter(helpingness > 0.5) %>%
  summarize(
    prob_gt_0.5 = sum(norm_prob)
  )

# since the posterior is analytically computed and we know (from mathematical analysis) that the shape of the posterior is a Beta distribution, we can also calculate this probability using the `pbeta` function

# 1 - pbeta(0.5, posterior_parameters$shape1, posterior_parameters$shape2)
```

Here we see that there is approximately a 96% chance that the parameter is greater than 0.5.

### What is the most likely range of values? (The *Bayesian Credible Interval*)

When describing posterior distributions, we are often interested in summarizing them in terms of a credible intervals (not to be confused with a frequentist confidence interval, though they have similar applications). There are a number of distinct credible intervals one could describe, but a simple one is looking at the X% range of the posterior (e.g., 95%) by examining the middle 95% quantile.

```{r}
qbeta(p = c(0.025, 0.975),
      shape1 = posterior_parameters$shape1,
      shape2 = posterior_parameters$shape2)
```

The proper interpretation of this credible interval is to say that there is a 95% chance that the parameter lies between 0.48 and 0.83.


## How the prior affects the posterior

Thus far in our running example, we have been assuming the uninformative prior that we articulated above, where we had no substantive (or independently justified) beliefs about the parameter *a priori*. Let's now examine how our posterior inferences would be affected if we had substantive prior beliefs. Let's assume we had meta-analytic data that gave us the prior we showed above, where lower values of the parameter were more likely a priori.

```{r}
informed_prior_parameters <- list(shape1 = 2, shape2 = 4) # as shape1 increases, the prior will favor higher numbers; as shape2 increases, the prior will favor lower numbers

observed_data <- 17
number_of_observations <- 25

# this example is a classic Beta-Binomial problem, where the posterior distribution can be computed analytically
informed_posterior_parameters <- list(
  shape1 = informed_prior_parameters$shape1 + observed_data,
  shape2 = informed_prior_parameters$shape2 + number_of_observations - observed_data
  )

bins <- seq(0.001, 0.999, 0.001)

data.frame(
  src = c(
    rep("posterior",length(bins)),
    rep("prior", length(bins))
    ),
  helpingness = c(bins, bins),
  density = c(
    dbeta(x = bins,
          shape1 = informed_posterior_parameters$shape1,
          shape2 = informed_posterior_parameters$shape2),
    dbeta(x = bins,
          shape1 = informed_prior_parameters$shape1,
          shape2 = informed_prior_parameters$shape2)
  )
) -> df_posterior_informed


df_posterior_informed %>%
  ggplot(., aes( x = helpingness, y = density, linetype = src))+
  geom_line()+
  theme_classic()
```

The results look (visually) similar. Let's look at our various summary statistics:

#### MAP value

```{r}
df_posterior_informed %>%
  filter(src == "posterior") %>%
  summarize(
    MAP = helpingness[which.max(density)]
  )
```

#### P(theta > 0.5)

```{r}
df_posterior_informed %>%
  filter(src == "posterior") %>%
  mutate(norm_prob = density / sum(density)) %>% # this calculation is technically an integral (not a sum), but we can approximate it by normalizing the density values into probabilities
  filter(helpingness > 0.5) %>%
  summarize(
    prob_gt_0.5 = sum(norm_prob)
  )
```

#### 95% credible interval

```{r}
qbeta(p = c(0.025, 0.975),
      shape1 = informed_posterior_parameters$shape1,
      shape2 = informed_posterior_parameters$shape2)
```

### Digest

As we can see from these summary statistics of the posterior distribution, our priors affected the posterior outcomes. Overall, our prior expectations that the parameter value was relatively low made our posterior beliefs about the parameter to favor slightly lower values overall. The MAP estimate is lower than it was before as are the credible interval and the probability that the parameter is > 0.5. Always remember the posterior beliefs are both a function of the prior expectations and the data / likelihood.

In the next session, we will use the Bayesian Framework to quantify evidence in support of or against a particular hypothesis.