2023/32-conditional-expectations.qmd at main · datasciencelabs/2023 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# Conditional probabilities and expectations

* In machine learning applications, we rarely can predict outcomes perfectly. For example,
    - spam detectors often miss emails that are clearly spam,
    - Siri often misunderstands the words we are saying, and
    - your bank at times thinks your card was stolen when it was not.

* The most common reason for not being able to build perfect algorithms is that it is impossible.

* To see this, note that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes.

* Because our prediction rules are functions, equal inputs (the predictors) implies equal outputs (the predictions).

* Therefore, for a challenge in which the same predictors are associated with different outcomes across different individual observations, it is impossible to predict correctly for all these cases.

## Conditional probabilities

* We use the notation $(X_1 = x_1,\dots,X_p=x_p)$ to represent the fact that we have observed values $x_1,\dots,x_p$ for features $X_1, \dots, X_p$.

* This does not imply that the outcome $Y$ will take a specific value. Instead, it implies a specific probability.

* We denote the _conditional probabilities_ for each class $k$ with:

$$
\mbox{Pr}(Y=k \mid X_1 = x_1,\dots,X_p=x_p), \, \mbox{for}\,k=1,\dots,K
$$

* We will use the bold letters like this: $\mathbf{X} \equiv (X_1,\dots,X_p)^\top$ and $\mathbf{x} \equiv (x_1,\dots,x_p)^\top$.
* We will also use the following notation for the conditional probability of being class $k$:

$$
p_k(\mathbf{x}) = \mbox{Pr}(Y=k \mid \mathbf{X}=\mathbf{x}), \, \mbox{for}\, k=1,\dots,K
$$

:::{.callout-note}
WDo not confuse this with the $p$ that represents the number of predictors.
:::

* These probabilities guide the construction of an algorithm that makes the best prediction: for any given $\mathbf{x}$, we will predict the class $k$ with the largest probability among $p_1(x), p_2(x), \dots p_K(x)$.

* In mathematical notation, we write it like this:

$$\hat{Y} = \max_k p_k(\mathbf{x})$$

* In machine learning, we refer to this as _Bayes' Rule_.

* But this is a theoretical rule since, in practice, we don't know $p_k(\mathbf{x}), k=1,\dots,K$.

* Estimating these conditional probabilities can be thought of as the main challenge of machine learning.

* The better our probability estimates $\hat{p}_k(\mathbf{x})$, the better our predictor $\hat{Y}$.

* So how well we predict depends on two things:
    1. how close are the $\max_k p_k(\mathbf{x})$ to 1 or 0 (perfect certainty) and
    2. how close our estimates $\hat{p}_k(\mathbf{x})$ are to $p_k(\mathbf{x})$.

* We can't do anything about the first restriction as it is determined by the nature of the problem, so

* our energy goes into finding ways to best estimate conditional probabilities.

* The first restriction does imply that we have limits as to how well even the best possible algorithm can perform.

* in some challenges we will be able to achieve almost perfect accuracy, with digit readers for example,

* in others our success is restricted by the randomness of the process, with movie recommendations for example.

* It is important to remember that defining our prediction by maximizing the probability is not always optimal in practice and depends on the context.

* As discussed above, sensitivity and specificity may differ in importance.

* But even in these cases, having a good estimate of the $p_k(x), k=1,\dots,K$ will suffice for us to build optimal prediction models, since we can control the balance between specificity and sensitivity however we wish.

## Conditional expectations

* For binary data, you can think of the probability $\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x})$ as the proportion of 1s in the stratum of the population for which $\mathbf{X}=\mathbf{x}$.

* Many of the algorithms we will learn can be applied to both categorical and continuous data due to the connection between _conditional probabilities_ and _conditional expectations_.

* Because the expectation is the average of values $y_1,\dots,y_n$ in the population, in the case in which the $y$s are 0 or 1:

$$
\mbox{E}(Y \mid \mathbf{X}=\mathbf{x})=\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x}).
$$

* As a result, we often only use the expectation to denote both the conditional probability and conditional expectation.

* We assume that the outcome follows the same conditional distribution.


## Conditional expectations minimizes squared loss function

* Why do we care about the conditional expectation in machine learning?

* This is because the expected value has an attractive mathematical property: it minimizes the MSE. Specifically, of all possible predictions $\hat{Y}$,

$$
\hat{Y} = \mbox{E}(Y \mid \mathbf{X}=\mathbf{x}) \, \mbox{ minimizes } \, \mbox{E}\{ (\hat{Y} - Y)^2  \mid  \mathbf{X}=\mathbf{x} \}
$$

* Due to this property, a succinct description of the main task of machine learning is that we use data to estimate:

$$
f(\mathbf{x}) \equiv \mbox{E}( Y  \mid  \mathbf{X}=\mathbf{x} )
$$

for any set of features $\mathbf{x} = (x_1, \dots, x_p)^\top$.

* This is easier said than done, since this function can take any shape and $p$ can be very large.

* Consider a case in which we only have one predictor $x$. The expectation $\mbox{E}\{ Y  \mid  X=x \}$ can be any function of $x$: a line, a parabola, a sine wave, a step function, anything.

* It gets even more complicated when we consider instances with large $p$, in which case $f(\mathbf{x})$ is a function of a multidimensional vector $\mathbf{x}$. For example, in our digit reader example $p = 784$!

* The main way in which competing machine learning algorithms differ is in their approach to estimating this conditional expectation.