-
Notifications
You must be signed in to change notification settings - Fork 42
Expand file tree
/
Copy path32-conditional-expectations.qmd
More file actions
114 lines (66 loc) · 5.68 KB
/
32-conditional-expectations.qmd
File metadata and controls
114 lines (66 loc) · 5.68 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# Conditional probabilities and expectations
* In machine learning applications, we rarely can predict outcomes perfectly. For example,
- spam detectors often miss emails that are clearly spam,
- Siri often misunderstands the words we are saying, and
- your bank at times thinks your card was stolen when it was not.
* The most common reason for not being able to build perfect algorithms is that it is impossible.
* To see this, note that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes.
* Because our prediction rules are functions, equal inputs (the predictors) implies equal outputs (the predictions).
* Therefore, for a challenge in which the same predictors are associated with different outcomes across different individual observations, it is impossible to predict correctly for all these cases.
## Conditional probabilities
* We use the notation $(X_1 = x_1,\dots,X_p=x_p)$ to represent the fact that we have observed values $x_1,\dots,x_p$ for features $X_1, \dots, X_p$.
* This does not imply that the outcome $Y$ will take a specific value. Instead, it implies a specific probability.
* We denote the _conditional probabilities_ for each class $k$ with:
$$
\mbox{Pr}(Y=k \mid X_1 = x_1,\dots,X_p=x_p), \, \mbox{for}\,k=1,\dots,K
$$
* We will use the bold letters like this: $\mathbf{X} \equiv (X_1,\dots,X_p)^\top$ and $\mathbf{x} \equiv (x_1,\dots,x_p)^\top$.
* We will also use the following notation for the conditional probability of being class $k$:
$$
p_k(\mathbf{x}) = \mbox{Pr}(Y=k \mid \mathbf{X}=\mathbf{x}), \, \mbox{for}\, k=1,\dots,K
$$
:::{.callout-note}
WDo not confuse this with the $p$ that represents the number of predictors.
:::
* These probabilities guide the construction of an algorithm that makes the best prediction: for any given $\mathbf{x}$, we will predict the class $k$ with the largest probability among $p_1(x), p_2(x), \dots p_K(x)$.
* In mathematical notation, we write it like this:
$$\hat{Y} = \max_k p_k(\mathbf{x})$$
* In machine learning, we refer to this as _Bayes' Rule_.
* But this is a theoretical rule since, in practice, we don't know $p_k(\mathbf{x}), k=1,\dots,K$.
* Estimating these conditional probabilities can be thought of as the main challenge of machine learning.
* The better our probability estimates $\hat{p}_k(\mathbf{x})$, the better our predictor $\hat{Y}$.
* So how well we predict depends on two things:
1. how close are the $\max_k p_k(\mathbf{x})$ to 1 or 0 (perfect certainty) and
2. how close our estimates $\hat{p}_k(\mathbf{x})$ are to $p_k(\mathbf{x})$.
* We can't do anything about the first restriction as it is determined by the nature of the problem, so
* our energy goes into finding ways to best estimate conditional probabilities.
* The first restriction does imply that we have limits as to how well even the best possible algorithm can perform.
* in some challenges we will be able to achieve almost perfect accuracy, with digit readers for example,
* in others our success is restricted by the randomness of the process, with movie recommendations for example.
* It is important to remember that defining our prediction by maximizing the probability is not always optimal in practice and depends on the context.
* As discussed above, sensitivity and specificity may differ in importance.
* But even in these cases, having a good estimate of the $p_k(x), k=1,\dots,K$ will suffice for us to build optimal prediction models, since we can control the balance between specificity and sensitivity however we wish.
## Conditional expectations
* For binary data, you can think of the probability $\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x})$ as the proportion of 1s in the stratum of the population for which $\mathbf{X}=\mathbf{x}$.
* Many of the algorithms we will learn can be applied to both categorical and continuous data due to the connection between _conditional probabilities_ and _conditional expectations_.
* Because the expectation is the average of values $y_1,\dots,y_n$ in the population, in the case in which the $y$s are 0 or 1:
$$
\mbox{E}(Y \mid \mathbf{X}=\mathbf{x})=\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x}).
$$
* As a result, we often only use the expectation to denote both the conditional probability and conditional expectation.
* We assume that the outcome follows the same conditional distribution.
## Conditional expectations minimizes squared loss function
* Why do we care about the conditional expectation in machine learning?
* This is because the expected value has an attractive mathematical property: it minimizes the MSE. Specifically, of all possible predictions $\hat{Y}$,
$$
\hat{Y} = \mbox{E}(Y \mid \mathbf{X}=\mathbf{x}) \, \mbox{ minimizes } \, \mbox{E}\{ (\hat{Y} - Y)^2 \mid \mathbf{X}=\mathbf{x} \}
$$
* Due to this property, a succinct description of the main task of machine learning is that we use data to estimate:
$$
f(\mathbf{x}) \equiv \mbox{E}( Y \mid \mathbf{X}=\mathbf{x} )
$$
for any set of features $\mathbf{x} = (x_1, \dots, x_p)^\top$.
* This is easier said than done, since this function can take any shape and $p$ can be very large.
* Consider a case in which we only have one predictor $x$. The expectation $\mbox{E}\{ Y \mid X=x \}$ can be any function of $x$: a line, a parabola, a sine wave, a step function, anything.
* It gets even more complicated when we consider instances with large $p$, in which case $f(\mathbf{x})$ is a function of a multidimensional vector $\mathbf{x}$. For example, in our digit reader example $p = 784$!
* The main way in which competing machine learning algorithms differ is in their approach to estimating this conditional expectation.