Bayesian-Inference/OptimalDecisionMaking.qmd at main · aosakwe/Bayesian-Inference · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "Optimal Decision Making"
author: "Adrien Osakwe"
format: html
editor: visual
---

## Decision-Making

Many statistical procedures involve some form of *decision-making* where actions are taken given the observed data. Examples include

-   parameter estimation

-   hypothesis testing

-   prediction/classification

-   model selection

We can denote a function $T$ which is used as an estimator of a statistic of interest from the random variable $Y$. Consider, the mean, order statistics, or the empirical cdf. We can also consider the model space $\mathbb{F}$. We can then examine a loss function $L(..)$ which will determine the accuracy with which we report $T$ when the ground truth comes from $F \in \mathbb{F}$.

For example, with the cdf $F_y$ we can have that

$$
\mu = \int yF_y(dy)
$$

And define the loss function as $L(T,F_y) = (T - \mu)^2$ to represent the loss in reporting the estimator $T$ when the value of interest is $\mu$.

The *optimal decision* is the decision which minimizes the expected loss with respect to the distribution $F_y$. In a parametric analysis defined by $\theta$. We have that

1.  $\theta$ is considered a fixed constant and the data as random in a **frequentist** setting
2.  The data are fixed and $\theta$ **is random** in a Bayesian setting

### Frequentist calculation

Back to our previous example, in a frequentist setting the optimal decision for $T$ would be

$$
argmin_{\text{T}}\mathbb{E}_{F_y}[(T- \mu)^2] = argmin_{\text{T}}\{\mathbb{E}_{F_y}[(T -\mathbb{E}_{F_y}[T])^2] + (\mathbb{E_{F_y}[T] - \mu)^2}) \}
$$

$$
= argmin_{\text{T}}\{Var_{F_y}[T] + (\mathbb{E_{F_y}[T] - \mu)^2})\}
$$

This tells us that we need to take into account the variance of T and the squared bias to define the optimal T.

### Kullback-Leibler Divergence/Loss

In Bayesian settings, we usually look at the Kullback-Leibler Loss which is used to measure the difference between two distributions $F_0$ and $F_1$.

$$
KL(F_0,F_1) = \int \log\{\frac{dF_0(y)}{dF_1(y)}\}dF_0(y)
$$

This is defined when $F_1$ is absolutely continuous with respect to $F_0$. That is, the probability mass at a given point for $F_1$ is zero whenever it is zero for $F_0$. In essence, we can express the KL as an expectation:

$$
KL(f_0,f_1) = \mathbb{E}_{f_0}[log\{\frac{f_0(Y)}{f_1(Y)}\}]
$$

It is important to note that:

1.  KL is always non-negative
2.  $KL(F_0,F_1) \neq KL(F_1,F_0)$
3.  KL can only be zero if both distributions are identical.

This is applicable to both discrete and continuous distributions. In a parametric setting, we can expect to have pdf $f(y;\theta)$ where $\theta_0$ represents the optimal, data-generating model. We can then write KL as

$$
KL(\theta_0,\theta) = \int\{\frac{f(y;\theta_0)}{f(y;\theta)}\}f(y;\theta_0)dy
$$

and aim to report an estimator $\hat{\theta} = T(Y)$ for the true value $\theta_0$

### Decision Theory Concepts

The key parts of a decision problem are:

-   a decision d, selected from a set of $D$ decisions must be made.

-   there is a true state of nature, $v(\theta)$, which lies in the set $\gamma$ , that is defined bby the data generating model, $F_Y(y;\theta)$ .

-   there is a loss function $L(d,v)$ for decision $d$ and state $v$ which records the loss when for a given state $v$ and decision $d$.

With these components, we aim to **minimize the expected loss**.

In the case of estimation, the decision required is the estimation of the parameter $\theta$ and the true state of nature is the parameter's value, that is, $v(\theta) \equiv \theta$. If data is available, the optimal decision is usually a function of the observed data. If the decision takes the form of a statistic, we have $d(y) \equiv T(y) = t_n$ with associated loss $L(t_n,\theta)$. The corresponding random variable will be $T_n \equiv T(Y)$.

**Frequentist Risk (loss):** The expected loss associated with $d(Y)$ over the distribution of $Y|\theta$.

$$
R_n(d,\theta) = \mathbb{E}_{F_Y}[L(T_n,\theta)] = \int_yL(T(y),\theta)f_Y(y;\theta)dy
$$

**Bayes Risk (average):** Is the expected risk over the **prior** distribution:

$$
R_n(d) = \mathbb{E}_{\pi_0}[R_n(d,\theta)] = \mathbb{E}_{\pi_0}[\mathbb{E}_{F_Y}[L(T_n,\theta)]]
$$

$$
= \int_\Theta\{ \int_yL(t_n,\theta)f_Y(y;\theta)dy\}\pi_0(\theta)d\theta
$$

where $t_n \equiv t_n(y)$.

$$
= \int_\Theta\int_yL(t_n,\theta)f_Y(y)\pi_n(\theta)dyd\theta
$$

$$
= \int_y\{\int_\Theta L(t_n,\theta)\pi_n(\theta)d\theta\}f_Y(y)dy
$$

Following Bayes Theorem.

With the prior and fixed observable data, the optimal Bayesian decision, the **Bayes rule** is

$$
\hat{d}_B = argmin_{d \in D} R_n(d(y))
$$

Interpretation:

-   Bayes risk is minimized when the inner integral is minimized for fixed observations, regardless of their value

-   The double integral is minimized

-   or that the optimal decision should be chosen conditionally on the observed data and not averaged over all possible values.

To estimate, we need to select a statistic of interest. The minimization can be reduced to the following:

$$
argmin_{t \in \Theta}\int_\Theta L(t_n,\theta)\pi_n(\theta)d\theta
$$

Given the interpretation mentioned above. In other words, the optimal decision minimizes the Bayes risk and the **posterior expected loss**.

#### Squared-error loss

-   Returns the posterior expectation (see proof)

#### Absolute error loss

-   Returns the posterior median

#### Zero-one loss

-   Returns the posterior mode