@@ -22,7 +22,6 @@ import pandas as pd
2222from math import exp, log
2323```
2424
25-
2625## Introduction
2726
2827Consider a situation where a policymaker is trying to estimate how much revenue a proposed wealth tax
@@ -78,15 +77,16 @@ The data is derived from the
7877[ Survey of Consumer Finances] ( https://en.wikipedia.org/wiki/Survey_of_Consumer_Finances ) (SCF).
7978
8079
81- The following code imports this data and reads it into an array called ` sample ` .
80+ The following code imports this data and reads it into an array called ` sample ` .
8281
8382``` {code-cell} ipython3
8483:tags: [hide-input]
84+
8585url = 'https://media.githubusercontent.com/media/QuantEcon/high_dim_data/update_scf_noweights/SCF_plus/SCF_plus_mini_no_weights.csv'
8686df = pd.read_csv(url)
8787df = df.dropna()
8888df = df[df['year'] == 2016]
89- df = df.loc[df['n_wealth'] > 0 ] #restrcting data to net worth > 0
89+ df = df.loc[df['n_wealth'] > 1 ] #restrcting data to net worth > 1
9090rv = df['n_wealth'].sample(n=n, random_state=1234)
9191rv = rv.to_numpy() / 100_000
9292sample = rv
@@ -157,22 +157,64 @@ ax.hist(ln_sample, density=True, bins=200, histtype='stepfilled', alpha=0.8)
157157plt.show()
158158```
159159
160+ +++ {"user_expressions": []}
161+
160162Now our job is to obtain the maximum likelihood estimates of $\mu$ and $\sigma$, which
161163we denote by $\hat{\mu}$ and $\hat{\sigma}$.
162164
163165These estimates can be found by maximizing the likelihood function given the
164166data.
165167
166- In our case they are
168+ The pdf of a lognormally distributed random variable $X$ is given by:
169+ $$
170+ f(x) = \frac{1}{x}\frac{1}{\sigma \sqrt{2\pi}} exp\left(\frac{-1}{2}\left(\frac{\ln x-\mu}{\sigma}\right)\right)^2
171+ $$
167172
173+ Since $\ln X$ is normally distributed this is the same as
168174$$
169- \hat{\mu} = \frac{\sum_{i=1}^{n} \ln w_i}{n}
170- \quad \text{and} \quad
171- \hat{\sigma}
172- = \left( \frac{\sum_{i=1}^{n}(\ln w_i - \hat{\mu})^2}{n} \right)^{1/2}
175+ f(x) = \frac{1}{x} \phi(x)
176+ $$
177+ where $\phi$ is the pdf of $\ln X$ which is normally distibuted with mean $\mu$ and variance $\sigma ^2$.
178+
179+ For a sample $x = (x_1, x_2, \cdots, x_n)$ the _likelihood function_ is given by:
180+ $$
181+ \begin{aligned}
182+ L(\mu, \sigma | x_i) = \prod_ {i=1}^{n} f(\mu, \sigma | x_i) \\
183+ L(\mu, \sigma | x_i) = \prod_ {i=1}^{n} \frac{1}{x_i} \phi(\ln x_i)
184+ \end{aligned}
185+ $$
186+
187+ Taking $\log$ on both sides gives us the _log likelihood function_ which is:
188+ $$
189+ \begin{aligned}
190+ l(\mu, \sigma | x_i) = -\sum_ {i=1}^{n} \ln x_i + \sum_ {i=1}^n \phi(\ln x_i) \\
191+ l(\mu, \sigma | x_i) = -\sum_ {i=1}^{n} \ln x_i - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln \sigma^2 - \frac{1}{2\sigma^2}
192+ \sum_ {i=1}^n (\ln x_i - \mu)^2
193+ \end{aligned}
173194$$
174195
175- Let's calculate these values
196+ To find where this function is maximised we find its partial derivatives wrt $\mu$ and $\sigma ^2$ and equate them to $0$.
197+
198+ Let's first find the MLE of $\mu$,
199+ $$
200+ \begin{aligned}
201+ \frac{\delta l}{\delta \mu} = - \frac{1}{2\sigma^2} \times 2 \sum_ {i=1}^n (\ln x_i - \mu) = 0 \\
202+ \Rightarrow \sum_ {i=1}^n \ln x_i - n \mu = 0 \\
203+ \Rightarrow \hat{\mu} = \frac{\sum_ {i=1}^n \ln x_i}{n}
204+ \end{aligned}
205+ $$
206+
207+ Now let's find the MLE of $\sigma$,
208+ $$
209+ \begin{aligned}
210+ \frac{\delta l}{\delta \sigma^2} = - \frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_ {i=1}^n (\ln x_i - \mu)^2 = 0 \\
211+ \Rightarrow \frac{n}{2\sigma^2} = \frac{1}{2\sigma^4} \sum_ {i=1}^n (\ln x_i - \mu)^2 \\
212+ \Rightarrow \hat{\sigma} = \left( \frac{\sum_ {i=1}^{n}(\ln x_i - \hat{\mu})^2}{n} \right)^{1/2}
213+ \end{aligned}
214+ $$
215+
216+ Now that we have derived the expressions for $\hat{\mu}$ and $\hat{\sigma}$,
217+ let's compute them for our wealth sample.
176218
177219```{code-cell} ipython3
178220μ_hat = np.mean(ln_sample)
@@ -200,7 +242,6 @@ ax.legend()
200242plt.show()
201243```
202244
203-
204245Our estimated lognormal distribution appears to be a decent fit for the overall data.
205246
206247We now use {eq}`eq:est_rev` to calculate total revenue.
@@ -268,7 +309,6 @@ tr_pareto
268309
269310The number is very different!
270311
271-
272312```{code-cell} ipython3
273313tr_pareto / tr_lognorm
274314```
@@ -280,7 +320,6 @@ We see that choosing the right distribution is extremely important.
280320Let's compare the fitted Pareto distribution to the histogram:
281321
282322```{code-cell} ipython3
283-
284323fig, ax = plt.subplots()
285324ax.set_xlim(-1, 20)
286325ax.set_ylim(0,1.75)
@@ -292,68 +331,11 @@ ax.legend()
292331plt.show()
293332```
294333
334+ +++ {"user_expressions": []}
335+
295336We observe that in this case the fit for the Pareto distribution is not very
296337good, so we can probably reject it.
297338
298-
299-
300- ## Exponential distribution
301-
302- What other distributions could we try?
303-
304- Suppose we assume that the distribution is [exponential](https://en.wikipedia.org/wiki/Exponential_distribution)
305- with parameter $\lambda > 0$.
306-
307- The maximum likelihood estimate of $\lambda$ is given by
308-
309- $$
310- \hat{\lambda} = \frac{n}{\sum_ {i=1}^n w_i}
311- $$
312-
313- Let's calculate it.
314-
315- ```{code-cell} ipython3
316- λ_hat = 1/np.mean(sample)
317- λ_hat
318- ```
319-
320- Now let's compute total revenue:
321-
322- ```{code-cell} ipython3
323- dist_exp = expon(scale = 1/λ_hat)
324- tr_expo = total_revenue(dist_exp)
325- tr_expo
326- ```
327-
328- Again, calculated revenue is very different.
329-
330- ```{code-cell} ipython3
331- tr_expo / tr_lognorm
332- ```
333-
334- But once again, when we plot the fitted distribution against the data we see it
335- is a bad fit.
336-
337-
338- ```{code-cell} ipython3
339-
340- fig, ax = plt.subplots()
341- ax.set_xlim(-1, 20)
342-
343- ax.hist(sample, density=True, bins=5000, histtype='stepfilled', alpha=0.5)
344- ax.plot(x, dist_exp.pdf(x), 'k-', lw=0.5, label='exponential pdf')
345- ax.legend()
346-
347- plt.show()
348- ```
349-
350- So we can reject this calculation.
351-
352-
353-
354-
355-
356-
357339## What is the best distribution?
358340
359341There is no "best" distribution --- every choice we make is an assumption.
@@ -370,6 +352,7 @@ We set an arbitrary threshold of $500,000 and read the data into `sample_tail`.
370352
371353```{code-cell} ipython3
372354:tags: [hide-input]
355+
373356df_tail = df.loc[df['n_wealth'] > 500_000 ]
374357df_tail.head()
375358rv_tail = df_tail['n_wealth'].sample(n=10_000, random_state=4321)
@@ -410,7 +393,6 @@ ax.legend()
410393plt.show()
411394```
412395
413-
414396While the lognormal distribution was a good fit for the entire dataset,
415397it is not a good fit for the right hand tail.
416398
@@ -435,6 +417,7 @@ ax.plot(x, dist_pareto_tail.pdf(x), 'k-', lw=0.5, label='pareto pdf')
435417plt.show()
436418```
437419
420+ +++ {"user_expressions": []}
438421
439422The Pareto distribution is a better fit for the right hand tail of our dataset.
440423
@@ -451,3 +434,73 @@ There are other more rigorous tests, such as the [Kolmogorov-Smirnov test](https
451434
452435We omit the details.
453436
437+ ## Exercises
438+
439+ ```{exercise-start}
440+ :label: mle_ex1
441+ ```
442+ Suppose we assume wealth is [exponentially](https://en.wikipedia.org/wiki/Exponential_distribution)
443+ distributed with parameter $\lambda > 0$.
444+
445+ The maximum likelihood estimate of $\lambda$ is given by
446+
447+ $$
448+ \hat{\lambda} = \frac{n}{\sum_ {i=1}^n w_i}
449+ $$
450+
451+ 1. Compute $\hat{\lambda}$ for our initial sample.
452+ 2. Use $\hat{\lambda}$ to find the total revenue
453+
454+ ```{exercise-end}
455+ ```
456+
457+ ```{solution-start} mle_ex1
458+ :class: dropdown
459+ ```
460+
461+ ```{code-cell} ipython3
462+ λ_hat = 1/np.mean(sample)
463+ λ_hat
464+ ```
465+
466+ ```{code-cell} ipython3
467+ dist_exp = expon(scale = 1/λ_hat)
468+ tr_expo = total_revenue(dist_exp)
469+ tr_expo
470+ ```
471+
472+ +++ {"user_expressions": []}
473+
474+ ```{solution-end}
475+ ```
476+
477+ ```{exercise-start}
478+ :label: mle_ex2
479+ ```
480+
481+ Plot the exponential distribution against the sample and check if it is a good fit or not.
482+
483+ ```{exercise-end}
484+ ```
485+
486+ ```{solution-start} mle_ex2
487+ :class: dropdown
488+ ```
489+
490+ ```{code-cell} ipython3
491+ fig, ax = plt.subplots()
492+ ax.set_xlim(-1, 20)
493+
494+ ax.hist(sample, density=True, bins=5000, histtype='stepfilled', alpha=0.5)
495+ ax.plot(x, dist_exp.pdf(x), 'k-', lw=0.5, label='exponential pdf')
496+ ax.legend()
497+
498+ plt.show()
499+ ```
500+
501+ +++ {"user_expressions": []}
502+
503+ Clearly, this distribution is not a good fit for our data.
504+
505+ ```{solution-end}
506+ ```
0 commit comments