@@ -12,9 +12,8 @@ form of approximate inference.
1212# Motivation
1313[ motivation ] : #motivation
1414
15- I have a Stan model that is too slow to sample. I would like to do something
16- better than optimization. Laplace approximations are a pretty standard way
17- of doing this.
15+ Laplace approximations are quick to compute. If a model can be fit with a Laplace
16+ approximation, that will probably be much faster than running full hmc.
1817
1918# Guide-level explanation
2019[ guide-level-explanation ] : #guide-level-explanation
@@ -33,7 +32,7 @@ where `jac(g, u)` is the Jacobian of the function `g` at `u`. In the Laplace
3332approximation, we search for a mode (a maximum) of ``` log(p(u)) ``` . Call this
3433` u_mode ` . This is not the same optimization that is done in the ` optimizing `
3534algorithm. That searches for a mode of ` log(p(g(u))) ` (or the equation above
36- without the ` log(det(jac(g, u))) ` term. This is different.
35+ without the ` log(det(jac(g, u))) ` term) . This is different.
3736
3837We can form a second order Taylor expansion of ` log(p(u)) ` around ` u_mode ` :
3938
@@ -44,10 +43,10 @@ log(p(u)) = log(p(u_mode))
4443 + O(||u - u_mode||^3)
4544```
4645
47- where ` gradient(log(p), u ) ` is the gradient of ` log(p) ` at ` u ` and
48- ` hessian(log(p), u ) ` is the hessian of ` log(p) ` at ` u ` . Because the gradient
49- is zero at the mode, the linear term drops out. Ignoring the third order
50- terms gives us a new distribution ` p_approx(u) ` :
46+ where ` gradient(log(p), u_mode ) ` is the gradient of ` log(p(u)) ` at ` u_mode ` and
47+ ` hessian(log(p), u_mode ) ` is the hessian of ` log(p(u)) ` at ` u_mode ` . Because the
48+ gradient is zero at the mode, the linear term drops out. Ignoring the third
49+ order terms gives us a new distribution ` p_approx(u) ` :
5150
5251```
5352log(p_approx(u)) = K + 0.5 * (u - u_mode)^T * hessian(log(p), u_mode) * (u - u_mode)
@@ -59,9 +58,10 @@ where K is a constant to make this normalize. `u_approx` (`u` sampled from
5958u_approx ~ N(u_mode, -(hessian(log(p), u_mode))^{-1})
6059```
6160
62- Taking draws from ` u_approx ` gives us draws from our distribution on Stan's
63- unconstrained space. Once constrained, these draws can be used in the same
64- way that regular draws from the ` sampling ` algorithm are used.
61+ Taking draws from ` u_approx ` gives us draws from the distribution on the
62+ unconstrained space. Once constrained (via the transform ` g(u) ` ), these draws
63+ can be used in the same way that regular draws from the ` sampling ` algorithm
64+ are used.
6565
6666The ` laplace ` algorithm would take in the same arguments as the ` optimize `
6767algorithm plus two additional ones:
@@ -79,7 +79,7 @@ A model can be called by:
7979./model laplace num_samples=100 data file=data.dat
8080```
8181
82- or with the diagonal:
82+ or with a small value added to the diagonal:
8383```
8484./model laplace num_samples=100 add_diag=1e-10 data file=data.dat
8585```
@@ -92,7 +92,10 @@ The three algorithm specific parameters are:
92921 . ``` log_p__ ``` - The log density of the model itself
93932 . ``` log_g__ ``` - The log density of the Laplace approximation
94943 . ``` rejected__ ``` - A boolean data indicating whether it was possible to
95- evaluate the log density of the model at this parameter value
95+ evaluate the log density of the model at this parameter value
96+
97+ Reporting both the model and approximate log density allows for importance
98+ sampling diagnostics to be applied to the model.
9699
97100For instance, the new output might look like:
98101
@@ -106,15 +109,24 @@ log_p__, log_g__, rejected__, b.1,b.2
106109-3, -4, 1, 0, 0
107110```
108111
112+ There are two additional diagnostics for the ` laplace ` approximation. The
113+ ` log_p__ ` and ` log_q__ ` outputs can be used to do the Pareto-k
114+ diagnostic from "Pareto Smoothed Importance Sampling" (Vehtari, 2015).
115+
116+ There could also be a diagnostic printout for how many times it is not
117+ possible to evaluate the log density of the model at a certain point from the
118+ approximate posterior (this would indicate the approximation is likely
119+ questionable).
120+
109121# Reference-level explanation
110122[ reference-level-explanation ] : #reference-level-explanation
111123
112124The implementation of this would borrow heavily from the optimization code. The
113125difference would be that the Jacobian would be turned on for the optimization.
114126
115- We will also need to implement a way of computing Hessians with finite
116- differences of gradients. Simple finite differences were not sufficiently
117- accurate for an example I was working on .
127+ The Hessians should be computed using finite differences of reverse mode autodiff.
128+ Higher order autodiff isn't possible for general Stan models, but finite differences
129+ of reverse mode autodiff should be more stable than second order finite differences .
118130
119131# Drawbacks
120132[ drawbacks ] : #drawbacks
@@ -137,24 +149,9 @@ lengths of their outputs (they may compute incorrect standard errors, etc).
137149print zeros where usually the parameters would be.
138150
139151If the user does not check the ` rejected__ ` column, then they will be using a
140- bunch of zeros in their follow-up calculations that could mislead them.
141-
142- Similarly to divergent transitions, we can print to the output information about
143- how many draws were rejected in any case.
144-
145- In the earlier part of the design document I assume #3 was implemented, but I
146- think #2 might be better.
147-
148- # Rationale and alternatives
149- [ rationale-and-alternatives ] : #rationale-and-alternatives
150-
151- An alternative that seems appealing at first is printing out the mode and
152- hessian so that it would be possible for people to make their own posterior
153- draws. This is not reasonable because the conversion from unconstrained to
154- constrained space is not exposed in all of the interfaces.
152+ bunch of zeros in their follow-up calculations that could be misleading.
155153
156154# Prior art
157155[ prior-art ] : #prior-art
158156
159- This is a pretty common Bayesian approximation. It just is not implemented in
160- Stan.
157+ This is a common Bayesian approximation. It just is not implemented in Stan.
0 commit comments