You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Place for me to store notes on reinforcement learning, with a
focus on the details of the derivations.
This is a draft, and will never be more than a draft.
Intended as a reference for me, may not be legible to other people.
Only posting publicly in case other people find it useful.
TODO: Add paper links for where I first encountered these proofs,
relevant papers, etc.
Table of Contents
Clobbered for auto generated table of contents
{:toc}
REINFORCE
Let $$X$$ be a random variable with known p.d.f. $$p_\theta(X)$$.
(From here on, $$\theta$$ will be omitted. In general, $$\theta$$
will be specified the first time, and will be implicit all other times.)
Let $$J(\theta)$$ be some cost function defined as
for some arbitrary function $$f(x)$$ that does not depend on $$\theta$$.
We want to compute
$$\nabla_\theta J(\theta)$$. Using the log derivative trick, we can
change the gradient of the expectation to the expectation of the gradient.
Now that this is an expectation, we can get an approximate gradient
by sampling $$x$$.
So far, none of the derivation is RL specific. Let's now suppose
the random variable is trajectories $$\tau = (s_0,a_0,s_1,a_1,s_2,\cdots)$$,
generated by policy $$\pi_\theta$$ in an MDP. Let's then have
$$f(\tau)$$ be the reward over the trajectory. $$J(\theta)$$ is
now the expected reward of the policy, and the gradient gives a
way to update the policy to increased reward.
Once we assume we're in an MDP, we get some magic cancellations.
Note that the dynamics ($$p(s_0)$$ and $$p(s_{i+1}|s_i)$$) do not depend on
$$\theta$$, so when we compute the gradient, they get cancelled out, leaving.
Any time we can move everything except the score function out of the expectation,
we can cancel that term.
The equality above is true for any distribution, but the $$a_i\vert s_i$$ distribution
is especially helpful for MDPs. By the Markov property, $$a_i\vert s_i$$
is the same as $$a_i \vert s_{0:i},a_{0:i-1}$$.
REINFORCE Variance Reduction
Now that we're married to the MDP framework, there are ways to quickly
reduce variance without adding any bias.
Observation 1: When we take an action at timestep $$t$$, it can only
affect the rewards from timesteps $$t$$ and onward. Assuming
$$f(\tau) = R(\tau)$$, let
$$
R_{a:b} = \sum_{i=a}^b r_i
$$
Note $$R(\tau) = R_{0:\infty}$$.
We'll use similar notation to denote trajectories. $$\tau_{a:b}$$
indicates states, actions, and rewards from time $$a$$ to $$b$$.
Slightly reorder the gradient as the sum of expectations
$$
g = \sum_i \mathbb{E}\left[R(\tau)\nabla_\theta \log \pi(a_i|s_i)\right]
$$
Then split $$R_{0:\infty}$$ into the sum of rewards before and after time $$i$$.
Because of this, we can subtract a state-dependent baseline from the cumulative
rewards without changing the expectation.
$$
\sum_i \left(\mathbb{E}[(R_{i:\infty} - b(s_i))\nabla_\theta \log \pi(a_i|s_i)] \right)
= g - \sum_i \left(b(s_i)\nabla_\theta \log \pi(a_i|s_i)] \right)
= g
$$
In practice, people usually try to estimate $$V^{\pi}(s_i)$$ and use that as their
baseline. Intuitively, this "centers" the scaling of
$$\nabla_\theta \log \pi(a_i|s_i)$$. If we estimated $$V^{\pi}$$ exactly, the
scaling is positive if and only if the reward for taking action $$a_i$$
is better than the average.
Q-Learning
Bellman Operators
Operators map functions to functions. We can think of each application of an
operator as one step of an optimization.
Use $$\mathcal{T}$$ to denote operators.
Proof of Convergence For Tabular Problems
Let $$\mathcal{T}^{\pi}$$ be the Bellman operator for $$\pi$$. Define this
as
$$
(\mathcal{T}^{\pi}Q)(s, a) = r(s,a) + \gamma \mathbb{E}_{\pi}[Q(s',a')]
$$
We now prove that if we repeatedly apply $$T^{\pi}$$ to any $$Q$$, we converge
to $$Q^{\pi}$$.
First, show that $$Q^{\pi}$$ is a fixed point. Follows by definitions.
Now, prove $$\mathcal{T}^{\pi}$$ is a contraction.
An operator is a contraction if
$$
| \mathcal{T}^{\pi}Q_1 - \mathcal{T}^{\pi}Q_2 | \le c |Q_1 - Q_2|
$$
for some $$0 \le c < 1$$ and some distance function.
Why do we want to prove this?
By the Banach fixed-point theorem, for any contraction,
Repeatedly applying the contraction converges to a fixed point.
That fixed point is unique.
We use max norm as the distance function. $$|f|_\infty = \sup_x |f(x)|$$.
In this case, the supremum is over state-action pairs $$(s,a)$$. For any
$$(s,a)$$,
Therefore, $$| \mathcal{T}^{\pi}Q_1 - \mathcal{T}^{\pi}Q_2 |\infty \le \gamma |Q_1 - Q_2|\infty$$,
and since $$ 0 \le \gamma < 1$$, operator $$\mathcal{T}^{\pi}$$ is a contraction.
We've now shown the Bellman operator $$\mathcal{T}^{\pi}$$ converges to $$\pi$$.
Let $$\pi^*$$ be the optimal policy. What does the Bellman operator for that
look like?
$$
(\mathcal{T}^_Q)(s, a) = r(s,a) + \gamma \mathbb{E}_{\pi^_}[Q(s',a')]
$$
The optimal policy $$\pi^*$$ will take the action $$a'$$ that maximizes
$$Q(s', a')$$, which converts the expectation into a max.
$$
(\mathcal{T}^*Q)(s, a) = r(s,a) + \gamma \max_{a'}Q(s',a')
$$
This recovers the 1-step return that appears all over the place. Since this
is a special case of $$\mathcal{T}^{\pi}$$, by earlier argument this converges
to $$Q^*(s,a)$$.
In practice, we can never apply the Bellman operator exactly, except for
very small problems where we can easily iterate over all $$(s,a)$$.
Instead, we have parametrized Q-functions $$Q_\theta(s,a)$$,
then take gradient steps to minimze the TD-error
With parametrized Q-functions, we are no longer guaranteed to converge,
TODO: add soft Q-Learning.
Natural Policy Gradient
Natural policy gradient is a special case of natural gradient, so let's
explain that first.
Suppose we have some objective function $$\eta(\theta)$$ that's
differentiable. When optimizing $$\eta(\theta)$$, why do we step in
direction $$\nabla_\theta \eta(\theta)$$? It's because the direction
of steepest descent is the gradient. Quick argument follows.
The direction of steepest descent is the $$d\theta$$ that solves
where $$\alpha$$ is the angle between the two vectors. Since the gradient norm
is constant, this is minimized when $$|d\theta| = \epsilon$$ and $$\cos\alpha = -1$$.
This gives step direction $$-\nabla \eta(\theta)$$, and update.
$$
\theta_{n+1} = \theta - \nabla \eta(\theta)
$$
In this derivation, we implicitly assumed $$|d\theta|$$ was defined as
the Euclidean norm. We could alternatively define norm as
$$
|v|_G^2 = v^TGv
$$
where $$G$$ is some symmetric positive definite matrix. The Euclidean norm is a special
case of this, where $$G$$ is the identity matrix.
(Why is it sufficient to consider symmetric positive definite matrices?
We care specifically about the quadratic form $$f(v) = v^TGv$$. For any non-symmetric
$$A$$, you get the same function if you use $$(A + A^T)/2$$ instead, so
for any quadratic form $$f(v)$$, there exists some equivalent
symmetric positive definite matrix.)
Suppose we required the step direction to be $$|d\theta|_G^2 < \epsilon$$ instead.
Now what's the optimal step direction? Note that different directions have
different maximum Euclidean norms, so the direction aligned with
$$\nabla \eta(\theta)$$ may no longer be optimal.
(Linear algebra interlude because I keep forgetting basic linear algebra.)
Lemma: Every positive definite matrix $$G$$ is invertible.
Proof: Suppose not. Then there exist $$u,v$$ such that $$Gu = Gv$$ and
$$u \neq v$$. Equivalently, $$G(u-v) = 0$$ for some non-zero vector $$u - v$$.
But then $$(u-v)^TG(u-v) = 0$$, contradicting the positive definite definition,
so $$G$$ must be invertible. $$\blacksquare$$.
Solve this constrained optimization problem through Lagrange multipliers.
To simplify notation let $$g = \nabla_\theta \eta(\theta)$$.
At the solution $$d\theta$$, we must have
In regular gradient descent, we always take small steps in parameter
space, defined by the Euclidean distance in parameter space. Natural gradient
argues that we should instead take small steps in the space of probability
distributions. We can do this by measuring distance with the
Fisher information matrix.
(I'm pretty sure this argument is wrong in some subtle way, but it's
approximately correct.)
Natural policy gradient is the idea of natural gradient, applied to the
policy gradient from REINFORCE. At a high level it's actually pretty
simple.
Decide on some learning rate $$\alpha$$.
Approximate the inverse Fisher $$F^{-1}$$ from the data.
Compute the gradient $$g$$ REINFORCE would have given.
Update with $$\theta_{n+1} \gets \theta_n + \alpha * F^{-1}g$$.
In practice, you need to use conjugate gradients to compute $$F^{-1}g$$
effeciently - computing $$F^{-1}$$ explicitly takes $$O(n^2)$$ time,
where $$n$$ is the number of parameters, and conjugate gradients
let you approximate $$F^{-1}g$$ in linear time.
(This actually gets you 80% of the way to Trust Region Policy Optimization,
since at the end of its derivation it reduces to natural policy gradient
with an adaptive learning rate to take the largest step it can within
the trust region.)
Equivalent Formulations of Policies and Reward
(Adapted from a mix of Generative Adversarial Imitation Learning
and Trust-Region Policy Optimization.)
State Visitation / Occupency Measure
The state visitation frequency is defined as the discounted
sum of probabilities of visiting a given state.
More precisely, multiplying $$\rho_\pi(s)$$ by $$(1-\gamma)$$ gives a
probability distribution.
The occupency measure $$\rho_\pi(s,a)$$ is defined similarly, except it's the
discounted sum of probabilityes of visiting state-action pairs $$(s,a)$$,
instead of just states $$s$$.
which can be interpreted as taking the average reward over all trajectories.
But alternatively, we can count how often we visit state-action pairs $$(s,a)$$,
and sum over that.
$$
\eta(\pi) = \sum_{s,a} \rho_\pi(s,a) r(s,a)
$$
Argument that works for me - imagine a giant grid of all $$(s,a)$$ pairs.
Whenever you execute an episode, when you encounter $$(s_t, a_t)$$,
add $$\gamma^t r(s_t,a_t)$$ to that square in the grid. The sum of values in
the grid is the reward of the episode. The sum of values in the average grid
is the expected reward of $$\pi$$. But equivalently, we could ask,
"What value is in cell $$(s,a)$$ in the average grid?" And it turns out
that cell will have value $$\rho_\pi(s,a)r(s,a)$$.
TODO: add proof of bijection between occupnecy measures $$\rho$$
and policies $$\pi$$.
Why bother formulating the problem this way? It gives another way to think
about the learning problem. In the trajectory view, you want to update
your policy such that episodes that give good reward happen more often.
In the state-action view, you want to update such that you visit
good $$(s,a)$$ more often. It can be easier to make arguments
in one view than in the other.
Advantage With Respect to Another Policy
The expected return of $$\pi'$$ can be defined by its expected
advantage over $$\pi$$, which is handy for thinking about the difference
between policies.
At this point, note that the last two summations are identical, except the
first starts counting from $$1$$ and the other starts counting from $$0$$.
This lets us cancel all terms except for $$t=0$$ in the second summation.
Now note that because the initial state $$s_0$$ is independent of the policy,
the first expectation changes to $$-\eta(\pi)$$. The second expectations
is $$\eta(\pi')$$ by definition. Therefore,
In REINFORCE, you can subtract a state-dependent baseline without biasing the
gradient. The Q-Prop paper gives a way to subtract an action-dependent baseline.
The argument is based on control variates, a variance reduction technique.
The Wikipedia page
is pretty good.
Suppose we want to estimate $$\mathbb{E}[f(X)]$$. We can do so by sampling
from $$X$$. Now, let $$g(X)$$ be another function, whose expectation
$$\mathbb{E}[g(X)]$$ is known / can be computed analytically. Let
$$\mathbb{E}[f(X)] = \mu$$ and $$\mathbb{E}[g(X)] = b$$. Then
$$
f(X) + c (g(X) - b)
$$
also has expectation $$\mu$$ for any $$c$$. If $$f(X)$$ and $$g(X)$$ are
correlated, a good choice of $$c$$ can make
the variance of $$f(X) + c(g(X) - b)$$ lower than
the variance of $$f(X)$$.
To apply control variates to RL, recall that we showed the policy gradient was
$$
g = \sum_i \mathbb{E}_{\tau|s_i}[R_{i:\infty}\nabla_\theta \log \pi(a_i|s_i)]
$$
We want to subtract an action dependent baseline $$f(s,a)$$. Next, we
make an observation from the MuProp paper: first-order Taylor expansions make
nice control variates. We'll see how the linear function makes the math nice.
(Or at least, nicer. It's still pretty bashy.)
Let $$f(s, a)$$ be some arbitrary function. Since each expectation is conditioned
on $$s_i$$, it suffices to consider the Taylor expansion with respect to just
$$a$$. Let $$\bar{a}$$ be some arbitrary action, depending only on $$s_i$$.
The 1st-order Taylor expansion $$\bar{f}(s,a)$$ around $$(s_i, \bar{a})$$
is
This is trivially equal to the original gradient, which leaves analytically
computing the 2nd term. Think of this as the offset we need to re-add to the gradient
to make the gradient estimator unbiased again.
In the first expectation, because $$\bar{a}$$ depends only on $$s_i$$, the
scaling factor depends only on $$s_i$$, and therefore the
expectation goes to $$0$$ (we can move $$f(s_i, \bar{a})$$ out of the expectation,
then note expectation of the score function is $$0$$.)
In the last expectation, the same is also true. $$\nabla_a f(s_i,a)|_{a=\bar{a}}\cdot \bar{a}$$
depends only on $$\bar{a}$$, which only depends on $$s_i$$, and therefore can
also move out of the expectation, and therefore that term goes to $$0$$ too.
That leaves the middle expectation. At this point, note the expectation over $$\tau\vert s_i$$
is the same as the expectation over $$a_i \vert s_i$$, since at this point no
terms depend on other states or actions.
Points of confusion I ran into: we're applying the same trick of $$p \nabla \log p = \nabla p$$.
We're multiplying a $$dim(a)$$ vector by a $$dim(a) \times dim(\theta)$$ matrix
to get a $$dim(\theta)$$ vector. (Recall we want the expectation to return
an offset for the gradient with respect to $$\theta$$.)
We've ended with multiplying the gradient w.r.t $$a$$ at $$\bar{a}$$ with the
gradient w.r.t. $$\theta$$ of the mean action of policy $$\pi$$. From here,
we can make a number of natural choices.
Let $$\pi_|theta(s) = \mathcal{N}(\mu_\theta(s), \Sigma)$$, where $$\mu_\theta(s)$$
is some learned actor (some neural net), and $$\Sigma$$ is either constant
or also learned. There's no specific reason $$\pi$$ needs to be Gaussian,
as long as the mean action has an analytic closed form.
For the $$i$$th term, let $$\bar{a} = \mu_\theta(s_i)$$.
For $$f(s, a)$$, let $$f$$ be an estimate of the Q-function $$Q_w(s,a)$$, where
$$w$$ are the weights of $$Q_w(s,a)$$. We update $$w$$ in between computing and
applying each policy gradient.
(The paper interprets $$R_{i:\infty}$$ as the empirical Q-value. Under that
interpretation, each gradient is scaled by the difference between the
empirical and estimated Q-values.)
TODO: explain extenstion to advantage estimation.
Why go through all this effort to learn an action-dependent baseline? Intuitvely,
if our baseline depends on both state and action, we can do a better job of
"centering" the gradient. we can also learn $$Q_w$$ with off-policy data saved
in a replay buffer, with standard Q-Learning.
(A common choice of state-dependent baseline is a learned value function $$V$$.
Why can't we learn $$V$$ with off-policy data? The problem is that it doesn't
have the same self-consistency objective as the TD-error in Q-Learning, it requires knowing
the action the current policy $$\pi$$ would take, which requires on-policy
data collection.)
Important note: Q-Prop is an approach for estimating the gradient. It says nothing
about how to apply it. You can directly add the gradient with SGD, or you can use
natural policy gradient, or TRPO.