cs188/note/18-Iteration.md at main · darstib/cs188

note

Value Iteration

Considering finite horizons, The time-limited value for a state s with a time-limit of k timesteps is denoted $U_k(s)$,

Value iteration is a dynamic programming algorithm that uses an iteratively longer time limit to compute time-limited values until convergence¹ , operating as follows:

$\forall s \in S$, initialze $U_{0}(s)=0$, since no actions can be taken to acquire rewards.
repear the following update rule until convergence (B is called the Bellman operator):

$$ \forall s\in S,U_{k+1}(s)\leftarrow\max_{a}\sum_{s^{\prime}}T(s,a,s^{\prime})[R(s,a,s^{\prime})+\gamma U_{k}(s^{\prime})] ~~ (or \ \ U_{k+1} \leftarrow BU_{k}) $$

When convergence is reached, the Bellman equation will hold for every state: $\forall s \in S, U_{k+1}(s)=U_{k}(s) = U^*(s)$.

Q-value Iteration

Q-value iteration is a dynamic programming algorithm that computes time-limited Q-values. It is described in the following equation:

$$Q_{k+1}(s,a)\leftarrow\sum_{s^{\prime}}T(s,a,s^{\prime})[R(s,a,s^{\prime})+\gamma\max_{a^{\prime}}Q_k(s^{\prime},a^{\prime})]$$

Policy Iteration

policy extraction

$$\forall s\in S,:\boldsymbol{\pi}^{_}(s)=\underset{a}{\operatorname_{\operatorname*{\operatorname*{argmax}}}}Q^{_}(s,a)=\underset{a}{\operatorname_{\operatorname*{\arg\max}}}\sum_{{s^{\prime}}}T(s,a,s^{\prime})[R(s,a,s^{\prime})+\boldsymbol{\gamma}U^{*}(s^{\prime})]$$

policy evaluation

For a policy π, policy evaluation means computing $U^π(s)$ for all states s, where $U^π(s)$ is expected utility of starting in state s when following π :

$$U^{\boldsymbol{\pi}}(s)=\sum_{s^{\prime}}T(s,\boldsymbol{\pi}(s),s^{\prime})[R(s,\boldsymbol{\pi}(s),s^{\prime})+\boldsymbol{\gamma}U^{\boldsymbol{\pi}}(s^{\prime})]$$

policy improvement

Once we’ve evaluated the current policy, use policy improvement to generate a better policy.

Define the policy at iteration i of policy iteration as $\pi_{i}$ , we have

$$U_{k+1}^{{\boldsymbol{\pi}_{i}}}(s)\leftarrow\sum_{{s^{\prime}}}T(s,\boldsymbol{\pi}_{i}(s),s^{\prime})[R(s,\boldsymbol{\pi}_{i}(s),s^{\prime})+\boldsymbol{\gamma}U_{k}^{{\boldsymbol{\pi}_{i}}}(s^{\prime})]$$

finally improve policy with:

$$\boldsymbol{\pi}_{i+1}(s)=\underset a{\operatorname*{argmax}}\sum_{s^{\prime}}T(s,a,s^{\prime})[R(s,a,s^{\prime})+\boldsymbol{\gamma}U^{\pi_i}(s^{\prime})]$$

summary

link

cs188-sp24-note18

convergence means that $\forall s \in S, U_{k+1}(s) = U_{k}(s)$. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

note

Value Iteration

Q-value Iteration

Policy Iteration

policy extraction

policy evaluation

policy improvement

summary

link

FilesExpand file tree

18-Iteration.md

Latest commit

History

18-Iteration.md

File metadata and controls

note

Value Iteration

Q-value Iteration

Policy Iteration

policy extraction

policy evaluation

policy improvement

summary

link

Footnotes