Skip to content

Commit 6b028fa

Browse files
Replace with clean path-intact layout + Chapter 2 working code + CI
1 parent b217701 commit 6b028fa

27 files changed

Lines changed: 226 additions & 458 deletions

.github/workflows/python-tests.yml

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Python (Chapter 2)
1+
name: Python (Chapters)
22

33
on:
44
push:
@@ -11,21 +11,22 @@ jobs:
1111
runs-on: ubuntu-latest
1212
strategy:
1313
matrix:
14-
python-version: ["3.9", "3.10", "3.11"]
14+
python-version: ["3.10", "3.11"]
1515
steps:
1616
- uses: actions/checkout@v4
1717

1818
- name: Set up Python
1919
uses: actions/setup-python@v5
2020
with:
2121
python-version: ${{ matrix.python-version }}
22-
cache: "pip"
2322

24-
- name: Install dependencies (Chapter 2)
23+
- name: Install root deps
2524
run: |
2625
python -m pip install --upgrade pip
27-
pip install -r ch2_rl_formulation/requirements.txt
26+
pip install -r requirements.txt
2827
29-
- name: Run tests (Chapter 2)
30-
run: |
31-
pytest -q ch2_rl_formulation/tests
28+
- name: Run Chapter 2 tests
29+
run: pytest -q ch2_rl_formulation/tests
30+
31+
- name: Run Chapter 3 tests
32+
run: pytest -q ch3_multi_armed_bandits/tests

.gitignore

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,5 @@
1-
# Python
1+
.venv/
22
__pycache__/
33
*.pyc
4-
.venv/
5-
.env
6-
.ipynb_checkpoints/
7-
8-
# OS
9-
.DS_Store
10-
11-
# Packaging / build
12-
build/
13-
dist/
14-
*.egg-info/
15-
16-
# IDE
17-
.vscode/
184
.idea/
5+
.DS_Store

LICENSE

Lines changed: 1 addition & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,3 @@
11
MIT License
22

3-
Copyright (c) 2025 <Your Name>
4-
5-
Permission is hereby granted, free of charge, to any person obtaining a copy
6-
of this software and associated documentation files (the "Software"), to deal
7-
in the Software without restriction, including without limitation the rights
8-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9-
copies of the Software, and to permit persons to whom the Software is
10-
furnished to do so, subject to the following conditions:
11-
12-
The above copyright notice and this permission notice shall be included in all
13-
copies or substantial portions of the Software.
14-
15-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21-
SOFTWARE.
3+
Copyright (c) 2025

README.md

Lines changed: 2 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,2 @@
1-
# Reinforcement Learning Fundamentals — Companion Code
2-
3-
This repository contains runnable Python code accompanying the book **"Reinforcement Learning Fundamentals: From Theory to Practice"**.
4-
5-
> Tip: Every chapter has its own folder with a short `README.md` and Python examples that mirror the book’s notation and figures.
6-
7-
## Contents
8-
9-
- [`ch2_rl_formulation/`](ch2_rl_formulation/README.md) — MDPs, policies, value functions, Bellman equations, policy evaluation, value iteration, grid world.
10-
- [`ch3_multi_armed_bandits/`](ch3_multi_armed_bandits/README.md) — ε-greedy, UCB, Thompson Sampling.
11-
- [`utils/`](utils/) — Shared helpers (random seeds, plotting, gridworld helpers).
12-
13-
## Quickstart
14-
15-
```bash
16-
# 1) Create and activate a virtual environment (optional but recommended)
17-
python -m venv .venv
18-
source .venv/bin/activate # Windows: .venv\Scripts\activate
19-
20-
# 2) Install dependencies
21-
pip install -r requirements.txt
22-
23-
# 3) Run a Chapter 2 demo (GridWorld Value Iteration)
24-
python ch2_rl_formulation/value_iteration.py
25-
26-
# 4) Run the Chapter 2 Random MDP demo
27-
python ch2_rl_formulation/demo_random_mdp.py
28-
29-
# 5) Run Chapter 3 bandit demos
30-
python ch3_multi_armed_bandits/epsilon_greedy.py
31-
python ch3_multi_armed_bandits/ucb.py
32-
python ch3_multi_armed_bandits/thompson_sampling.py
1+
# Reinforcement Learning Fundamentals — Code
2+
This repo hosts chapter-wise companion code. Chapter 2 is complete and CI-tested.

ch2_rl_formulation/LICENSE

Lines changed: 0 additions & 21 deletions
This file was deleted.

ch2_rl_formulation/README.md

Lines changed: 0 additions & 45 deletions
This file was deleted.

ch2_rl_formulation/evaluation.py

Lines changed: 35 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,45 @@
11
import numpy as np
22

3-
def policy_evaluation(S,A,P,R,pi,gamma=1.0,theta=1e-10):
4-
nS,nA,_=P.shape; V=np.zeros(nS)
3+
def policy_evaluation(S, A, P, R, pi, gamma=1.0, theta=1e-10):
4+
"""
5+
Tabular policy evaluation for general R(s,a,s').
6+
Inputs:
7+
- S: list of states
8+
- A: list of actions
9+
- P: [|S|, |A|, |S'|] transition probabilities
10+
- R: [|S|, |A|, |S'|] rewards
11+
- pi: [|S|, |A|] policy (row-stochastic; can be deterministic one-hot)
12+
- gamma: discount factor
13+
- theta: convergence threshold (max delta)
14+
Returns:
15+
- V: np.ndarray of shape [|S|]
16+
"""
17+
nS, nA, nSp = P.shape
18+
assert nS == len(S) and nA == len(A) and nSp == nS
19+
assert pi.shape == (nS, nA)
20+
21+
V = np.zeros(nS, dtype=float)
522
while True:
6-
delta=0; V_new=np.zeros_like(V)
23+
delta = 0.0
24+
V_new = np.zeros_like(V)
725
for s in range(nS):
8-
val=0
26+
val = 0.0
927
for a in range(nA):
10-
if pi[s,a]==0: continue
11-
val+=pi[s,a]*np.sum(P[s,a,:]*(R[s,a,:]+gamma*V))
12-
V_new[s]=val; delta=max(delta,abs(V_new[s]-V[s]))
13-
V=V_new
14-
if delta<theta: break
28+
p_sa = pi[s, a]
29+
if p_sa == 0.0:
30+
continue
31+
val += p_sa * np.sum(P[s, a, :] * (R[s, a, :] + gamma * V))
32+
V_new[s] = val
33+
delta = max(delta, abs(V_new[s] - V[s]))
34+
V = V_new
35+
if delta < theta:
36+
break
1537
return V
1638

17-
def q_from_v(S,A,P,R,V,gamma=1.0):
18-
nS,nA,_=P.shape; Q=np.zeros((nS,nA))
39+
def q_from_v(S, A, P, R, V, gamma=1.0):
40+
nS, nA, _ = P.shape
41+
Q = np.zeros((nS, nA), dtype=float)
1942
for s in range(nS):
2043
for a in range(nA):
21-
Q[s,a]=np.sum(P[s,a,:]*(R[s,a,:]+gamma*V))
44+
Q[s, a] = np.sum(P[s, a, :] * (R[s, a, :] + gamma * V))
2245
return Q
Lines changed: 29 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,34 @@
11
import numpy as np
22
from ch2_rl_formulation.gridworld import GridWorld4x4
33
from ch2_rl_formulation.policies import greedy_toward_goal_policy
4-
from ch2_rl_formulation.evaluation import policy_evaluation,q_from_v
4+
from ch2_rl_formulation.evaluation import policy_evaluation, q_from_v
55

66
def main():
7-
env=GridWorld4x4(); S,A=env.states(),env.actions()
8-
P,R=env.P_tensor(),env.R_tensor()
9-
pi=greedy_toward_goal_policy(env)
10-
V=policy_evaluation(S,A,P,R,pi)
11-
print("V_pi grid:",np.array(V).reshape(4,4))
12-
Q=q_from_v(S,A,P,R,V)
13-
s_bl=env.state_index(3,0)
14-
print("Q bottom-left:",[Q[s_bl,env.action_index(a)] for a in A])
15-
if __name__=='__main__': main()
7+
env = GridWorld4x4(step_reward=-1, goal_reward=0, goal=(0,3))
8+
S, A = env.states(), env.actions()
9+
P, R = env.P_tensor(), env.R_tensor()
10+
pi = greedy_toward_goal_policy(env)
11+
V = policy_evaluation(S, A, P, R, pi, gamma=1.0, theta=1e-12)
12+
Vgrid = np.array(V).reshape(4,4)
13+
print("V_pi grid (goal top-right):\n", Vgrid)
14+
expected = np.array([
15+
[-4, -3, -2, 0],
16+
[-5, -4, -3, -1],
17+
[-6, -5, -4, -2],
18+
[-7, -6, -5, -3],
19+
], dtype=float)
20+
assert np.allclose(Vgrid, expected, atol=1e-12)
21+
Q = q_from_v(S, A, P, R, V, gamma=1.0)
22+
s_bl = env.state_index(3,0)
23+
a = {name: env.action_index(name) for name in A}
24+
print("\nQ at bottom-left (row=3,col=0):")
25+
for name in A:
26+
print(f"{name:>5}: {Q[s_bl, a[name]]:6.2f}")
27+
assert abs(Q[s_bl, a["up"]] - (-7)) < 1e-12
28+
assert abs(Q[s_bl, a["right"]]- (-7)) < 1e-12
29+
assert abs(Q[s_bl, a["left"]] - (-8)) < 1e-12
30+
assert abs(Q[s_bl, a["down"]] - (-8)) < 1e-12
31+
print("\nAll checks PASS.")
32+
33+
if __name__ == "__main__":
34+
main()
Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,24 @@
1+
def approx(x, y, tol=1e-6):
2+
return abs(x - y) < tol
3+
14
def main():
2-
print("G0 =",1+0.9*2+0.9**2*3) #5.23
3-
print("v =",-1+0.9*(-1)+0.9**2*(-1)+0.9**3*10) #4.58
4-
print("v_pe =",2+0.9*4) #5.6
5-
print("q_pe =",1+0.9*3) #3.7
6-
print("v* =",max(2+0.9*5,1+0.9*8)) #8.2
7-
print("q* =",2+0.9*6) #7.4
8-
print("v*_4steps =",sum((0.9**k)*(-1) for k in range(4))+0.9**4*10) #3.122
9-
if __name__=='__main__': main()
5+
ok = True
6+
g0 = 1 + 0.9*2 + (0.9**2)*3
7+
print("G0 =", g0); ok &= approx(g0, 5.23)
8+
v = -1 + 0.9*(-1) + (0.9**2)*(-1) + (0.9**3)*10
9+
print("v =", v); ok &= approx(v, 4.58)
10+
v_pe = 2 + 0.9*4
11+
print("v_pe =", v_pe); ok &= approx(v_pe, 5.6, 1e-12)
12+
q_pe = 1 + 0.9*3
13+
print("q_pe =", q_pe); ok &= approx(q_pe, 3.7, 1e-12)
14+
vopt = max(2 + 0.9*5, 1 + 0.9*8)
15+
print("v* =", vopt); ok &= approx(vopt, 8.2, 1e-12)
16+
qopt = 2 + 0.9*6
17+
print("q* =", qopt); ok &= approx(qopt, 7.4, 1e-12)
18+
v4 = sum((0.9**k)*(-1) for k in range(4)) + (0.9**4)*10
19+
print("v*(4 steps) =", v4); ok &= abs(v4 - 3.122) < 1e-3
20+
print("\nALL NUMERIC EXAMPLES:", "PASS" if ok else "FAIL")
21+
if not ok: raise SystemExit(1)
22+
23+
if __name__ == "__main__":
24+
main()

0 commit comments

Comments
 (0)