srikanthbaride
diff --git a/‎.github/workflows/ch8.yml‎
Lines changed: 41 additions & 0 deletions b/‎.github/workflows/ch8.yml‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎ch8_td_lambda/README.md‎
Lines changed: 137 additions & 0 deletions b/‎ch8_td_lambda/README.md‎
Lines changed: 137 additions & 0 deletions
diff --git a/‎ch8_td_lambda/__init__.py‎ b/‎ch8_td_lambda/__init__.py‎
diff --git a/‎ch8_td_lambda/examples/SARSA_Lambda_Demo.py‎
Lines changed: 13 additions & 0 deletions b/‎ch8_td_lambda/examples/SARSA_Lambda_Demo.py‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎ch8_td_lambda/examples/TD_Lambda_Demo.py‎
Lines changed: 16 additions & 0 deletions b/‎ch8_td_lambda/examples/TD_Lambda_Demo.py‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎ch8_td_lambda/gridworld_small.py‎
Lines changed: 52 additions & 0 deletions b/‎ch8_td_lambda/gridworld_small.py‎
Lines changed: 52 additions & 0 deletions
diff --git a/‎ch8_td_lambda/plot_tdlambda_learning.py‎
Lines changed: 68 additions & 0 deletions b/‎ch8_td_lambda/plot_tdlambda_learning.py‎
Lines changed: 68 additions & 0 deletions
diff --git a/‎ch8_td_lambda/plot_utils.py‎
Lines changed: 24 additions & 0 deletions b/‎ch8_td_lambda/plot_utils.py‎
Lines changed: 24 additions & 0 deletions
@@ -0,0 +1,41 @@
+name: "Chapter 8: Eligibility Traces and TD(λ)"
+
+on:
+  push:
+    paths:
+      - 'ch8_td_lambda/**'
+      - '.github/workflows/ch8.yml'
+  pull_request:
+    paths:
+      - 'ch8_td_lambda/**'
+      - '.github/workflows/ch8.yml'
+
+jobs:
+  test:
+    uses: ./.github/workflows/_chapter-tests.yml
+    with:
+      chapter: ch8_td_lambda
+
+  examples:
+    needs: test
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+
+      - name: Install root requirements
+        run: |
+          python -m pip install --upgrade pip
+          if [ -f requirements.txt ]; then pip install -r requirements.txt; else pip install numpy matplotlib; fi
+          pip install pytest
+
+      - name: Run TD(λ) demo
+        run: python ch8_td_lambda/examples/TD_Lambda_Demo.py
+
+      - name: Run SARSA(λ) demo
+        run: python ch8_td_lambda/examples/SARSA_Lambda_Demo.py
@@ -0,0 +1,137 @@
+# ch8_td_lambda · Eligibility Traces and TD(λ)
+
+Reference implementations and experiments for **Chapter 8** of  
+_Reinforcement Learning Fundamentals: From Theory to Practice_.
+
+This chapter covers the forward/backward views, TD(λ) prediction, SARSA(λ) control, and true-online TD(λ) with linear function approximation.
+
+---
+
+## Folder layout
+
+```
+ch8_td_lambda/
+├─ gridworld_small.py              # 4×4 tabular gridworld (start=(3,0), goal=(0,3))
+├─ td_lambda.py                    # TD(λ) prediction (backward view; accumulating/replacing)
+├─ sarsa_lambda.py                 # SARSA(λ) control with ε-greedy
+├─ true_online_td_lambda.py        # True Online TD(λ) for linear FA
+├─ plot_tdlambda_learning.py       # Produces learning curves for λ ∈ {0, 0.5, 1}
+├─ tests/
+│  └─ test_forward_backward_equiv.py  # Forward ↔ backward numerical check
+```
+
+---
+
+## Quick start
+
+> Assumes Python ≥ 3.9 and `matplotlib`, `numpy`, `pytest`.
+
+### 1) Run the unit test (forward ↔ backward equivalence)
+
+```bash
+pytest ch8_td_lambda/tests -q
+```
+
+Expected:
+```
+.                                                                   [100%]
+1 passed in ~0.02s
+```
+
+### 2) Generate learning curves (SARSA(λ) on gridworld)
+
+```bash
+python ch8_td_lambda/plot_tdlambda_learning.py
+```
+
+Artifacts written to the project (figure under `figs/`):
+- `ch8_tdlambda_learning.csv`
+- `figs/ch8_tdlambda_learning.png`
+
+The plot compares success rates for **λ ∈ {0.0, 0.5, 1.0}**.  
+(Intermediate λ typically balances speed and stability in this task.)
+
+---
+
+## Minimal examples
+
+### TD(λ) prediction (tabular; backward view)
+```python
+import numpy as np
+from ch8_td_lambda.gridworld_small import GridworldSmall
+from ch8_td_lambda.td_lambda import td_lambda_prediction
+
+env = GridworldSmall(seed=0)
+
+def random_policy(s: int):
+    return np.ones(env.n_actions) / env.n_actions  # uniform
+
+V = td_lambda_prediction(env, random_policy, gamma=0.99, alpha=0.1, lam=0.9, episodes=200)
+print(V.reshape(env.n_rows, env.n_cols))
+```
+
+### SARSA(λ) control (ε-greedy)
+```python
+from ch8_td_lambda.gridworld_small import GridworldSmall
+from ch8_td_lambda.sarsa_lambda import sarsa_lambda_control
+import numpy as np
+
+env = GridworldSmall(seed=0)
+Q = sarsa_lambda_control(env, gamma=0.99, alpha=0.1, lam=0.8, epsilon=0.1, episodes=1000)
+print(Q.argmax(axis=1).reshape(env.n_rows, env.n_cols))  # greedy policy
+```
+
+### True Online TD(λ) (linear FA; one-hot features)
+```python
+import numpy as np
+from ch8_td_lambda.gridworld_small import GridworldSmall
+from ch8_td_lambda.true_online_td_lambda import true_online_td_lambda_linear
+
+env = GridworldSmall(seed=0)
+def phi(s: int):
+    x = np.zeros(env.n_states, dtype=float)
+    x[s] = 1.0
+    return x
+
+w = true_online_td_lambda_linear(env, phi, gamma=0.99, alpha=0.15, lam=0.8, episodes=800, seed=0)
+print(w.reshape(env.n_rows, env.n_cols))  # value estimates
+```
+
+---
+
+## Expected outputs
+
+- **Learning curves:** `ch8_tdlambda_learning.png` — success rate vs. episodes for λ=0, 0.5, 1.0.  
+- **CSV:** `ch8_tdlambda_learning.csv` — columns: `episodes, lambda_0.0, lambda_0.5, lambda_1.0`.
+
+---
+
+## LaTeX snippet (embed figure in the book)
+
+After generating the figure, move/commit it under `figs/` and include:
+
+```latex
+\begin{figure}[h!]
+  \centering
+  \includegraphics[width=0.75\linewidth]{figs/ch8_tdlambda_learning.png}
+  \caption{Learning curves for TD($\lambda$) on a $4\times4$ gridworld under SARSA($\lambda$). Intermediate $\lambda$ values (e.g., $0.5$) often balance speed and stability.}
+  \label{fig:tdlambda-learning}
+\end{figure}
+```
+
+---
+
+## Notes
+
+- `sarsa_lambda_control` supports `trace_type="accumulating"` or `"replacing"` (default is replacing in the learning-curve script for stability when states repeat).
+- For reproducibility, seeds are set inside scripts; you can adjust α, ε, and λ from the script/CLI if desired.
+
+---
+
+## References
+
+- Sutton, R. S. (1988). *Learning to Predict by the Methods of Temporal Differences*.  
+- Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction (2nd ed.)*.  
+- van Seijen, H., & Sutton, R. S. (2014). *True Online TD(λ)*.  
+- Tesauro, G. (1995). *TD-Gammon*.  
+- Schulman, J. et al. (2016). *Generalized Advantage Estimation*.
@@ -0,0 +1,13 @@
+from __future__ import annotations
+import numpy as np
+from ch8_td_lambda.gridworld_small import GridworldSmall
+from ch8_td_lambda.sarsa_lambda import sarsa_lambda_control
+
+def main():
+    env = GridworldSmall(seed=0)
+    Q = sarsa_lambda_control(env, gamma=0.99, alpha=0.1, lam=0.8, epsilon=0.1, episodes=1500, seed=0)
+    greedy = Q.argmax(axis=1).reshape(env.n_rows, env.n_cols)
+    print('Greedy policy (0:up,1:right,2:down,3:left):\n', greedy)
+
+if __name__ == '__main__':
+    main()
@@ -0,0 +1,16 @@
+from __future__ import annotations
+import numpy as np
+from ch8_td_lambda.gridworld_small import GridworldSmall
+from ch8_td_lambda.td_lambda import td_lambda_prediction
+
+def main():
+    env = GridworldSmall(seed=0)
+
+    def random_policy(s: int):
+        return np.ones(env.n_actions) / env.n_actions  # uniform
+
+    V = td_lambda_prediction(env, random_policy, gamma=0.99, alpha=0.1, lam=0.9, episodes=300, seed=0)
+    print('Value estimates (4x4):\n', V.reshape(env.n_rows, env.n_cols))
+
+if __name__ == '__main__':
+    main()
@@ -0,0 +1,52 @@
+# Minimal 4x4 gridworld with gym-like API (tabular states 0..15).
+# Start at (3,0), goal at (0,3). Step reward -0.01, goal reward +1.
+from __future__ import annotations
+from typing import Tuple, Optional
+import numpy as np
+
+class GridworldSmall:
+    def __init__(self, seed: Optional[int] = None):
+        self.n_rows = 4
+        self.n_cols = 4
+        self.n_states = self.n_rows * self.n_cols
+        self.n_actions = 4  # 0:up, 1:right, 2:down, 3:left
+        self.start = (3, 0)
+        self.goal = (0, 3)
+        self.step_reward = -0.01
+        self.goal_reward = 1.0
+        self._rng = np.random.default_rng(seed)
+        self.s = self._to_state(self.start)
+
+    def _to_state(self, rc: Tuple[int, int]) -> int:
+        r, c = rc
+        return r * self.n_cols + c
+
+    def _to_rc(self, s: int) -> Tuple[int, int]:
+        return divmod(s, self.n_cols)
+
+    def reset(self) -> int:
+        self.s = self._to_state(self.start)
+        return self.s
+
+    def step(self, a: int):
+        r, c = self._to_rc(self.s)
+        if a == 0:   # up
+            r = max(0, r - 1)
+        elif a == 1: # right
+            c = min(self.n_cols - 1, c + 1)
+        elif a == 2: # down
+            r = min(self.n_rows - 1, r + 1)
+        elif a == 3: # left
+            c = max(0, c - 1)
+        s_next = self._to_state((r, c))
+        done = (r, c) == self.goal
+        reward = self.goal_reward if done else self.step_reward
+        self.s = s_next
+        return s_next, reward, done, {}
+
+    def render(self) -> None:
+        r, c = self._to_rc(self.s)
+        board = np.full((self.n_rows, self.n_cols), '.', dtype=object)
+        board[self.goal] = 'G'
+        board[r, c] = 'A'
+        print('\n'.join(' '.join(row) for row in board))
@@ -0,0 +1,68 @@
+from __future__ import annotations
+import numpy as np
+import matplotlib.pyplot as plt
+from ch8_td_lambda.plot_utils import moving_average, style_axes
+from ch8_td_lambda.gridworld_small import GridworldSmall
+from ch8_td_lambda.sarsa_lambda import sarsa_lambda_control
+
+def eval_success_rate(env, Q, episodes=200, max_steps=200, seed=123) -> float:
+    rng = np.random.default_rng(seed)
+    succ = 0
+    for _ in range(episodes):
+        s = env.reset()
+        for _ in range(max_steps):
+            a = int(np.argmax(Q[s]))
+            s, r, done, *_ = env.step(a)
+            if done:
+                succ += 1
+                break
+    return succ / episodes
+
+def main():
+    seeds = [0, 1, 2]
+    lambdas = [0.0, 0.5, 1.0]
+    episodes_per_seed = 3000
+    eval_every = 100
+    alphas = {0.0: 0.15, 0.5: 0.12, 1.0: 0.08}  # gentle tuning
+
+    curves = {lam: [] for lam in lambdas}
+    xs = []
+
+    for lam in lambdas:
+        agg = []
+        for sd in seeds:
+            env = GridworldSmall(seed=sd)
+            # train in chunks, evaluate periodically
+            Q = np.zeros((env.n_states, env.n_actions))
+            for ep0 in range(0, episodes_per_seed, eval_every):
+                Q = sarsa_lambda_control(
+                    env=env, gamma=0.99, alpha=alphas[lam], lam=lam,
+                    epsilon=0.1, episodes=eval_every, trace_type='replacing',
+                    seed=sd, n_states=env.n_states, n_actions=env.n_actions
+                )
+                sr = eval_success_rate(env, Q, episodes=100, seed=sd)
+                agg.append(sr)
+                if lam == lambdas[0]:
+                    xs.append(ep0 + eval_every)
+        curves[lam] = np.array(agg).reshape(len(seeds), -1).mean(axis=0)
+
+    # plot
+    plt.figure(figsize=(7.2, 4.2))
+    for lam in lambdas:
+        plt.plot(xs, moving_average(curves[lam], w=5), label=f'λ={lam}')
+    style_axes(plt.gca(), xlabel='Episodes', ylabel='Success rate (greedy)', ylim=(0,1.02), legend_loc='lower right')
+    import os
+    os.makedirs('figs', exist_ok=True)
+    plt.savefig('figs/ch8_tdlambda_learning.png', dpi=160)
+
+    # also dump CSV
+    import csv
+    with open('ch8_tdlambda_learning.csv', 'w', newline='') as f:
+        w = csv.writer(f)
+        w.writerow(['episodes'] + [f'lambda_{lam}' for lam in lambdas])
+        for i, x in enumerate(xs):
+            row = [x] + [float(curves[lam][i]) for lam in lambdas]
+            w.writerow(row)
+
+if __name__ == '__main__':
+    main()
@@ -0,0 +1,24 @@
+from __future__ import annotations
+import numpy as np
+import matplotlib.pyplot as plt
+
+def moving_average(x, w=5):
+    if w <= 1:
+        return np.asarray(x)
+    x = np.asarray(x, dtype=float)
+    if x.size < w:
+        return x.copy()
+    c = np.cumsum(np.insert(x, 0, 0.0))
+    y = (c[w:] - c[:-w]) / float(w)
+    # pad to original length (left pad with first value)
+    pad = np.full(w-1, y[0])
+    return np.concatenate([pad, y])
+
+def style_axes(ax, xlabel=None, ylabel=None, grid=True, ylim=None, legend_loc="best"):
+    if xlabel: ax.set_xlabel(xlabel)
+    if ylabel: ax.set_ylabel(ylabel)
+    if grid: ax.grid(True, alpha=0.35)
+    if ylim is not None: ax.set_ylim(*ylim)
+    ax.legend(loc=legend_loc)
+    plt.tight_layout()
+    return ax