A comprehensive study of tabular, linear, deep, and goal-conditioned RL algorithms
This repository presents a two-part investigation into reinforcement learning, developed as part of the Statistical Planning & Reinforcement Learning module at Queen Mary University of London (MSc Artificial Intelligence).
| Part | Environment | Algorithms | Key Question |
|---|---|---|---|
| Part 1 | Frozen Lake (4Γ4 & 8Γ8) | Policy Iteration, Value Iteration, SARSA, Q-Learning, Linear SARSA, Linear Q-Learning, DQN | How do model-based, tabular, linear, and deep RL methods compare on a classic grid world? |
| Part 2 | Highway-Env Parking (parking-v0) |
SAC, SAC + HER | Can Hindsight Experience Replay overcome the sparse reward problem in continuous control? |
.
βββ Frozen Lake - Reinforcement Learning.ipynb # Part 1: Full RL pipeline on Frozen Lake
βββ Sparse Reward.ipynb # Part 2: SAC vs SAC+HER on Parking
βββ output.txt # Part 1 results (policies, values, convergence)
βββ plots/ # Training performance visualizations
β βββ sarsa_plot.png
β βββ q_learning_plot.png
β βββ linear_sarsa_plot.png
β βββ linear_q_learning_plot.png
β βββ dqn.png
βββ sac_her_comparison.png # Part 2 sample efficiency comparison
βββ .gitignore
βββ README.md
The Frozen Lake is a grid-world where an agent navigates from a start tile (&) to a goal ($), while avoiding holes (#) on a slippery surface. A configurable slip parameter (0.1) introduces stochasticity β the agent may slide in an unintended direction.
Small Lake (4Γ4) Large Lake (8Γ8)
βββββ¬ββββ¬ββββ¬ββββ βββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ
β & β . β . β . β β & β . β . β . β . β . β . β . β
βββββΌββββΌββββΌββββ€ βββββΌββββΌββββΌββββΌββββΌββββΌββββΌββββ€
β . β # β . β # β β . β . β . β . β . β . β . β . β
βββββΌββββΌββββΌββββ€ βββββΌββββΌββββΌββββΌββββΌββββΌββββΌββββ€
β . β . β . β # β β . β . β . β . β . β . β . β . β
βββββΌββββΌββββΌββββ€ βββββΌββββΌββββΌββββΌββββΌββββΌββββΌββββ€
β # β . β . β $ β β . β . β . β # β . β . β . β . β
βββββ΄ββββ΄ββββ΄ββββ βββββΌββββΌββββΌββββΌββββΌββββΌββββΌββββ€
β . β . β . β . β . β # β . β . β
βββββΌββββΌββββΌββββΌββββΌββββΌββββΌββββ€
β . β . β # β . β . β . β . β . β
βββββΌββββΌββββΌββββΌββββΌββββΌββββΌββββ€
β . β # β # β . β . β . β . β . β
βββββΌββββΌββββΌββββΌββββΌββββΌββββΌββββ€
β . β . β . β . β . β . β . β $ β
βββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ
- Policy Iteration β Alternates between policy evaluation and policy improvement until convergence.
- Value Iteration β Directly iterates over Bellman optimality equations.
- SARSA β On-policy TD control with Ξ΅-greedy exploration.
- Q-Learning β Off-policy TD control taking the greedy max over next-state actions.
- Linear SARSA / Linear Q-Learning β Replace the Q-table with a linear feature representation (one-hot state-action encoding), enabling generalization.
- Deep Q-Network (DQN) β Uses a convolutional neural network to estimate Q-values from a multi-channel image representation of the lake state (agent position, start, holes, goal layers).
| Algorithm | Converged At | Optimal Policy Found? |
|---|---|---|
| Policy Iteration | Iteration 3 | β |
| Value Iteration | Iteration 10 | β |
| SARSA | Episode 50 | β |
| Q-Learning | Episode 77 | β |
| Linear SARSA | Episode 80 | β |
| Linear Q-Learning | Episode 84 | β |
| DQN | Episode 50 | β |
Insight: Model-based methods converge fastest (3β10 iterations), as they have complete environment knowledge. Among model-free approaches, tabular methods (SARSA/Q-Learning) and DQN converge within ~50β80 episodes.
The parking-v0 environment from Highway-Env requires an agent to park a car in a designated spot. We transform the default dense reward into a sparse binary signal:
# Sparse Reward Wrapper
if is_success:
reward = +105.0 # Large positive on success
else:
reward = -1.0 # Constant penalty otherwiseThis makes the problem significantly harder β the agent receives almost no learning signal until it accidentally parks correctly.
| Component | Baseline | Challenger |
|---|---|---|
| Algorithm | SAC (Soft Actor-Critic) | SAC + HER (Hindsight Experience Replay) |
| Policy | MultiInputPolicy |
MultiInputPolicy |
| Buffer Size | 300,000 | 3,000,000 |
| Batch Size | 256 (default) | 256 |
| Network | Default | [256, 256, 256] |
| HER Goals | β | 8 future goals per transition |
| Timesteps | 200,000 | 200,000 |
Why HER? In sparse reward settings, successful experiences are extremely rare. HER retroactively relabels failed trajectories with achieved goals, creating artificial successes that allow the agent to learn from every episode.
Key Finding: SAC + HER reaches the 90% mastery threshold ~30,000 timesteps earlier than standard SAC, demonstrating dramatically improved sample efficiency in sparse reward environments.
- Python 3.10+
- PyTorch (CPU or GPU)
The Frozen Lake notebook is self-contained β no external dependencies beyond NumPy, PyTorch, and Matplotlib.
pip install numpy torch matplotlib
jupyter notebook "Frozen Lake - Reinforcement Learning.ipynb"pip install highway-env stable-baselines3[extra] shimmy gymnasium
jupyter notebook "Sparse Reward.ipynb"Note: Part 2 is designed to run on Google Colab (with TPU/GPU acceleration) and may require significant compute time (~200k timesteps Γ 2 models).
The notebook includes a comprehensive parameter search across learning rates (Ξ·) and exploration rates (Ξ΅):
| Ξ· \ Ξ΅ | 0.1 | 0.3 | 0.5 | 0.8 |
|---|---|---|---|---|
| 0.1 | 1203 | 1514 | 1855 | 1960 |
| 0.2 | 656 | 1898 | 1820 | 1952 |
| 0.5 | 968 | 1578 | 1706 | 1944 |
| 0.8 | 1057 | 1716 | 1786 | 1890 |
Convergence episodes for SARSA on 4Γ4 lake (lower is better). Best: Ξ·=0.2, Ξ΅=0.1
- Module: ECS7002P β Statistical Planning & Reinforcement Learning
- Program: MSc Artificial Intelligence, Queen Mary University of London
- Assessment: Assignment 2
- Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction (2nd ed.)
- Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540).
- Haarnoja, T. et al. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor.
- Andrychowicz, M. et al. (2017). Hindsight Experience Replay. NeurIPS.
- Leurent, E. (2018). Highway-Env.
This project is for educational purposes as part of the QMUL MSc AI program.





