🧊 Reinforcement Learning: From Frozen Lakes to Autonomous Parking

A comprehensive study of tabular, linear, deep, and goal-conditioned RL algorithms

📋 Overview

This repository presents a two-part investigation into reinforcement learning, developed as part of the Statistical Planning & Reinforcement Learning module at Queen Mary University of London (MSc Artificial Intelligence).

Part	Environment	Algorithms	Key Question
Part 1	Frozen Lake (4×4 & 8×8)	Policy Iteration, Value Iteration, SARSA, Q-Learning, Linear SARSA, Linear Q-Learning, DQN	How do model-based, tabular, linear, and deep RL methods compare on a classic grid world?
Part 2	Highway-Env Parking (`parking-v0`)	SAC, SAC + HER	Can Hindsight Experience Replay overcome the sparse reward problem in continuous control?

🏗️ Repository Structure

.
├── Frozen Lake - Reinforcement Learning.ipynb   # Part 1: Full RL pipeline on Frozen Lake
├── Sparse Reward.ipynb                          # Part 2: SAC vs SAC+HER on Parking
├── output.txt                                   # Part 1 results (policies, values, convergence)
├── plots/                                       # Training performance visualizations
│   ├── sarsa_plot.png
│   ├── q_learning_plot.png
│   ├── linear_sarsa_plot.png
│   ├── linear_q_learning_plot.png
│   └── dqn.png
├── sac_her_comparison.png                       # Part 2 sample efficiency comparison
├── .gitignore
└── README.md

❄️ Part 1: Frozen Lake — Classical to Deep RL

The Environment

The Frozen Lake is a grid-world where an agent navigates from a start tile (&) to a goal ($), while avoiding holes (#) on a slippery surface. A configurable slip parameter (0.1) introduces stochasticity — the agent may slide in an unintended direction.

Small Lake (4×4)          Large Lake (8×8)
┌───┬───┬───┬───┐        ┌───┬───┬───┬───┬───┬───┬───┬───┐
│ & │ . │ . │ . │        │ & │ . │ . │ . │ . │ . │ . │ . │
├───┼───┼───┼───┤        ├───┼───┼───┼───┼───┼───┼───┼───┤
│ . │ # │ . │ # │        │ . │ . │ . │ . │ . │ . │ . │ . │
├───┼───┼───┼───┤        ├───┼───┼───┼───┼───┼───┼───┼───┤
│ . │ . │ . │ # │        │ . │ . │ . │ . │ . │ . │ . │ . │
├───┼───┼───┼───┤        ├───┼───┼───┼───┼───┼───┼───┼───┤
│ # │ . │ . │ $ │        │ . │ . │ . │ # │ . │ . │ . │ . │
└───┴───┴───┴───┘        ├───┼───┼───┼───┼───┼───┼───┼───┤
                         │ . │ . │ . │ . │ . │ # │ . │ . │
                         ├───┼───┼───┼───┼───┼───┼───┼───┤
                         │ . │ . │ # │ . │ . │ . │ . │ . │
                         ├───┼───┼───┼───┼───┼───┼───┼───┤
                         │ . │ # │ # │ . │ . │ . │ . │ . │
                         ├───┼───┼───┼───┼───┼───┼───┼───┤
                         │ . │ . │ . │ . │ . │ . │ . │ $ │
                         └───┴───┴───┴───┴───┴───┴───┴───┘

Algorithms Implemented

1. Model-Based Methods

Policy Iteration — Alternates between policy evaluation and policy improvement until convergence.
Value Iteration — Directly iterates over Bellman optimality equations.

2. Model-Free Tabular Methods

SARSA — On-policy TD control with ε-greedy exploration.
Q-Learning — Off-policy TD control taking the greedy max over next-state actions.

3. Linear Function Approximation

Linear SARSA / Linear Q-Learning — Replace the Q-table with a linear feature representation (one-hot state-action encoding), enabling generalization.

4. Deep Reinforcement Learning

Deep Q-Network (DQN) — Uses a convolutional neural network to estimate Q-values from a multi-channel image representation of the lake state (agent position, start, holes, goal layers).

Key Results (4×4 Lake)

Algorithm	Converged At	Optimal Policy Found?
Policy Iteration	Iteration 3	✅
Value Iteration	Iteration 10	✅
SARSA	Episode 50	✅
Q-Learning	Episode 77	✅
Linear SARSA	Episode 80	✅
Linear Q-Learning	Episode 84	✅
DQN	Episode 50	✅

Insight: Model-based methods converge fastest (3–10 iterations), as they have complete environment knowledge. Among model-free approaches, tabular methods (SARSA/Q-Learning) and DQN converge within ~50–80 episodes.

Training Curves (4×4 Lake)

🅿️ Part 2: Sparse Reward Parking — SAC vs SAC + HER

The Challenge

The parking-v0 environment from Highway-Env requires an agent to park a car in a designated spot. We transform the default dense reward into a sparse binary signal:

# Sparse Reward Wrapper
if is_success:
    reward = +105.0   # Large positive on success
else:
    reward = -1.0     # Constant penalty otherwise

This makes the problem significantly harder — the agent receives almost no learning signal until it accidentally parks correctly.

Approach

Component	Baseline	Challenger
Algorithm	SAC (Soft Actor-Critic)	SAC + HER (Hindsight Experience Replay)
Policy	`MultiInputPolicy`	`MultiInputPolicy`
Buffer Size	300,000	3,000,000
Batch Size	256 (default)	256
Network	Default	`[256, 256, 256]`
HER Goals	—	8 future goals per transition
Timesteps	200,000	200,000

Why HER? In sparse reward settings, successful experiences are extremely rare. HER retroactively relabels failed trajectories with achieved goals, creating artificial successes that allow the agent to learn from every episode.

Results

Key Finding: SAC + HER reaches the 90% mastery threshold ~30,000 timesteps earlier than standard SAC, demonstrating dramatically improved sample efficiency in sparse reward environments.

🚀 Getting Started

Prerequisites

Python 3.10+
PyTorch (CPU or GPU)

Part 1: Frozen Lake

The Frozen Lake notebook is self-contained — no external dependencies beyond NumPy, PyTorch, and Matplotlib.

pip install numpy torch matplotlib
jupyter notebook "Frozen Lake - Reinforcement Learning.ipynb"

Part 2: Sparse Reward Parking

pip install highway-env stable-baselines3[extra] shimmy gymnasium
jupyter notebook "Sparse Reward.ipynb"

Note: Part 2 is designed to run on Google Colab (with TPU/GPU acceleration) and may require significant compute time (~200k timesteps × 2 models).

📊 Hyperparameter Sensitivity

The notebook includes a comprehensive parameter search across learning rates (η) and exploration rates (ε):

η \ ε	0.1	0.3	0.5	0.8
0.1	1203	1514	1855	1960
0.2	656	1898	1820	1952
0.5	968	1578	1706	1944
0.8	1057	1716	1786	1890

_{Convergence episodes for SARSA on 4×4 lake (lower is better). Best: η=0.2, ε=0.1}

🎓 Academic Context

Module: ECS7002P — Statistical Planning & Reinforcement Learning
Program: MSc Artificial Intelligence, Queen Mary University of London
Assessment: Assignment 2

📚 References

Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction (2nd ed.)
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540).
Haarnoja, T. et al. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor.
Andrychowicz, M. et al. (2017). Hindsight Experience Replay. NeurIPS.
Leurent, E. (2018). Highway-Env.

📝 License

This project is for educational purposes as part of the QMUL MSc AI program.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧊 Reinforcement Learning: From Frozen Lakes to Autonomous Parking

📋 Overview

🏗️ Repository Structure

❄️ Part 1: Frozen Lake — Classical to Deep RL

The Environment

Algorithms Implemented

1. Model-Based Methods

2. Model-Free Tabular Methods

3. Linear Function Approximation

4. Deep Reinforcement Learning

Key Results (4×4 Lake)

Training Curves (4×4 Lake)

🅿️ Part 2: Sparse Reward Parking — SAC vs SAC + HER

The Challenge

Approach

Results

🚀 Getting Started

Prerequisites

Part 1: Frozen Lake

Part 2: Sparse Reward Parking

📊 Hyperparameter Sensitivity

🎓 Academic Context

📚 References

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
plots		plots
.gitignore		.gitignore
Frozen Lake - Reinforcement Learning.ipynb		Frozen Lake - Reinforcement Learning.ipynb
README.md		README.md
Sparse Reward.ipynb		Sparse Reward.ipynb
output.txt		output.txt
sac_her_comparison.png		sac_her_comparison.png

Folders and files

Latest commit

History

Repository files navigation

🧊 Reinforcement Learning: From Frozen Lakes to Autonomous Parking

📋 Overview

🏗️ Repository Structure

❄️ Part 1: Frozen Lake — Classical to Deep RL

The Environment

Algorithms Implemented

1. Model-Based Methods

2. Model-Free Tabular Methods

3. Linear Function Approximation

4. Deep Reinforcement Learning

Key Results (4×4 Lake)

Training Curves (4×4 Lake)

🅿️ Part 2: Sparse Reward Parking — SAC vs SAC + HER

The Challenge

Approach

Results

🚀 Getting Started

Prerequisites

Part 1: Frozen Lake

Part 2: Sparse Reward Parking

📊 Hyperparameter Sensitivity

🎓 Academic Context

📚 References

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages