Skip to content

HiEdson/tabular_non_tabular_-_deep_reinforcement_learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧊 Reinforcement Learning: From Frozen Lakes to Autonomous Parking

A comprehensive study of tabular, linear, deep, and goal-conditioned RL algorithms

Python PyTorch SB3 Gymnasium QMUL


πŸ“‹ Overview

This repository presents a two-part investigation into reinforcement learning, developed as part of the Statistical Planning & Reinforcement Learning module at Queen Mary University of London (MSc Artificial Intelligence).

Part Environment Algorithms Key Question
Part 1 Frozen Lake (4Γ—4 & 8Γ—8) Policy Iteration, Value Iteration, SARSA, Q-Learning, Linear SARSA, Linear Q-Learning, DQN How do model-based, tabular, linear, and deep RL methods compare on a classic grid world?
Part 2 Highway-Env Parking (parking-v0) SAC, SAC + HER Can Hindsight Experience Replay overcome the sparse reward problem in continuous control?

πŸ—οΈ Repository Structure

.
β”œβ”€β”€ Frozen Lake - Reinforcement Learning.ipynb   # Part 1: Full RL pipeline on Frozen Lake
β”œβ”€β”€ Sparse Reward.ipynb                          # Part 2: SAC vs SAC+HER on Parking
β”œβ”€β”€ output.txt                                   # Part 1 results (policies, values, convergence)
β”œβ”€β”€ plots/                                       # Training performance visualizations
β”‚   β”œβ”€β”€ sarsa_plot.png
β”‚   β”œβ”€β”€ q_learning_plot.png
β”‚   β”œβ”€β”€ linear_sarsa_plot.png
β”‚   β”œβ”€β”€ linear_q_learning_plot.png
β”‚   └── dqn.png
β”œβ”€β”€ sac_her_comparison.png                       # Part 2 sample efficiency comparison
β”œβ”€β”€ .gitignore
└── README.md

❄️ Part 1: Frozen Lake β€” Classical to Deep RL

The Environment

The Frozen Lake is a grid-world where an agent navigates from a start tile (&) to a goal ($), while avoiding holes (#) on a slippery surface. A configurable slip parameter (0.1) introduces stochasticity β€” the agent may slide in an unintended direction.

Small Lake (4Γ—4)          Large Lake (8Γ—8)
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”        β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”
β”‚ & β”‚ . β”‚ . β”‚ . β”‚        β”‚ & β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€        β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚ . β”‚ # β”‚ . β”‚ # β”‚        β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€        β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚ . β”‚ . β”‚ . β”‚ # β”‚        β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€        β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚ # β”‚ . β”‚ . β”‚ $ β”‚        β”‚ . β”‚ . β”‚ . β”‚ # β”‚ . β”‚ . β”‚ . β”‚ . β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”˜        β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
                         β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ # β”‚ . β”‚ . β”‚
                         β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
                         β”‚ . β”‚ . β”‚ # β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚
                         β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
                         β”‚ . β”‚ # β”‚ # β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚
                         β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
                         β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ . β”‚ $ β”‚
                         β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”˜

Algorithms Implemented

1. Model-Based Methods

  • Policy Iteration β€” Alternates between policy evaluation and policy improvement until convergence.
  • Value Iteration β€” Directly iterates over Bellman optimality equations.

2. Model-Free Tabular Methods

  • SARSA β€” On-policy TD control with Ξ΅-greedy exploration.
  • Q-Learning β€” Off-policy TD control taking the greedy max over next-state actions.

3. Linear Function Approximation

  • Linear SARSA / Linear Q-Learning β€” Replace the Q-table with a linear feature representation (one-hot state-action encoding), enabling generalization.

4. Deep Reinforcement Learning

  • Deep Q-Network (DQN) β€” Uses a convolutional neural network to estimate Q-values from a multi-channel image representation of the lake state (agent position, start, holes, goal layers).

Key Results (4Γ—4 Lake)

Algorithm Converged At Optimal Policy Found?
Policy Iteration Iteration 3 βœ…
Value Iteration Iteration 10 βœ…
SARSA Episode 50 βœ…
Q-Learning Episode 77 βœ…
Linear SARSA Episode 80 βœ…
Linear Q-Learning Episode 84 βœ…
DQN Episode 50 βœ…

Insight: Model-based methods converge fastest (3–10 iterations), as they have complete environment knowledge. Among model-free approaches, tabular methods (SARSA/Q-Learning) and DQN converge within ~50–80 episodes.

Training Curves (4Γ—4 Lake)

SARSA Training Q-Learning Training

Linear SARSA Training Linear Q-Learning Training

DQN Training


πŸ…ΏοΈ Part 2: Sparse Reward Parking β€” SAC vs SAC + HER

The Challenge

The parking-v0 environment from Highway-Env requires an agent to park a car in a designated spot. We transform the default dense reward into a sparse binary signal:

# Sparse Reward Wrapper
if is_success:
    reward = +105.0   # Large positive on success
else:
    reward = -1.0     # Constant penalty otherwise

This makes the problem significantly harder β€” the agent receives almost no learning signal until it accidentally parks correctly.

Approach

Component Baseline Challenger
Algorithm SAC (Soft Actor-Critic) SAC + HER (Hindsight Experience Replay)
Policy MultiInputPolicy MultiInputPolicy
Buffer Size 300,000 3,000,000
Batch Size 256 (default) 256
Network Default [256, 256, 256]
HER Goals β€” 8 future goals per transition
Timesteps 200,000 200,000

Why HER? In sparse reward settings, successful experiences are extremely rare. HER retroactively relabels failed trajectories with achieved goals, creating artificial successes that allow the agent to learn from every episode.

Results

SAC vs SAC+HER Sample Efficiency

Key Finding: SAC + HER reaches the 90% mastery threshold ~30,000 timesteps earlier than standard SAC, demonstrating dramatically improved sample efficiency in sparse reward environments.


πŸš€ Getting Started

Prerequisites

  • Python 3.10+
  • PyTorch (CPU or GPU)

Part 1: Frozen Lake

The Frozen Lake notebook is self-contained β€” no external dependencies beyond NumPy, PyTorch, and Matplotlib.

pip install numpy torch matplotlib
jupyter notebook "Frozen Lake - Reinforcement Learning.ipynb"

Part 2: Sparse Reward Parking

pip install highway-env stable-baselines3[extra] shimmy gymnasium
jupyter notebook "Sparse Reward.ipynb"

Note: Part 2 is designed to run on Google Colab (with TPU/GPU acceleration) and may require significant compute time (~200k timesteps Γ— 2 models).


πŸ“Š Hyperparameter Sensitivity

The notebook includes a comprehensive parameter search across learning rates (Ξ·) and exploration rates (Ξ΅):

Ξ· \ Ξ΅ 0.1 0.3 0.5 0.8
0.1 1203 1514 1855 1960
0.2 656 1898 1820 1952
0.5 968 1578 1706 1944
0.8 1057 1716 1786 1890

Convergence episodes for SARSA on 4Γ—4 lake (lower is better). Best: Ξ·=0.2, Ξ΅=0.1


πŸŽ“ Academic Context

  • Module: ECS7002P β€” Statistical Planning & Reinforcement Learning
  • Program: MSc Artificial Intelligence, Queen Mary University of London
  • Assessment: Assignment 2

πŸ“š References

  • Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction (2nd ed.)
  • Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540).
  • Haarnoja, T. et al. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor.
  • Andrychowicz, M. et al. (2017). Hindsight Experience Replay. NeurIPS.
  • Leurent, E. (2018). Highway-Env.

πŸ“ License

This project is for educational purposes as part of the QMUL MSc AI program.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors