Skip to content

raphael-ph/ppo-review

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PPO-Review

This repository contains a compact implementation of Proximal Policy Optimization (PPO) in PyTorch used for experimenting with on-policy reinforcement learning. The package provides a PPO training loop, a policy/value model interface, and simple utilities for live plotting and logging.

Contents

  • src/algorithms/ppo.py — PPO training loop with GAE, clipping, and entropy bonus.
  • src/models/ — Policy and value network implementations.
  • src/train.py — (placeholder) training script entry point (extend to run experiments).
  • main.py — simple entrypoint placeholder.
  • utils/ and src/utils.py — plotting, logging, and helper utilities used by training.

Quick summary of the PPO implementation

  • Uses Generalized Advantage Estimation (GAE) to compute advantages.
  • Implements clipped surrogate objective with entropy regularization.
  • Uses separate policy and value networks and separate optimizers.
  • Performs mini-batch updates over multiple epochs per episode.

The core training logic lives in src/algorithms/ppo.py (function ppo(...)).

Requirements

  • Python 3.8+
  • PyTorch
  • OpenAI Gym (or a Gym-compatible environment)

You can install common dependencies with pip (adjust versions as needed):

python -m pip install torch gym

If you manage dependencies with pyproject.toml, prefer using your existing environment.

How to run (developer notes)

This repository currently exposes the PPO trainer as a callable function. A minimal flow to run training from a script looks like:

  1. Create or import a Gym environment (make sure it returns (obs, info) from reset() if using the current trainer API).
  2. Instantiate the policy and value networks from src/models.
  3. Create optimizers for policy and value nets.
  4. Call ppo(...) with the desired hyperparameters.

PPO-Review

A compact, research-oriented implementation of Proximal Policy Optimization (PPO) using PyTorch. This repository is intended for experimentation and teaching: it contains a PPO trainer, model implementations (policy & value), utility scripts for training and evaluation, and simple plotting/logging helpers.

Highlights

  • PPO training loop with GAE (Generalized Advantage Estimation), clipped surrogate objective, and entropy regularization.
  • Modular policy and value networks in src/models/.
  • Training/evaluation entrypoints and convenience scripts.
  • Example results/checkpoints saved to results/.

Repository structure

Top-level important files and directories:

  • README.md — this file
  • Dockerfile — container build for reproducible runs
  • requirements.txt / pyproject.toml — Python dependencies
  • main.py — lightweight entrypoint (project-specific)
  • evaluate.py — evaluation runner (top-level)
  • results/ — models, training logs and plots (checkpoints under results/models/)
  • src/ — implementation
    • src/train.py — training script (experiment wiring)
    • src/algorithms/ppo.py — PPO algorithm implementation
    • src/models/policy.py — policy network
    • src/models/value.py — value network
    • src/scripts/ — helper shell scripts (run_train.sh, run_evaluate.sh)
    • src/utils/ — plotting and logging utilities

Quick start (developer)

These instructions assume macOS (zsh) and a Python 3.8+ environment.

  1. Create a virtual environment and activate it:
python -m venv .venv
source .venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt

If you prefer, use the pyproject.toml and a tool like pipx / poetry to manage the environment.

Running training

There are two ways to run training depending on which script you prefer.

  • Use the provided shell script (convenience wrapper):
./src/scripts/run_train.sh
  • Run the Python training script directly and pass any hyperparameters via CLI (if implemented):
python src/train.py --env CartPole-v1 --episodes 1000 --seed 42

Note: src/train.py wires the PPO trainer in src/algorithms/ppo.py with model constructors in src/models/. If the CLI flags are not present, check the top of src/train.py for usage examples and how to set hyperparameters programmatically.

Recommended starting hyperparameters (tune for your environment):

  • policy lr: 3e-4
  • value lr: 1e-3
  • gamma: 0.99
  • gae_lambda: 0.95
  • clip (epsilon): 0.2
  • epochs: 4
  • mini-batch size: 64

Running evaluation

Evaluate a trained checkpoint (example uses the results/models/ppo_policy.pt shipped in results/):

python evaluate.py --checkpoint results/models/ppo_policy.pt --env CartPole-v1 --episodes 10

Or use the helper script:

./src/scripts/run_evaluate.sh

Docker

To build and run the project in Docker (useful for reproducible experiments):

# build image
docker build -t ppo-review:latest .

# run training inside container (example)
docker run --gpus all -v $(pwd)/results:/workspace/results -it ppo-review:latest /bin/zsh -c "python src/train.py --env CartPole-v1"

Adjust --gpus and volume mounts to your setup.

Results and checkpoints

Training artifacts are (by default) placed under results/:

  • results/models/ — saved model checkpoints (e.g. ppo_policy.pt)
  • results/cartpole-training/, results/cartpole-agent/ — example outputs/plots from experiments
  • results/plots/ — generated learning curves

If you change checkpoint file names or locations, update the src/train.py config or the scripts under src/scripts/.

Reproducibility

To reproduce results reliably:

  • Fix random seeds for Python, NumPy, Gym, and PyTorch in your training script (search for seed in src/train.py).
  • Save exact dependency versions (use pip freeze > requirements-freeze.txt after installing).
  • Record hardware (CPU/GPU) and CUDA/cuDNN versions.

Example reproducibility snippet (Python):

import random
import numpy as np
import torch

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

Tests and quality

This repo currently focuses on experiment code. Recommended small tests to add:

  • Unit tests for GAE advantage calculation
  • Tests for surrogate loss and clipping behavior
  • Integration test to run a short training loop for a few steps

If you'd like, I can add a minimal pytest suite with a few deterministic checks.

Contributing

Contributions are welcome. A good flow:

  1. Fork the repository.
  2. Create a feature branch.
  3. Add tests for new behavior.
  4. Open a pull request describing the change and training/expected behavior.

References

  • Schulman, J. et al., Proximal Policy Optimization Algorithms (2017)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors