PPO-Review

This repository contains a compact implementation of Proximal Policy Optimization (PPO) in PyTorch used for experimenting with on-policy reinforcement learning. The package provides a PPO training loop, a policy/value model interface, and simple utilities for live plotting and logging.

src/algorithms/ppo.py — PPO training loop with GAE, clipping, and entropy bonus.
src/models/ — Policy and value network implementations.
src/train.py — (placeholder) training script entry point (extend to run experiments).
main.py — simple entrypoint placeholder.
utils/ and src/utils.py — plotting, logging, and helper utilities used by training.

Quick summary of the PPO implementation

Uses Generalized Advantage Estimation (GAE) to compute advantages.
Implements clipped surrogate objective with entropy regularization.
Uses separate policy and value networks and separate optimizers.
Performs mini-batch updates over multiple epochs per episode.

The core training logic lives in src/algorithms/ppo.py (function ppo(...)).

Requirements

Python 3.8+
PyTorch
OpenAI Gym (or a Gym-compatible environment)

You can install common dependencies with pip (adjust versions as needed):

python -m pip install torch gym

If you manage dependencies with pyproject.toml, prefer using your existing environment.

How to run (developer notes)

This repository currently exposes the PPO trainer as a callable function. A minimal flow to run training from a script looks like:

Create or import a Gym environment (make sure it returns (obs, info) from reset() if using the current trainer API).
Instantiate the policy and value networks from src/models.
Create optimizers for policy and value nets.
Call ppo(...) with the desired hyperparameters.

PPO-Review

A compact, research-oriented implementation of Proximal Policy Optimization (PPO) using PyTorch. This repository is intended for experimentation and teaching: it contains a PPO trainer, model implementations (policy & value), utility scripts for training and evaluation, and simple plotting/logging helpers.

Highlights

PPO training loop with GAE (Generalized Advantage Estimation), clipped surrogate objective, and entropy regularization.
Modular policy and value networks in src/models/.
Training/evaluation entrypoints and convenience scripts.
Example results/checkpoints saved to results/.

Repository structure

Top-level important files and directories:

README.md — this file
Dockerfile — container build for reproducible runs
requirements.txt / pyproject.toml — Python dependencies
main.py — lightweight entrypoint (project-specific)
evaluate.py — evaluation runner (top-level)
results/ — models, training logs and plots (checkpoints under results/models/)
src/ — implementation
- src/train.py — training script (experiment wiring)
- src/algorithms/ppo.py — PPO algorithm implementation
- src/models/policy.py — policy network
- src/models/value.py — value network
- src/scripts/ — helper shell scripts (run_train.sh, run_evaluate.sh)
- src/utils/ — plotting and logging utilities

Quick start (developer)

These instructions assume macOS (zsh) and a Python 3.8+ environment.

Create a virtual environment and activate it:

python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

If you prefer, use the pyproject.toml and a tool like pipx / poetry to manage the environment.

Running training

There are two ways to run training depending on which script you prefer.

Use the provided shell script (convenience wrapper):

./src/scripts/run_train.sh

Run the Python training script directly and pass any hyperparameters via CLI (if implemented):

python src/train.py --env CartPole-v1 --episodes 1000 --seed 42

Note: src/train.py wires the PPO trainer in src/algorithms/ppo.py with model constructors in src/models/. If the CLI flags are not present, check the top of src/train.py for usage examples and how to set hyperparameters programmatically.

Recommended starting hyperparameters (tune for your environment):

policy lr: 3e-4
value lr: 1e-3
gamma: 0.99
gae_lambda: 0.95
clip (epsilon): 0.2
epochs: 4
mini-batch size: 64

Running evaluation

Evaluate a trained checkpoint (example uses the results/models/ppo_policy.pt shipped in results/):

python evaluate.py --checkpoint results/models/ppo_policy.pt --env CartPole-v1 --episodes 10

Or use the helper script:

./src/scripts/run_evaluate.sh

Docker

To build and run the project in Docker (useful for reproducible experiments):

# build image
docker build -t ppo-review:latest .

# run training inside container (example)
docker run --gpus all -v $(pwd)/results:/workspace/results -it ppo-review:latest /bin/zsh -c "python src/train.py --env CartPole-v1"

Adjust --gpus and volume mounts to your setup.

Results and checkpoints

Training artifacts are (by default) placed under results/:

results/models/ — saved model checkpoints (e.g. ppo_policy.pt)
results/cartpole-training/, results/cartpole-agent/ — example outputs/plots from experiments
results/plots/ — generated learning curves

If you change checkpoint file names or locations, update the src/train.py config or the scripts under src/scripts/.

Reproducibility

To reproduce results reliably:

Fix random seeds for Python, NumPy, Gym, and PyTorch in your training script (search for seed in src/train.py).
Save exact dependency versions (use pip freeze > requirements-freeze.txt after installing).
Record hardware (CPU/GPU) and CUDA/cuDNN versions.

Example reproducibility snippet (Python):

import random
import numpy as np
import torch

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

Tests and quality

This repo currently focuses on experiment code. Recommended small tests to add:

Unit tests for GAE advantage calculation
Tests for surrogate loss and clipping behavior
Integration test to run a short training loop for a few steps

If you'd like, I can add a minimal pytest suite with a few deterministic checks.

Contributing

Contributions are welcome. A good flow:

Fork the repository.
Create a feature branch.
Add tests for new behavior.
Open a pull request describing the change and training/expected behavior.

References

Schulman, J. et al., Proximal Policy Optimization Algorithms (2017)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PPO-Review

Contents

Quick summary of the PPO implementation

Requirements

How to run (developer notes)

PPO-Review

Highlights

Repository structure

Quick start (developer)

Running training

Running evaluation

Docker

Results and checkpoints

Reproducibility

Tests and quality

Contributing

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PPO-Review

Contents

Quick summary of the PPO implementation

Requirements

How to run (developer notes)

PPO-Review

Highlights

Repository structure

Quick start (developer)

Running training

Running evaluation

Docker

Results and checkpoints

Reproducibility

Tests and quality

Contributing

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages