This repository contains a compact implementation of Proximal Policy Optimization (PPO) in PyTorch used for experimenting with on-policy reinforcement learning. The package provides a PPO training loop, a policy/value model interface, and simple utilities for live plotting and logging.
src/algorithms/ppo.py— PPO training loop with GAE, clipping, and entropy bonus.src/models/— Policy and value network implementations.src/train.py— (placeholder) training script entry point (extend to run experiments).main.py— simple entrypoint placeholder.utils/andsrc/utils.py— plotting, logging, and helper utilities used by training.
- Uses Generalized Advantage Estimation (GAE) to compute advantages.
- Implements clipped surrogate objective with entropy regularization.
- Uses separate policy and value networks and separate optimizers.
- Performs mini-batch updates over multiple epochs per episode.
The core training logic lives in src/algorithms/ppo.py (function ppo(...)).
- Python 3.8+
- PyTorch
- OpenAI Gym (or a Gym-compatible environment)
You can install common dependencies with pip (adjust versions as needed):
python -m pip install torch gymIf you manage dependencies with pyproject.toml, prefer using your existing environment.
This repository currently exposes the PPO trainer as a callable function. A minimal flow to run training from a script looks like:
- Create or import a Gym environment (make sure it returns
(obs, info)fromreset()if using the current trainer API). - Instantiate the policy and value networks from
src/models. - Create optimizers for policy and value nets.
- Call
ppo(...)with the desired hyperparameters.
A compact, research-oriented implementation of Proximal Policy Optimization (PPO) using PyTorch. This repository is intended for experimentation and teaching: it contains a PPO trainer, model implementations (policy & value), utility scripts for training and evaluation, and simple plotting/logging helpers.
- PPO training loop with GAE (Generalized Advantage Estimation), clipped surrogate objective, and entropy regularization.
- Modular policy and value networks in
src/models/. - Training/evaluation entrypoints and convenience scripts.
- Example results/checkpoints saved to
results/.
Top-level important files and directories:
README.md— this fileDockerfile— container build for reproducible runsrequirements.txt/pyproject.toml— Python dependenciesmain.py— lightweight entrypoint (project-specific)evaluate.py— evaluation runner (top-level)results/— models, training logs and plots (checkpoints underresults/models/)src/— implementationsrc/train.py— training script (experiment wiring)src/algorithms/ppo.py— PPO algorithm implementationsrc/models/policy.py— policy networksrc/models/value.py— value networksrc/scripts/— helper shell scripts (run_train.sh,run_evaluate.sh)src/utils/— plotting and logging utilities
These instructions assume macOS (zsh) and a Python 3.8+ environment.
- Create a virtual environment and activate it:
python -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txtIf you prefer, use the pyproject.toml and a tool like pipx / poetry to manage the environment.
There are two ways to run training depending on which script you prefer.
- Use the provided shell script (convenience wrapper):
./src/scripts/run_train.sh- Run the Python training script directly and pass any hyperparameters via CLI (if implemented):
python src/train.py --env CartPole-v1 --episodes 1000 --seed 42Note: src/train.py wires the PPO trainer in src/algorithms/ppo.py with model constructors in src/models/. If the CLI flags are not present, check the top of src/train.py for usage examples and how to set hyperparameters programmatically.
Recommended starting hyperparameters (tune for your environment):
- policy lr: 3e-4
- value lr: 1e-3
- gamma: 0.99
- gae_lambda: 0.95
- clip (epsilon): 0.2
- epochs: 4
- mini-batch size: 64
Evaluate a trained checkpoint (example uses the results/models/ppo_policy.pt shipped in results/):
python evaluate.py --checkpoint results/models/ppo_policy.pt --env CartPole-v1 --episodes 10Or use the helper script:
./src/scripts/run_evaluate.shTo build and run the project in Docker (useful for reproducible experiments):
# build image
docker build -t ppo-review:latest .
# run training inside container (example)
docker run --gpus all -v $(pwd)/results:/workspace/results -it ppo-review:latest /bin/zsh -c "python src/train.py --env CartPole-v1"Adjust --gpus and volume mounts to your setup.
Training artifacts are (by default) placed under results/:
results/models/— saved model checkpoints (e.g.ppo_policy.pt)results/cartpole-training/,results/cartpole-agent/— example outputs/plots from experimentsresults/plots/— generated learning curves
If you change checkpoint file names or locations, update the src/train.py config or the scripts under src/scripts/.
To reproduce results reliably:
- Fix random seeds for Python, NumPy, Gym, and PyTorch in your training script (search for
seedinsrc/train.py). - Save exact dependency versions (use
pip freeze > requirements-freeze.txtafter installing). - Record hardware (CPU/GPU) and CUDA/cuDNN versions.
Example reproducibility snippet (Python):
import random
import numpy as np
import torch
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)This repo currently focuses on experiment code. Recommended small tests to add:
- Unit tests for GAE advantage calculation
- Tests for surrogate loss and clipping behavior
- Integration test to run a short training loop for a few steps
If you'd like, I can add a minimal pytest suite with a few deterministic checks.
Contributions are welcome. A good flow:
- Fork the repository.
- Create a feature branch.
- Add tests for new behavior.
- Open a pull request describing the change and training/expected behavior.
- Schulman, J. et al., Proximal Policy Optimization Algorithms (2017)