A from-scratch implementation of Reinforcement Learning from Human Feedback (RLHF) using Proximal Policy Optimization (PPO) for text summarization.
This project implements the complete RLHF pipeline on the Reddit TL;DR dataset using Llama 3.2 1B models:
- Supervised Fine-Tuning (SFT)
- Reward Model Training
- PPO Optimization
Built as a learning and fun project to deeply understand RLHF mechanics
# Activate virtual environment with
source torch_env/bin/activate
# install stuff with
python -m pip install <stuff>
WIPWIP
alignment-lab/
├── models/ # Model architectures and components
├── training/ # Training loops for SFT, RM, and PPO
├── data/ # Dataset processing and loading
├── configs/ # Configuration files
└── utils/ # Helper functions and utilities
WIP For an in-depth explanation of the implementation, key design decisions, and lessons learned, see my writeup.
WIP
- Training language models to follow instructions with human feedback (InstructGPT)
- Proximal Policy Optimization Algorithms
- Learning to summarize from human feedback
- The 37 Implementation Details of Proximal Policy Optimization
- Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO
- The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization