Alignment Lab: RLHF from Scratch

A from-scratch implementation of Reinforcement Learning from Human Feedback (RLHF) using Proximal Policy Optimization (PPO) for text summarization.

Overview

This project implements the complete RLHF pipeline on the Reddit TL;DR dataset using Llama 3.2 1B models:

Supervised Fine-Tuning (SFT)
Reward Model Training
PPO Optimization

Built as a learning and fun project to deeply understand RLHF mechanics

Setup

# Activate virtual environment with 
source torch_env/bin/activate 

# install stuff with 
python -m pip install <stuff>

Usage

WIP

Project Structure

WIP
alignment-lab/
├── models/          # Model architectures and components
├── training/        # Training loops for SFT, RM, and PPO
├── data/           # Dataset processing and loading
├── configs/        # Configuration files
└── utils/          # Helper functions and utilities

Technical Details

WIP For an in-depth explanation of the implementation, key design decisions, and lessons learned, see my writeup.

References

WIP

Name		Name	Last commit message	Last commit date
Latest commit History 830 Commits
docs		docs
experiments		experiments
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
loss_curve.png		loss_curve.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
rm_loss_curve.png		rm_loss_curve.png
sft_loss_curve.png		sft_loss_curve.png
startup_runpod.sh		startup_runpod.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Alignment Lab: RLHF from Scratch

Overview

Setup

Usage

Project Structure

Technical Details

References

About

Uh oh!

Releases

Packages

Languages

License

ryanprinster/alignment-lab

Folders and files

Latest commit

History

Repository files navigation

Alignment Lab: RLHF from Scratch

Overview

Setup

Usage

Project Structure

Technical Details

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages