Skip to content

ryanprinster/alignment-lab

Repository files navigation

Alignment Lab: RLHF from Scratch

A from-scratch implementation of Reinforcement Learning from Human Feedback (RLHF) using Proximal Policy Optimization (PPO) for text summarization.

Overview

This project implements the complete RLHF pipeline on the Reddit TL;DR dataset using Llama 3.2 1B models:

  1. Supervised Fine-Tuning (SFT)
  2. Reward Model Training
  3. PPO Optimization

Built as a learning and fun project to deeply understand RLHF mechanics

Setup

# Activate virtual environment with 
source torch_env/bin/activate 

# install stuff with 
python -m pip install <stuff>

Usage

WIP

Project Structure

WIP
alignment-lab/
├── models/          # Model architectures and components
├── training/        # Training loops for SFT, RM, and PPO
├── data/           # Dataset processing and loading
├── configs/        # Configuration files
└── utils/          # Helper functions and utilities

Technical Details

WIP For an in-depth explanation of the implementation, key design decisions, and lessons learned, see my writeup.

References

WIP

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published