Tiny-RL

This repo is inspired from https://github.com/open-thought/tiny-grpo and has support for different RL variants to train your llms on Math tasks.

Setup

uv sync

Usage

cd src
uv run train.py --config config.yaml

Config lives in src/config.yaml. Key settings:

model.name — base model (default: Qwen/Qwen3-1.7B)
loss.name — algorithm: grpo, dapo, or reinforce_pp
rollout.group_size — completions per question
training.lr — learning rate

Checkpoints save to ./output. Metrics logged to wandb.

Structure

src/
  train.py          # training loop + vllm rollouts
  loss.py           # grpo, dapo, reinforce++ losses
  rewards.py        # math answer extraction + reward model
  replay_buffer.py  # experience storage + batching
  config.yaml       # all hyperparameters

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tiny-RL

Setup

Usage

Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tiny-RL

Setup

Usage

Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages