This repo is inspired from https://github.com/open-thought/tiny-grpo and has support for different RL variants to train your llms on Math tasks.
uv synccd src
uv run train.py --config config.yamlConfig lives in src/config.yaml. Key settings:
model.name— base model (default:Qwen/Qwen3-1.7B)loss.name— algorithm:grpo,dapo, orreinforce_pprollout.group_size— completions per questiontraining.lr— learning rate
Checkpoints save to ./output.
Metrics logged to wandb.
src/
train.py # training loop + vllm rollouts
loss.py # grpo, dapo, reinforce++ losses
rewards.py # math answer extraction + reward model
replay_buffer.py # experience storage + batching
config.yaml # all hyperparameters