Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Instructions for AI coding agents (Codex, Claude Code, etc.) working in this rep

A personal study repository for reinforcement learning and its use in training LLMs. It holds:

- the original course notes from 2017 (`Archive/`)
- the original course notes from 2017 (`notes/cs294-2017/`, `notes/sutton-barto-digest/`)
- a self-study lecture series taking RL from MDPs to RLHF (`notes/lectures/`)
- worked, tested coding exercises (`exercises/`)
- a curated reading list of recent papers (`reference/papers/`)
Expand All @@ -18,8 +18,8 @@ It is a learning environment, not a library or a product. A person is working th

| Path | What it is | Editable by an agent? |
|---|---|---|
| `Archive/` | The original 2017 notes, idiosyncratic voice, kept as written | No. Reference only. Never reword. |
| `notes/lectures/` | The lecture series, `NN-topic.md` | Yes, under the rules below |
| `notes/cs294-2017/`, `notes/sutton-barto-digest/` | Original 2017 hand-written notes (CS294 student notes; Sutton & Barto digest); idiosyncratic voice, kept as written | No. Reference only. Never reword. |
| `notes/lectures/` | The 19-lecture series, `NN-topic.md` | Yes, under the rules below |
| `notes/cheat-sheets/`, `notes/diagrams/` | Quick reference | Yes |
| `exercises/NN-topic/` | A task, a starter file, tests, a reference solution, hints | Yes |
| `reference/papers/` | Reading lists. The `PAPERS.md` files are generated by the collector; the per-topic READMEs are hand notes. | READMEs yes; don't hand-edit `PAPERS.md` — re-run the collector. |
Expand Down Expand Up @@ -58,7 +58,7 @@ The repo has a voice: plain and direct, a little informal, written by someone le
- **No marketing voice.** Not a product launch. Don't call things "comprehensive," "powerful," "cutting-edge," "robust." Don't open a section with "Why this matters." Don't close with "the future is bright."
- **No AI-slop tells.** No emoji as bullets or in headings. No rule-of-three padding ("fast, simple, and elegant"). No "it's not just X — it's Y." No "let's dive in." Sentence-case headings. If `~/.claude/skills/anti-slop-guide` is available, follow it.
- **Be specific.** "The loss explodes around update 50 if you don't normalize the advantage" beats "this can be unstable."
- **Keep the old notes' quirks.** The 2017 archive says "quadratize (it could be a word)" and references the author being from Iraq. That stays. Don't sand it down.
- **Keep the old notes' quirks.** The 2017 hand-written notes (`notes/cs294-2017/`) say "quadratize (it could be a word)" and reference the author being from Iraq. That stays. Don't sand it down.

## Citations

Expand Down Expand Up @@ -95,7 +95,7 @@ An agent acting as tutor: have the student edit `starter.py`, run `pytest exerci

## Don't

- Don't reword `Archive/`.
- Don't reword the 2017 hand-written notes (`notes/cs294-2017/`, `notes/sutton-barto-digest/`).
- Don't mark your own output `reviewed`.
- Don't add a citation you haven't verified.
- Don't touch files outside the repo, shell config, or git history without being asked.
Expand Down
17 changes: 0 additions & 17 deletions Archive/README.md

This file was deleted.

13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,19 @@

Notable changes to the repo. Not a release log — there are no releases — just a record of what moved and why.

## 2026-05-15 — promote the hand-written notes out of Archive/

The trusted, hand-written 2017 notes had been sitting in a folder called `Archive/` — which connotes "old/dead" — while the unreviewed AI-drafted lecture series occupied `notes/`. Backwards. Fixed:

- `Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/` → `notes/cs294-2017/` (with `imgs/` intact and image links unchanged).
- `Archive/2017-Course-Notes/Elements-Of-RL/` → `notes/sutton-barto-digest/`.
- Both moved files got a `<!-- status: hand-written -->` header.
- `Archive/` directory deleted (`Archive/README.md` was a wrapper; no content lost).
- Root `readme.md` "What's here" section restructured to lead with the trusted, hand-written content (the CS294 notes, the Sutton & Barto digest, the curated talks/books/courses, the tested exercises) and clearly demote the AI-drafted lecture series as scaffold-with-skepticism. "Start here" reordered to lead with safer paths (talks/books → exercises → drafts).
- `notes/README.md` rewritten in the same spirit — hand-written content first, lecture series second with a clear caveat about what `unreviewed` means.
- `AGENTS.md` and `CLAUDE.md` updated: the layout table now points at `notes/cs294-2017/` and `notes/sutton-barto-digest/` as the trusted, frozen, never-reword material instead of `Archive/`.
- GitHub topics refreshed: dropped `guideline` and `study` (generic), added `rlhf`, `llm-alignment`, `dpo`, `grpo`, `ppo`, `rlvr`, `agentic-rl`, `lecture-notes`, `study-notes`, `deepseek-r1`, `constitutional-ai`, `policy-gradient`, `q-learning`, `sutton-barto`. Description sharpened.

## 2026-05-12 — restructure: separate the layers, set up rules

Context: the repo had grown two layers — the original 2017 notes, and a much larger newer layer added in 2025 (a 13-lecture series, scraped paper lists, a content tool). The newer layer was unmarked, wrote in a first person it hadn't earned, and shipped broken links, phantom lectures, and made-up citations. This pass separates the two so nobody has to guess what's trustworthy, and sets up conventions so they can coexist.
Expand Down
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Read [`AGENTS.md`](./AGENTS.md). Everything in it applies to Claude Code working
Quick orientation:

- This is a study repo for RL and RL-for-LLMs. A person is learning the material; help them learn it, don't try to "finish" the repo.
- `Archive/` is frozen — reference it, never reword it.
- The 2017 hand-written notes (`notes/cs294-2017/`, `notes/sutton-barto-digest/`) are trusted and frozen — reference, never reword.
- Docs under `notes/` and `reference/` carry a `<!-- status: ... -->` comment. `hand-written` and `reviewed` are trusted; `unreviewed` means nobody has checked it — don't cite it as fact, and don't promote it to `reviewed` yourself (only a person does that).
- Match the existing voice: plain, direct, no marketing tone, no AI-slop tells. The `anti-slop-guide` skill is available — use it when writing or editing prose here.
- Verify every paper citation before adding it. The repo currently has some invented ones; don't add more.
Expand Down
60 changes: 28 additions & 32 deletions notes/README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,32 @@
<!-- status: unreviewed | last-reviewed: never -->

# Lecture series: deep RL to LLM alignment
# notes — study material

A self-study sequence that goes from MDPs and policy gradients up through RLHF, DPO, and the 2024–2025 alignment methods. The lecture bodies **haven't been reviewed yet** — useful as a structured path, but check the math, the code, and the citations against primary sources. `../CURRICULUM.md` is the same path with prerequisites and time estimates; [`../AGENTS.md`](../AGENTS.md) explains the `status:` labels.
Two layers live in this directory, mixed.

Each lecture tries to do four things: give the intuition before the math, show code that runs, point at where the method breaks in practice, and name the papers that introduced it. When a lecture has a matching exercise, it links to [`../exercises/`](../exercises/).
**Trusted, hand-written:**

## Lectures and review status
- **[`cs294-2017/`](./cs294-2017/)** — personal student notes from CS 294 Deep RL (Berkeley, Spring 2017 — Levine, Schulman, Finn). 246 lines of working notes from the field being built. Idiosyncratic, kept as written. `status: hand-written`.
- **[`sutton-barto-digest/`](./sutton-barto-digest/)** — short distillation of the four elements of an RL system (policy, reward, value function, model) from Sutton & Barto. `status: hand-written`.

These are old (2017) and informal — but they're a real person's understanding, not AI text. Trusted as starting points.

**AI-drafted, useful as scaffold (`unreviewed` — treat with skepticism):**

- **[`lectures/`](./lectures/)** — a 19-lecture series taking RL from MDPs through to RLHF / DPO / GRPO / RLVR / agentic / offline. Editorial pass has been done — broken links fixed, code bugs caught (`import gym` → `gymnasium`, missing imports, old-API `env.step` calls), citations checked or removed when they didn't resolve, fake-first-person framing stripped. **But no person has read each lecture end to end and signed off.** Cross-check the math against the cited papers; treat the code as a starting point that needs verification. Index and per-lecture review status below.
- **[`cheat-sheets/`](./cheat-sheets/)** — `RL-Math-Formulas.md` and `RL-Quick-Reference.md`. Audited (caught a wrong KL direction; fixed). Same caveat.
- **[`diagrams/`](./diagrams/)** — `RL-Algorithm-Diagrams.md`. Audited (caught and fixed a wrong DPO loss diagram and a wrong GRPO advantage diagram). Same caveat.

[`../CURRICULUM.md`](../CURRICULUM.md) is the suggested order through everything. [`../AGENTS.md`](../AGENTS.md) explains the `<!-- status: ... -->` convention every doc carries.

## Lecture series — drafts, in order

| # | Lecture | Status |
|---|---|---|
| 01 | [MDPs and Bellman equations](./lectures/01-mdps-bellman.md) — exercise: [`01-mdps`](../exercises/01-mdps/) | unreviewed (de-slopped; a fabricated value-function output was removed) |
| 02 | [Policy gradients from scratch](./lectures/02-policy-gradients.md) — exercise: [`02-policy-gradients`](../exercises/02-policy-gradients/) | unreviewed (de-slopped; a broken link and a code bug were fixed) |
| 03 | [Value functions & Q-learning](./lectures/03-value-functions-q-learning.md) — exercise: [`03-q-learning`](../exercises/03-q-learning/) | unreviewed (de-slopped; a dead `Modern-RL-Research/` path and a missing import fixed) |
| 04 | [Actor-critic methods](./lectures/04-actor-critic.md) | unreviewed (de-slopped; a code bug fixed) |
| 04 | [Actor-critic methods](./lectures/04-actor-critic.md) — exercise: [`04-actor-critic`](../exercises/04-actor-critic/) | unreviewed (de-slopped; a code bug fixed) |
| 05 | [Trust regions and TRPO](./lectures/05-trpo.md) | unreviewed (de-slopped; fabricated training times removed) |
| 06 | [PPO](./lectures/06-ppo.md) | unreviewed (de-slopped; `import gym` → `gymnasium` fixed) |
| 07 | [Off-policy learning: SAC and TD3](./lectures/07-off-policy-rl.md) | unreviewed (de-slopped; an old-API `env.step` call fixed) |
Expand All @@ -22,45 +35,28 @@ Each lecture tries to do four things: give the intuition before the math, show c
| 10 | [PPO for language models](./lectures/10-ppo-for-llms.md) | unreviewed (de-slopped; a broken next-lecture link + unverified compute claims fixed) |
| 11 | [Direct preference optimization](./lectures/11-dpo.md) | unreviewed (de-slopped; a fabricated paper removed) |
| 12 | [Beyond DPO: GRPO, RRHF, IPO](./lectures/12-beyond-dpo.md) | unreviewed (de-slopped; a fabricated benchmark table + a fabricated paper removed) |
| 13 | [RLHF for code generation](./lectures/13-rlhf-code-generation.md) | unreviewed (de-slopped; CodeRL mis-attributed to Meta → fixed to Salesforce; fabricated benchmark numbers removed) |
| 13 | [RLHF for code generation](./lectures/13-rlhf-code-generation.md) — exercise: [`15-grpo-rlvr`](../exercises/15-grpo-rlvr/) (related) | unreviewed (de-slopped; CodeRL mis-attributed to Meta → fixed to Salesforce; fabricated benchmark numbers removed) |
| 14 | [Constitutional AI, RLAIF, self-improvement](./lectures/14-constitutional-ai-rlaif.md) | unreviewed (new draft) |
| 15 | [RL with verifiable rewards & reasoning models](./lectures/15-rl-verifiable-rewards.md) | unreviewed (new draft) |
| 15 | [RL with verifiable rewards & reasoning models](./lectures/15-rl-verifiable-rewards.md) — exercise: [`15-grpo-rlvr`](../exercises/15-grpo-rlvr/) | unreviewed (new draft) |
| 16 | [Agentic RL: tool use, multi-turn](./lectures/16-agentic-rl.md) | unreviewed (new draft) |
| 17 | [Online & iterative preference optimization](./lectures/17-online-iterative-preference.md) | unreviewed (new draft) |
| 18 | [Distillation of reasoning models](./lectures/18-distillation-reasoning.md) | unreviewed (new draft) |
| 19 | [Offline RL](./lectures/19-offline-rl.md) | unreviewed (new draft) |

Planned: a curated paper layer in [`../reference/papers/`](../reference/papers/), built from `../tools/lit-builder/` once the LLM scoring step has been run (it needs a credential). Optionally: an exploration lecture (intrinsic motivation, count-based methods, RND) — the one remaining foundational gap.
What "unreviewed" means here: nobody has read the lecture end-to-end and signed off on it. The editorial pass (de-slop, fix broken links, catch code bugs, verify citations) has happened — that's the parenthetical note next to each row. The next step is a person reads it and either flips it to `reviewed` (with today's date in `last-reviewed:`) or notes what's still wrong.

Cheat sheets and diagrams are in [`cheat-sheets/`](./cheat-sheets/) and [`diagrams/`](./diagrams/) — also unreviewed.
Planned: a curated paper layer in [`../reference/papers/`](../reference/papers/), built from `../tools/lit-builder/` once the LLM scoring step has been run (it needs a credential — see issue #2). Optionally: an exploration lecture (intrinsic motivation, count-based methods, RND).

## How to use this

Starting from scratch: do 01–05 in order, type out the code yourself, and don't move on from a lecture until you can explain its method without notes. Then 06–08, then 09 onward.
Starting from scratch: read the talks/books/courses linked in [`../readme.md`](../readme.md) — they're the trusted external material. The hand-written CS294 notes at [`cs294-2017/`](./cs294-2017/) give you one student's path through the same material.

Already know RL, here for the LLM part: skim 01–05 for notation, then go 09 → 1011 → 12. Lecture 13 if you care about code generation specifically.
Already know RL, here for the LLM part: lectures 09 → 11 → 12 → 14 → 1517 covers the RLHF → DPO → GRPO → constitutional AI → RLVR → iterative preference optimization arc.

Here for code generation: 02 (policy-gradient intuition), 10 (PPO for LLMs), 11–13.
Here for code generation specifically: lecture 02 (policy-gradient intuition), 10 (PPO for LLMs), 13 (RLHF for code), 15 (RLVR — the basis of modern reasoning-RL on code).

## Prerequisites

- Calculus (derivatives, chain rule, gradients), probability (expectations, distributions, KL divergence), basic linear algebra. The math is explained as it comes up.
- Python at an intermediate level; PyTorch basics (the code uses PyTorch); NumPy.
- Budget a few hours per lecture including coding and debugging.

## Study notes that hold up

- Type the code out. Don't paste it.
- Break it on purpose — change a hyperparameter until it fails, then work out why.
- If you can't explain a method simply, you don't have it yet.
- After coding a method, read the original paper. It reads very differently once you've implemented it.
- Print shapes when something's wrong. Most RL bugs are shape or sign errors.

## Supplementary resources

- Sutton & Barto, *Reinforcement Learning: An Introduction* (2nd ed.)
- Spinning Up in Deep RL (OpenAI) — explanations plus reference implementations
- David Silver's UCL lectures
- Recent papers, by topic, in [`../reference/papers/`](../reference/papers/)

The lectures are meant to stand on their own, but they'll make more sense alongside these.
- Calculus (derivatives, chain rule, gradients), probability (expectations, distributions, KL divergence), basic linear algebra.
- Python at an intermediate level; PyTorch basics; NumPy.
- A few hours per lecture including coding and debugging.
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
<!-- status: hand-written | provenance: notes by Yad Konrad while taking the courses (2017); kept as written -->


## Notes taken from CS 294: Deep Reinforcement Learning, Spring 2017 (Berkeley)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
<!-- status: hand-written | provenance: notes by Yad Konrad while taking the courses (2017); kept as written -->

#### Elements Of Reinforcement Learning: (Derived from Barto and Sutton '17 and Li '17)

* A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.
Expand Down
Loading