From a9441eed4f6b3f39c302666c3498838476f8bdf8 Mon Sep 17 00:00:00 2001 From: Yad Konrad Date: Fri, 15 May 2026 09:07:27 -0400 Subject: [PATCH] Promote hand-written 2017 notes out of Archive/ into notes/ MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The trusted, hand-written CS294 notes were sitting in a folder called Archive — which sounds like "old/dead" — while the unreviewed AI-drafted lecture series occupied notes/. Backwards from a learner's perspective. - Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/ -> notes/cs294-2017/ (with imgs/ intact) - Archive/2017-Course-Notes/Elements-Of-RL/ -> notes/sutton-barto-digest/ - Both files got headers - Archive/ directory deleted (Archive/README.md was just a wrapper) readme.md restructured: "What's here" now leads with trusted hand-written content (the CS294 notes, the Sutton & Barto digest, the curated talks/books/courses, the tested exercises). The AI-drafted lecture series is clearly demoted as "scaffold, treat with skepticism." "Start here" reordered: talks/books -> exercises -> drafts. notes/README.md rewritten in the same spirit. AGENTS.md and CLAUDE.md updated to point at notes/cs294-2017/ and notes/sutton-barto-digest/ as the trusted, frozen, never-reword material. GitHub topics refreshed separately (not in this commit): dropped `guideline` and `study`; added `rlhf`, `llm-alignment`, `dpo`, `grpo`, `ppo`, `rlvr`, `agentic-rl`, `lecture-notes`, `study-notes`, `deepseek-r1`, `constitutional-ai`, `policy-gradient`, `q-learning`, `sutton-barto`. Description sharpened. Co-Authored-By: Claude Opus 4.7 (1M context) --- AGENTS.md | 10 +-- Archive/README.md | 17 ----- CHANGELOG.md | 13 ++++ CLAUDE.md | 2 +- notes/README.md | 60 ++++++++---------- .../cs294-2017}/imgs/cannon.svg | 0 .../cs294-2017}/imgs/linear-lqr.png | Bin .../cs294-2017}/imgs/nvidia-case.png | Bin .../imgs/rl-imitation-learning.png | Bin .../cs294-2017}/readme.md | 2 + .../sutton-barto-digest}/readme.md | 2 + readme.md | 33 ++++++---- 12 files changed, 71 insertions(+), 68 deletions(-) delete mode 100644 Archive/README.md rename {Archive/2017-Course-Notes/CS294-DeepRL-Berkeley => notes/cs294-2017}/imgs/cannon.svg (100%) rename {Archive/2017-Course-Notes/CS294-DeepRL-Berkeley => notes/cs294-2017}/imgs/linear-lqr.png (100%) rename {Archive/2017-Course-Notes/CS294-DeepRL-Berkeley => notes/cs294-2017}/imgs/nvidia-case.png (100%) rename {Archive/2017-Course-Notes/CS294-DeepRL-Berkeley => notes/cs294-2017}/imgs/rl-imitation-learning.png (100%) rename {Archive/2017-Course-Notes/CS294-DeepRL-Berkeley => notes/cs294-2017}/readme.md (99%) rename {Archive/2017-Course-Notes/Elements-Of-RL => notes/sutton-barto-digest}/readme.md (91%) diff --git a/AGENTS.md b/AGENTS.md index 8aae0b5..566c83d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -6,7 +6,7 @@ Instructions for AI coding agents (Codex, Claude Code, etc.) working in this rep A personal study repository for reinforcement learning and its use in training LLMs. It holds: -- the original course notes from 2017 (`Archive/`) +- the original course notes from 2017 (`notes/cs294-2017/`, `notes/sutton-barto-digest/`) - a self-study lecture series taking RL from MDPs to RLHF (`notes/lectures/`) - worked, tested coding exercises (`exercises/`) - a curated reading list of recent papers (`reference/papers/`) @@ -18,8 +18,8 @@ It is a learning environment, not a library or a product. A person is working th | Path | What it is | Editable by an agent? | |---|---|---| -| `Archive/` | The original 2017 notes, idiosyncratic voice, kept as written | No. Reference only. Never reword. | -| `notes/lectures/` | The lecture series, `NN-topic.md` | Yes, under the rules below | +| `notes/cs294-2017/`, `notes/sutton-barto-digest/` | Original 2017 hand-written notes (CS294 student notes; Sutton & Barto digest); idiosyncratic voice, kept as written | No. Reference only. Never reword. | +| `notes/lectures/` | The 19-lecture series, `NN-topic.md` | Yes, under the rules below | | `notes/cheat-sheets/`, `notes/diagrams/` | Quick reference | Yes | | `exercises/NN-topic/` | A task, a starter file, tests, a reference solution, hints | Yes | | `reference/papers/` | Reading lists. The `PAPERS.md` files are generated by the collector; the per-topic READMEs are hand notes. | READMEs yes; don't hand-edit `PAPERS.md` — re-run the collector. | @@ -58,7 +58,7 @@ The repo has a voice: plain and direct, a little informal, written by someone le - **No marketing voice.** Not a product launch. Don't call things "comprehensive," "powerful," "cutting-edge," "robust." Don't open a section with "Why this matters." Don't close with "the future is bright." - **No AI-slop tells.** No emoji as bullets or in headings. No rule-of-three padding ("fast, simple, and elegant"). No "it's not just X — it's Y." No "let's dive in." Sentence-case headings. If `~/.claude/skills/anti-slop-guide` is available, follow it. - **Be specific.** "The loss explodes around update 50 if you don't normalize the advantage" beats "this can be unstable." -- **Keep the old notes' quirks.** The 2017 archive says "quadratize (it could be a word)" and references the author being from Iraq. That stays. Don't sand it down. +- **Keep the old notes' quirks.** The 2017 hand-written notes (`notes/cs294-2017/`) say "quadratize (it could be a word)" and reference the author being from Iraq. That stays. Don't sand it down. ## Citations @@ -95,7 +95,7 @@ An agent acting as tutor: have the student edit `starter.py`, run `pytest exerci ## Don't -- Don't reword `Archive/`. +- Don't reword the 2017 hand-written notes (`notes/cs294-2017/`, `notes/sutton-barto-digest/`). - Don't mark your own output `reviewed`. - Don't add a citation you haven't verified. - Don't touch files outside the repo, shell config, or git history without being asked. diff --git a/Archive/README.md b/Archive/README.md deleted file mode 100644 index e281ec9..0000000 --- a/Archive/README.md +++ /dev/null @@ -1,17 +0,0 @@ -# Archive — original notes (2017) - -Hand-written notes kept as they were. Not edited, not modernized. If something here is dated, that's the point — it's a record, not a maintained doc. - -## Contents - -### [2017-Course-Notes/CS294-DeepRL-Berkeley/](./2017-Course-Notes/CS294-DeepRL-Berkeley/) - -Notes from CS 294: Deep Reinforcement Learning, Berkeley, Spring 2017 (Sergey Levine, John Schulman, Chelsea Finn). Covers imitation learning and DAgger, optimal control and trajectory optimization (LQR/iLQR, MCTS), learning dynamics models, policy gradients, TRPO, actor-critic, and model-based RL. The current version of the course is [CS 285](https://rail.eecs.berkeley.edu/deeprlcourse/). - -### [2017-Course-Notes/Elements-Of-RL/](./2017-Course-Notes/Elements-Of-RL/) - -A short digest of the four elements of an RL system — policy, reward signal, value function, model — from Sutton & Barto and Li's *Deep RL: An Overview*. - -## Status - -`hand-written`. These are student notes: informal, with the occasional error or dead link, and some personal asides that stay in. For authoritative versions, go to the original course materials and papers. The newer lecture series in [`../notes/lectures/`](../notes/lectures/) covers the same foundations and continues into RLHF; the reading lists are in [`../reference/papers/`](../reference/papers/). diff --git a/CHANGELOG.md b/CHANGELOG.md index 3242e23..20fad36 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,19 @@ Notable changes to the repo. Not a release log — there are no releases — just a record of what moved and why. +## 2026-05-15 — promote the hand-written notes out of Archive/ + +The trusted, hand-written 2017 notes had been sitting in a folder called `Archive/` — which connotes "old/dead" — while the unreviewed AI-drafted lecture series occupied `notes/`. Backwards. Fixed: + +- `Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/` → `notes/cs294-2017/` (with `imgs/` intact and image links unchanged). +- `Archive/2017-Course-Notes/Elements-Of-RL/` → `notes/sutton-barto-digest/`. +- Both moved files got a `` header. +- `Archive/` directory deleted (`Archive/README.md` was a wrapper; no content lost). +- Root `readme.md` "What's here" section restructured to lead with the trusted, hand-written content (the CS294 notes, the Sutton & Barto digest, the curated talks/books/courses, the tested exercises) and clearly demote the AI-drafted lecture series as scaffold-with-skepticism. "Start here" reordered to lead with safer paths (talks/books → exercises → drafts). +- `notes/README.md` rewritten in the same spirit — hand-written content first, lecture series second with a clear caveat about what `unreviewed` means. +- `AGENTS.md` and `CLAUDE.md` updated: the layout table now points at `notes/cs294-2017/` and `notes/sutton-barto-digest/` as the trusted, frozen, never-reword material instead of `Archive/`. +- GitHub topics refreshed: dropped `guideline` and `study` (generic), added `rlhf`, `llm-alignment`, `dpo`, `grpo`, `ppo`, `rlvr`, `agentic-rl`, `lecture-notes`, `study-notes`, `deepseek-r1`, `constitutional-ai`, `policy-gradient`, `q-learning`, `sutton-barto`. Description sharpened. + ## 2026-05-12 — restructure: separate the layers, set up rules Context: the repo had grown two layers — the original 2017 notes, and a much larger newer layer added in 2025 (a 13-lecture series, scraped paper lists, a content tool). The newer layer was unmarked, wrote in a first person it hadn't earned, and shipped broken links, phantom lectures, and made-up citations. This pass separates the two so nobody has to guess what's trustworthy, and sets up conventions so they can coexist. diff --git a/CLAUDE.md b/CLAUDE.md index d0f9cb3..6fb03f8 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -5,7 +5,7 @@ Read [`AGENTS.md`](./AGENTS.md). Everything in it applies to Claude Code working Quick orientation: - This is a study repo for RL and RL-for-LLMs. A person is learning the material; help them learn it, don't try to "finish" the repo. -- `Archive/` is frozen — reference it, never reword it. +- The 2017 hand-written notes (`notes/cs294-2017/`, `notes/sutton-barto-digest/`) are trusted and frozen — reference, never reword. - Docs under `notes/` and `reference/` carry a `` comment. `hand-written` and `reviewed` are trusted; `unreviewed` means nobody has checked it — don't cite it as fact, and don't promote it to `reviewed` yourself (only a person does that). - Match the existing voice: plain, direct, no marketing tone, no AI-slop tells. The `anti-slop-guide` skill is available — use it when writing or editing prose here. - Verify every paper citation before adding it. The repo currently has some invented ones; don't add more. diff --git a/notes/README.md b/notes/README.md index 1e0b296..d5418d3 100644 --- a/notes/README.md +++ b/notes/README.md @@ -1,19 +1,32 @@ -# Lecture series: deep RL to LLM alignment +# notes — study material -A self-study sequence that goes from MDPs and policy gradients up through RLHF, DPO, and the 2024–2025 alignment methods. The lecture bodies **haven't been reviewed yet** — useful as a structured path, but check the math, the code, and the citations against primary sources. `../CURRICULUM.md` is the same path with prerequisites and time estimates; [`../AGENTS.md`](../AGENTS.md) explains the `status:` labels. +Two layers live in this directory, mixed. -Each lecture tries to do four things: give the intuition before the math, show code that runs, point at where the method breaks in practice, and name the papers that introduced it. When a lecture has a matching exercise, it links to [`../exercises/`](../exercises/). +**Trusted, hand-written:** -## Lectures and review status +- **[`cs294-2017/`](./cs294-2017/)** — personal student notes from CS 294 Deep RL (Berkeley, Spring 2017 — Levine, Schulman, Finn). 246 lines of working notes from the field being built. Idiosyncratic, kept as written. `status: hand-written`. +- **[`sutton-barto-digest/`](./sutton-barto-digest/)** — short distillation of the four elements of an RL system (policy, reward, value function, model) from Sutton & Barto. `status: hand-written`. + +These are old (2017) and informal — but they're a real person's understanding, not AI text. Trusted as starting points. + +**AI-drafted, useful as scaffold (`unreviewed` — treat with skepticism):** + +- **[`lectures/`](./lectures/)** — a 19-lecture series taking RL from MDPs through to RLHF / DPO / GRPO / RLVR / agentic / offline. Editorial pass has been done — broken links fixed, code bugs caught (`import gym` → `gymnasium`, missing imports, old-API `env.step` calls), citations checked or removed when they didn't resolve, fake-first-person framing stripped. **But no person has read each lecture end to end and signed off.** Cross-check the math against the cited papers; treat the code as a starting point that needs verification. Index and per-lecture review status below. +- **[`cheat-sheets/`](./cheat-sheets/)** — `RL-Math-Formulas.md` and `RL-Quick-Reference.md`. Audited (caught a wrong KL direction; fixed). Same caveat. +- **[`diagrams/`](./diagrams/)** — `RL-Algorithm-Diagrams.md`. Audited (caught and fixed a wrong DPO loss diagram and a wrong GRPO advantage diagram). Same caveat. + +[`../CURRICULUM.md`](../CURRICULUM.md) is the suggested order through everything. [`../AGENTS.md`](../AGENTS.md) explains the `` convention every doc carries. + +## Lecture series — drafts, in order | # | Lecture | Status | |---|---|---| | 01 | [MDPs and Bellman equations](./lectures/01-mdps-bellman.md) — exercise: [`01-mdps`](../exercises/01-mdps/) | unreviewed (de-slopped; a fabricated value-function output was removed) | | 02 | [Policy gradients from scratch](./lectures/02-policy-gradients.md) — exercise: [`02-policy-gradients`](../exercises/02-policy-gradients/) | unreviewed (de-slopped; a broken link and a code bug were fixed) | | 03 | [Value functions & Q-learning](./lectures/03-value-functions-q-learning.md) — exercise: [`03-q-learning`](../exercises/03-q-learning/) | unreviewed (de-slopped; a dead `Modern-RL-Research/` path and a missing import fixed) | -| 04 | [Actor-critic methods](./lectures/04-actor-critic.md) | unreviewed (de-slopped; a code bug fixed) | +| 04 | [Actor-critic methods](./lectures/04-actor-critic.md) — exercise: [`04-actor-critic`](../exercises/04-actor-critic/) | unreviewed (de-slopped; a code bug fixed) | | 05 | [Trust regions and TRPO](./lectures/05-trpo.md) | unreviewed (de-slopped; fabricated training times removed) | | 06 | [PPO](./lectures/06-ppo.md) | unreviewed (de-slopped; `import gym` → `gymnasium` fixed) | | 07 | [Off-policy learning: SAC and TD3](./lectures/07-off-policy-rl.md) | unreviewed (de-slopped; an old-API `env.step` call fixed) | @@ -22,45 +35,28 @@ Each lecture tries to do four things: give the intuition before the math, show c | 10 | [PPO for language models](./lectures/10-ppo-for-llms.md) | unreviewed (de-slopped; a broken next-lecture link + unverified compute claims fixed) | | 11 | [Direct preference optimization](./lectures/11-dpo.md) | unreviewed (de-slopped; a fabricated paper removed) | | 12 | [Beyond DPO: GRPO, RRHF, IPO](./lectures/12-beyond-dpo.md) | unreviewed (de-slopped; a fabricated benchmark table + a fabricated paper removed) | -| 13 | [RLHF for code generation](./lectures/13-rlhf-code-generation.md) | unreviewed (de-slopped; CodeRL mis-attributed to Meta → fixed to Salesforce; fabricated benchmark numbers removed) | +| 13 | [RLHF for code generation](./lectures/13-rlhf-code-generation.md) — exercise: [`15-grpo-rlvr`](../exercises/15-grpo-rlvr/) (related) | unreviewed (de-slopped; CodeRL mis-attributed to Meta → fixed to Salesforce; fabricated benchmark numbers removed) | | 14 | [Constitutional AI, RLAIF, self-improvement](./lectures/14-constitutional-ai-rlaif.md) | unreviewed (new draft) | -| 15 | [RL with verifiable rewards & reasoning models](./lectures/15-rl-verifiable-rewards.md) | unreviewed (new draft) | +| 15 | [RL with verifiable rewards & reasoning models](./lectures/15-rl-verifiable-rewards.md) — exercise: [`15-grpo-rlvr`](../exercises/15-grpo-rlvr/) | unreviewed (new draft) | | 16 | [Agentic RL: tool use, multi-turn](./lectures/16-agentic-rl.md) | unreviewed (new draft) | | 17 | [Online & iterative preference optimization](./lectures/17-online-iterative-preference.md) | unreviewed (new draft) | | 18 | [Distillation of reasoning models](./lectures/18-distillation-reasoning.md) | unreviewed (new draft) | | 19 | [Offline RL](./lectures/19-offline-rl.md) | unreviewed (new draft) | -Planned: a curated paper layer in [`../reference/papers/`](../reference/papers/), built from `../tools/lit-builder/` once the LLM scoring step has been run (it needs a credential). Optionally: an exploration lecture (intrinsic motivation, count-based methods, RND) — the one remaining foundational gap. +What "unreviewed" means here: nobody has read the lecture end-to-end and signed off on it. The editorial pass (de-slop, fix broken links, catch code bugs, verify citations) has happened — that's the parenthetical note next to each row. The next step is a person reads it and either flips it to `reviewed` (with today's date in `last-reviewed:`) or notes what's still wrong. -Cheat sheets and diagrams are in [`cheat-sheets/`](./cheat-sheets/) and [`diagrams/`](./diagrams/) — also unreviewed. +Planned: a curated paper layer in [`../reference/papers/`](../reference/papers/), built from `../tools/lit-builder/` once the LLM scoring step has been run (it needs a credential — see issue #2). Optionally: an exploration lecture (intrinsic motivation, count-based methods, RND). ## How to use this -Starting from scratch: do 01–05 in order, type out the code yourself, and don't move on from a lecture until you can explain its method without notes. Then 06–08, then 09 onward. +Starting from scratch: read the talks/books/courses linked in [`../readme.md`](../readme.md) — they're the trusted external material. The hand-written CS294 notes at [`cs294-2017/`](./cs294-2017/) give you one student's path through the same material. -Already know RL, here for the LLM part: skim 01–05 for notation, then go 09 → 10 → 11 → 12. Lecture 13 if you care about code generation specifically. +Already know RL, here for the LLM part: lectures 09 → 11 → 12 → 14 → 15 → 17 covers the RLHF → DPO → GRPO → constitutional AI → RLVR → iterative preference optimization arc. -Here for code generation: 02 (policy-gradient intuition), 10 (PPO for LLMs), 11–13. +Here for code generation specifically: lecture 02 (policy-gradient intuition), 10 (PPO for LLMs), 13 (RLHF for code), 15 (RLVR — the basis of modern reasoning-RL on code). ## Prerequisites -- Calculus (derivatives, chain rule, gradients), probability (expectations, distributions, KL divergence), basic linear algebra. The math is explained as it comes up. -- Python at an intermediate level; PyTorch basics (the code uses PyTorch); NumPy. -- Budget a few hours per lecture including coding and debugging. - -## Study notes that hold up - -- Type the code out. Don't paste it. -- Break it on purpose — change a hyperparameter until it fails, then work out why. -- If you can't explain a method simply, you don't have it yet. -- After coding a method, read the original paper. It reads very differently once you've implemented it. -- Print shapes when something's wrong. Most RL bugs are shape or sign errors. - -## Supplementary resources - -- Sutton & Barto, *Reinforcement Learning: An Introduction* (2nd ed.) -- Spinning Up in Deep RL (OpenAI) — explanations plus reference implementations -- David Silver's UCL lectures -- Recent papers, by topic, in [`../reference/papers/`](../reference/papers/) - -The lectures are meant to stand on their own, but they'll make more sense alongside these. +- Calculus (derivatives, chain rule, gradients), probability (expectations, distributions, KL divergence), basic linear algebra. +- Python at an intermediate level; PyTorch basics; NumPy. +- A few hours per lecture including coding and debugging. diff --git a/Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/imgs/cannon.svg b/notes/cs294-2017/imgs/cannon.svg similarity index 100% rename from Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/imgs/cannon.svg rename to notes/cs294-2017/imgs/cannon.svg diff --git a/Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/imgs/linear-lqr.png b/notes/cs294-2017/imgs/linear-lqr.png similarity index 100% rename from Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/imgs/linear-lqr.png rename to notes/cs294-2017/imgs/linear-lqr.png diff --git a/Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/imgs/nvidia-case.png b/notes/cs294-2017/imgs/nvidia-case.png similarity index 100% rename from Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/imgs/nvidia-case.png rename to notes/cs294-2017/imgs/nvidia-case.png diff --git a/Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/imgs/rl-imitation-learning.png b/notes/cs294-2017/imgs/rl-imitation-learning.png similarity index 100% rename from Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/imgs/rl-imitation-learning.png rename to notes/cs294-2017/imgs/rl-imitation-learning.png diff --git a/Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/readme.md b/notes/cs294-2017/readme.md similarity index 99% rename from Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/readme.md rename to notes/cs294-2017/readme.md index d281c3a..387910c 100644 --- a/Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/readme.md +++ b/notes/cs294-2017/readme.md @@ -1,3 +1,5 @@ + + ## Notes taken from CS 294: Deep Reinforcement Learning, Spring 2017 (Berkeley) diff --git a/Archive/2017-Course-Notes/Elements-Of-RL/readme.md b/notes/sutton-barto-digest/readme.md similarity index 91% rename from Archive/2017-Course-Notes/Elements-Of-RL/readme.md rename to notes/sutton-barto-digest/readme.md index 24eaffb..69779ba 100644 --- a/Archive/2017-Course-Notes/Elements-Of-RL/readme.md +++ b/notes/sutton-barto-digest/readme.md @@ -1,3 +1,5 @@ + + #### Elements Of Reinforcement Learning: (Derived from Barto and Sutton '17 and Li '17) * A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. diff --git a/readme.md b/readme.md index a9bc791..35c8c5e 100644 --- a/readme.md +++ b/readme.md @@ -6,21 +6,28 @@ This is a personal study repo, not a library. It mixes notes a person wrote (som ## What's here -- **`notes/`** — the lecture series (`notes/lectures/`), plus cheat sheets and diagrams. Currently unreviewed; see [`notes/README.md`](./notes/README.md) for the index and review status. -- **`exercises/`** — small coding exercises with tests and reference solutions. Built to be worked through step by step, with a coding agent or on your own. -- **`CURRICULUM.md`** — the ordered path through the lectures and exercises. -- **`Archive/`** — the original 2017 course notes (CS294 Deep RL, Berkeley) and a short Sutton & Barto digest. Kept as written. -- **`reference/papers/`** — reading lists of recent papers, collected from arXiv by the script in `tools/`. -- **`tools/`** — `arxiv-collector/` (fetches arXiv papers), `lit-builder/` (conference-paper triage — a retuned copy of `iclr-lit-builder`: fetches ICLR/NeurIPS/ICML paper lists, keyword-filters, LLM-scores 0–3 with a reason), and `content-pipeline/` (drafts blog posts / threads from papers; auxiliary). +**Trusted, hand-written:** -## Start here +- **[`notes/cs294-2017/`](./notes/cs294-2017/)** — personal student notes from CS 294 Deep RL (Berkeley, Spring 2017 — Levine, Schulman, Finn). 246 lines of real-time notes from the field being built. Idiosyncratic, opinionated, with the cannon-trajectory aside. Kept as written. +- **[`notes/sutton-barto-digest/`](./notes/sutton-barto-digest/)** — short distillation of the four elements of an RL system, from Sutton & Barto. +- **Talks, books, courses** — the curated external links below. The Pineau intro, Abbeel's deep RL talk, David Silver's UCL course, Sutton & Barto's book, CS285, Spinning Up. Here since 2015. Still the best place to start if you're new. +- **[`exercises/`](./exercises/)** — five small coding exercises with `pytest` tests and reference solutions, verified to pass. Implement REINFORCE on CartPole, Q-learning on FrozenLake, value iteration on a gridworld, actor-critic, a tiny GRPO loop on a verifiable arithmetic task. + +**AI-drafted, useful as scaffold (`unreviewed` — treat with skepticism):** + +- **[`notes/lectures/`](./notes/lectures/)** — a 19-lecture series, MDPs through RLHF / DPO / GRPO / RLVR / agentic / offline. Editorial pass done (broken links, code bugs, made-up citations all caught and fixed) — but no person has read each one end-to-end. Cross-check the math against the cited papers before relying on it. Index and per-lecture status in [`notes/README.md`](./notes/README.md); ordered study path in [`CURRICULUM.md`](./CURRICULUM.md). +- **`notes/cheat-sheets/`, `notes/diagrams/`** — quick reference. Same caveat. (The diagrams file caught and fixed two wrong loss diagrams during the audit, FWIW.) +- **[`reference/papers/`](./reference/papers/)** — auto-collected paper lists from arXiv (~430 abstracts). Use as a search index, not a curated reading list. +- **[`tools/`](./tools/)** — `arxiv-collector/` (fetches arXiv papers), `lit-builder/` (ICLR/NeurIPS/ICML triage with keyword filter + LLM scoring), `content-pipeline/` (drafts blog posts from papers; auxiliary). -- New to RL: read [`CURRICULUM.md`](./CURRICULUM.md), then start `notes/lectures/01-mdps-bellman.md`. Do the exercises as you go. -- Know RL, here for the LLM part: skim lectures 1–5, then 9 onward (reward modeling → PPO for LLMs → DPO → GRPO). -- Want the original 2017 notes: `Archive/2017-Course-Notes/`. -- Working in this repo with Claude Code or Codex? Read [`AGENTS.md`](./AGENTS.md) first. +[`AGENTS.md`](./AGENTS.md) explains the `` convention every doc carries. + +## Start here -For the foundational external material the repo has always pointed at — talks, books, courses — see below. It's still the best starting point if you want lectures from the people who built the field. +- **New to RL?** Start with the talks/books/courses below — Pineau's intro, then Sutton & Barto for foundations, then David Silver's UCL course or CS285 (Berkeley's current version of CS294). The 2017 CS294 notes ([`notes/cs294-2017/`](./notes/cs294-2017/)) give you one student's working notes through the same material if you like that genre. +- **Want hands-on?** Do the [`exercises/`](./exercises/). They're tested and they actually run. Five of them, a couple of hours each. +- **Curious about modern LLM RL?** The 19-lecture series in [`notes/lectures/`](./notes/lectures/) covers RLHF, DPO, GRPO, RLVR, agentic, offline. Drafts; cross-check the claims against the cited papers. +- **Working in this repo with Claude Code or Codex?** Read [`AGENTS.md`](./AGENTS.md) first. ## The landscape @@ -151,7 +158,7 @@ The map below shows where each family fits. The lectures fill in the details; [` * Lecture 9: Exploration and Exploitation * Lecture 10: Case Study: RL in Classic Games -* [CS 294: Deep Reinforcement Learning, Spring 2017](https://rll.berkeley.edu/deeprlcourse-fa17/) by Sergey Levine, John Schulman, Chelsea Finn. My notes are archived at [`Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/`](./Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/). +* [CS 294: Deep Reinforcement Learning, Spring 2017](https://rll.berkeley.edu/deeprlcourse-fa17/) by Sergey Levine, John Schulman, Chelsea Finn. My notes from taking it are at [`notes/cs294-2017/`](./notes/cs294-2017/). * [CS 285: Deep Reinforcement Learning (Berkeley)](https://rail.eecs.berkeley.edu/deeprlcourse/) — the current version of CS294, updated each year.