0bserver07 · 0bserver07 · May 15, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -6,7 +6,7 @@ Instructions for AI coding agents (Codex, Claude Code, etc.) working in this rep
 
 A personal study repository for reinforcement learning and its use in training LLMs. It holds:
 
-- the original course notes from 2017 (`Archive/`)
+- the original course notes from 2017 (`notes/cs294-2017/`, `notes/sutton-barto-digest/`)
 - a self-study lecture series taking RL from MDPs to RLHF (`notes/lectures/`)
 - worked, tested coding exercises (`exercises/`)
 - a curated reading list of recent papers (`reference/papers/`)
@@ -18,8 +18,8 @@ It is a learning environment, not a library or a product. A person is working th
 
 | Path | What it is | Editable by an agent? |
 |---|---|---|
-| `Archive/` | The original 2017 notes, idiosyncratic voice, kept as written | No. Reference only. Never reword. |
-| `notes/lectures/` | The lecture series, `NN-topic.md` | Yes, under the rules below |
+| `notes/cs294-2017/`, `notes/sutton-barto-digest/` | Original 2017 hand-written notes (CS294 student notes; Sutton & Barto digest); idiosyncratic voice, kept as written | No. Reference only. Never reword. |
+| `notes/lectures/` | The 19-lecture series, `NN-topic.md` | Yes, under the rules below |
 | `notes/cheat-sheets/`, `notes/diagrams/` | Quick reference | Yes |
 | `exercises/NN-topic/` | A task, a starter file, tests, a reference solution, hints | Yes |
 | `reference/papers/` | Reading lists. The `PAPERS.md` files are generated by the collector; the per-topic READMEs are hand notes. | READMEs yes; don't hand-edit `PAPERS.md` — re-run the collector. |
@@ -58,7 +58,7 @@ The repo has a voice: plain and direct, a little informal, written by someone le
 - **No marketing voice.** Not a product launch. Don't call things "comprehensive," "powerful," "cutting-edge," "robust." Don't open a section with "Why this matters." Don't close with "the future is bright."
 - **No AI-slop tells.** No emoji as bullets or in headings. No rule-of-three padding ("fast, simple, and elegant"). No "it's not just X — it's Y." No "let's dive in." Sentence-case headings. If `~/.claude/skills/anti-slop-guide` is available, follow it.
 - **Be specific.** "The loss explodes around update 50 if you don't normalize the advantage" beats "this can be unstable."
-- **Keep the old notes' quirks.** The 2017 archive says "quadratize (it could be a word)" and references the author being from Iraq. That stays. Don't sand it down.
+- **Keep the old notes' quirks.** The 2017 hand-written notes (`notes/cs294-2017/`) say "quadratize (it could be a word)" and reference the author being from Iraq. That stays. Don't sand it down.
 
 ## Citations
 
@@ -95,7 +95,7 @@ An agent acting as tutor: have the student edit `starter.py`, run `pytest exerci
 
 ## Don't
 
-- Don't reword `Archive/`.
+- Don't reword the 2017 hand-written notes (`notes/cs294-2017/`, `notes/sutton-barto-digest/`).
 - Don't mark your own output `reviewed`.
 - Don't add a citation you haven't verified.
 - Don't touch files outside the repo, shell config, or git history without being asked.

diff --git a/Archive/README.md b/Archive/README.md
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,19 @@
 
 Notable changes to the repo. Not a release log — there are no releases — just a record of what moved and why.
 
+## 2026-05-15 — promote the hand-written notes out of Archive/
+
+The trusted, hand-written 2017 notes had been sitting in a folder called `Archive/` — which connotes "old/dead" — while the unreviewed AI-drafted lecture series occupied `notes/`. Backwards. Fixed:
+
+- `Archive/2017-Course-Notes/CS294-DeepRL-Berkeley/` → `notes/cs294-2017/` (with `imgs/` intact and image links unchanged).
+- `Archive/2017-Course-Notes/Elements-Of-RL/` → `notes/sutton-barto-digest/`.
+- Both moved files got a `<!-- status: hand-written -->` header.
+- `Archive/` directory deleted (`Archive/README.md` was a wrapper; no content lost).
+- Root `readme.md` "What's here" section restructured to lead with the trusted, hand-written content (the CS294 notes, the Sutton & Barto digest, the curated talks/books/courses, the tested exercises) and clearly demote the AI-drafted lecture series as scaffold-with-skepticism. "Start here" reordered to lead with safer paths (talks/books → exercises → drafts).
+- `notes/README.md` rewritten in the same spirit — hand-written content first, lecture series second with a clear caveat about what `unreviewed` means.
+- `AGENTS.md` and `CLAUDE.md` updated: the layout table now points at `notes/cs294-2017/` and `notes/sutton-barto-digest/` as the trusted, frozen, never-reword material instead of `Archive/`.
+- GitHub topics refreshed: dropped `guideline` and `study` (generic), added `rlhf`, `llm-alignment`, `dpo`, `grpo`, `ppo`, `rlvr`, `agentic-rl`, `lecture-notes`, `study-notes`, `deepseek-r1`, `constitutional-ai`, `policy-gradient`, `q-learning`, `sutton-barto`. Description sharpened.
+
 ## 2026-05-12 — restructure: separate the layers, set up rules
 
 Context: the repo had grown two layers — the original 2017 notes, and a much larger newer layer added in 2025 (a 13-lecture series, scraped paper lists, a content tool). The newer layer was unmarked, wrote in a first person it hadn't earned, and shipped broken links, phantom lectures, and made-up citations. This pass separates the two so nobody has to guess what's trustworthy, and sets up conventions so they can coexist.

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -5,7 +5,7 @@ Read [`AGENTS.md`](./AGENTS.md). Everything in it applies to Claude Code working
 Quick orientation:
 
 - This is a study repo for RL and RL-for-LLMs. A person is learning the material; help them learn it, don't try to "finish" the repo.
-- `Archive/` is frozen — reference it, never reword it.
+- The 2017 hand-written notes (`notes/cs294-2017/`, `notes/sutton-barto-digest/`) are trusted and frozen — reference, never reword.
 - Docs under `notes/` and `reference/` carry a `<!-- status: ... -->` comment. `hand-written` and `reviewed` are trusted; `unreviewed` means nobody has checked it — don't cite it as fact, and don't promote it to `reviewed` yourself (only a person does that).
 - Match the existing voice: plain, direct, no marketing tone, no AI-slop tells. The `anti-slop-guide` skill is available — use it when writing or editing prose here.
 - Verify every paper citation before adding it. The repo currently has some invented ones; don't add more.

diff --git a/notes/README.md b/notes/README.md
@@ -1,19 +1,32 @@
 <!-- status: unreviewed | last-reviewed: never -->
 
-# Lecture series: deep RL to LLM alignment
+# notes — study material
 
-A self-study sequence that goes from MDPs and policy gradients up through RLHF, DPO, and the 2024–2025 alignment methods. The lecture bodies **haven't been reviewed yet** — useful as a structured path, but check the math, the code, and the citations against primary sources. `../CURRICULUM.md` is the same path with prerequisites and time estimates; [`../AGENTS.md`](../AGENTS.md) explains the `status:` labels.
+Two layers live in this directory, mixed.
 
-Each lecture tries to do four things: give the intuition before the math, show code that runs, point at where the method breaks in practice, and name the papers that introduced it. When a lecture has a matching exercise, it links to [`../exercises/`](../exercises/).
+**Trusted, hand-written:**
 
-## Lectures and review status
+- **[`cs294-2017/`](./cs294-2017/)** — personal student notes from CS 294 Deep RL (Berkeley, Spring 2017 — Levine, Schulman, Finn). 246 lines of working notes from the field being built. Idiosyncratic, kept as written. `status: hand-written`.
+- **[`sutton-barto-digest/`](./sutton-barto-digest/)** — short distillation of the four elements of an RL system (policy, reward, value function, model) from Sutton & Barto. `status: hand-written`.
+
+These are old (2017) and informal — but they're a real person's understanding, not AI text. Trusted as starting points.
+
+**AI-drafted, useful as scaffold (`unreviewed` — treat with skepticism):**
+
+- **[`lectures/`](./lectures/)** — a 19-lecture series taking RL from MDPs through to RLHF / DPO / GRPO / RLVR / agentic / offline. Editorial pass has been done — broken links fixed, code bugs caught (`import gym` → `gymnasium`, missing imports, old-API `env.step` calls), citations checked or removed when they didn't resolve, fake-first-person framing stripped. **But no person has read each lecture end to end and signed off.** Cross-check the math against the cited papers; treat the code as a starting point that needs verification. Index and per-lecture review status below.
+- **[`cheat-sheets/`](./cheat-sheets/)** — `RL-Math-Formulas.md` and `RL-Quick-Reference.md`. Audited (caught a wrong KL direction; fixed). Same caveat.
+- **[`diagrams/`](./diagrams/)** — `RL-Algorithm-Diagrams.md`. Audited (caught and fixed a wrong DPO loss diagram and a wrong GRPO advantage diagram). Same caveat.
+
+[`../CURRICULUM.md`](../CURRICULUM.md) is the suggested order through everything. [`../AGENTS.md`](../AGENTS.md) explains the `<!-- status: ... -->` convention every doc carries.
+
+## Lecture series — drafts, in order
 
 | # | Lecture | Status |
 |---|---|---|
 | 01 | [MDPs and Bellman equations](./lectures/01-mdps-bellman.md) — exercise: [`01-mdps`](../exercises/01-mdps/) | unreviewed (de-slopped; a fabricated value-function output was removed) |
 | 02 | [Policy gradients from scratch](./lectures/02-policy-gradients.md) — exercise: [`02-policy-gradients`](../exercises/02-policy-gradients/) | unreviewed (de-slopped; a broken link and a code bug were fixed) |
 | 03 | [Value functions & Q-learning](./lectures/03-value-functions-q-learning.md) — exercise: [`03-q-learning`](../exercises/03-q-learning/) | unreviewed (de-slopped; a dead `Modern-RL-Research/` path and a missing import fixed) |
-| 04 | [Actor-critic methods](./lectures/04-actor-critic.md) | unreviewed (de-slopped; a code bug fixed) |
+| 04 | [Actor-critic methods](./lectures/04-actor-critic.md) — exercise: [`04-actor-critic`](../exercises/04-actor-critic/) | unreviewed (de-slopped; a code bug fixed) |
 | 05 | [Trust regions and TRPO](./lectures/05-trpo.md) | unreviewed (de-slopped; fabricated training times removed) |
 | 06 | [PPO](./lectures/06-ppo.md) | unreviewed (de-slopped; `import gym` → `gymnasium` fixed) |
 | 07 | [Off-policy learning: SAC and TD3](./lectures/07-off-policy-rl.md) | unreviewed (de-slopped; an old-API `env.step` call fixed) |
@@ -22,45 +35,28 @@ Each lecture tries to do four things: give the intuition before the math, show c
 | 10 | [PPO for language models](./lectures/10-ppo-for-llms.md) | unreviewed (de-slopped; a broken next-lecture link + unverified compute claims fixed) |
 | 11 | [Direct preference optimization](./lectures/11-dpo.md) | unreviewed (de-slopped; a fabricated paper removed) |
 | 12 | [Beyond DPO: GRPO, RRHF, IPO](./lectures/12-beyond-dpo.md) | unreviewed (de-slopped; a fabricated benchmark table + a fabricated paper removed) |
-| 13 | [RLHF for code generation](./lectures/13-rlhf-code-generation.md) | unreviewed (de-slopped; CodeRL mis-attributed to Meta → fixed to Salesforce; fabricated benchmark numbers removed) |
+| 13 | [RLHF for code generation](./lectures/13-rlhf-code-generation.md) — exercise: [`15-grpo-rlvr`](../exercises/15-grpo-rlvr/) (related) | unreviewed (de-slopped; CodeRL mis-attributed to Meta → fixed to Salesforce; fabricated benchmark numbers removed) |
 | 14 | [Constitutional AI, RLAIF, self-improvement](./lectures/14-constitutional-ai-rlaif.md) | unreviewed (new draft) |
-| 15 | [RL with verifiable rewards & reasoning models](./lectures/15-rl-verifiable-rewards.md) | unreviewed (new draft) |
+| 15 | [RL with verifiable rewards & reasoning models](./lectures/15-rl-verifiable-rewards.md) — exercise: [`15-grpo-rlvr`](../exercises/15-grpo-rlvr/) | unreviewed (new draft) |
 | 16 | [Agentic RL: tool use, multi-turn](./lectures/16-agentic-rl.md) | unreviewed (new draft) |
 | 17 | [Online & iterative preference optimization](./lectures/17-online-iterative-preference.md) | unreviewed (new draft) |
 | 18 | [Distillation of reasoning models](./lectures/18-distillation-reasoning.md) | unreviewed (new draft) |
 | 19 | [Offline RL](./lectures/19-offline-rl.md) | unreviewed (new draft) |
 
-Planned: a curated paper layer in [`../reference/papers/`](../reference/papers/), built from `../tools/lit-builder/` once the LLM scoring step has been run (it needs a credential). Optionally: an exploration lecture (intrinsic motivation, count-based methods, RND) — the one remaining foundational gap.
+What "unreviewed" means here: nobody has read the lecture end-to-end and signed off on it. The editorial pass (de-slop, fix broken links, catch code bugs, verify citations) has happened — that's the parenthetical note next to each row. The next step is a person reads it and either flips it to `reviewed` (with today's date in `last-reviewed:`) or notes what's still wrong.
 
-Cheat sheets and diagrams are in [`cheat-sheets/`](./cheat-sheets/) and [`diagrams/`](./diagrams/) — also unreviewed.
+Planned: a curated paper layer in [`../reference/papers/`](../reference/papers/), built from `../tools/lit-builder/` once the LLM scoring step has been run (it needs a credential — see issue #2). Optionally: an exploration lecture (intrinsic motivation, count-based methods, RND).
 
 ## How to use this
 
-Starting from scratch: do 01–05 in order, type out the code yourself, and don't move on from a lecture until you can explain its method without notes. Then 06–08, then 09 onward.
+Starting from scratch: read the talks/books/courses linked in [`../readme.md`](../readme.md) — they're the trusted external material. The hand-written CS294 notes at [`cs294-2017/`](./cs294-2017/) give you one student's path through the same material.
 
-Already know RL, here for the LLM part: skim 01–05 for notation, then go 09 → 10 → 11 → 12. Lecture 13 if you care about code generation specifically.
+Already know RL, here for the LLM part: lectures 09 → 11 → 12 → 14 → 15 → 17 covers the RLHF → DPO → GRPO → constitutional AI → RLVR → iterative preference optimization arc.
 
-Here for code generation: 02 (policy-gradient intuition), 10 (PPO for LLMs), 11–13.
+Here for code generation specifically: lecture 02 (policy-gradient intuition), 10 (PPO for LLMs), 13 (RLHF for code), 15 (RLVR — the basis of modern reasoning-RL on code).
 
 ## Prerequisites
 
-- Calculus (derivatives, chain rule, gradients), probability (expectations, distributions, KL divergence), basic linear algebra. The math is explained as it comes up.
-- Python at an intermediate level; PyTorch basics (the code uses PyTorch); NumPy.
-- Budget a few hours per lecture including coding and debugging.
-
-## Study notes that hold up
-
-- Type the code out. Don't paste it.
-- Break it on purpose — change a hyperparameter until it fails, then work out why.
-- If you can't explain a method simply, you don't have it yet.
-- After coding a method, read the original paper. It reads very differently once you've implemented it.
-- Print shapes when something's wrong. Most RL bugs are shape or sign errors.
-
-## Supplementary resources
-
-- Sutton & Barto, *Reinforcement Learning: An Introduction* (2nd ed.)
-- Spinning Up in Deep RL (OpenAI) — explanations plus reference implementations
-- David Silver's UCL lectures
-- Recent papers, by topic, in [`../reference/papers/`](../reference/papers/)
-
-The lectures are meant to stand on their own, but they'll make more sense alongside these.
+- Calculus (derivatives, chain rule, gradients), probability (expectations, distributions, KL divergence), basic linear algebra.
+- Python at an intermediate level; PyTorch basics; NumPy.
+- A few hours per lecture including coding and debugging.
diff --git a/...tes/CS294-DeepRL-Berkeley/imgs/cannon.svg → notes/cs294-2017/imgs/cannon.svg b/...tes/CS294-DeepRL-Berkeley/imgs/cannon.svg → notes/cs294-2017/imgs/cannon.svg
diff --git a/...CS294-DeepRL-Berkeley/imgs/linear-lqr.png → notes/cs294-2017/imgs/linear-lqr.png b/...CS294-DeepRL-Berkeley/imgs/linear-lqr.png → notes/cs294-2017/imgs/linear-lqr.png
diff --git a/...S294-DeepRL-Berkeley/imgs/nvidia-case.png → notes/cs294-2017/imgs/nvidia-case.png b/...S294-DeepRL-Berkeley/imgs/nvidia-case.png → notes/cs294-2017/imgs/nvidia-case.png
diff --git a/...L-Berkeley/imgs/rl-imitation-learning.png → ...cs294-2017/imgs/rl-imitation-learning.png b/...L-Berkeley/imgs/rl-imitation-learning.png → ...cs294-2017/imgs/rl-imitation-learning.png
diff --git a/...rse-Notes/CS294-DeepRL-Berkeley/readme.md → notes/cs294-2017/readme.md b/...rse-Notes/CS294-DeepRL-Berkeley/readme.md → notes/cs294-2017/readme.md
@@ -1,3 +1,5 @@
+<!-- status: hand-written | provenance: notes by Yad Konrad while taking the courses (2017); kept as written -->
+
 
 ## Notes taken from CS 294: Deep Reinforcement Learning, Spring 2017 (Berkeley)
 

diff --git a/...017-Course-Notes/Elements-Of-RL/readme.md → notes/sutton-barto-digest/readme.md b/...017-Course-Notes/Elements-Of-RL/readme.md → notes/sutton-barto-digest/readme.md
@@ -1,3 +1,5 @@
+<!-- status: hand-written | provenance: notes by Yad Konrad while taking the courses (2017); kept as written -->
+
 #### Elements Of Reinforcement Learning: (Derived from Barto and Sutton '17 and Li '17)
 
 * A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.
Original file line number	Diff line number	Diff line change
		@@ -1,3 +1,5 @@
		<!-- status: hand-written \| provenance: notes by Yad Konrad while taking the courses (2017); kept as written -->


		## Notes taken from CS 294: Deep Reinforcement Learning, Spring 2017 (Berkeley)

Expand Down