Skip to content

Commit f2a1343

Browse files
committed
Improve PREREQUISITES.md for math-averse readers\n\n- Add reading guide callout in intro (skip formulas, focus on analogies)\n- Add speedometer analogy and notation note for derivatives (§7)\n- Add notation explainer and currency exchange analogy for chain rule (§8)\n- Add sigma notation note in softmax section (§19)\n- Add intuition callout before Adam optimizer formulas (§17)\n- Add plain English paraphrase for RMSNorm (§26)\n- Fix duplicate orphan header in Adam section\n- Fix token ID for 'm' (15 → 14) in embeddings section\n- Fix ambiguous vocab count in summary (26 letters + BOS + EOS = 28)
1 parent 3cd53c9 commit f2a1343

1 file changed

Lines changed: 19 additions & 5 deletions

File tree

PREREQUISITES.md

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1-
# Prerequisites — What You Need to Know
1+
# Prerequisites — What you need to know
22

33
Everything you need to understand MicroGPT, starting from scratch. No prior machine learning experience required. We assume basic secondary school math and build up from there.
44

5+
> **📖 How to read this document:** You'll see math formulas throughout, but **you don't need to memorize or even fully understand every formula.** Focus on the **analogies and "Why it matters" sections** — those carry the core intuition. The formulas are here so you can reference them when reading the code. If a formula feels overwhelming, skip it and keep reading — the next paragraph usually explains the same idea in plain English.
6+
57
---
68

79
## Table of Contents
@@ -160,10 +162,14 @@ Total = 1.0
160162

161163
The **derivative** answers: "If I nudge the input a tiny bit, how much does the output change?"
162164

165+
**Everyday example:** Think of a car's speedometer. Your position is a function of time. The derivative of your position is your *speed* — how fast your position is changing. If you're parked, the derivative is 0. If you're on the highway, it's large. The derivative doesn't tell you *where* you are; it tells you *how quickly you're moving*.
166+
163167
For $f(x) = x^2$:
164168
- At $x = 3$: $f(3) = 9$. If we nudge to $x = 3.001$: $f(3.001) = 9.006001$. The output changed by about $0.006$ for a $0.001$ nudge — that's a rate of $6$. The derivative is $f'(3) = 6$.
165169
- Formula: $f'(x) = 2x$. At $x = 3$: $f'(3) = 2 \times 3 = 6$. ✓
166170

171+
> **Notation note:** $f'(x)$ is read as "f prime of x" — it just means "the derivative of $f$." When you see the prime mark ('), think "rate of change."
172+
167173
**Common derivative rules you'll see in the code:**
168174

169175
| Function | Derivative | In plain English |
@@ -186,6 +192,10 @@ What if functions are **chained** together? If $y = f(g(x))$ — that is, first
186192

187193
$$\frac{dy}{dx} = \frac{dy}{dg} \times \frac{dg}{dx}$$
188194

195+
> **Notation note:** $\frac{dy}{dx}$ is read as "the derivative of $y$ with respect to $x$" — it means "how much does $y$ change when I nudge $x$?" Think of it as a fraction: change in output divided by change in input.
196+
197+
**Real-world analogy:** Suppose you earn euros, your friend converts euros to dollars, and then converts dollars to yen. If 1 euro = 1.1 dollars, and 1 dollar = 150 yen, then 1 euro = $1.1 \times 150 = 165$ yen. You just **multiplied the conversion rates**. The chain rule works the same way — multiply the rates of change at each step.
198+
189199
**Concrete example:**
190200

191201
Let $g(x) = 3x$ and $f(g) = g^2$. So $y = (3x)^2$.
@@ -367,6 +377,8 @@ Where $t$ is the current step and $T$ is the total number of steps.
367377

368378
**Adam** (Adaptive Moment Estimation) is a smarter version of gradient descent. Plain gradient descent uses the raw gradient directly. Adam improves on this in two ways:
369379

380+
> **If the formulas below feel overwhelming, here's the key intuition:** Adam is like a ball rolling downhill that (1) builds up speed when it keeps going the same direction (momentum), and (2) takes smaller steps on steep slopes and larger steps on gentle ones (adaptive rate). That's it — the formulas below are how this is computed, but the idea is just "smart rolling ball."
381+
370382
**1. Momentum (first moment, $m$)**
371383

372384
Instead of using just the current gradient, Adam keeps a running average of recent gradients:
@@ -383,8 +395,6 @@ $$v_t = 0.95 \times v_{t-1} + 0.05 \times g_t^2$$
383395

384396
Parameters with consistently large gradients get smaller updates. Parameters with small, noisy gradients get larger updates. Each parameter effectively gets its own tuned learning rate.
385397

386-
**The update rule:**
387-
388398
**3. Bias correction**
389399

390400
Since $m$ and $v$ are initialized to zero, they're biased toward zero during early steps. Adam corrects this:
@@ -437,6 +447,8 @@ The squaring makes it smoother and more selective — it emphasizes larger value
437447

438448
$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$
439449

450+
> **Notation note:** The symbol $\sum$ (capital sigma) means "add them all up." So $\sum_j e^{x_j}$ means "compute $e^x$ for every element, then sum all the results." It's just shorthand for "total."
451+
440452
**Step by step example:**
441453

442454
Logits: $[2.0, 1.0, 0.1]$
@@ -487,7 +499,7 @@ An **embedding** is a learned vector (list of numbers) that represents a token.
487499
```
488500
Token 'a' (ID 2) → [0.12, -0.34, 0.56, 0.78, ...] (16 numbers)
489501
Token 'b' (ID 3) → [-0.23, 0.45, 0.01, -0.67, ...] (16 numbers)
490-
Token 'm' (ID 15) → [0.89, 0.11, -0.44, 0.33, ...] (16 numbers)
502+
Token 'm' (ID 14) → [0.89, 0.11, -0.44, 0.33, ...] (16 numbers)
491503
```
492504

493505
These vectors are **not designed by a human.** They start as random numbers and are adjusted during training. After training, tokens with similar roles develop similar vectors. For example, vowels might cluster together in this vector space.
@@ -605,6 +617,8 @@ As data flows through many layers, values can grow very large or shrink to near
605617
1. Compute the average squared value: $\text{ms} = \frac{1}{n}\sum x_i^2$
606618
2. Scale all values so the average squared magnitude is ~1: $\hat{x}_i = \frac{x_i}{\sqrt{\text{ms} + \epsilon}}$
607619

620+
> In plain English: take all the numbers, square them, find the average, then divide every number by the square root of that average. This shrinks large values and magnifies small ones so they're all in a similar range.
621+
608622
The $\epsilon$ (a tiny number like $10^{-5}$) prevents division by zero.
609623

610624
**Analogy:** Think of it like an automatic volume control on a microphone. Whether someone whispers or shouts, the output level stays consistent.
@@ -684,7 +698,7 @@ Here's the complete picture of what happens when you run MicroGPT:
684698

685699
**Setup:**
686700
1. Download 32,000 human names
687-
2. Build a tokenizer (28 characters + BOS + EOS)
701+
2. Build a tokenizer (26 letters + BOS + EOS = 28 tokens)
688702
3. Initialize random weight matrices (3,648 numbers with default settings)
689703

690704
**Training (1,000 steps):**

0 commit comments

Comments
 (0)