You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve PREREQUISITES.md for math-averse readers\n\n- Add reading guide callout in intro (skip formulas, focus on analogies)\n- Add speedometer analogy and notation note for derivatives (§7)\n- Add notation explainer and currency exchange analogy for chain rule (§8)\n- Add sigma notation note in softmax section (§19)\n- Add intuition callout before Adam optimizer formulas (§17)\n- Add plain English paraphrase for RMSNorm (§26)\n- Fix duplicate orphan header in Adam section\n- Fix token ID for 'm' (15 → 14) in embeddings section\n- Fix ambiguous vocab count in summary (26 letters + BOS + EOS = 28)
Copy file name to clipboardExpand all lines: PREREQUISITES.md
+19-5Lines changed: 19 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,9 @@
1
-
# Prerequisites — What You Need to Know
1
+
# Prerequisites — What you need to know
2
2
3
3
Everything you need to understand MicroGPT, starting from scratch. No prior machine learning experience required. We assume basic secondary school math and build up from there.
4
4
5
+
> **📖 How to read this document:** You'll see math formulas throughout, but **you don't need to memorize or even fully understand every formula.** Focus on the **analogies and "Why it matters" sections** — those carry the core intuition. The formulas are here so you can reference them when reading the code. If a formula feels overwhelming, skip it and keep reading — the next paragraph usually explains the same idea in plain English.
6
+
5
7
---
6
8
7
9
## Table of Contents
@@ -160,10 +162,14 @@ Total = 1.0
160
162
161
163
The **derivative** answers: "If I nudge the input a tiny bit, how much does the output change?"
162
164
165
+
**Everyday example:** Think of a car's speedometer. Your position is a function of time. The derivative of your position is your *speed* — how fast your position is changing. If you're parked, the derivative is 0. If you're on the highway, it's large. The derivative doesn't tell you *where* you are; it tells you *how quickly you're moving*.
166
+
163
167
For $f(x) = x^2$:
164
168
- At $x = 3$: $f(3) = 9$. If we nudge to $x = 3.001$: $f(3.001) = 9.006001$. The output changed by about $0.006$ for a $0.001$ nudge — that's a rate of $6$. The derivative is $f'(3) = 6$.
> **Notation note:** $f'(x)$ is read as "f prime of x" — it just means "the derivative of $f$." When you see the prime mark ('), think "rate of change."
172
+
167
173
**Common derivative rules you'll see in the code:**
168
174
169
175
| Function | Derivative | In plain English |
@@ -186,6 +192,10 @@ What if functions are **chained** together? If $y = f(g(x))$ — that is, first
> **Notation note:** $\frac{dy}{dx}$ is read as "the derivative of $y$ with respect to $x$" — it means "how much does $y$ change when I nudge $x$?" Think of it as a fraction: change in output divided by change in input.
196
+
197
+
**Real-world analogy:** Suppose you earn euros, your friend converts euros to dollars, and then converts dollars to yen. If 1 euro = 1.1 dollars, and 1 dollar = 150 yen, then 1 euro = $1.1 \times 150 = 165$ yen. You just **multiplied the conversion rates**. The chain rule works the same way — multiply the rates of change at each step.
198
+
189
199
**Concrete example:**
190
200
191
201
Let $g(x) = 3x$ and $f(g) = g^2$. So $y = (3x)^2$.
@@ -367,6 +377,8 @@ Where $t$ is the current step and $T$ is the total number of steps.
367
377
368
378
**Adam** (Adaptive Moment Estimation) is a smarter version of gradient descent. Plain gradient descent uses the raw gradient directly. Adam improves on this in two ways:
369
379
380
+
> **If the formulas below feel overwhelming, here's the key intuition:** Adam is like a ball rolling downhill that (1) builds up speed when it keeps going the same direction (momentum), and (2) takes smaller steps on steep slopes and larger steps on gentle ones (adaptive rate). That's it — the formulas below are how this is computed, but the idea is just "smart rolling ball."
381
+
370
382
**1. Momentum (first moment, $m$)**
371
383
372
384
Instead of using just the current gradient, Adam keeps a running average of recent gradients:
Parameters with consistently large gradients get smaller updates. Parameters with small, noisy gradients get larger updates. Each parameter effectively gets its own tuned learning rate.
385
397
386
-
**The update rule:**
387
-
388
398
**3. Bias correction**
389
399
390
400
Since $m$ and $v$ are initialized to zero, they're biased toward zero during early steps. Adam corrects this:
@@ -437,6 +447,8 @@ The squaring makes it smoother and more selective — it emphasizes larger value
> **Notation note:** The symbol $\sum$ (capital sigma) means "add them all up." So $\sum_j e^{x_j}$ means "compute $e^x$ for every element, then sum all the results." It's just shorthand for "total."
451
+
440
452
**Step by step example:**
441
453
442
454
Logits: $[2.0, 1.0, 0.1]$
@@ -487,7 +499,7 @@ An **embedding** is a learned vector (list of numbers) that represents a token.
These vectors are **not designed by a human.** They start as random numbers and are adjusted during training. After training, tokens with similar roles develop similar vectors. For example, vowels might cluster together in this vector space.
@@ -605,6 +617,8 @@ As data flows through many layers, values can grow very large or shrink to near
605
617
1. Compute the average squared value: $\text{ms} = \frac{1}{n}\sum x_i^2$
606
618
2. Scale all values so the average squared magnitude is ~1: $\hat{x}_i = \frac{x_i}{\sqrt{\text{ms} + \epsilon}}$
607
619
620
+
> In plain English: take all the numbers, square them, find the average, then divide every number by the square root of that average. This shrinks large values and magnifies small ones so they're all in a similar range.
621
+
608
622
The $\epsilon$ (a tiny number like $10^{-5}$) prevents division by zero.
609
623
610
624
**Analogy:** Think of it like an automatic volume control on a microphone. Whether someone whispers or shouts, the output level stays consistent.
@@ -684,7 +698,7 @@ Here's the complete picture of what happens when you run MicroGPT:
684
698
685
699
**Setup:**
686
700
1. Download 32,000 human names
687
-
2. Build a tokenizer (28 characters + BOS + EOS)
701
+
2. Build a tokenizer (26 letters + BOS + EOS = 28 tokens)
688
702
3. Initialize random weight matrices (3,648 numbers with default settings)
0 commit comments