Skip to content

Commit df5d900

Browse files
committed
Improve explanations inspired by Karpathy's GPT lecture
- Add 'communication vs computation' framing for transformer blocks - Add bigram baseline motivation for why attention is needed - Add weighted averaging intuition for attention (simple avg → learned weights) - Add sqrt(d) variance-based scaling explanation in PREREQUISITES - Add inline sqrt(d) scaling note in README attention section - Add direct link to Karpathy's GPT lecture video in Further Reading - Fix parameter count: 5360/~5000 → 3648/~3600 (verified by running) - Fix line count: ~600 → ~400 lines of code (849 total with comments) - Fix sample output loss values to match actual defaults - Update MLP framing to 'gathering context' vs 'reasoning about it'
1 parent f2a1343 commit df5d900

3 files changed

Lines changed: 17 additions & 13 deletions

File tree

PREREQUISITES.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -555,7 +555,7 @@ The attention score between two tokens is the dot product of one token's Q and a
555555

556556
$$\text{score}(i, j) = \frac{Q_i \cdot K_j}{\sqrt{d}}$$
557557

558-
The division by $\sqrt{d}$ (where $d$ is the vector dimension) keeps the scores from getting too large, which would make softmax output nearly one-hot (too confident about one token).
558+
The division by $\sqrt{d}$ (where $d$ is the vector dimension) is crucial and has a precise mathematical reason. When Q and K vectors have entries with roughly unit variance (which they do at initialization), their dot product is a sum of $d$ random products — so the variance of the dot product grows proportionally to $d$. For large $d$, the raw scores can become very large in magnitude, which pushes softmax toward a one-hot distribution (all weight on one token, everything else near zero). Dividing by $\sqrt{d}$ rescales the scores back to unit variance, keeping softmax "diffuse" enough to attend to multiple tokens. Without this scaling, the model starts training with saturated attention patterns and learns very slowly.
559559

560560
These scores are passed through softmax to get weights that sum to 1. The output is a weighted sum of all V vectors.
561561

@@ -729,6 +729,7 @@ Here's the complete picture of what happens when you run MicroGPT:
729729

730730
**Neural networks from scratch:**
731731
- [3Blue1Brown — Neural Networks](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) — Excellent visual introduction
732+
- [Karpathy — "Let's build GPT from scratch"](https://www.youtube.com/watch?v=kCc8FmEb1nY) — The specific video lecture this project builds on
732733
- [Karpathy — Neural Networks: Zero to Hero](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) — Complete video series building up to GPT from scratch
733734

734735
**Transformers and GPT:**

README.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Faithful port of [Andrej Karpathy's microgpt.py](https://gist.github.com/karpath
66

77
## What is this?
88

9-
This is the exact same algorithm that powers ChatGPT, in ~600 lines of C# across 4 files. No PyTorch, no TensorFlow, no NuGet packages. Just plain C# and math.
9+
This is the exact same algorithm that powers ChatGPT, in ~400 lines of code across 4 files (plus extensive comments explaining every piece). No PyTorch, no TensorFlow, no NuGet packages. Just plain C# and math.
1010

1111
It trains a tiny GPT model on a list of human names, then generates new ones that sound real but never existed.
1212

@@ -52,9 +52,9 @@ dotnet run -- --n_embd 32 --n_layer 2 --num_steps 2000
5252

5353
```
5454
vocab size: 28, num docs: 32033
55-
num params: 5360
56-
step 1 / 1000 | loss 3.4507
57-
step 2 / 1000 | loss 3.3121
55+
num params: 3648
56+
step 1 / 1000 | loss 3.3327
57+
step 2 / 1000 | loss 3.3090
5858
...
5959
step 1000 / 1000 | loss 2.1844
6060
@@ -103,29 +103,31 @@ Tokens with similar roles end up with similar vectors. The model also has **posi
103103

104104
### Step 3: The Transformer Layer
105105

106-
Each layer has two sub-blocks: **Attention** and **MLP**.
106+
Each layer has two sub-blocks: **Attention** and **MLP** — or as Andrej Karpathy puts it: **communication** followed by **computation**. Attention gathers information from other tokens (communication), and the MLP processes that information (computation).
107107

108108
#### Attention: "Which past tokens matter right now?"
109109

110-
This is the key innovation behind GPT. For each token, the model looks at every previous token and decides which ones are relevant.
110+
**Why it's needed:** Without attention, each token is processed in complete isolation — the model at position 5 has no idea what tokens appeared at positions 1–4. This is the baseline "bigram" approach: each token independently predicts the next one using only a lookup table. It works, but poorly — it can learn that 'e' is often followed by 'n', but can't learn that 'e' after 'Emm' should be followed by 'a'.
111+
112+
Attention solves this by letting each token look at all previous tokens and decide which ones are relevant. The insight is that this is really just a **weighted average**. Start with the simplest version: average all past token vectors equally. Better: weight them so recent tokens matter more. Best: let the model *learn* which tokens matter based on their content. That's what Q/K/V attention does — it computes data-dependent weights for this average.
111113

112114
It works through three projections per token:
113115

114116
- **Query (Q):** "What am I looking for?"
115117
- **Key (K):** "What do I contain?"
116118
- **Value (V):** "What information do I offer if selected?"
117119

118-
The model computes a score between the current token's Query and every past token's Key (via dot product). High score = that past token is relevant. These scores become weights (via softmax), and the output is a weighted blend of all the Value vectors.
120+
The model computes a score between the current token's Query and every past token's Key (via dot product). High score = that past token is relevant. These scores are divided by $\sqrt{d}$ to keep them in a stable range (without this, large dimensions cause softmax to collapse to a one-hot distribution). The scaled scores become weights (via softmax), and the output is a weighted blend of all the Value vectors.
119121

120122
**Multi-head attention** splits this into parallel "heads" — each head can learn different patterns. One might track consonant sequences, another might focus on name length.
121123

122124
**Causality** is enforced for free in this implementation. Since tokens are processed one at a time and the KV cache only contains past tokens, the model can never look at the future.
123125

124126
#### MLP: "Now think about it."
125127

126-
After attention gathers information, the MLP processes it. It expands the vector to 4x width (giving the model more "thinking space"), applies an activation function (squared ReLU), then compresses back down.
128+
After attention gathers information (communication), the MLP processes it (computation). It expands the vector to 4x width (giving the model more "thinking space"), applies an activation function (squared ReLU), then compresses back down.
127129

128-
Think of attention as "what info do I need?" and MLP as "what do I do with it?"
130+
Think of attention as "gathering relevant context" and MLP as "reasoning about it." Each token first collects information from other positions, then independently processes what it collected.
129131

130132
#### Residual Connections
131133

@@ -182,7 +184,7 @@ This is called **autoregressive generation**. It's how every GPT model generates
182184

183185
| | MicroGPT | GPT-4 |
184186
|---|---|---|
185-
| Parameters | ~5,000 | ~1,800,000,000,000 |
187+
| Parameters | ~3,600 | ~1,800,000,000,000 |
186188
| Token type | Characters | Word pieces (~100K vocab) |
187189
| Context window | 8 tokens | ~128,000 tokens |
188190
| Training data | 32K names | Trillions of words |
@@ -233,7 +235,8 @@ This implementation uses more modern design choices (closer to LLaMA):
233235

234236
- [Karpathy's original microgpt.py](https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95) — The Python source
235237
- [Karpathy's micrograd](https://github.com/karpathy/micrograd) — The autograd engine this builds on
236-
- [Karpathy's Neural Networks: Zero to Hero](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) — Video series building up to GPT from scratch
238+
- [Karpathy's "Let's build GPT from scratch"](https://www.youtube.com/watch?v=kCc8FmEb1nY) — The specific video lecture this project builds on
239+
- [Karpathy's Neural Networks: Zero to Hero](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) — Full video series building up to GPT from scratch
237240
- [Attention Is All You Need (2017)](https://arxiv.org/abs/1706.03762) — The original transformer paper
238241
- [GPT-2 Paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) — The GPT-2 paper
239242
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) — Visual explanation of the architecture

src/AutogradEngine/Program.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
//
77
// This is the exact same algorithm behind ChatGPT, stripped to its essence.
88
// Real GPTs have billions of parameters and run on GPU clusters.
9-
// This one has ~5,000 parameters and runs on a single CPU thread.
9+
// This one has ~3,600 parameters and runs on a single CPU thread.
1010
// But every conceptual piece is here. Everything else is "just" optimization.
1111
//
1212
// What does it do? It learns to generate fake human names by reading real ones.

0 commit comments

Comments
 (0)