You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve explanations inspired by Karpathy's GPT lecture
- Add 'communication vs computation' framing for transformer blocks
- Add bigram baseline motivation for why attention is needed
- Add weighted averaging intuition for attention (simple avg → learned weights)
- Add sqrt(d) variance-based scaling explanation in PREREQUISITES
- Add inline sqrt(d) scaling note in README attention section
- Add direct link to Karpathy's GPT lecture video in Further Reading
- Fix parameter count: 5360/~5000 → 3648/~3600 (verified by running)
- Fix line count: ~600 → ~400 lines of code (849 total with comments)
- Fix sample output loss values to match actual defaults
- Update MLP framing to 'gathering context' vs 'reasoning about it'
The division by $\sqrt{d}$ (where $d$ is the vector dimension) keeps the scores from getting too large, which would make softmax output nearly one-hot (too confident about one token).
558
+
The division by $\sqrt{d}$ (where $d$ is the vector dimension) is crucial and has a precise mathematical reason. When Q and K vectors have entries with roughly unit variance (which they do at initialization), their dot product is a sum of $d$ random products — so the variance of the dot product grows proportionally to $d$. For large $d$, the raw scores can become very large in magnitude, which pushes softmax toward a one-hot distribution (all weight on one token, everything else near zero). Dividing by $\sqrt{d}$ rescales the scores back to unit variance, keeping softmax "diffuse" enough to attend to multiple tokens. Without this scaling, the model starts training with saturated attention patterns and learns very slowly.
559
559
560
560
These scores are passed through softmax to get weights that sum to 1. The output is a weighted sum of all V vectors.
561
561
@@ -729,6 +729,7 @@ Here's the complete picture of what happens when you run MicroGPT:
-[Karpathy — "Let's build GPT from scratch"](https://www.youtube.com/watch?v=kCc8FmEb1nY) — The specific video lecture this project builds on
732
733
-[Karpathy — Neural Networks: Zero to Hero](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) — Complete video series building up to GPT from scratch
Copy file name to clipboardExpand all lines: README.md
+14-11Lines changed: 14 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ Faithful port of [Andrej Karpathy's microgpt.py](https://gist.github.com/karpath
6
6
7
7
## What is this?
8
8
9
-
This is the exact same algorithm that powers ChatGPT, in ~600 lines of C# across 4 files. No PyTorch, no TensorFlow, no NuGet packages. Just plain C# and math.
9
+
This is the exact same algorithm that powers ChatGPT, in ~400 lines of code across 4 files (plus extensive comments explaining every piece). No PyTorch, no TensorFlow, no NuGet packages. Just plain C# and math.
10
10
11
11
It trains a tiny GPT model on a list of human names, then generates new ones that sound real but never existed.
@@ -103,29 +103,31 @@ Tokens with similar roles end up with similar vectors. The model also has **posi
103
103
104
104
### Step 3: The Transformer Layer
105
105
106
-
Each layer has two sub-blocks: **Attention** and **MLP**.
106
+
Each layer has two sub-blocks: **Attention** and **MLP** — or as Andrej Karpathy puts it: **communication** followed by **computation**. Attention gathers information from other tokens (communication), and the MLP processes that information (computation).
107
107
108
108
#### Attention: "Which past tokens matter right now?"
109
109
110
-
This is the key innovation behind GPT. For each token, the model looks at every previous token and decides which ones are relevant.
110
+
**Why it's needed:** Without attention, each token is processed in complete isolation — the model at position 5 has no idea what tokens appeared at positions 1–4. This is the baseline "bigram" approach: each token independently predicts the next one using only a lookup table. It works, but poorly — it can learn that 'e' is often followed by 'n', but can't learn that 'e' after 'Emm' should be followed by 'a'.
111
+
112
+
Attention solves this by letting each token look at all previous tokens and decide which ones are relevant. The insight is that this is really just a **weighted average**. Start with the simplest version: average all past token vectors equally. Better: weight them so recent tokens matter more. Best: let the model *learn* which tokens matter based on their content. That's what Q/K/V attention does — it computes data-dependent weights for this average.
111
113
112
114
It works through three projections per token:
113
115
114
116
-**Query (Q):** "What am I looking for?"
115
117
-**Key (K):** "What do I contain?"
116
118
-**Value (V):** "What information do I offer if selected?"
117
119
118
-
The model computes a score between the current token's Query and every past token's Key (via dot product). High score = that past token is relevant. These scores become weights (via softmax), and the output is a weighted blend of all the Value vectors.
120
+
The model computes a score between the current token's Query and every past token's Key (via dot product). High score = that past token is relevant. These scores are divided by $\sqrt{d}$ to keep them in a stable range (without this, large dimensions cause softmax to collapse to a one-hot distribution). The scaled scores become weights (via softmax), and the output is a weighted blend of all the Value vectors.
119
121
120
122
**Multi-head attention** splits this into parallel "heads" — each head can learn different patterns. One might track consonant sequences, another might focus on name length.
121
123
122
124
**Causality** is enforced for free in this implementation. Since tokens are processed one at a time and the KV cache only contains past tokens, the model can never look at the future.
123
125
124
126
#### MLP: "Now think about it."
125
127
126
-
After attention gathers information, the MLP processes it. It expands the vector to 4x width (giving the model more "thinking space"), applies an activation function (squared ReLU), then compresses back down.
128
+
After attention gathers information (communication), the MLP processes it (computation). It expands the vector to 4x width (giving the model more "thinking space"), applies an activation function (squared ReLU), then compresses back down.
127
129
128
-
Think of attention as "what info do I need?" and MLP as "what do I do with it?"
130
+
Think of attention as "gathering relevant context" and MLP as "reasoning about it." Each token first collects information from other positions, then independently processes what it collected.
129
131
130
132
#### Residual Connections
131
133
@@ -182,7 +184,7 @@ This is called **autoregressive generation**. It's how every GPT model generates
182
184
183
185
|| MicroGPT | GPT-4 |
184
186
|---|---|---|
185
-
| Parameters |~5,000|~1,800,000,000,000 |
187
+
| Parameters |~3,600|~1,800,000,000,000 |
186
188
| Token type | Characters | Word pieces (~100K vocab) |
187
189
| Context window | 8 tokens |~128,000 tokens |
188
190
| Training data | 32K names | Trillions of words |
@@ -233,7 +235,8 @@ This implementation uses more modern design choices (closer to LLaMA):
233
235
234
236
-[Karpathy's original microgpt.py](https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95) — The Python source
235
237
-[Karpathy's micrograd](https://github.com/karpathy/micrograd) — The autograd engine this builds on
236
-
-[Karpathy's Neural Networks: Zero to Hero](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) — Video series building up to GPT from scratch
238
+
-[Karpathy's "Let's build GPT from scratch"](https://www.youtube.com/watch?v=kCc8FmEb1nY) — The specific video lecture this project builds on
239
+
-[Karpathy's Neural Networks: Zero to Hero](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) — Full video series building up to GPT from scratch
237
240
-[Attention Is All You Need (2017)](https://arxiv.org/abs/1706.03762) — The original transformer paper
238
241
-[GPT-2 Paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) — The GPT-2 paper
239
242
-[The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) — Visual explanation of the architecture
0 commit comments