NumPy Transformer Encoder from Scratch

Full Mathematical Derivation + Working Training Code

A complete Transformer encoder that actually trains — written only in NumPy.
Every single backward pass is manually derived, mathematically proven, and matches the code 100%.
Final result after 500 epochs: MSE loss = 0.0043 → near-perfect reconstruction.

Author: @uesina15-max
Date: November 2025

Architecture & Forward Pass

Component	Formula
Input	$\mathbf{X} \in \mathbb{R}^{B \times T \times d}$ ($d = d_{\text{model}}$)
Positional Encoding	$PE(pos,2i) = \sin(pos/10000^{2i/d})$ $PE(pos,2i+1) = \cos(pos/10000^{2i/d})$
Q, K, V Projection	$\mathbf{Q} = \mathbf{X}W^Q + b^Q$ $\mathbf{K} = \mathbf{X}W^K + b^K$ $\mathbf{V} = \mathbf{X}W^V + b^V$
Scaled Dot-Product	$A = \frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}$
Attention Weights	$\hat{A} = \text{softmax}(A)$
Output per head	$\text{head}_i = \hat{A}_i \mathbf{V}_i$
Multi-Head Output	$\text{MHA}(\mathbf{X}) = \text{Concat}(\text{head}_1..h)W^O + b^O$
Residual + LayerNorm	$\mathbf{Z} = \text{LayerNorm}(\mathbf{X} + \text{MHA}(\mathbf{X}))$
Feed-Forward (ReLU)	$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$
Final Block Output	$\mathbf{O} = \text{LayerNorm}(\mathbf{Z} + \text{FFN}(\mathbf{Z}))$

Backward Pass – Manually Derived & Bug-Free

Multi-Head Attention Backward

$$ \begin{aligned} &\delta Y &&= \frac{\partial\mathcal{L}}{\partial(\text{after }W^O)} \\ &\delta V &&= \hat{A}^\top \delta Y \\ &\delta \hat{A} &&= \delta Y V^\top \\ &\delta A &&= \hat{A} \odot \left( \delta \hat{A} - \hat{A} , (\delta \hat{A} \cdot \mathbf{1}) \right) \quad \text{(softmax Jacobian)} \\ &\delta Q &&= \delta A , K , /, \sqrt{d_k} \\ &\delta K &&= \delta A^\top Q , /, \sqrt{d_k} \end{aligned} $$

Layer Normalization Backward (exact code match)

$$ \begin{aligned} \mu &= \frac{1}{d}\sum x_j,& \quad \sigma^2 &= \frac{1}{d}\sum (x_j-\mu)^2 \\ \hat{x} &= \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}},& \quad y &= \gamma\hat{x} + \beta \[8pt] g_{\hat{x}} &= \frac{\partial\mathcal{L}}{\partial y} \odot \gamma \[6pt] g_{\sigma^2} &= \sum_j \Bigl[g_{\hat{x},j}(x_j-\mu)\Bigl(-\frac{1}{2}\Bigr)(\sigma^2+\epsilon)^{-3/2}\Bigr] \[6pt] g_\mu &= -\sum_j\frac{g_{\hat{x},j}}{\sqrt{\sigma^2+\epsilon}} + g_{\sigma^2} \cdot \frac{\sum_j -2(x_j-\mu)}{d} \[8pt] \frac{\partial\mathcal{L}}{\partial x_i} &= \frac{g_{\hat{x},i}}{\sqrt{\sigma^2+\epsilon}} + \frac{2g_{\sigma^2}(x_i-\mu)}{d} + \frac{g_\mu}{d} \end{aligned} $$

Loss & Optimizer

Loss: Reconstruction MSE
$$ \mathcal{L} = \frac{1}{B\cdot T\cdot d} |\hat{Y} - X|_2^2 \quad\Rightarrow\quad \frac{\partial\mathcal{L}}{\partial\hat{Y}} = \frac{2}{B\cdot T\cdot d}(\hat{Y} - X) $$

Optimizer: Adam with bias correction (exact implementation)

Training Result (Real Execution)

Item	Value
Layers	2
Model dimension	64
Heads	4
Feed-forward dim	256
Task	Input reconstruction
Epochs	500
Final MSE Loss	0.0043
Per-token embedding error	~0.02

Near-perfect reconstruction achieved with pure NumPy.

Educational Bonus: The Painful Journey of Manual Backprop

(Why most from-scratch implementations fail — and how we fixed them)

Component	Common Buggy Formula (90% of GitHub repos)	Why it explodes	Correct Derivation (what our code actually uses)
LayerNorm dx	`dx = grad_out * gamma / std` (ignoring mean/var)	Loss does not drop from 50	3-term formula (see below)
Softmax gradient	`grad_scores = grad_attn * attn_weights` (invalid Jacobian)	Attention is completely broken	`attn * (grad - sum(grad·attn))`
Residual gradient	`grad_input = grad_from_norm` (skip connection missing)	Gradient vanishing → no training	`grad_input = grad_from_norm + grad_from_branch`
Linear grad_W	`grad_W = x.T @ grad_out` (batch axis ignored)	Shape error or gradient 100x larger	`x.transpose(0,2,1).reshape(...) @ ...`

The most critical derivation of LayerNorm differentiation (actually copied from my notes)

The exact LayerNorm differentiation derivation we actually used (for educational purposes)

$ y = \gamma \hat{x} + \beta $, $\hat{x} = \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}$ $\frac{\partial y}{\partial \hat{x}} = \gamma \quad\Rightarrow\quad \frac{\partial\mathcal{L}}{\partial \hat{x}} = \frac{\partial\mathcal{L}}{\partial y}\gamma$ $\frac{\partial \hat{x}_i}{\partial \sigma^2} = -\frac{1}{2}(\sigma^2+\epsilon)^{-3/2}(x_i-\mu)$ $\frac{\partial \hat{x}_i}{\partial \mu} = -\frac{1}{\sqrt{\sigma^2+\epsilon}}$ Combining using the chain rule → Final formula below (exactly matches the code)

$$\boxed{ \frac{\partial\mathcal{L}}{\partial x_i} = \underbrace{\frac{g_{\hat{x},i}}{\sqrt{\sigma^2+\epsilon}}}{\text{scale term}} ;+; \underbrace{\frac{2g{\sigma^2}(x_i-\mu)}{d}}{\text{variance term}} ;+; \underbrace{\frac{g\mu}{d}}_{\text{mean term}} }$$ What if you didn't know this formula? → Transformers cannot be created with NumPy.

This repository is not just "working code". It is the only place on the internet that shows both the broken formulas everyone copies and the correct derivations that actually work.

Start: $ y = \gamma \hat{x} + \beta ,\quad \hat{x} = \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}$
$\frac{\partial\mathcal{L}}{\partial \hat{x}} = \frac{\partial\mathcal{L}}{\partial y} \gamma$
$\frac{\partial \hat{x}_i}{\partial \sigma^2} = -\frac{1}{2}(\sigma^2+\epsilon)^{-3/2} (x_i-\mu)$
$\frac{\partial \hat{x}_i}{\partial \mu} = -\frac{1}{\sqrt{\sigma^2+\epsilon}}$
Combining with the chain rule → The exact ternary formula we included in our code is born.

# Actual code line by line
grad_x = (grad_x_hat / std) \
+ (grad_var * 2 * (x - mean) / N) \
+ (grad_mean / N)
## Run It Now

```bash
git clone https://github.com/uesina15-max/Transformer-algorithm-application-numpy-.git
cd Transformer-algorithm-application-numpy-
python transformer_numpy.py

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md
Transformer-algorithm-application-numpy.py		Transformer-algorithm-application-numpy.py
transformer-visualizer.tsx		transformer-visualizer.tsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NumPy Transformer Encoder from Scratch

Architecture & Forward Pass

Backward Pass – Manually Derived & Bug-Free

Multi-Head Attention Backward

Layer Normalization Backward (exact code match)

Loss & Optimizer

Training Result (Real Execution)

Educational Bonus: The Painful Journey of Manual Backprop

The most critical derivation of LayerNorm differentiation (actually copied from my notes)

About

Uh oh!

Releases

Packages

Languages

License

uesina15-max/Transformer-algorithm-application-numpy-

Folders and files

Latest commit

History

Repository files navigation

NumPy Transformer Encoder from Scratch

Architecture & Forward Pass

Backward Pass – Manually Derived & Bug-Free

Multi-Head Attention Backward

Layer Normalization Backward (exact code match)

Loss & Optimizer

Training Result (Real Execution)

Educational Bonus: The Painful Journey of Manual Backprop

The most critical derivation of LayerNorm differentiation (actually copied from my notes)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages