This repository investigates how model capacity shapes the trade-off between memorization and generalization in GPT-style Transformers using small, fully controlled, character-level tasks.
Paper: https://arxiv.org/pdf/2506.09099
Code: https://github.com/Josh-ee/Too-Big-to-Think
-
Install dependencies:
conda create -n tbtt python=3.11conda activate tbttpip install -r requirements.txt
-
Prepare data (choose one):
as_math,facts_char,both_math_and_factspython data/<Type>/prepare.py
-
Train (choose type:
math,facts,both; size:14,28,56,mlt)python train_with_eval.py config/train_<Type>_mini_<Model Size>.py
-
Plot a single run
python plot_results.py --model_folder <type>_mini_<Model Size>
-
Extra evaluations
- Capitals:
python eval_capitals.py --out_dir=facts_mini_<Model Size> - Math:
python eval_as_math.py --out_dir=math_mini_<Model Size>
- Capitals:
-
Recreate the three grid figures in this README
python plot_all_models.py- Outputs:
grid_math.png,grid_facts.png,grid_combined.png
-
Example sampling command:
python sample.py \ --out_dir=math_mini_14 \ --num_samples=5 --max_new_tokens=3 --ckpt="ckpt_best_math.pt" \ --start="<7-5"
The relationship between memorization and generalization in large language models (LLMs) remains an open area of research, with growing evidence that the two are deeply intertwined. In this work, we investigate this relationship by pre-training a series of capacity-limited Transformer models from scratch on two synthetic character-level tasks designed to separately probe generalization (via arithmetic extrapolation) and memorization (via factual recall). We observe a consistent trade-off: small models extrapolate to unseen arithmetic cases but fail to memorize facts, while larger models memorize but fail to extrapolate. An intermediate-capacity model exhibits a similar shift toward memorization. When trained on both tasks jointly, no model (regardless of size) succeeds at extrapolation. These findings suggest that pre-training may intrinsically favor one learning mode over the other. By isolating these dynamics in a controlled setting, our study offers insight into how model capacity shapes learning behavior and offers broader implications for the design and deployment of small language models.
- Arithmetic (Generalization): Simple addition and subtraction, formatted as
<a+b=c>and<a-b=c>wherea, b ∈ {0,…,9}. All expressions involving both 5 and 7 (5+7,7+5,5-7,7-5) are withheld from training and validation. Each line is 9 characters; block size9; batch size882. - Capital Cities (Memorization): 50 statements of the form “Capital of X is Y.” The same 50 examples are used for training and validation to probe memorization. Max length
24; block size24; batch size576. - Combined Test: Uses the Capital Cities settings for compatibility (block size
24, batch size576). Arithmetic examples fit cleanly into these batches since576is divisible by9.
Note: Training and validation use the same data by design to reflect duplication often present in pre-training and to isolate memorization dynamics.
- n-models: Single-layer, single-head Transformers with embedding sizes
n ∈ {14, 28, 56}. An MLP expansion parametermoverrides the default hidden size4nwithmn. We identified n14 as the smallest viable model after incrementally increasing from a minimal configuration. - MLT (Multi-Layer Transformer): A larger GPT-style model with multiple layers and heads, following the Shakespeare configuration from the nanoGPT repository (representative of standard small-scale GPT-style models).
- We extend nanoGPT by adding evaluations and further hyperparameter customization. Evaluation runs every 250 steps and reports accuracy on the arithmetic and capital-cities tasks. Arithmetic evaluation includes the withheld
(5,7)combinations. Each test input is evaluated 10 times to account for sampling variability, and the best combined evaluation score across training is logged and shown in the results tables. - Hardware: single NVIDIA RTX 5090 GPU.
- Reproducibility: global seeding for Python, NumPy, and PyTorch (CPU and CUDA); deterministic cuDNN (benchmarking disabled).
- Optimization (all models):
max_iters = 30000,lr_decay_iters = 25000,beta2 = 0.99. Each model is trained from scratch with hyperparameters in Table 1 below.
| Parameter | n-models | MLT |
|---|---|---|
| Embedding size (n) | 14, 28, 56 | 384 |
| Layers | 1 | 6 |
| Attention heads | 1 | 6 |
| MLP expansion (m) | 1 | 4 |
| Dropout | 0.0 | 0.2 |
| Weight decay | 0 | 0.1 |
| Learning rate | 1e-2 | 1e-5 |
| Min. learning rate | 1e-4 | 1e-6 |
Controlled Regularization: To confirm scaling effects are not due to regularization differences, we retrained all models with weight decay = 0.1 and dropout = 0.0. Results are reported alongside the main runs (see Appendices cited in the paper).
Table 2 — Addition & Subtraction Performance
| Model | Parameters | Addition | Subtraction | (5,7)* |
|---|---|---|---|---|
| n14 | 1.46k | 100.0% | 100.0% | 40/40 |
| n28 | 5.26k | 98% | 98% | 0/40 |
| n56 | 19.94k | 98% | 98% | 0/40 |
| MLT | 10.63M | 98% | 98% | 0/40 |
* Four (5,7)/(7,5) combinations, 10 attempts each.

Math Tasks Figure description: Addition and subtraction accuracy over the math evaluation, which includes the four withheld (5,7)/(7,5) cases. Red dots indicate where both curves hit 100%, signifying in-distribution generalization to the unseen pairs within the evaluation protocol. Only n14 earns these red markers; larger models plateau below perfect overall accuracy due to being unable to generalize to the withheld combinations.
Table 3 — Memorization Performance (Capital Cities)
| Model | Parameters | Facts Accuracy |
|---|---|---|
| n14 | 1.93k | 8.2% |
| n28 | 6.22k | 100.0% |
| n56 | 21.84k | 100.0% |
| MLT | 10.64M | 100.0% |

Capital Tasks Figure description: Accuracy over 30k iterations for the factual memorization task (Capital Cities). Only models at or above n28 achieve full memorization; n14 fails to converge. This illustrates a clear capacity threshold required for factual recall.
Table 4 — Combined Learning Performance
| Model | Parameters | Addition | Subtraction | Facts | Combined | (5,7)* |
|---|---|---|---|---|---|---|
| n14 | 2.14k | 31.2% | 39.4% | 2.0% | 28.6% | 0/40 |
| n28 | 6.64k | 95.0% | 97% | 87.6% | 94.3% | 0/40 |
| n56 | 22.68k | 98% | 98% | 100.0% | 98.4% | 0/40 |
| MLT | 10.65M | 98% | 98% | 100.0% | 98.4% | 0/40 |
* Four (5,7)/(7,5) combinations, 10 attempts each.

Joint Training Figure description: Accuracy over 30k iterations for models trained jointly on arithmetic and factual tasks. No model achieves in-distribution generalization to the held-out (5,7) arithmetic cases. The n14 model regresses relative to its arithmetic-only performance, and larger models prioritize memorization. This reveals the incompatibility of generalization and memorization under simultaneous training conditions.
- Math-only (grid_math.png): The smallest model (n14) cannot memorize all arithmetic mappings and instead learns the operation well enough to achieve in-distribution generalization to the withheld
(5,7)/(7,5)pairs (40/40). Larger models default to memorization and fail on these held-out combinations (0/40). - Facts-only (grid_facts.png): Memorization requires sufficient capacity. n14 achieves 8.2%, while n28, n56, and MLT reach 100% factual recall.
- Joint training (grid_combined.png): When trained on both tasks, no model gets a single
(5,7)case correct (0/40). In this setup, learning to memorize and to generalize simultaneously does not occur across model sizes.
- Code is based on nanoGPT, with added evaluations and configurability as described above.
- Default training settings:
max_iters = 30000,lr_decay_iters = 25000,beta2 = 0.99. - Evaluation every 250 steps; arithmetic eval includes withheld
(5,7); 10 samples per test input. - Results are generally consistent, but there is slight run-to-run variance due to stochasticity in training and sampling.
@misc{barron2025bigthinkcapacitymemorization,
title={Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers},
author={Joshua Barron and Devin White},
year={2025},
eprint={2506.09099},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.09099},
}- Inspired by Dr. Gerald Friedland’s course and book “Information-Driven Machine Learning: Data Science as an Engineering Discipline”.
- Thanks to MD Sunbeam for reviewing early drafts and offering guidance.