Too Big To Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers

This repository investigates how model capacity shapes the trade-off between memorization and generalization in GPT-style Transformers using small, fully controlled, character-level tasks.

Paper: https://arxiv.org/pdf/2506.09099
Code: https://github.com/Josh-ee/Too-Big-to-Think

Quickstart

Install dependencies:
- conda create -n tbtt python=3.11
- conda activate tbtt
- pip install -r requirements.txt
Prepare data (choose one): as_math, facts_char, both_math_and_facts
- python data/<Type>/prepare.py
Train (choose type: math, facts, both; size: 14, 28, 56, mlt)
- python train_with_eval.py config/train_<Type>_mini_<Model Size>.py
Plot a single run
- python plot_results.py --model_folder <type>_mini_<Model Size>
Extra evaluations
- Capitals: python eval_capitals.py --out_dir=facts_mini_<Model Size>
- Math: python eval_as_math.py --out_dir=math_mini_<Model Size>
Recreate the three grid figures in this README
- python plot_all_models.py
- Outputs: grid_math.png, grid_facts.png, grid_combined.png

Example sampling command:

python sample.py \
  --out_dir=math_mini_14 \
  --num_samples=5 --max_new_tokens=3 --ckpt="ckpt_best_math.pt" \
  --start="<7-5"

Abstract

The relationship between memorization and generalization in large language models (LLMs) remains an open area of research, with growing evidence that the two are deeply intertwined. In this work, we investigate this relationship by pre-training a series of capacity-limited Transformer models from scratch on two synthetic character-level tasks designed to separately probe generalization (via arithmetic extrapolation) and memorization (via factual recall). We observe a consistent trade-off: small models extrapolate to unseen arithmetic cases but fail to memorize facts, while larger models memorize but fail to extrapolate. An intermediate-capacity model exhibits a similar shift toward memorization. When trained on both tasks jointly, no model (regardless of size) succeeds at extrapolation. These findings suggest that pre-training may intrinsically favor one learning mode over the other. By isolating these dynamics in a controlled setting, our study offers insight into how model capacity shapes learning behavior and offers broader implications for the design and deployment of small language models.

Methods

Datasets

Arithmetic (Generalization): Simple addition and subtraction, formatted as <a+b=c> and <a-b=c> where a, b ∈ {0,…,9}. All expressions involving both 5 and 7 (5+7, 7+5, 5-7, 7-5) are withheld from training and validation. Each line is 9 characters; block size 9; batch size 882.
Capital Cities (Memorization): 50 statements of the form “Capital of X is Y.” The same 50 examples are used for training and validation to probe memorization. Max length 24; block size 24; batch size 576.
Combined Test: Uses the Capital Cities settings for compatibility (block size 24, batch size 576). Arithmetic examples fit cleanly into these batches since 576 is divisible by 9.

Note: Training and validation use the same data by design to reflect duplication often present in pre-training and to isolate memorization dynamics.

Model Architectures

n-models: Single-layer, single-head Transformers with embedding sizes n ∈ {14, 28, 56}. An MLP expansion parameter m overrides the default hidden size 4n with mn. We identified n14 as the smallest viable model after incrementally increasing from a minimal configuration.
MLT (Multi-Layer Transformer): A larger GPT-style model with multiple layers and heads, following the Shakespeare configuration from the nanoGPT repository (representative of standard small-scale GPT-style models).

Implementation Details

We extend nanoGPT by adding evaluations and further hyperparameter customization. Evaluation runs every 250 steps and reports accuracy on the arithmetic and capital-cities tasks. Arithmetic evaluation includes the withheld (5,7) combinations. Each test input is evaluated 10 times to account for sampling variability, and the best combined evaluation score across training is logged and shown in the results tables.
Hardware: single NVIDIA RTX 5090 GPU.
Reproducibility: global seeding for Python, NumPy, and PyTorch (CPU and CUDA); deterministic cuDNN (benchmarking disabled).
Optimization (all models): max_iters = 30000, lr_decay_iters = 25000, beta2 = 0.99. Each model is trained from scratch with hyperparameters in Table 1 below.

Table 1 — Hyperparameter comparison: n-models vs. MLT

Parameter	n-models	MLT
Embedding size (n)	14, 28, 56	384
Layers	1	6
Attention heads	1	6
MLP expansion (m)	1	4
Dropout	0.0	0.2
Weight decay	0	0.1
Learning rate	1e-2	1e-5
Min. learning rate	1e-4	1e-6

Controlled Regularization: To confirm scaling effects are not due to regularization differences, we retrained all models with weight decay = 0.1 and dropout = 0.0. Results are reported alongside the main runs (see Appendices cited in the paper).

Results

In-Distribution Generalization on Held-Out Arithmetic Cases

Table 2 — Addition & Subtraction Performance

Model	Parameters	Addition	Subtraction	(5,7)*
n14	1.46k	100.0%	100.0%	40/40
n28	5.26k	98%	98%	0/40
n56	19.94k	98%	98%	0/40
MLT	10.63M	98%	98%	0/40

* Four (5,7)/(7,5) combinations, 10 attempts each.

Math Tasks Figure description: Addition and subtraction accuracy over the math evaluation, which includes the four withheld (5,7)/(7,5) cases. Red dots indicate where both curves hit 100%, signifying in-distribution generalization to the unseen pairs within the evaluation protocol. Only n14 earns these red markers; larger models plateau below perfect overall accuracy due to being unable to generalize to the withheld combinations.

Factual Recall of Capital Cities

Table 3 — Memorization Performance (Capital Cities)

Model	Parameters	Facts Accuracy
n14	1.93k	8.2%
n28	6.22k	100.0%
n56	21.84k	100.0%
MLT	10.64M	100.0%

Capital Tasks Figure description: Accuracy over 30k iterations for the factual memorization task (Capital Cities). Only models at or above n28 achieve full memorization; n14 fails to converge. This illustrates a clear capacity threshold required for factual recall.

Joint Arithmetic + Capital Cities Training

Table 4 — Combined Learning Performance

Model	Parameters	Addition	Subtraction	Facts	Combined	(5,7)*
n14	2.14k	31.2%	39.4%	2.0%	28.6%	0/40
n28	6.64k	95.0%	97%	87.6%	94.3%	0/40
n56	22.68k	98%	98%	100.0%	98.4%	0/40
MLT	10.65M	98%	98%	100.0%	98.4%	0/40

* Four (5,7)/(7,5) combinations, 10 attempts each.

Joint Training Figure description: Accuracy over 30k iterations for models trained jointly on arithmetic and factual tasks. No model achieves in-distribution generalization to the held-out (5,7) arithmetic cases. The n14 model regresses relative to its arithmetic-only performance, and larger models prioritize memorization. This reveals the incompatibility of generalization and memorization under simultaneous training conditions.

Key Takeaways

Math-only (grid_math.png): The smallest model (n14) cannot memorize all arithmetic mappings and instead learns the operation well enough to achieve in-distribution generalization to the withheld (5,7)/(7,5) pairs (40/40). Larger models default to memorization and fail on these held-out combinations (0/40).
Facts-only (grid_facts.png): Memorization requires sufficient capacity. n14 achieves 8.2%, while n28, n56, and MLT reach 100% factual recall.
Joint training (grid_combined.png): When trained on both tasks, no model gets a single (5,7) case correct (0/40). In this setup, learning to memorize and to generalize simultaneously does not occur across model sizes.

Reproducing

Code is based on nanoGPT, with added evaluations and configurability as described above.
Default training settings: max_iters = 30000, lr_decay_iters = 25000, beta2 = 0.99.
Evaluation every 250 steps; arithmetic eval includes withheld (5,7); 10 samples per test input.
Results are generally consistent, but there is slight run-to-run variance due to stochasticity in training and sampling.

Citation

@misc{barron2025bigthinkcapacitymemorization,
      title={Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers},
      author={Joshua Barron and Devin White},
      year={2025},
      eprint={2506.09099},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.09099},
}

Acknowledgments & Contact

Inspired by Dr. Gerald Friedland’s course and book “Information-Driven Machine Learning: Data Science as an Engineering Discipline”.
Thanks to MD Sunbeam for reviewing early drafts and offering guidance.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Results		Results
config		config
data		data
math_mini_14		math_mini_14
math_mini_56		math_mini_56
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
configurator.py		configurator.py
eval_as_math.py		eval_as_math.py
eval_as_math_99.py		eval_as_math_99.py
eval_best_ckpts.py		eval_best_ckpts.py
eval_best_ckpts_wd.py		eval_best_ckpts_wd.py
eval_capitals.py		eval_capitals.py
eval_utils.py		eval_utils.py
grid_combined.png		grid_combined.png
grid_combined_wd.png		grid_combined_wd.png
grid_facts.png		grid_facts.png
grid_facts_wd.png		grid_facts_wd.png
grid_math.png		grid_math.png
grid_math_wd.png		grid_math_wd.png
model.py		model.py
plot_all_models.py		plot_all_models.py
plot_all_models_wd.py		plot_all_models_wd.py
plot_gk.py		plot_gk.py
plot_results.py		plot_results.py
prompt.txt		prompt.txt
requirements.txt		requirements.txt
sample.py		sample.py
train.py		train.py
train_with_eval.py		train_with_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Too Big To Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers

Quickstart

Abstract