Skip to content

Josh-ee/Too-Big-to-Think

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Too Big To Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers

This repository investigates how model capacity shapes the trade-off between memorization and generalization in GPT-style Transformers using small, fully controlled, character-level tasks.

Paper: https://arxiv.org/pdf/2506.09099
Code: https://github.com/Josh-ee/Too-Big-to-Think

Quickstart

  • Install dependencies:

    • conda create -n tbtt python=3.11
    • conda activate tbtt
    • pip install -r requirements.txt
  • Prepare data (choose one): as_math, facts_char, both_math_and_facts

    • python data/<Type>/prepare.py
  • Train (choose type: math, facts, both; size: 14, 28, 56, mlt)

    • python train_with_eval.py config/train_<Type>_mini_<Model Size>.py
  • Plot a single run

    • python plot_results.py --model_folder <type>_mini_<Model Size>
  • Extra evaluations

    • Capitals: python eval_capitals.py --out_dir=facts_mini_<Model Size>
    • Math: python eval_as_math.py --out_dir=math_mini_<Model Size>
  • Recreate the three grid figures in this README

    • python plot_all_models.py
    • Outputs: grid_math.png, grid_facts.png, grid_combined.png
  • Example sampling command:

    python sample.py \
      --out_dir=math_mini_14 \
      --num_samples=5 --max_new_tokens=3 --ckpt="ckpt_best_math.pt" \
      --start="<7-5"

Abstract

The relationship between memorization and generalization in large language models (LLMs) remains an open area of research, with growing evidence that the two are deeply intertwined. In this work, we investigate this relationship by pre-training a series of capacity-limited Transformer models from scratch on two synthetic character-level tasks designed to separately probe generalization (via arithmetic extrapolation) and memorization (via factual recall). We observe a consistent trade-off: small models extrapolate to unseen arithmetic cases but fail to memorize facts, while larger models memorize but fail to extrapolate. An intermediate-capacity model exhibits a similar shift toward memorization. When trained on both tasks jointly, no model (regardless of size) succeeds at extrapolation. These findings suggest that pre-training may intrinsically favor one learning mode over the other. By isolating these dynamics in a controlled setting, our study offers insight into how model capacity shapes learning behavior and offers broader implications for the design and deployment of small language models.

Methods

Datasets

  • Arithmetic (Generalization): Simple addition and subtraction, formatted as <a+b=c> and <a-b=c> where a, b ∈ {0,…,9}. All expressions involving both 5 and 7 (5+7, 7+5, 5-7, 7-5) are withheld from training and validation. Each line is 9 characters; block size 9; batch size 882.
  • Capital Cities (Memorization): 50 statements of the form “Capital of X is Y.” The same 50 examples are used for training and validation to probe memorization. Max length 24; block size 24; batch size 576.
  • Combined Test: Uses the Capital Cities settings for compatibility (block size 24, batch size 576). Arithmetic examples fit cleanly into these batches since 576 is divisible by 9.

Note: Training and validation use the same data by design to reflect duplication often present in pre-training and to isolate memorization dynamics.

Model Architectures

  • n-models: Single-layer, single-head Transformers with embedding sizes n ∈ {14, 28, 56}. An MLP expansion parameter m overrides the default hidden size 4n with mn. We identified n14 as the smallest viable model after incrementally increasing from a minimal configuration.
  • MLT (Multi-Layer Transformer): A larger GPT-style model with multiple layers and heads, following the Shakespeare configuration from the nanoGPT repository (representative of standard small-scale GPT-style models).

Implementation Details

  • We extend nanoGPT by adding evaluations and further hyperparameter customization. Evaluation runs every 250 steps and reports accuracy on the arithmetic and capital-cities tasks. Arithmetic evaluation includes the withheld (5,7) combinations. Each test input is evaluated 10 times to account for sampling variability, and the best combined evaluation score across training is logged and shown in the results tables.
  • Hardware: single NVIDIA RTX 5090 GPU.
  • Reproducibility: global seeding for Python, NumPy, and PyTorch (CPU and CUDA); deterministic cuDNN (benchmarking disabled).
  • Optimization (all models): max_iters = 30000, lr_decay_iters = 25000, beta2 = 0.99. Each model is trained from scratch with hyperparameters in Table 1 below.

Table 1 — Hyperparameter comparison: n-models vs. MLT

Parameter n-models MLT
Embedding size (n) 14, 28, 56 384
Layers 1 6
Attention heads 1 6
MLP expansion (m) 1 4
Dropout 0.0 0.2
Weight decay 0 0.1
Learning rate 1e-2 1e-5
Min. learning rate 1e-4 1e-6

Controlled Regularization: To confirm scaling effects are not due to regularization differences, we retrained all models with weight decay = 0.1 and dropout = 0.0. Results are reported alongside the main runs (see Appendices cited in the paper).

Results

In-Distribution Generalization on Held-Out Arithmetic Cases

Table 2 — Addition & Subtraction Performance

Model Parameters Addition Subtraction (5,7)*
n14 1.46k 100.0% 100.0% 40/40
n28 5.26k 98% 98% 0/40
n56 19.94k 98% 98% 0/40
MLT 10.63M 98% 98% 0/40

* Four (5,7)/(7,5) combinations, 10 attempts each.

Performance in Math tasks
Math Tasks Figure description: Addition and subtraction accuracy over the math evaluation, which includes the four withheld (5,7)/(7,5) cases. Red dots indicate where both curves hit 100%, signifying in-distribution generalization to the unseen pairs within the evaluation protocol. Only n14 earns these red markers; larger models plateau below perfect overall accuracy due to being unable to generalize to the withheld combinations.


Factual Recall of Capital Cities

Table 3 — Memorization Performance (Capital Cities)

Model Parameters Facts Accuracy
n14 1.93k 8.2%
n28 6.22k 100.0%
n56 21.84k 100.0%
MLT 10.64M 100.0%

Performance in Capital tasks
Capital Tasks Figure description: Accuracy over 30k iterations for the factual memorization task (Capital Cities). Only models at or above n28 achieve full memorization; n14 fails to converge. This illustrates a clear capacity threshold required for factual recall.


Joint Arithmetic + Capital Cities Training

Table 4 — Combined Learning Performance

Model Parameters Addition Subtraction Facts Combined (5,7)*
n14 2.14k 31.2% 39.4% 2.0% 28.6% 0/40
n28 6.64k 95.0% 97% 87.6% 94.3% 0/40
n56 22.68k 98% 98% 100.0% 98.4% 0/40
MLT 10.65M 98% 98% 100.0% 98.4% 0/40

* Four (5,7)/(7,5) combinations, 10 attempts each.

Performance in combined tasks
Joint Training Figure description: Accuracy over 30k iterations for models trained jointly on arithmetic and factual tasks. No model achieves in-distribution generalization to the held-out (5,7) arithmetic cases. The n14 model regresses relative to its arithmetic-only performance, and larger models prioritize memorization. This reveals the incompatibility of generalization and memorization under simultaneous training conditions.


Key Takeaways

  • Math-only (grid_math.png): The smallest model (n14) cannot memorize all arithmetic mappings and instead learns the operation well enough to achieve in-distribution generalization to the withheld (5,7)/(7,5) pairs (40/40). Larger models default to memorization and fail on these held-out combinations (0/40).
  • Facts-only (grid_facts.png): Memorization requires sufficient capacity. n14 achieves 8.2%, while n28, n56, and MLT reach 100% factual recall.
  • Joint training (grid_combined.png): When trained on both tasks, no model gets a single (5,7) case correct (0/40). In this setup, learning to memorize and to generalize simultaneously does not occur across model sizes.

Reproducing

  • Code is based on nanoGPT, with added evaluations and configurability as described above.
  • Default training settings: max_iters = 30000, lr_decay_iters = 25000, beta2 = 0.99.
  • Evaluation every 250 steps; arithmetic eval includes withheld (5,7); 10 samples per test input.
  • Results are generally consistent, but there is slight run-to-run variance due to stochasticity in training and sampling.

Citation

@misc{barron2025bigthinkcapacitymemorization,
      title={Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers},
      author={Joshua Barron and Devin White},
      year={2025},
      eprint={2506.09099},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.09099},
}

Acknowledgments & Contact

  • Inspired by Dr. Gerald Friedland’s course and book “Information-Driven Machine Learning: Data Science as an Engineering Discipline”.
  • Thanks to MD Sunbeam for reviewing early drafts and offering guidance.

About

This is a modified version of nanoGPT for studying memorization vs generalization in pretraining.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages