art-test-stack · art-test-stack · May 21, 2026 · May 14, 2026 · May 14, 2026 · May 14, 2026
diff --git a/README.md b/README.md
@@ -240,25 +240,109 @@ Next sections detail the different generated components.
 
 ### Tokenization
 
-The tokenization implementation are located in [`gpt_lab.tokenizer`](./src/gpt_lab/tokenizer/tokenizer.py). The code only includes BPE tokenization for now (include sentencepiece is a TODO). The tokenizer training is only supported by huggingface implementation for now. For inference, the tiktoken implementation is the default one, as it is much faster than the huggingface one. The custom BPE implementation is still under development, and is not functional yet.
+The tokenization system lives in [`gpt_lab/tokenizer/`](./src/gpt_lab/tokenizer/) and is organized as follows:
+
+```
+tokenizer/
+├── auto.py            # tokenizer resolution and scaling-law selection
+├── base.py            # base Tokenizer class and utilities for internal logic
+├── corpus.py          # TokenizerCorpus class and training data generation logic
+├── hf.py              # HuggingFace backend wrapper and conversion
+├── serialization.py   # msgpack save/load, validation, fingerprinting
+├── tokenizer.py       # core Tokenizer class and public API
+└── truncation.py      # vocabulary truncation helpers
+```
+
+The library supports BPE tokenization only (SentencePiece is a TODO). Training uses the HuggingFace backend; encoding and decoding use tiktoken.
+
+Tokenizers are serialized with `msgpack` (binary, rank-sorted, SHA-256 fingerprinted) rather than pickle. Legacy `vocab.pkl` files are still readable with a deprecation warning.
+
+#### Using a pretrained tokenizer
+
+```python
+from gpt_lab.tokenizer import Tokenizer
+
+# Load from tiktoken (default)
+tokenizer = Tokenizer.from_pretrained("cl100k_base")
+
+# Load from a specific source
+tokenizer = Tokenizer.from_pretrained("cl100k_base", source="tiktoken")
+```
+
+#### Loading a truncated tokenizer
+
+Truncated tokenizers are identified by name suffix and handled automatically.
+If a cached version exists on disk it is loaded directly; otherwise the base
+tokenizer is loaded, truncated, saved, and returned.
+
+```python
+# Automatically builds cl100k_base truncated to 32k mergeable ranks
+# and caches it to disk for future loads
+tokenizer = Tokenizer.from_pretrained("cl100k_base_truncated_32000")
+```
+
+Truncation always preserves all 256 byte-level tokens and reassigns ranks to be contiguous from 0.
 
 #### Training a tokenizer
 
 ```python
 from gpt_lab.tokenizer import Tokenizer
 from gpt_lab.tokenizer.corpus import TokenizerCorpus
-from gpt_lab.utils.schemas import TokenizerTrainerConfig
+from gpt_lab.utils.schemas import TokenizerConfig, TokenizerTrainerConfig
 
 # uses default corpus settings (mixture of HuggingFaceFW/fineweb-edu, HuggingFaceFW/fineweb-2, HuggingFaceTB/finemath and codeparrot/codeparrot-clean)
 corpus = TokenizerCorpus.from_sources(random_seed=42)
-cfg = TokenizerTrainerConfig(
+trainer_cfg = TokenizerTrainerConfig(
+    source="huggingface", # training backend (e.g., "huggingface", "tiktoken", "bpe", "fbpe", "rbpe", "dummy")
+    to_save=True, # pattern for pre-tokenization (e.g., "gpt2", "cl100k-base", etc., or regex pattern for custom pre-tokenization)
+)
+cfg = TokenizerConfig(
     name="my_tokenizer",
     vocab_size=32_000,
-    pat_str="gpt2", # pattern for pre-tokenization (e.g., "gpt2", "cl100k-base", etc., or regex pattern for custom pre-tokenization)
+    pat_str="gpt2", 
+    trainer=trainer_cfg,# whether to save the trained tokenizer to disk
 )
-tokenizer = Tokenizer.train_from_iterator(cfg, iterator=corpus.iterator())
+tokenizer = Tokenizer.train_from_iterator(iterator=corpus.iterator(), config=trainer_cfg)
 ```
 
+
+#### Loading a locally saved tokenizer
+
+```python
+tokenizer = Tokenizer.from_disk("my_tokenizer")
+```
+
+Expects the directory structure:
+
+```
+<TOKENIZERS_FOLDER>/my_tokenizer/
+├── tokenizer_config.json       # includes SHA-256 fingerprint
+├── mergeable_ranks.msgpack
+└── token_bytes.pt
+```
+
+#### Automatic tokenizer resolution
+
+When using `AutoConfig`, the tokenizer is resolved automatically based on
+scaling laws. You can also call the resolver directly:
+
+```python
+from gpt_lab.tokenizer.auto import resolve_tokenizer
+
+tokenizer = resolve_tokenizer("cl100k_base", vocab_size=32_000)
+```
+
+#### Which tokenizer to use?
+
+The training script is at `scripts/train_tokenizer.py`. For encoding during
+training and inference, tiktoken is always used regardless of which backend
+trained the tokenizer.
+
+> [!NOTE]
+> HuggingFace `BpeTrainer` does not expose a random seed, so trained
+> tokenizers are not guaranteed to be bit-for-bit reproducible across runs.
+> Pretrained and truncated tokenizers are fully deterministic.
+
 #### Using a pre-trained tokenizer
 
 ```python
@@ -274,9 +358,27 @@ The tokenizer training script is located in `scripts/train_tokenizer.py`. It all
 
 Training time benchmarks for different implementations and configurations. All the tokenizers were trained on corpus generated from `gpt_lab.tokenizer.corpus.TokenizerCorpus()` with default settings, tuned with variable `vocab_size`.
 
-<!-- Implementation | Vocabulary size | Num proc | Corpus size | Training time
---- | --- | --- | --- | ---
-huggingface | 32,000 | 7 | 112.58 MB | 11.45 seconds  -->
+#### Scaling laws for tokenizer training
+
+In [docs/tokenizer_scaling.md](./docs/tokenizer_scaling.md), we analyze how ByteLevel BPE tokenization scales with different corpus sizes, vocabulary sizes, and split patterns. The goal is to understand the trade-offs between these factors and their impact on tokenization quality and efficiency.
+
+To experiment yourself tokenizer scaling, you can run the following command from the root directory of the repo:
+
+```bash
+uv run python -m scripts.benchmark.tokenizer_corpus_size \
+    --seed 42 \
+    --num-procs 16 \
+    --vocab-sizes 20000,50000,100000 \
+    --pat-strs gpt2,cl100k_base \
+    --write-corpus \
+    --corpus-sizes-gb 10,50,100,500,1000,5000,10000 
+```
+
+More details on the arguments are given in [tokenizer_corpus_size.py](./scripts/benchmark/tokenizer_corpus_size.py) or using `--help`:
+
+```bash
+uv run python -m scripts.benchmark.tokenizer_corpus_size --help
+```
 
 ### Model architecture
 

diff --git a/docs/tokenizer_scaling.md b/docs/tokenizer_scaling.md
@@ -0,0 +1,112 @@
+# ByteLevel BPE Scaling Experiments
+
+This document describes the experiments conducted to analyze how ByteLevel BPE tokenization scales with different corpus sizes, vocabulary sizes, and split patterns. The goal is to understand the trade-offs between these factors and their impact on tokenization quality and efficiency.
+
+However, the results obtained were quite poor compared to the baselines, given that I could not reach the optimal memory budget for training the tokenizers.
+
+To run the experiments, we run the following command from the root directory of the repo:
+```bash
+uv run python -m scripts.benchmark.tokenizer_corpus_size \
+    --seed 42 \ # for reproducibility
+    --num-procs 16 \
+    --vocab-sizes 20000,50000,100000 \
+    --pat-strs gpt2,cl100k_base \
+    --write-corpus \
+    --corpus-sizes-gb 10,50,100,500,1000,5000,10000 
+```
+
+Args:
+- `--seed`: Random seed for reproducibility. Default is 42.
+- `--num-procs`: Number of processes to use for tokenizer training. Defaults to the number of CPU cores available, capped at 32 to avoid overloading the system.
+- `--vocab-sizes`: Comma-separated list of vocabulary sizes to train tokenizers with.
+- `--pat-strs`: Comma-separated list of pattern string names to use for tokenizer training. If not specified, defaults to using the GPT-2 pattern string.
+- `--write-corpus`: Flag to indicate training mode (write corpus). If not set, the script will attempt to load an existing corpus from disk.
+- `--corpus-sizes-gb`: Comma-separated list of corpus sizes in gigabytes to use for tokenizer training. If not specified, defaults to a range of sizes based on the vocabulary size.
+- `--compare-truncated-baselines`: Whether to compare trained tokenizers with truncated versions of baseline tokenizers.
+- `--corpus-temperature-alpha`: Optional temperature parameter to control the randomness of the corpus generation. 
+
+It allows us to train tokenizers with different configurations and evaluate them on a simple test set. The script is designed to be flexible and customizable, allowing us to easily add new sources for training data, new evaluation datasets, and new metrics for evaluating the tokenizers.
+
+## Summary
+
+Full recipe for training and scaling tokenizer with different corpus sizes, vocabulary sizes, patterns, 
+and evaluating the trained tokenizers on a simple test set to analyze the trade-offs between:
+- training corpus size, 
+- vocabulary size, 
+- split pattern,
+- tokenization quality (compression ration, efficiency, etc.)
+- and cross-language generalization (if we have multilingual evaluation sets).
+
+The training data is generated by sampling from a mixture of sources, which can be customized by the user. 
+
+By default, the training data is sampled from a mixture of the following sources ([gpt_lab.tokenizer.corpus](./../src/gpt_lab/tokenizer/corpus.py#L339)):
+- English web text (from fineweb-edu)
+- Multilingual web text (from fineweb-2)
+- Maths text (from finemath-4plus)
+- Python code (from codeparrot-clean)
+To keep the same logic between the different runs, we create the training data once, store it on disk, and then use it for all the different runs. 
+This also allows us to analyze the impact of the corpus size without having to worry about the randomness of the data generation process.
+
+The evaluation is done on the following datasets (eval_configs in the code):
+- enwik8 (for English text)
+- fineweb-edu (for English web text)
+- fineweb-2 (for multilingual web text, with subsets for different languages)
+- finemath-4plus (for math text)
+- github-top-code (for Python code)
+
+> [!WARNING]
+> With it current implementation, the script may use the same samples for both training and evaluation, 
+which can lead to overfitting and an overestimation of the tokenizer's performance.
+> However, the results obtained were quite poor compared to the baselines, given that I could not reach the optimal memory budget for training the tokenizers. 
+> Hence, in case of future runs with **exceptionally good results**, it would be important to check whether 
+the training and evaluation samples are overlapping, and if so, to implement a proper train/eval 
+split to get a more accurate estimate of the tokenizer's performance.
+
+> [!NOTE]
+> The code is designed to be flexible and customizable, allowing you to easily add new sources for
+training data, new evaluation datasets, and new metrics for evaluating the tokenizers.
+
+
+There is similar study on studying the optimal corpus size for training a BPE tokenizer as:
+- [1] in which they find that the returns diminish after 150GB of training data, for BPE tokenizers with 40,960, 64,000, 128,000, and 256,000 vocabulary sizes.
+
+However, this study is focused on training a BPE tokenizer with a specific size. They conclude that over 150GB of training data, the performance improvements become marginal, 
+suggesting that there is an optimal corpus size for training BPE tokenizers.
+
+Here, we want to analyze the trade-offs between corpus size, vocabulary size, and split pattern, 
+Then, we also compare with truncated versions of baseline tokenizers to see how much of the performance can be retained with a smaller vocabulary size.
+
+This is mainly motivated by the following facts:
+- Language model have been scaled up but tokenizers sizes have not been scaled up as much, and it is not clear how much the tokenizer performance can be improved by scaling up the tokenizer training corpus and vocabulary size.
+- According to [3], Language model performance is sensitive to tokenizer size, and the optimal size is often larger than the commonly used 50-100k tokens, especially for larger models. 
+
+## About the Evaluation
+
+The evaluation is done on different datasets; which can be easily customized in the code (making it easy to add new evaluation datasets is a TODO), with the following metrics:
+- **Compression ratio**: the ratio of the number of tokens produced by the tokenizer to the number of characters in the input text.
+- **Efficiency**: the average number of characters per token, which is the inverse of the compression ratio.
+- **Rényi entropy**: introduced in [2], it is a generalization of the Shannon entropy that can be used to measure the diversity of the token distribution. It is defined as:
+    $$H_\alpha(X) = \frac{1}{1-\alpha} \log \sum_{i=1}^n p_i^\alpha$$
+    where $p_i$ is the probability of the $i$-th token in the distribution, and $\alpha$ is a parameter that controls the sensitivity of the entropy to the probabilities of the tokens. When $\alpha \to 1$, the Rényi entropy converges to the Shannon entropy, which is the most commonly used measure of entropy. When $\alpha > 1$, the Rényi entropy is more sensitive to the probabilities of the most common tokens, while when $\alpha < 1$, it is more sensitive to the probabilities of the less common tokens. In our experiments, we use $\alpha = 2.5$, which is a common choice in the literature for measuring the diversity of token distributions.
+- **Efficient Entropy**: also introduced in [2], it is a the Rényi entropy with $\alpha = 2.5$ scaled by the number of tokens:
+    $$H_\alpha^{\text{eff}}(X) = \frac{H_\alpha(X)}{\log n}$$
+    where $n$ is the number of tokens in the vocabulary. The efficient entropy is a measure of the diversity of the token distribution that takes into account the size of the vocabulary. It is defined as the Rényi entropy scaled by the logarithm of the number of tokens in the vocabulary, which allows us to compare tokenizers with different vocabulary sizes on a more equal footing.
+
+## Acknowledgements
+
+This experiment is inspired by and has some code adapted from the following sources:
+- The Hugging Face Tokenizers library (https://github.com/huggingface/tokenizers)
+- The OpenAI tiktoken library (https://github.com/openai/tiktoken)
+- nanochat tokenizer code (https://github.com/karpathy/nanochat) for the idea of using HF-training backend + tiktoken-inference backend for efficient training and evaluation of tokenizers.
+
+## References
+1. Reddy, Varshini, et al. "How much is enough? the diminishing returns of tokenization training data." arXiv preprint arXiv:2502.20273 (2025).
+2. Zouhar, Vilém, et al. "Tokenization and the noiseless channel." Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
+3. Tao, Chaofan, et al. "Scaling laws with vocabulary: Larger models deserve larger vocabularies." Advances in Neural Information Processing Systems 37 (2024): 114147-114179.
+4. Karpathy, Andrej. "Let’s Build the GPT Tokenizer: A Complete Guide to Tokenization in LLMs. A text and code version of Karpathy’s famous tokenizer video." https://www.fast.ai/posts/2025-10-16-karpathy-tokenizers.html (2025).
+
+## Contributing
+- If you want to contribute to this project, please feel free to open an issue or a pull request. Any contributions are welcome, whether it's fixing a bug, adding a new feature, or improving the documentation.
+
+Author: Arthur Testard (arthur.testard.pro@gmail.com) \
+Please cite this work if the code is helpful to you.
diff --git a/pyproject.toml b/pyproject.toml
@@ -9,9 +9,11 @@ authors = [
 ]
 
 dependencies = [
+    "datasets==4.8.4",
     "gradio==3.18.0",
     "jinja2==3.1.6",
     "kernels==0.11.7",
+    "msgpack>=1.1.2",
     "numpy==2.4.3",
     "psutil==7.2.2",
     "pydantic==2.12.5",
@@ -42,7 +44,6 @@ gpu = [
 [dependency-groups]
 dev = [
     "pytest==8.0",
-    "datasets==4.8.4",
     "huggingface-hub==1.12.0",
     "matplotlib==3.10.8",
 ]

diff --git a/scripts/benchmark_dataloaders.py → scripts/benchmark/dataloaders.py b/scripts/benchmark_dataloaders.py → scripts/benchmark/dataloaders.py