Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
c361255
tokenizer: added clamp init option for optimal tokenizer size
art-test-stack May 14, 2026
904cdf6
tokenizer: fix test
art-test-stack May 14, 2026
bed8100
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack May 14, 2026
d634087
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack May 15, 2026
2d85e62
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack May 15, 2026
a3e536b
tokenizer encoder note
art-test-stack May 15, 2026
b27cb51
tokenizer: clamp to truncate
art-test-stack May 16, 2026
14ad038
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack May 16, 2026
e605beb
list special tokens
art-test-stack May 17, 2026
6a65794
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack May 17, 2026
9219505
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack May 17, 2026
2b7ac79
tokenizer refactoring: serialization + truncation
art-test-stack May 17, 2026
5534413
tokenizer: auto + hf + trainer config refactoring
art-test-stack May 17, 2026
3dca1a3
tokenizer: fix corpus import + hf log error + test
art-test-stack May 18, 2026
bde2912
tokenizer corpus: introduce byte control over char control
art-test-stack May 19, 2026
2a1f573
tokenizer training config: integrates trainer parameters to tokenizer…
art-test-stack May 19, 2026
05f986e
tokenizer: some function calls fixes
art-test-stack May 19, 2026
8c587cd
tokenizer: fix sp token in truncated tokenizer + scaling tok header +…
art-test-stack May 20, 2026
d5d8f21
tokenizer: eval with renyi and efficient entropy
art-test-stack May 20, 2026
0b7ee67
tokenizer: split tech report from script+impl args and improve storag…
art-test-stack May 21, 2026
b2a3008
add minimal fixes
art-test-stack May 21, 2026
5385b95
fix test with wrong tokenizer.auto.build_or_load_tokenizer signature
art-test-stack May 21, 2026
6006769
tokenizer: readme
art-test-stack May 21, 2026
083f2f2
tokenizer: readme + minor fixes
art-test-stack May 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 110 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -240,25 +240,109 @@ Next sections detail the different generated components.

### Tokenization

The tokenization implementation are located in [`gpt_lab.tokenizer`](./src/gpt_lab/tokenizer/tokenizer.py). The code only includes BPE tokenization for now (include sentencepiece is a TODO). The tokenizer training is only supported by huggingface implementation for now. For inference, the tiktoken implementation is the default one, as it is much faster than the huggingface one. The custom BPE implementation is still under development, and is not functional yet.
The tokenization system lives in [`gpt_lab/tokenizer/`](./src/gpt_lab/tokenizer/) and is organized as follows:

```
tokenizer/
├── auto.py # tokenizer resolution and scaling-law selection
├── base.py # base Tokenizer class and utilities for internal logic
├── corpus.py # TokenizerCorpus class and training data generation logic
├── hf.py # HuggingFace backend wrapper and conversion
├── serialization.py # msgpack save/load, validation, fingerprinting
├── tokenizer.py # core Tokenizer class and public API
└── truncation.py # vocabulary truncation helpers
```

The library supports BPE tokenization only (SentencePiece is a TODO). Training uses the HuggingFace backend; encoding and decoding use tiktoken.

Tokenizers are serialized with `msgpack` (binary, rank-sorted, SHA-256 fingerprinted) rather than pickle. Legacy `vocab.pkl` files are still readable with a deprecation warning.

#### Using a pretrained tokenizer

```python
from gpt_lab.tokenizer import Tokenizer

# Load from tiktoken (default)
tokenizer = Tokenizer.from_pretrained("cl100k_base")

# Load from a specific source
tokenizer = Tokenizer.from_pretrained("cl100k_base", source="tiktoken")
```

#### Loading a truncated tokenizer

Truncated tokenizers are identified by name suffix and handled automatically.
If a cached version exists on disk it is loaded directly; otherwise the base
tokenizer is loaded, truncated, saved, and returned.

```python
# Automatically builds cl100k_base truncated to 32k mergeable ranks
# and caches it to disk for future loads
tokenizer = Tokenizer.from_pretrained("cl100k_base_truncated_32000")
```

Truncation always preserves all 256 byte-level tokens and reassigns ranks to be contiguous from 0.

#### Training a tokenizer

Comment thread
art-test-stack marked this conversation as resolved.
```python
from gpt_lab.tokenizer import Tokenizer
from gpt_lab.tokenizer.corpus import TokenizerCorpus
from gpt_lab.utils.schemas import TokenizerTrainerConfig
from gpt_lab.utils.schemas import TokenizerConfig, TokenizerTrainerConfig

# uses default corpus settings (mixture of HuggingFaceFW/fineweb-edu, HuggingFaceFW/fineweb-2, HuggingFaceTB/finemath and codeparrot/codeparrot-clean)
corpus = TokenizerCorpus.from_sources(random_seed=42)
cfg = TokenizerTrainerConfig(
trainer_cfg = TokenizerTrainerConfig(
source="huggingface", # training backend (e.g., "huggingface", "tiktoken", "bpe", "fbpe", "rbpe", "dummy")
to_save=True, # pattern for pre-tokenization (e.g., "gpt2", "cl100k-base", etc., or regex pattern for custom pre-tokenization)
)
cfg = TokenizerConfig(
name="my_tokenizer",
vocab_size=32_000,
pat_str="gpt2", # pattern for pre-tokenization (e.g., "gpt2", "cl100k-base", etc., or regex pattern for custom pre-tokenization)
pat_str="gpt2",
trainer=trainer_cfg,# whether to save the trained tokenizer to disk
)
tokenizer = Tokenizer.train_from_iterator(cfg, iterator=corpus.iterator())
tokenizer = Tokenizer.train_from_iterator(iterator=corpus.iterator(), config=trainer_cfg)
```


#### Loading a locally saved tokenizer

```python
tokenizer = Tokenizer.from_disk("my_tokenizer")
```

Expects the directory structure:

```
<TOKENIZERS_FOLDER>/my_tokenizer/
├── tokenizer_config.json # includes SHA-256 fingerprint
├── mergeable_ranks.msgpack
└── token_bytes.pt
```

#### Automatic tokenizer resolution

When using `AutoConfig`, the tokenizer is resolved automatically based on
scaling laws. You can also call the resolver directly:

```python
from gpt_lab.tokenizer.auto import resolve_tokenizer

tokenizer = resolve_tokenizer("cl100k_base", vocab_size=32_000)
```

#### Which tokenizer to use?

The training script is at `scripts/train_tokenizer.py`. For encoding during
training and inference, tiktoken is always used regardless of which backend
trained the tokenizer.

> [!NOTE]
> HuggingFace `BpeTrainer` does not expose a random seed, so trained
> tokenizers are not guaranteed to be bit-for-bit reproducible across runs.
> Pretrained and truncated tokenizers are fully deterministic.

#### Using a pre-trained tokenizer

```python
Expand All @@ -274,9 +358,27 @@ The tokenizer training script is located in `scripts/train_tokenizer.py`. It all

Training time benchmarks for different implementations and configurations. All the tokenizers were trained on corpus generated from `gpt_lab.tokenizer.corpus.TokenizerCorpus()` with default settings, tuned with variable `vocab_size`.

<!-- Implementation | Vocabulary size | Num proc | Corpus size | Training time
--- | --- | --- | --- | ---
huggingface | 32,000 | 7 | 112.58 MB | 11.45 seconds -->
#### Scaling laws for tokenizer training

In [docs/tokenizer_scaling.md](./docs/tokenizer_scaling.md), we analyze how ByteLevel BPE tokenization scales with different corpus sizes, vocabulary sizes, and split patterns. The goal is to understand the trade-offs between these factors and their impact on tokenization quality and efficiency.

To experiment yourself tokenizer scaling, you can run the following command from the root directory of the repo:

```bash
uv run python -m scripts.benchmark.tokenizer_corpus_size \
--seed 42 \
--num-procs 16 \
--vocab-sizes 20000,50000,100000 \
--pat-strs gpt2,cl100k_base \
--write-corpus \
--corpus-sizes-gb 10,50,100,500,1000,5000,10000
```

More details on the arguments are given in [tokenizer_corpus_size.py](./scripts/benchmark/tokenizer_corpus_size.py) or using `--help`:

```bash
uv run python -m scripts.benchmark.tokenizer_corpus_size --help
```

### Model architecture

Expand Down
112 changes: 112 additions & 0 deletions docs/tokenizer_scaling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# ByteLevel BPE Scaling Experiments

This document describes the experiments conducted to analyze how ByteLevel BPE tokenization scales with different corpus sizes, vocabulary sizes, and split patterns. The goal is to understand the trade-offs between these factors and their impact on tokenization quality and efficiency.

However, the results obtained were quite poor compared to the baselines, given that I could not reach the optimal memory budget for training the tokenizers.

To run the experiments, we run the following command from the root directory of the repo:
```bash
uv run python -m scripts.benchmark.tokenizer_corpus_size \
--seed 42 \ # for reproducibility
--num-procs 16 \
--vocab-sizes 20000,50000,100000 \
--pat-strs gpt2,cl100k_base \
--write-corpus \
--corpus-sizes-gb 10,50,100,500,1000,5000,10000
```

Args:
- `--seed`: Random seed for reproducibility. Default is 42.
- `--num-procs`: Number of processes to use for tokenizer training. Defaults to the number of CPU cores available, capped at 32 to avoid overloading the system.
- `--vocab-sizes`: Comma-separated list of vocabulary sizes to train tokenizers with.
- `--pat-strs`: Comma-separated list of pattern string names to use for tokenizer training. If not specified, defaults to using the GPT-2 pattern string.
- `--write-corpus`: Flag to indicate training mode (write corpus). If not set, the script will attempt to load an existing corpus from disk.
- `--corpus-sizes-gb`: Comma-separated list of corpus sizes in gigabytes to use for tokenizer training. If not specified, defaults to a range of sizes based on the vocabulary size.
- `--compare-truncated-baselines`: Whether to compare trained tokenizers with truncated versions of baseline tokenizers.
- `--corpus-temperature-alpha`: Optional temperature parameter to control the randomness of the corpus generation.

It allows us to train tokenizers with different configurations and evaluate them on a simple test set. The script is designed to be flexible and customizable, allowing us to easily add new sources for training data, new evaluation datasets, and new metrics for evaluating the tokenizers.

## Summary

Full recipe for training and scaling tokenizer with different corpus sizes, vocabulary sizes, patterns,
and evaluating the trained tokenizers on a simple test set to analyze the trade-offs between:
- training corpus size,
- vocabulary size,
- split pattern,
- tokenization quality (compression ration, efficiency, etc.)
- and cross-language generalization (if we have multilingual evaluation sets).

The training data is generated by sampling from a mixture of sources, which can be customized by the user.

By default, the training data is sampled from a mixture of the following sources ([gpt_lab.tokenizer.corpus](./../src/gpt_lab/tokenizer/corpus.py#L339)):
- English web text (from fineweb-edu)
- Multilingual web text (from fineweb-2)
- Maths text (from finemath-4plus)
- Python code (from codeparrot-clean)
To keep the same logic between the different runs, we create the training data once, store it on disk, and then use it for all the different runs.
This also allows us to analyze the impact of the corpus size without having to worry about the randomness of the data generation process.

The evaluation is done on the following datasets (eval_configs in the code):
- enwik8 (for English text)
- fineweb-edu (for English web text)
- fineweb-2 (for multilingual web text, with subsets for different languages)
- finemath-4plus (for math text)
- github-top-code (for Python code)

> [!WARNING]
> With it current implementation, the script may use the same samples for both training and evaluation,
which can lead to overfitting and an overestimation of the tokenizer's performance.
> However, the results obtained were quite poor compared to the baselines, given that I could not reach the optimal memory budget for training the tokenizers.
> Hence, in case of future runs with **exceptionally good results**, it would be important to check whether
the training and evaluation samples are overlapping, and if so, to implement a proper train/eval
split to get a more accurate estimate of the tokenizer's performance.

> [!NOTE]
> The code is designed to be flexible and customizable, allowing you to easily add new sources for
training data, new evaluation datasets, and new metrics for evaluating the tokenizers.


There is similar study on studying the optimal corpus size for training a BPE tokenizer as:
- [1] in which they find that the returns diminish after 150GB of training data, for BPE tokenizers with 40,960, 64,000, 128,000, and 256,000 vocabulary sizes.

However, this study is focused on training a BPE tokenizer with a specific size. They conclude that over 150GB of training data, the performance improvements become marginal,
suggesting that there is an optimal corpus size for training BPE tokenizers.

Here, we want to analyze the trade-offs between corpus size, vocabulary size, and split pattern,
Then, we also compare with truncated versions of baseline tokenizers to see how much of the performance can be retained with a smaller vocabulary size.

This is mainly motivated by the following facts:
- Language model have been scaled up but tokenizers sizes have not been scaled up as much, and it is not clear how much the tokenizer performance can be improved by scaling up the tokenizer training corpus and vocabulary size.
- According to [3], Language model performance is sensitive to tokenizer size, and the optimal size is often larger than the commonly used 50-100k tokens, especially for larger models.

## About the Evaluation

The evaluation is done on different datasets; which can be easily customized in the code (making it easy to add new evaluation datasets is a TODO), with the following metrics:
- **Compression ratio**: the ratio of the number of tokens produced by the tokenizer to the number of characters in the input text.
- **Efficiency**: the average number of characters per token, which is the inverse of the compression ratio.
- **Rényi entropy**: introduced in [2], it is a generalization of the Shannon entropy that can be used to measure the diversity of the token distribution. It is defined as:
$$H_\alpha(X) = \frac{1}{1-\alpha} \log \sum_{i=1}^n p_i^\alpha$$
where $p_i$ is the probability of the $i$-th token in the distribution, and $\alpha$ is a parameter that controls the sensitivity of the entropy to the probabilities of the tokens. When $\alpha \to 1$, the Rényi entropy converges to the Shannon entropy, which is the most commonly used measure of entropy. When $\alpha > 1$, the Rényi entropy is more sensitive to the probabilities of the most common tokens, while when $\alpha < 1$, it is more sensitive to the probabilities of the less common tokens. In our experiments, we use $\alpha = 2.5$, which is a common choice in the literature for measuring the diversity of token distributions.
- **Efficient Entropy**: also introduced in [2], it is a the Rényi entropy with $\alpha = 2.5$ scaled by the number of tokens:
$$H_\alpha^{\text{eff}}(X) = \frac{H_\alpha(X)}{\log n}$$
where $n$ is the number of tokens in the vocabulary. The efficient entropy is a measure of the diversity of the token distribution that takes into account the size of the vocabulary. It is defined as the Rényi entropy scaled by the logarithm of the number of tokens in the vocabulary, which allows us to compare tokenizers with different vocabulary sizes on a more equal footing.

## Acknowledgements

This experiment is inspired by and has some code adapted from the following sources:
- The Hugging Face Tokenizers library (https://github.com/huggingface/tokenizers)
- The OpenAI tiktoken library (https://github.com/openai/tiktoken)
- nanochat tokenizer code (https://github.com/karpathy/nanochat) for the idea of using HF-training backend + tiktoken-inference backend for efficient training and evaluation of tokenizers.

## References
1. Reddy, Varshini, et al. "How much is enough? the diminishing returns of tokenization training data." arXiv preprint arXiv:2502.20273 (2025).
2. Zouhar, Vilém, et al. "Tokenization and the noiseless channel." Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
3. Tao, Chaofan, et al. "Scaling laws with vocabulary: Larger models deserve larger vocabularies." Advances in Neural Information Processing Systems 37 (2024): 114147-114179.
4. Karpathy, Andrej. "Let’s Build the GPT Tokenizer: A Complete Guide to Tokenization in LLMs. A text and code version of Karpathy’s famous tokenizer video." https://www.fast.ai/posts/2025-10-16-karpathy-tokenizers.html (2025).

## Contributing
- If you want to contribute to this project, please feel free to open an issue or a pull request. Any contributions are welcome, whether it's fixing a bug, adding a new feature, or improving the documentation.

Author: Arthur Testard (arthur.testard.pro@gmail.com) \
Please cite this work if the code is helpful to you.
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,11 @@ authors = [
]

dependencies = [
"datasets==4.8.4",
"gradio==3.18.0",
"jinja2==3.1.6",
"kernels==0.11.7",
"msgpack>=1.1.2",
"numpy==2.4.3",
"psutil==7.2.2",
"pydantic==2.12.5",
Expand Down Expand Up @@ -42,7 +44,6 @@ gpu = [
[dependency-groups]
dev = [
"pytest==8.0",
"datasets==4.8.4",
"huggingface-hub==1.12.0",
"matplotlib==3.10.8",
]
Expand Down
File renamed without changes.
Loading
Loading