-
Notifications
You must be signed in to change notification settings - Fork 0
Tokenizer refactor #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
c361255
tokenizer: added clamp init option for optimal tokenizer size
art-test-stack 904cdf6
tokenizer: fix test
art-test-stack bed8100
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack d634087
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack 2d85e62
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack a3e536b
tokenizer encoder note
art-test-stack b27cb51
tokenizer: clamp to truncate
art-test-stack 14ad038
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack e605beb
list special tokens
art-test-stack 6a65794
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack 9219505
Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…
art-test-stack 2b7ac79
tokenizer refactoring: serialization + truncation
art-test-stack 5534413
tokenizer: auto + hf + trainer config refactoring
art-test-stack 3dca1a3
tokenizer: fix corpus import + hf log error + test
art-test-stack bde2912
tokenizer corpus: introduce byte control over char control
art-test-stack 2a1f573
tokenizer training config: integrates trainer parameters to tokenizer…
art-test-stack 05f986e
tokenizer: some function calls fixes
art-test-stack 8c587cd
tokenizer: fix sp token in truncated tokenizer + scaling tok header +…
art-test-stack d5d8f21
tokenizer: eval with renyi and efficient entropy
art-test-stack 0b7ee67
tokenizer: split tech report from script+impl args and improve storag…
art-test-stack b2a3008
add minimal fixes
art-test-stack 5385b95
fix test with wrong tokenizer.auto.build_or_load_tokenizer signature
art-test-stack 6006769
tokenizer: readme
art-test-stack 083f2f2
tokenizer: readme + minor fixes
art-test-stack File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,112 @@ | ||
| # ByteLevel BPE Scaling Experiments | ||
|
|
||
| This document describes the experiments conducted to analyze how ByteLevel BPE tokenization scales with different corpus sizes, vocabulary sizes, and split patterns. The goal is to understand the trade-offs between these factors and their impact on tokenization quality and efficiency. | ||
|
|
||
| However, the results obtained were quite poor compared to the baselines, given that I could not reach the optimal memory budget for training the tokenizers. | ||
|
|
||
| To run the experiments, we run the following command from the root directory of the repo: | ||
| ```bash | ||
| uv run python -m scripts.benchmark.tokenizer_corpus_size \ | ||
| --seed 42 \ # for reproducibility | ||
| --num-procs 16 \ | ||
| --vocab-sizes 20000,50000,100000 \ | ||
| --pat-strs gpt2,cl100k_base \ | ||
| --write-corpus \ | ||
| --corpus-sizes-gb 10,50,100,500,1000,5000,10000 | ||
| ``` | ||
|
|
||
| Args: | ||
| - `--seed`: Random seed for reproducibility. Default is 42. | ||
| - `--num-procs`: Number of processes to use for tokenizer training. Defaults to the number of CPU cores available, capped at 32 to avoid overloading the system. | ||
| - `--vocab-sizes`: Comma-separated list of vocabulary sizes to train tokenizers with. | ||
| - `--pat-strs`: Comma-separated list of pattern string names to use for tokenizer training. If not specified, defaults to using the GPT-2 pattern string. | ||
| - `--write-corpus`: Flag to indicate training mode (write corpus). If not set, the script will attempt to load an existing corpus from disk. | ||
| - `--corpus-sizes-gb`: Comma-separated list of corpus sizes in gigabytes to use for tokenizer training. If not specified, defaults to a range of sizes based on the vocabulary size. | ||
| - `--compare-truncated-baselines`: Whether to compare trained tokenizers with truncated versions of baseline tokenizers. | ||
| - `--corpus-temperature-alpha`: Optional temperature parameter to control the randomness of the corpus generation. | ||
|
|
||
| It allows us to train tokenizers with different configurations and evaluate them on a simple test set. The script is designed to be flexible and customizable, allowing us to easily add new sources for training data, new evaluation datasets, and new metrics for evaluating the tokenizers. | ||
|
|
||
| ## Summary | ||
|
|
||
| Full recipe for training and scaling tokenizer with different corpus sizes, vocabulary sizes, patterns, | ||
| and evaluating the trained tokenizers on a simple test set to analyze the trade-offs between: | ||
| - training corpus size, | ||
| - vocabulary size, | ||
| - split pattern, | ||
| - tokenization quality (compression ration, efficiency, etc.) | ||
| - and cross-language generalization (if we have multilingual evaluation sets). | ||
|
|
||
| The training data is generated by sampling from a mixture of sources, which can be customized by the user. | ||
|
|
||
| By default, the training data is sampled from a mixture of the following sources ([gpt_lab.tokenizer.corpus](./../src/gpt_lab/tokenizer/corpus.py#L339)): | ||
| - English web text (from fineweb-edu) | ||
| - Multilingual web text (from fineweb-2) | ||
| - Maths text (from finemath-4plus) | ||
| - Python code (from codeparrot-clean) | ||
| To keep the same logic between the different runs, we create the training data once, store it on disk, and then use it for all the different runs. | ||
| This also allows us to analyze the impact of the corpus size without having to worry about the randomness of the data generation process. | ||
|
|
||
| The evaluation is done on the following datasets (eval_configs in the code): | ||
| - enwik8 (for English text) | ||
| - fineweb-edu (for English web text) | ||
| - fineweb-2 (for multilingual web text, with subsets for different languages) | ||
| - finemath-4plus (for math text) | ||
| - github-top-code (for Python code) | ||
|
|
||
| > [!WARNING] | ||
| > With it current implementation, the script may use the same samples for both training and evaluation, | ||
| which can lead to overfitting and an overestimation of the tokenizer's performance. | ||
| > However, the results obtained were quite poor compared to the baselines, given that I could not reach the optimal memory budget for training the tokenizers. | ||
| > Hence, in case of future runs with **exceptionally good results**, it would be important to check whether | ||
| the training and evaluation samples are overlapping, and if so, to implement a proper train/eval | ||
| split to get a more accurate estimate of the tokenizer's performance. | ||
|
|
||
| > [!NOTE] | ||
| > The code is designed to be flexible and customizable, allowing you to easily add new sources for | ||
| training data, new evaluation datasets, and new metrics for evaluating the tokenizers. | ||
|
|
||
|
|
||
| There is similar study on studying the optimal corpus size for training a BPE tokenizer as: | ||
| - [1] in which they find that the returns diminish after 150GB of training data, for BPE tokenizers with 40,960, 64,000, 128,000, and 256,000 vocabulary sizes. | ||
|
|
||
| However, this study is focused on training a BPE tokenizer with a specific size. They conclude that over 150GB of training data, the performance improvements become marginal, | ||
| suggesting that there is an optimal corpus size for training BPE tokenizers. | ||
|
|
||
| Here, we want to analyze the trade-offs between corpus size, vocabulary size, and split pattern, | ||
| Then, we also compare with truncated versions of baseline tokenizers to see how much of the performance can be retained with a smaller vocabulary size. | ||
|
|
||
| This is mainly motivated by the following facts: | ||
| - Language model have been scaled up but tokenizers sizes have not been scaled up as much, and it is not clear how much the tokenizer performance can be improved by scaling up the tokenizer training corpus and vocabulary size. | ||
| - According to [3], Language model performance is sensitive to tokenizer size, and the optimal size is often larger than the commonly used 50-100k tokens, especially for larger models. | ||
|
|
||
| ## About the Evaluation | ||
|
|
||
| The evaluation is done on different datasets; which can be easily customized in the code (making it easy to add new evaluation datasets is a TODO), with the following metrics: | ||
| - **Compression ratio**: the ratio of the number of tokens produced by the tokenizer to the number of characters in the input text. | ||
| - **Efficiency**: the average number of characters per token, which is the inverse of the compression ratio. | ||
| - **Rényi entropy**: introduced in [2], it is a generalization of the Shannon entropy that can be used to measure the diversity of the token distribution. It is defined as: | ||
| $$H_\alpha(X) = \frac{1}{1-\alpha} \log \sum_{i=1}^n p_i^\alpha$$ | ||
| where $p_i$ is the probability of the $i$-th token in the distribution, and $\alpha$ is a parameter that controls the sensitivity of the entropy to the probabilities of the tokens. When $\alpha \to 1$, the Rényi entropy converges to the Shannon entropy, which is the most commonly used measure of entropy. When $\alpha > 1$, the Rényi entropy is more sensitive to the probabilities of the most common tokens, while when $\alpha < 1$, it is more sensitive to the probabilities of the less common tokens. In our experiments, we use $\alpha = 2.5$, which is a common choice in the literature for measuring the diversity of token distributions. | ||
| - **Efficient Entropy**: also introduced in [2], it is a the Rényi entropy with $\alpha = 2.5$ scaled by the number of tokens: | ||
| $$H_\alpha^{\text{eff}}(X) = \frac{H_\alpha(X)}{\log n}$$ | ||
| where $n$ is the number of tokens in the vocabulary. The efficient entropy is a measure of the diversity of the token distribution that takes into account the size of the vocabulary. It is defined as the Rényi entropy scaled by the logarithm of the number of tokens in the vocabulary, which allows us to compare tokenizers with different vocabulary sizes on a more equal footing. | ||
|
|
||
| ## Acknowledgements | ||
|
|
||
| This experiment is inspired by and has some code adapted from the following sources: | ||
| - The Hugging Face Tokenizers library (https://github.com/huggingface/tokenizers) | ||
| - The OpenAI tiktoken library (https://github.com/openai/tiktoken) | ||
| - nanochat tokenizer code (https://github.com/karpathy/nanochat) for the idea of using HF-training backend + tiktoken-inference backend for efficient training and evaluation of tokenizers. | ||
|
|
||
| ## References | ||
| 1. Reddy, Varshini, et al. "How much is enough? the diminishing returns of tokenization training data." arXiv preprint arXiv:2502.20273 (2025). | ||
| 2. Zouhar, Vilém, et al. "Tokenization and the noiseless channel." Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. | ||
| 3. Tao, Chaofan, et al. "Scaling laws with vocabulary: Larger models deserve larger vocabularies." Advances in Neural Information Processing Systems 37 (2024): 114147-114179. | ||
| 4. Karpathy, Andrej. "Let’s Build the GPT Tokenizer: A Complete Guide to Tokenization in LLMs. A text and code version of Karpathy’s famous tokenizer video." https://www.fast.ai/posts/2025-10-16-karpathy-tokenizers.html (2025). | ||
|
|
||
| ## Contributing | ||
| - If you want to contribute to this project, please feel free to open an issue or a pull request. Any contributions are welcome, whether it's fixing a bug, adding a new feature, or improving the documentation. | ||
|
|
||
| Author: Arthur Testard (arthur.testard.pro@gmail.com) \ | ||
| Please cite this work if the code is helpful to you. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.