This project benchmarks various tokenizers on a dataset of Sudanese dialect text. It aims to compare the tokenization efficiency (token count) of different models for this specific low-resource language variant.
.
├── benchmark.py # Main script to run the tokenizer benchmark.
├── dataset.json # JSON file defining the dataset structure and pointing to sample files.
├── ensure_utf8_encoding.py # Utility script to verify and fix file encodings to UTF-8.
├── samples/ # Directory containing the Sudanese dialect text samples.
│ ├── *.txt # Various text files with Sudanese dialect content.
├── benchmark_results.csv # CSV output of the benchmark results.
├── benchmark_results.html # HTML output of the benchmark results.
└── README.md # This file.
- Clone the repository (or set up the project files).
- Create a Python virtual environment and activate it:
python -m venv benchmark_env benchmark_env\Scripts\activate
- Install dependencies:
(You will need to create a
pip install -r requirements.txt
requirements.txtfile. See below.)
Create a requirements.txt file with the following content:
transformers
pandas
wcwidth
tiktoken
chardet
huggingface_hub
-
Place your Sudanese dialect text files (UTF-8 encoded) in the
samples/directory. -
Update
dataset.jsonto reflect your sample files. Each entry should have anid,filename(matching a file insamples/),classification(e.g., "Sudanese Dialect"), andprompt(the actual text content from the file).Example
dataset.jsonentry:[ { "id": "001", "filename": "sudanese-phrases.txt", "classification": "Sudanese Dialect", "subclassification": "", // This field is present in the JSON but not used in the final table "prompt": "بعض العبارات والجمل باللهجة السودانية..." } // ... more entries ]
Run the ensure_utf8_encoding.py script to check and convert any non-UTF-8 files in the samples/ directory:
python ensure_utf8_encoding.py --dir samplesExecute the benchmark.py script:
python benchmark.py --file dataset.json Optional arguments for benchmark.py:
--models <model_name_or_path ...>: Specify a list of Hugging Face model names/paths or "gpt-4" to benchmark. If not provided, a default list will be used:gpt-4aubmindlab/bert-base-arabertv02google/gemma-7b-itdeepseek-ai/deepseek-llm-7b-base
--ignore-numbers: If set, numeric tokens will be ignored in the count.
Example with specific models:
python benchmark.py --file dataset.json --models gpt-4 aubmindlab/bert-base-arabertv02- The benchmark results will be printed to the console.
- The results will also be saved in
benchmark_results.csvandbenchmark_results.html.
Some models (e.g., google/gemma-7b-it) require you to be logged into your Hugging Face account. If you encounter issues with model access:
- **Install Hugging Face CLI (if not already installed with
requirements.txt):pip install huggingface_hub[cli]
- Login:
You will be prompted for a token, which you can generate from your Hugging Face account settings: https://huggingface.co/settings/tokens
huggingface-cli login
## Running the benchmark The following command line will allow you to run the tokenizer benchmark against multiple different models
python benchmark.py --file dataset.json --models mistralai/Mistral-7B-v0.1 gpt-4 google/gemma-7bpython visualizer.py --file ./samples/Programming/BASIC/guess.bas --model google/gemma-7bor
python visualizer2.py --file ./samples/Programming/BASIC/guess.bas --models mistralai/Mistral-7B-v0.1 gpt-4 google/gemma-7bpython visualizer2.py --file ./samples/Text/cities.txt --models mistralai/Mistral-7B-v0.1 gpt-4 google/gemma-7b --ignore-numbersFeel free to open issues or submit pull requests for improvements or bug fixes.