Sudanese Dialect Tokenizer Benchmark

This project benchmarks various tokenizers on a dataset of Sudanese dialect text. It aims to compare the tokenization efficiency (token count) of different models for this specific low-resource language variant.

Project Structure

.
├── benchmark.py            # Main script to run the tokenizer benchmark.
├── dataset.json            # JSON file defining the dataset structure and pointing to sample files.
├── ensure_utf8_encoding.py # Utility script to verify and fix file encodings to UTF-8.
├── samples/                # Directory containing the Sudanese dialect text samples.
│   ├── *.txt               # Various text files with Sudanese dialect content.
├── benchmark_results.csv   # CSV output of the benchmark results.
├── benchmark_results.html  # HTML output of the benchmark results.
└── README.md               # This file.

Setup

Clone the repository (or set up the project files).

Create a Python virtual environment and activate it:

python -m venv benchmark_env
benchmark_env\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt 
```
(You will need to create a requirements.txt file. See below.)

`requirements.txt`

Create a requirements.txt file with the following content:

transformers
pandas
wcwidth
tiktoken
chardet
huggingface_hub

Usage

1. Prepare Sample Data

Place your Sudanese dialect text files (UTF-8 encoded) in the samples/ directory.

Update dataset.json to reflect your sample files. Each entry should have an id, filename (matching a file in samples/), classification (e.g., "Sudanese Dialect"), and prompt (the actual text content from the file).

Example dataset.json entry:

[
  {
    "id": "001",
    "filename": "sudanese-phrases.txt",
    "classification": "Sudanese Dialect",
    "subclassification": "", // This field is present in the JSON but not used in the final table
    "prompt": "بعض العبارات والجمل باللهجة السودانية..."
  }
  // ... more entries
]

2. Ensure UTF-8 Encoding (Optional but Recommended)

Run the ensure_utf8_encoding.py script to check and convert any non-UTF-8 files in the samples/ directory:

python ensure_utf8_encoding.py --dir samples

3. Run the Benchmark

Execute the benchmark.py script:

python benchmark.py --file dataset.json

Optional arguments for benchmark.py:

--models <model_name_or_path ...>: Specify a list of Hugging Face model names/paths or "gpt-4" to benchmark. If not provided, a default list will be used:
- gpt-4
- aubmindlab/bert-base-arabertv02
- google/gemma-7b-it
- deepseek-ai/deepseek-llm-7b-base
--ignore-numbers: If set, numeric tokens will be ignored in the count.

Example with specific models:

python benchmark.py --file dataset.json --models gpt-4 aubmindlab/bert-base-arabertv02

4. View Results

The benchmark results will be printed to the console.
The results will also be saved in benchmark_results.csv and benchmark_results.html.

Hugging Face Login

Some models (e.g., google/gemma-7b-it) require you to be logged into your Hugging Face account. If you encounter issues with model access:

**Install Hugging Face CLI (if not already installed with requirements.txt):
```
pip install huggingface_hub[cli]
```
Login:
```
huggingface-cli login
```
You will be prompted for a token, which you can generate from your Hugging Face account settings: https://huggingface.co/settings/tokens

CLI

## Running the benchmark The following command line will allow you to run the tokenizer benchmark against multiple different models

python benchmark.py --file dataset.json --models mistralai/Mistral-7B-v0.1 gpt-4 google/gemma-7b

visualizer

python visualizer.py --file ./samples/Programming/BASIC/guess.bas --model google/gemma-7b

or

python visualizer2.py --file ./samples/Programming/BASIC/guess.bas --models mistralai/Mistral-7B-v0.1 gpt-4 google/gemma-7b

python visualizer2.py --file ./samples/Text/cities.txt --models mistralai/Mistral-7B-v0.1 gpt-4 google/gemma-7b --ignore-numbers

Contributing

Feel free to open issues or submit pull requests for improvements or bug fixes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sudanese Dialect Tokenizer Benchmark

Project Structure

Setup

`requirements.txt`

Usage

1. Prepare Sample Data

2. Ensure UTF-8 Encoding (Optional but Recommended)

3. Run the Benchmark

4. View Results

Hugging Face Login

CLI

visualizer

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
samples		samples
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
benchmark_results.csv		benchmark_results.csv
benchmark_results.html		benchmark_results.html
dataset.json		dataset.json
ensure_utf8_encoding.py		ensure_utf8_encoding.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Sudanese Dialect Tokenizer Benchmark

Project Structure

Setup

requirements.txt

Usage

1. Prepare Sample Data

2. Ensure UTF-8 Encoding (Optional but Recommended)

3. Run the Benchmark

4. View Results

Hugging Face Login

CLI

visualizer

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`requirements.txt`

Packages