LLM Token Vacabulary Analyzer

Uncover what's missing in AI language models' vocabularies. This project provides tools to:

Extract the complete token vocabulary from any LLM
Analyze tokens to detect patterns of inclusion/exclusion
Visualize findings about conceptual representation

📋 Overview

The LLM Token Analyzer consists of two main components:

Token Sweeper: A Python script that extracts the complete vocabulary from any compatible LLM by systematically probing all possible token IDs.
Token Analyzer: An interactive HTML tool that allows you to search for specific terms, analyze their presence (or absence), and visualize the results.

This toolkit was created to investigate potential biases in LLM tokenization, particularly around consciousness-related concepts and AI self-reference capabilities.

🚀 Getting Started

Prerequisites

Python 3.8+
Access to a local LLM server (like llama.cpp) with OpenAI-compatible API
Modern web browser

Extracting Token Vocabularies

Clone this repository
Configure your local LLM server
Run the token sweeper:

python token_sweeper.py [MODEL_NAME] [START_ID] [END_ID]

Example:

python token_sweeper.py gemma-3-1b-it 1 50000

This will create a token_mappings_[MODEL_NAME].json file containing the complete vocabulary mapping.

Running the Analyzer

Open token_analyzer.html in any modern web browser
Upload your token mapping JSON file using the file uploader
Configure search terms or use the pre-defined categories
Click "Analyze Token Data" to generate insights

📊 Analysis Features

Category Management: Search across 23 different philosophical categories
Case Sensitivity: Find variations in capitalization and formatting
Visualization: Charts showing term frequency distribution
Missing Terms: Identify concepts absent from the vocabulary
Export Options: Save results as JSON or CSV for further analysis

🔍 Workflow Guide

Complete Token Discovery Workflow

Set Up a Local LLM Server
- Install a local inference server like llama.cpp, text-generation-webui, or vLLM
- Configure it to expose an OpenAI-compatible API (typically on port 1234)
Run Token Sweeper
- Edit the configuration section in token_sweeper.py to point to your server
- Run the script, specifying the model name and token ID range
- The script will save progress periodically, so you can interrupt and resume
- For large models, this process may take several hours
Analyze Token Mappings
- Open the HTML analyzer in your browser or at EveryoneandAI.com
- Upload the generated token mapping file
- Configure search terms and categories
- Generate visualizations and reports
Interpret Results
- Look for patterns in missing terms
- Compare occurrence rates of similar concepts
- Examine case variations
- Export data for detailed statistical analysis

How Token Sweeper Works

The script uses a clever technique to extract the complete vocabulary:

==================================================
TOKEN SWEEP COMPLETE
==================================================

[*] Statistics:
    - Elapsed time: 0:00:19.081364
    - Total tokens mapped: 230954
    - Tokens processed: 260
    - Successful: 260
    - Failed: 0
    - Skipped (already mapped): 230694
    - Processing rate: 13.63 tokens/second

It sends a request to the model with a minimal prompt
It sets an extremely high logit_bias (+100) for a specific token ID
This forces the model to output the token associated with that ID
The process repeats for each token ID in the specified range
Results are saved to a JSON file mapping each ID to its character representation

This approach provides a comprehensive view of the model's underlying vocabulary, which can reveal patterns that might be missed by examining only the tokenizer.

📝 Notes

Large models may have vocabularies of 100,000+ tokens
The extraction process can be resource-intensive but can be paused/resumed
Some token IDs may not map to valid tokens

📣 Contribute

Contributions are welcome! Feel free to:

Report bugs
Suggest new features
Add new analysis categories
Contribute token mappings for popular models

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
token_mappings		token_mappings
EXAMPLE_token_analysis_results.json		EXAMPLE_token_analysis_results.json
LICENSE		LICENSE
README.md		README.md
consciousness_terms.json		consciousness_terms.json
html-analyzer.html		html-analyzer.html
token_sweeper.py		token_sweeper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Token Vacabulary Analyzer

📋 Overview

🚀 Getting Started

Prerequisites

Extracting Token Vocabularies

Running the Analyzer

📊 Analysis Features

🔍 Workflow Guide

Complete Token Discovery Workflow

How Token Sweeper Works

📝 Notes

📣 Contribute

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Token Vacabulary Analyzer

📋 Overview

🚀 Getting Started

Prerequisites

Extracting Token Vocabularies

Running the Analyzer

📊 Analysis Features

🔍 Workflow Guide

Complete Token Discovery Workflow

How Token Sweeper Works

📝 Notes

📣 Contribute

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages