Uncover what's missing in AI language models' vocabularies. This project provides tools to:
- Extract the complete token vocabulary from any LLM
- Analyze tokens to detect patterns of inclusion/exclusion
- Visualize findings about conceptual representation
The LLM Token Analyzer consists of two main components:
- Token Sweeper: A Python script that extracts the complete vocabulary from any compatible LLM by systematically probing all possible token IDs.
- Token Analyzer: An interactive HTML tool that allows you to search for specific terms, analyze their presence (or absence), and visualize the results.
This toolkit was created to investigate potential biases in LLM tokenization, particularly around consciousness-related concepts and AI self-reference capabilities.
- Python 3.8+
- Access to a local LLM server (like llama.cpp) with OpenAI-compatible API
- Modern web browser
- Clone this repository
- Configure your local LLM server
- Run the token sweeper:
python token_sweeper.py [MODEL_NAME] [START_ID] [END_ID]Example:
python token_sweeper.py gemma-3-1b-it 1 50000This will create a token_mappings_[MODEL_NAME].json file containing the complete vocabulary mapping.
- Open
token_analyzer.htmlin any modern web browser - Upload your token mapping JSON file using the file uploader
- Configure search terms or use the pre-defined categories
- Click "Analyze Token Data" to generate insights
- Category Management: Search across 23 different philosophical categories
- Case Sensitivity: Find variations in capitalization and formatting
- Visualization: Charts showing term frequency distribution
- Missing Terms: Identify concepts absent from the vocabulary
- Export Options: Save results as JSON or CSV for further analysis
-
Set Up a Local LLM Server
- Install a local inference server like llama.cpp, text-generation-webui, or vLLM
- Configure it to expose an OpenAI-compatible API (typically on port 1234)
-
Run Token Sweeper
- Edit the configuration section in
token_sweeper.pyto point to your server - Run the script, specifying the model name and token ID range
- The script will save progress periodically, so you can interrupt and resume
- For large models, this process may take several hours
- Edit the configuration section in
-
Analyze Token Mappings
- Open the HTML analyzer in your browser or at EveryoneandAI.com
- Upload the generated token mapping file
- Configure search terms and categories
- Generate visualizations and reports
-
Interpret Results
- Look for patterns in missing terms
- Compare occurrence rates of similar concepts
- Examine case variations
- Export data for detailed statistical analysis
The script uses a clever technique to extract the complete vocabulary:
==================================================
TOKEN SWEEP COMPLETE
==================================================
[*] Statistics:
- Elapsed time: 0:00:19.081364
- Total tokens mapped: 230954
- Tokens processed: 260
- Successful: 260
- Failed: 0
- Skipped (already mapped): 230694
- Processing rate: 13.63 tokens/second
- It sends a request to the model with a minimal prompt
- It sets an extremely high
logit_bias(+100) for a specific token ID - This forces the model to output the token associated with that ID
- The process repeats for each token ID in the specified range
- Results are saved to a JSON file mapping each ID to its character representation
This approach provides a comprehensive view of the model's underlying vocabulary, which can reveal patterns that might be missed by examining only the tokenizer.
- Large models may have vocabularies of 100,000+ tokens
- The extraction process can be resource-intensive but can be paused/resumed
- Some token IDs may not map to valid tokens
Contributions are welcome! Feel free to:
- Report bugs
- Suggest new features
- Add new analysis categories
- Contribute token mappings for popular models
This project is licensed under the MIT License - see the LICENSE file for details.

