Character Prefix Conditioning with Back Tokenization

This repository implements a solution for Character Prefix Conditioning when working with language model tokenizers.

Read about the implementation in my blog post.

Overview

High-Level Approach The strategy involves four main steps:

Create a trie data structure from the tokenizer vocabulary with the ability to take a string and get all tokens that has the the string as the prefix
For a given user input, search from the right to left of the last work determined by the regex pattern. Move character by character from left to right accumulating the prefix and for each prefix, find possible token completions that have the prefix
Filter token candidates to ensure they maintain valid token boundaries
Return a list of possible completion paths for the model to consider

Trie Implementation

At the core of our solution is a trie (prefix tree) data structure optimized for byte-level searching. Each node in the trie represents a byte and the children of the node represent the next byte in the token. The trie also stores the original token that ends at that node.

Visualizing subtree for prefix 'user':

==================================================
TRIE VISUALIZATION (max_depth=4, max_children=3)
==================================================
Visualizing subtree for prefix: b'user'
* TOKENS: user
├── 'D'
│   └── 'a'
│       └── 't'
│           └── 'a'
│               * TOKENS: userData
├── 'I'
│   ├── 'D'
│   │   * TOKENS: userID
│   ├── 'd'
│   │   * TOKENS: userId
│   └── 'n'
│       └── 'f'
│           └── 'o'
│               * TOKENS: userInfo
└── 'M'
    └── 'a'
        └── 'n'
            └── 'a'
                └── 'g'
                    └── ... (more nodes, reached max depth)
└── ... (8 more children)
==================================================

So given a prefix, we can traverse the trie and collect all the tokens that have the prefix.

Token Matching Algorithm

We first take the user's input and split it into words using the regex pattern from the tokenizer library.

For example, for the user sentence: He introduced an intermediar, the last words would be intermediar.

But for explaining the process, we'll use the simpler user sentence: indivi.

For the user sentence: "indivi", we search the trie for each of these prefixes. We would initiate search for the following prefixes:

i
vi
ivi
divi
ndivi
indivi
_indivi

Token Boundary Validation

The above search yields many potential candidates, forming a superset of the combinations we want to explore. We need an effective pruning strategy to filter these combinations.

Token Validation Strategy For each candidate token, we:

Replace the last word's prefix with the candidate token
Re-encode the resulting string
Compare with expected token sequence
Keep only if the sequences match exactly

Let's walk through an example using the prefix "divi" from our search candidates above:

Example: Validating token "division" Input Analysis:

Last word: " indivi"
Prefix: "divi"
Candidate token: "division"
Resulting word: " indivision"

The re-encoded tokens don't match our expected sequence, we discard "division" as a candidate. By applying this validation process across all candidate tokens, we can effectively filter out invalid completions while preserving the ones that maintain proper token boundaries.

Examples

Consider these example inputs and how the algorithm processes them:

Testing: I bought some apple
last word  apple
Found 1424 matching tokens in 1.25 ms
Analysis completed in 19.13 ms
Combinations:
Pos 19: I bought some appl -> [b'en', b'ers', b'ere']... 474 possible tokens
Pos 18: I bought some app -> [b'led', b'lem', b'let']... 101 possible tokens
Pos 17: I bought some ap -> [b'pler']... 1 possible tokens
Pos 14: I bought some -> [b' apple', b' apples']... 2 possible tokens

Testing: https:
last word :
Found 324 matching tokens in 0.36 ms
Analysis completed in 2.42 ms
Combinations:
Pos 6: https -> [b':', b'::', b':\n']... 324 possible tokens

Testing: userNa
last word userNa
Found 2323 matching tokens in 1.90 ms
Analysis completed in 21.15 ms
Combinations:
Pos 6: userN -> [b'an', b'ar', b'al']... 2170 possible tokens
Pos 5: user -> [b'Na', b'Nam', b'Nav']... 34 possible tokens
Pos 1:  -> [b'userName']... 1 possible tokens

Testing: We found a hidden causali
last word  causali
Found 2106 matching tokens in 1.67 ms
Analysis completed in 21.13 ms
Combinations:
Pos 25: We found a hidden causal -> [b'it', b'io', b'id']... 1748 possible tokens
Pos 23: We found a hidden caus -> [b'ali', b'alin', b'alia']... 17 possible tokens

Testing: He introduced an intermediar
last word  intermediar
Found 922 matching tokens in 0.81 ms
Analysis completed in 11.40 ms
Combinations:
Pos 28: He introduced an intermedia -> [b'res', b'ress', b'resp']... 52 possible tokens
Pos 27: He introduced an intermedi -> [b'ar', b'art', b'are']... 215 possible tokens
Pos 17: He introduced an -> [b' intermediary']... 1 possible tokens

Testing: indivi
last word indivi
Found 2106 matching tokens in 1.37 ms
Analysis completed in 19.37 ms
Combinations:
Pos 6: indiv -> [b'in', b'it', b'is']... 1885 possible tokens
Pos 4: ind -> [b'ivi', b'ivil', b'ivid']... 16 possible tokens
Pos 1:  -> [b'individual']... 1 possible tokens

Files

main.py: Contains the full implementation of both the token analysis algorithm and the optimized trie data structure
benchmark.py: Tools to measure performance improvements between the original approach and the trie-based solution

Benchmarking

Run the benchmark script to compare performance:

python benchmark.py

The benchmark measures:

Simple prefix matching performance
Full token combination analysis
Amortized cost including trie initialization

Future Improvements

Potential future optimizations:

replacing encoding and decoding steps for matching with something based on mergeable_ranks
Implement a more memory-efficient trie structure
Explore parallel processing for building the trie
Add more aggressive caching strategies
Optimize the token filtering process

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
.replit		.replit
README.md		README.md
benchmark.py		benchmark.py
main.ipynb		main.ipynb
main.py		main.py
prompt.txt		prompt.txt
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Character Prefix Conditioning with Back Tokenization

Overview

Trie Implementation

Token Matching Algorithm

Token Boundary Validation

Examples

Files

Benchmarking

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Character Prefix Conditioning with Back Tokenization

Overview

Trie Implementation

Token Matching Algorithm

Token Boundary Validation

Examples

Files

Benchmarking

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages