Skip to content

Commit b6fba6e

Browse files
committed
docs: add design.md for ztoken v1.0.0
1 parent d3c171d commit b6fba6e

File tree

1 file changed

+177
-0
lines changed

1 file changed

+177
-0
lines changed

docs/design.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# ztoken Design Document
2+
3+
**Module:** `github.com/zerfoo/ztoken`
4+
**Version:** v1.0.0
5+
**Status:** Stable
6+
7+
## Overview
8+
9+
ztoken is a pure-Go tokenizer library for ML model inference. It provides BPE (byte-pair encoding), SentencePiece unigram, and WordPiece tokenization with two loading paths: HuggingFace `tokenizer.json` and GGUF metadata extraction. The library has a single external dependency (`golang.org/x/text` for Unicode normalization) and zero CGo.
10+
11+
## Architecture
12+
13+
```
14+
ztoken/
15+
tokenizer.go Tokenizer interface + WhitespaceTokenizer
16+
bpe.go BPETokenizer (BPE merges, SentencePiece unigram, byte-level BPE)
17+
wordpiece.go WordPieceTokenizer (BERT-family models)
18+
loader.go HuggingFace tokenizer.json loader
19+
gguf/gguf.go GGUF metadata tokenizer extraction
20+
```
21+
22+
### Tokenizer Interface
23+
24+
All tokenizer implementations satisfy a single interface:
25+
26+
```go
27+
type Tokenizer interface {
28+
Encode(text string) ([]int, error)
29+
Decode(ids []int) (string, error)
30+
VocabSize() int
31+
GetToken(id int) (string, bool)
32+
GetID(token string) (int, bool)
33+
SpecialTokens() SpecialTokens
34+
}
35+
```
36+
37+
`SpecialTokens` holds integer IDs for BOS, EOS, PAD, and UNK. All implementations populate these from the loaded model data.
38+
39+
### Encode/Decode Contract
40+
41+
- **Encode** accepts arbitrary UTF-8 text and returns a slice of integer token IDs. An empty string returns an empty slice with no error. Text normalization (NFC, NFD, lowercase, strip) is applied first when configured.
42+
- **Decode** accepts a slice of token IDs and returns the reconstructed UTF-8 string. Unknown IDs return an error. Special tokens are stripped during WordPiece decoding.
43+
- **Round-trip fidelity**: `Decode(Encode(text))` reproduces the original text modulo normalization and leading-space behavior inherent to each algorithm.
44+
45+
## BPE Tokenizer
46+
47+
`BPETokenizer` is the primary production tokenizer. It supports three encoding modes selected by configuration:
48+
49+
### 1. Standard BPE (merge-based)
50+
51+
The classic byte-pair encoding algorithm. Text is pre-tokenized into words, each word is split into characters, and adjacent pairs are iteratively merged in priority order defined by the merge table.
52+
53+
**Encoding steps:**
54+
1. Apply normalizer (if configured)
55+
2. Split around registered special tokens (exact string match, longest wins)
56+
3. Pre-tokenize into words (whitespace split or byte-level or SentencePiece, depending on mode)
57+
4. For each word, split into characters and iteratively merge the highest-rank adjacent pair until no more merges apply
58+
5. Map final subword strings to vocabulary IDs (unmapped subwords become UNK)
59+
60+
### 2. Byte-level BPE (GPT-2 style)
61+
62+
Enabled when `byteLevelBPE` is true. Every byte of the UTF-8 input is mapped to a printable Unicode character using the GPT-2 byte encoder table (printable ASCII maps to itself; other bytes map to codepoints starting at U+0100). This ensures all inputs are representable without an UNK token. Decode reverses the mapping.
63+
64+
**Pre-tokenization:** whitespace becomes a prefix of the following word token, preserving space information in the token stream.
65+
66+
### 3. SentencePiece unigram (score-based)
67+
68+
Activated when `sentencePiece` is true and the merge table is empty but token scores are set. Spaces are replaced with the metaspace character (U+2581). Encoding uses greedy leftmost-longest match: at each position, the longest vocabulary token is selected, with ties broken by score (negative log probability). Unmatched bytes fall back to `<0xNN>` byte tokens.
69+
70+
This mode matches llama.cpp's `llm_tokenizer_spm::tokenize` behavior and is used by Llama, Gemma, and other GGUF models with `tokenizer.ggml.model = "llama"`.
71+
72+
**Leading space:** by default, SentencePiece mode prepends U+2581 to the first word. This can be overridden via `SetAddLeadingSpace(false)`, which GGUF models control through the `tokenizer.ggml.add_space_prefix` metadata key.
73+
74+
### Special Token Handling
75+
76+
Special tokens (e.g., `<start_of_turn>`, `<end_of_turn>`) are registered via `SetSpecialTokenStrings`. During encoding, the input is scanned for these strings before BPE/unigram processing. Each match emits its pre-assigned ID as a single token, preventing BPE from splitting control sequences into characters.
77+
78+
## WordPiece Tokenizer
79+
80+
`WordPieceTokenizer` implements the subword algorithm used by BERT-family models. Text is pre-tokenized by splitting on whitespace and punctuation boundaries, then each word is greedily matched against the vocabulary:
81+
82+
1. Try the full word
83+
2. If not found, find the longest prefix in the vocabulary
84+
3. Continue with the remainder, prefixed by `##` (continuing subword marker)
85+
4. If no subword match is found at any position, the entire word maps to UNK
86+
87+
### BERT Encoding
88+
89+
`EncodeForBERT` produces the standard BERT input format:
90+
- **Single sentence:** `[CLS] tokens [SEP]`
91+
- **Sentence pair:** `[CLS] tokens_a [SEP] tokens_b [SEP]`
92+
- Returns `BERTEncoding` with `InputIDs`, `AttentionMask`, and `TokenTypeIDs`
93+
- Optional padding to `maxLen`
94+
95+
## HuggingFace Compatibility Layer
96+
97+
The `Load` and `LoadFromJSON` functions parse HuggingFace `tokenizer.json` files. This is the standard format exported by the `transformers` library and hosted on the HuggingFace Hub.
98+
99+
### Supported Schema
100+
101+
| JSON field | Purpose |
102+
|------------|---------|
103+
| `model.type` | Selects algorithm: `"BPE"` (or empty) routes to `BPETokenizer`, `"WordPiece"` routes to `WordPieceTokenizer` |
104+
| `model.vocab` | Token-to-ID mapping |
105+
| `model.merges` | Merge rules in either `["a b", ...]` or `[["a","b"], ...]` format |
106+
| `added_tokens` | Special tokens with IDs and `special` flag |
107+
| `pre_tokenizer` | Pre-tokenizer config; `ByteLevel` type enables byte-level BPE |
108+
| `normalizer` | Text normalization chain (NFC, NFD, Lowercase, Strip, Sequence) |
109+
| `decoder` | Decoder config; `Metaspace` or `Replace` with U+2581 enables SentencePiece mode |
110+
111+
### Auto-detection
112+
113+
The loader auto-detects the tokenization mode from the JSON structure:
114+
- **Byte-level BPE:** detected when `pre_tokenizer` contains a `ByteLevel` entry (direct or inside a `Sequence`)
115+
- **SentencePiece:** detected when `decoder` contains a `Metaspace` entry or a `Replace` rule targeting U+2581
116+
- **WordPiece:** detected when `model.type` is `"WordPiece"`
117+
118+
### Merge Format Compatibility
119+
120+
Merges accept both the standard space-separated string format (`"a b"`) used by most models and the two-element array format (`["a", "b"]`) used by Gemma 3 tokenizers.
121+
122+
### Special Token Extraction
123+
124+
`extractSpecialTokens` maps `added_tokens` entries to BOS/EOS/PAD/UNK using both GPT-style names (`<s>`, `</s>`, `<pad>`, `<unk>`) and BERT-style names (`[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`), plus substring matching on `bos`, `eos`, `pad`, `unk`.
125+
126+
## GGUF Tokenizer Loading
127+
128+
The `gguf` sub-package extracts tokenizer data from GGUF file metadata without depending on a specific GGUF parser. It defines a `Metadata` interface:
129+
130+
```go
131+
type Metadata interface {
132+
GetString(key string) (string, bool)
133+
GetStringArray(key string) ([]string, bool)
134+
GetUint32(key string) (uint32, bool)
135+
GetInt32Array(key string) ([]int32, bool)
136+
GetFloat32Array(key string) ([]float32, bool)
137+
}
138+
```
139+
140+
`ExtractTokenizer(m Metadata)` reads the following GGUF keys:
141+
142+
| Key | Required | Purpose |
143+
|-----|----------|---------|
144+
| `tokenizer.ggml.tokens` | Yes | Token vocabulary (string array) |
145+
| `tokenizer.ggml.merges` | No | BPE merge rules (space-separated strings) |
146+
| `tokenizer.ggml.scores` | No | SentencePiece unigram scores (float32 array) |
147+
| `tokenizer.ggml.bos_token_id` | No | Beginning-of-sequence token ID |
148+
| `tokenizer.ggml.eos_token_id` | No | End-of-sequence token ID |
149+
| `tokenizer.ggml.unknown_token_id` | No | Unknown token ID |
150+
| `tokenizer.ggml.padding_token_id` | No | Padding token ID |
151+
| `tokenizer.ggml.model` | No | Model type; `"llama"` enables SentencePiece mode |
152+
| `tokenizer.ggml.token_type` | No | Per-token type array; type 3 = control/special token |
153+
154+
### Mode Selection
155+
156+
- If `tokenizer.ggml.model` is `"llama"`, SentencePiece pre-tokenization is enabled
157+
- If scores are present but merges are absent, the tokenizer uses greedy unigram encoding
158+
- If merges are present, standard BPE merge encoding is used
159+
- Control tokens (type 3) are registered for exact-match during encoding
160+
161+
### Interface Decoupling
162+
163+
The `Metadata` interface decouples ztoken from any specific GGUF parser. In the Zerfoo ecosystem, `zerfoo/model` implements this interface over its GGUF reader, but any implementation satisfying the five-method interface works.
164+
165+
## Text Normalization
166+
167+
Normalizers are optional functions applied before tokenization. The HuggingFace loader builds them from JSON config:
168+
169+
| Type | Behavior |
170+
|------|----------|
171+
| `NFC` | Unicode NFC normalization |
172+
| `NFD` | Unicode NFD normalization |
173+
| `Lowercase` | Case folding |
174+
| `Strip` | Trim leading/trailing whitespace |
175+
| `Sequence` | Chain of normalizers applied in order |
176+
177+
Both `BPETokenizer` and `WordPieceTokenizer` accept a `NormalizerFunc` internally. GGUF-loaded tokenizers do not currently carry normalizer configuration (normalization is handled at the model level).

0 commit comments

Comments
 (0)