|
| 1 | +# ztoken Design Document |
| 2 | + |
| 3 | +**Module:** `github.com/zerfoo/ztoken` |
| 4 | +**Version:** v1.0.0 |
| 5 | +**Status:** Stable |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +ztoken is a pure-Go tokenizer library for ML model inference. It provides BPE (byte-pair encoding), SentencePiece unigram, and WordPiece tokenization with two loading paths: HuggingFace `tokenizer.json` and GGUF metadata extraction. The library has a single external dependency (`golang.org/x/text` for Unicode normalization) and zero CGo. |
| 10 | + |
| 11 | +## Architecture |
| 12 | + |
| 13 | +``` |
| 14 | +ztoken/ |
| 15 | + tokenizer.go Tokenizer interface + WhitespaceTokenizer |
| 16 | + bpe.go BPETokenizer (BPE merges, SentencePiece unigram, byte-level BPE) |
| 17 | + wordpiece.go WordPieceTokenizer (BERT-family models) |
| 18 | + loader.go HuggingFace tokenizer.json loader |
| 19 | + gguf/gguf.go GGUF metadata tokenizer extraction |
| 20 | +``` |
| 21 | + |
| 22 | +### Tokenizer Interface |
| 23 | + |
| 24 | +All tokenizer implementations satisfy a single interface: |
| 25 | + |
| 26 | +```go |
| 27 | +type Tokenizer interface { |
| 28 | + Encode(text string) ([]int, error) |
| 29 | + Decode(ids []int) (string, error) |
| 30 | + VocabSize() int |
| 31 | + GetToken(id int) (string, bool) |
| 32 | + GetID(token string) (int, bool) |
| 33 | + SpecialTokens() SpecialTokens |
| 34 | +} |
| 35 | +``` |
| 36 | + |
| 37 | +`SpecialTokens` holds integer IDs for BOS, EOS, PAD, and UNK. All implementations populate these from the loaded model data. |
| 38 | + |
| 39 | +### Encode/Decode Contract |
| 40 | + |
| 41 | +- **Encode** accepts arbitrary UTF-8 text and returns a slice of integer token IDs. An empty string returns an empty slice with no error. Text normalization (NFC, NFD, lowercase, strip) is applied first when configured. |
| 42 | +- **Decode** accepts a slice of token IDs and returns the reconstructed UTF-8 string. Unknown IDs return an error. Special tokens are stripped during WordPiece decoding. |
| 43 | +- **Round-trip fidelity**: `Decode(Encode(text))` reproduces the original text modulo normalization and leading-space behavior inherent to each algorithm. |
| 44 | + |
| 45 | +## BPE Tokenizer |
| 46 | + |
| 47 | +`BPETokenizer` is the primary production tokenizer. It supports three encoding modes selected by configuration: |
| 48 | + |
| 49 | +### 1. Standard BPE (merge-based) |
| 50 | + |
| 51 | +The classic byte-pair encoding algorithm. Text is pre-tokenized into words, each word is split into characters, and adjacent pairs are iteratively merged in priority order defined by the merge table. |
| 52 | + |
| 53 | +**Encoding steps:** |
| 54 | +1. Apply normalizer (if configured) |
| 55 | +2. Split around registered special tokens (exact string match, longest wins) |
| 56 | +3. Pre-tokenize into words (whitespace split or byte-level or SentencePiece, depending on mode) |
| 57 | +4. For each word, split into characters and iteratively merge the highest-rank adjacent pair until no more merges apply |
| 58 | +5. Map final subword strings to vocabulary IDs (unmapped subwords become UNK) |
| 59 | + |
| 60 | +### 2. Byte-level BPE (GPT-2 style) |
| 61 | + |
| 62 | +Enabled when `byteLevelBPE` is true. Every byte of the UTF-8 input is mapped to a printable Unicode character using the GPT-2 byte encoder table (printable ASCII maps to itself; other bytes map to codepoints starting at U+0100). This ensures all inputs are representable without an UNK token. Decode reverses the mapping. |
| 63 | + |
| 64 | +**Pre-tokenization:** whitespace becomes a prefix of the following word token, preserving space information in the token stream. |
| 65 | + |
| 66 | +### 3. SentencePiece unigram (score-based) |
| 67 | + |
| 68 | +Activated when `sentencePiece` is true and the merge table is empty but token scores are set. Spaces are replaced with the metaspace character (U+2581). Encoding uses greedy leftmost-longest match: at each position, the longest vocabulary token is selected, with ties broken by score (negative log probability). Unmatched bytes fall back to `<0xNN>` byte tokens. |
| 69 | + |
| 70 | +This mode matches llama.cpp's `llm_tokenizer_spm::tokenize` behavior and is used by Llama, Gemma, and other GGUF models with `tokenizer.ggml.model = "llama"`. |
| 71 | + |
| 72 | +**Leading space:** by default, SentencePiece mode prepends U+2581 to the first word. This can be overridden via `SetAddLeadingSpace(false)`, which GGUF models control through the `tokenizer.ggml.add_space_prefix` metadata key. |
| 73 | + |
| 74 | +### Special Token Handling |
| 75 | + |
| 76 | +Special tokens (e.g., `<start_of_turn>`, `<end_of_turn>`) are registered via `SetSpecialTokenStrings`. During encoding, the input is scanned for these strings before BPE/unigram processing. Each match emits its pre-assigned ID as a single token, preventing BPE from splitting control sequences into characters. |
| 77 | + |
| 78 | +## WordPiece Tokenizer |
| 79 | + |
| 80 | +`WordPieceTokenizer` implements the subword algorithm used by BERT-family models. Text is pre-tokenized by splitting on whitespace and punctuation boundaries, then each word is greedily matched against the vocabulary: |
| 81 | + |
| 82 | +1. Try the full word |
| 83 | +2. If not found, find the longest prefix in the vocabulary |
| 84 | +3. Continue with the remainder, prefixed by `##` (continuing subword marker) |
| 85 | +4. If no subword match is found at any position, the entire word maps to UNK |
| 86 | + |
| 87 | +### BERT Encoding |
| 88 | + |
| 89 | +`EncodeForBERT` produces the standard BERT input format: |
| 90 | +- **Single sentence:** `[CLS] tokens [SEP]` |
| 91 | +- **Sentence pair:** `[CLS] tokens_a [SEP] tokens_b [SEP]` |
| 92 | +- Returns `BERTEncoding` with `InputIDs`, `AttentionMask`, and `TokenTypeIDs` |
| 93 | +- Optional padding to `maxLen` |
| 94 | + |
| 95 | +## HuggingFace Compatibility Layer |
| 96 | + |
| 97 | +The `Load` and `LoadFromJSON` functions parse HuggingFace `tokenizer.json` files. This is the standard format exported by the `transformers` library and hosted on the HuggingFace Hub. |
| 98 | + |
| 99 | +### Supported Schema |
| 100 | + |
| 101 | +| JSON field | Purpose | |
| 102 | +|------------|---------| |
| 103 | +| `model.type` | Selects algorithm: `"BPE"` (or empty) routes to `BPETokenizer`, `"WordPiece"` routes to `WordPieceTokenizer` | |
| 104 | +| `model.vocab` | Token-to-ID mapping | |
| 105 | +| `model.merges` | Merge rules in either `["a b", ...]` or `[["a","b"], ...]` format | |
| 106 | +| `added_tokens` | Special tokens with IDs and `special` flag | |
| 107 | +| `pre_tokenizer` | Pre-tokenizer config; `ByteLevel` type enables byte-level BPE | |
| 108 | +| `normalizer` | Text normalization chain (NFC, NFD, Lowercase, Strip, Sequence) | |
| 109 | +| `decoder` | Decoder config; `Metaspace` or `Replace` with U+2581 enables SentencePiece mode | |
| 110 | + |
| 111 | +### Auto-detection |
| 112 | + |
| 113 | +The loader auto-detects the tokenization mode from the JSON structure: |
| 114 | +- **Byte-level BPE:** detected when `pre_tokenizer` contains a `ByteLevel` entry (direct or inside a `Sequence`) |
| 115 | +- **SentencePiece:** detected when `decoder` contains a `Metaspace` entry or a `Replace` rule targeting U+2581 |
| 116 | +- **WordPiece:** detected when `model.type` is `"WordPiece"` |
| 117 | + |
| 118 | +### Merge Format Compatibility |
| 119 | + |
| 120 | +Merges accept both the standard space-separated string format (`"a b"`) used by most models and the two-element array format (`["a", "b"]`) used by Gemma 3 tokenizers. |
| 121 | + |
| 122 | +### Special Token Extraction |
| 123 | + |
| 124 | +`extractSpecialTokens` maps `added_tokens` entries to BOS/EOS/PAD/UNK using both GPT-style names (`<s>`, `</s>`, `<pad>`, `<unk>`) and BERT-style names (`[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`), plus substring matching on `bos`, `eos`, `pad`, `unk`. |
| 125 | + |
| 126 | +## GGUF Tokenizer Loading |
| 127 | + |
| 128 | +The `gguf` sub-package extracts tokenizer data from GGUF file metadata without depending on a specific GGUF parser. It defines a `Metadata` interface: |
| 129 | + |
| 130 | +```go |
| 131 | +type Metadata interface { |
| 132 | + GetString(key string) (string, bool) |
| 133 | + GetStringArray(key string) ([]string, bool) |
| 134 | + GetUint32(key string) (uint32, bool) |
| 135 | + GetInt32Array(key string) ([]int32, bool) |
| 136 | + GetFloat32Array(key string) ([]float32, bool) |
| 137 | +} |
| 138 | +``` |
| 139 | + |
| 140 | +`ExtractTokenizer(m Metadata)` reads the following GGUF keys: |
| 141 | + |
| 142 | +| Key | Required | Purpose | |
| 143 | +|-----|----------|---------| |
| 144 | +| `tokenizer.ggml.tokens` | Yes | Token vocabulary (string array) | |
| 145 | +| `tokenizer.ggml.merges` | No | BPE merge rules (space-separated strings) | |
| 146 | +| `tokenizer.ggml.scores` | No | SentencePiece unigram scores (float32 array) | |
| 147 | +| `tokenizer.ggml.bos_token_id` | No | Beginning-of-sequence token ID | |
| 148 | +| `tokenizer.ggml.eos_token_id` | No | End-of-sequence token ID | |
| 149 | +| `tokenizer.ggml.unknown_token_id` | No | Unknown token ID | |
| 150 | +| `tokenizer.ggml.padding_token_id` | No | Padding token ID | |
| 151 | +| `tokenizer.ggml.model` | No | Model type; `"llama"` enables SentencePiece mode | |
| 152 | +| `tokenizer.ggml.token_type` | No | Per-token type array; type 3 = control/special token | |
| 153 | + |
| 154 | +### Mode Selection |
| 155 | + |
| 156 | +- If `tokenizer.ggml.model` is `"llama"`, SentencePiece pre-tokenization is enabled |
| 157 | +- If scores are present but merges are absent, the tokenizer uses greedy unigram encoding |
| 158 | +- If merges are present, standard BPE merge encoding is used |
| 159 | +- Control tokens (type 3) are registered for exact-match during encoding |
| 160 | + |
| 161 | +### Interface Decoupling |
| 162 | + |
| 163 | +The `Metadata` interface decouples ztoken from any specific GGUF parser. In the Zerfoo ecosystem, `zerfoo/model` implements this interface over its GGUF reader, but any implementation satisfying the five-method interface works. |
| 164 | + |
| 165 | +## Text Normalization |
| 166 | + |
| 167 | +Normalizers are optional functions applied before tokenization. The HuggingFace loader builds them from JSON config: |
| 168 | + |
| 169 | +| Type | Behavior | |
| 170 | +|------|----------| |
| 171 | +| `NFC` | Unicode NFC normalization | |
| 172 | +| `NFD` | Unicode NFD normalization | |
| 173 | +| `Lowercase` | Case folding | |
| 174 | +| `Strip` | Trim leading/trailing whitespace | |
| 175 | +| `Sequence` | Chain of normalizers applied in order | |
| 176 | + |
| 177 | +Both `BPETokenizer` and `WordPieceTokenizer` accept a `NormalizerFunc` internally. GGUF-loaded tokenizers do not currently carry normalizer configuration (normalization is handled at the model level). |
0 commit comments