From a4416f9ed240ce72fb536e93b7e73b82c8edae18 Mon Sep 17 00:00:00 2001 From: David Ndungu Date: Mon, 30 Mar 2026 01:15:19 -0700 Subject: [PATCH] docs: add design.md for ztoken v1.0.0 --- docs/design.md | 177 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 177 insertions(+) create mode 100644 docs/design.md diff --git a/docs/design.md b/docs/design.md new file mode 100644 index 0000000..4f5fd57 --- /dev/null +++ b/docs/design.md @@ -0,0 +1,177 @@ +# ztoken Design Document + +**Module:** `github.com/zerfoo/ztoken` +**Version:** v1.0.0 +**Status:** Stable + +## Overview + +ztoken is a pure-Go tokenizer library for ML model inference. It provides BPE (byte-pair encoding), SentencePiece unigram, and WordPiece tokenization with two loading paths: HuggingFace `tokenizer.json` and GGUF metadata extraction. The library has a single external dependency (`golang.org/x/text` for Unicode normalization) and zero CGo. + +## Architecture + +``` +ztoken/ + tokenizer.go Tokenizer interface + WhitespaceTokenizer + bpe.go BPETokenizer (BPE merges, SentencePiece unigram, byte-level BPE) + wordpiece.go WordPieceTokenizer (BERT-family models) + loader.go HuggingFace tokenizer.json loader + gguf/gguf.go GGUF metadata tokenizer extraction +``` + +### Tokenizer Interface + +All tokenizer implementations satisfy a single interface: + +```go +type Tokenizer interface { + Encode(text string) ([]int, error) + Decode(ids []int) (string, error) + VocabSize() int + GetToken(id int) (string, bool) + GetID(token string) (int, bool) + SpecialTokens() SpecialTokens +} +``` + +`SpecialTokens` holds integer IDs for BOS, EOS, PAD, and UNK. All implementations populate these from the loaded model data. + +### Encode/Decode Contract + +- **Encode** accepts arbitrary UTF-8 text and returns a slice of integer token IDs. An empty string returns an empty slice with no error. Text normalization (NFC, NFD, lowercase, strip) is applied first when configured. +- **Decode** accepts a slice of token IDs and returns the reconstructed UTF-8 string. Unknown IDs return an error. Special tokens are stripped during WordPiece decoding. +- **Round-trip fidelity**: `Decode(Encode(text))` reproduces the original text modulo normalization and leading-space behavior inherent to each algorithm. + +## BPE Tokenizer + +`BPETokenizer` is the primary production tokenizer. It supports three encoding modes selected by configuration: + +### 1. Standard BPE (merge-based) + +The classic byte-pair encoding algorithm. Text is pre-tokenized into words, each word is split into characters, and adjacent pairs are iteratively merged in priority order defined by the merge table. + +**Encoding steps:** +1. Apply normalizer (if configured) +2. Split around registered special tokens (exact string match, longest wins) +3. Pre-tokenize into words (whitespace split or byte-level or SentencePiece, depending on mode) +4. For each word, split into characters and iteratively merge the highest-rank adjacent pair until no more merges apply +5. Map final subword strings to vocabulary IDs (unmapped subwords become UNK) + +### 2. Byte-level BPE (GPT-2 style) + +Enabled when `byteLevelBPE` is true. Every byte of the UTF-8 input is mapped to a printable Unicode character using the GPT-2 byte encoder table (printable ASCII maps to itself; other bytes map to codepoints starting at U+0100). This ensures all inputs are representable without an UNK token. Decode reverses the mapping. + +**Pre-tokenization:** whitespace becomes a prefix of the following word token, preserving space information in the token stream. + +### 3. SentencePiece unigram (score-based) + +Activated when `sentencePiece` is true and the merge table is empty but token scores are set. Spaces are replaced with the metaspace character (U+2581). Encoding uses greedy leftmost-longest match: at each position, the longest vocabulary token is selected, with ties broken by score (negative log probability). Unmatched bytes fall back to `<0xNN>` byte tokens. + +This mode matches llama.cpp's `llm_tokenizer_spm::tokenize` behavior and is used by Llama, Gemma, and other GGUF models with `tokenizer.ggml.model = "llama"`. + +**Leading space:** by default, SentencePiece mode prepends U+2581 to the first word. This can be overridden via `SetAddLeadingSpace(false)`, which GGUF models control through the `tokenizer.ggml.add_space_prefix` metadata key. + +### Special Token Handling + +Special tokens (e.g., ``, ``) are registered via `SetSpecialTokenStrings`. During encoding, the input is scanned for these strings before BPE/unigram processing. Each match emits its pre-assigned ID as a single token, preventing BPE from splitting control sequences into characters. + +## WordPiece Tokenizer + +`WordPieceTokenizer` implements the subword algorithm used by BERT-family models. Text is pre-tokenized by splitting on whitespace and punctuation boundaries, then each word is greedily matched against the vocabulary: + +1. Try the full word +2. If not found, find the longest prefix in the vocabulary +3. Continue with the remainder, prefixed by `##` (continuing subword marker) +4. If no subword match is found at any position, the entire word maps to UNK + +### BERT Encoding + +`EncodeForBERT` produces the standard BERT input format: +- **Single sentence:** `[CLS] tokens [SEP]` +- **Sentence pair:** `[CLS] tokens_a [SEP] tokens_b [SEP]` +- Returns `BERTEncoding` with `InputIDs`, `AttentionMask`, and `TokenTypeIDs` +- Optional padding to `maxLen` + +## HuggingFace Compatibility Layer + +The `Load` and `LoadFromJSON` functions parse HuggingFace `tokenizer.json` files. This is the standard format exported by the `transformers` library and hosted on the HuggingFace Hub. + +### Supported Schema + +| JSON field | Purpose | +|------------|---------| +| `model.type` | Selects algorithm: `"BPE"` (or empty) routes to `BPETokenizer`, `"WordPiece"` routes to `WordPieceTokenizer` | +| `model.vocab` | Token-to-ID mapping | +| `model.merges` | Merge rules in either `["a b", ...]` or `[["a","b"], ...]` format | +| `added_tokens` | Special tokens with IDs and `special` flag | +| `pre_tokenizer` | Pre-tokenizer config; `ByteLevel` type enables byte-level BPE | +| `normalizer` | Text normalization chain (NFC, NFD, Lowercase, Strip, Sequence) | +| `decoder` | Decoder config; `Metaspace` or `Replace` with U+2581 enables SentencePiece mode | + +### Auto-detection + +The loader auto-detects the tokenization mode from the JSON structure: +- **Byte-level BPE:** detected when `pre_tokenizer` contains a `ByteLevel` entry (direct or inside a `Sequence`) +- **SentencePiece:** detected when `decoder` contains a `Metaspace` entry or a `Replace` rule targeting U+2581 +- **WordPiece:** detected when `model.type` is `"WordPiece"` + +### Merge Format Compatibility + +Merges accept both the standard space-separated string format (`"a b"`) used by most models and the two-element array format (`["a", "b"]`) used by Gemma 3 tokenizers. + +### Special Token Extraction + +`extractSpecialTokens` maps `added_tokens` entries to BOS/EOS/PAD/UNK using both GPT-style names (``, ``, ``, ``) and BERT-style names (`[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`), plus substring matching on `bos`, `eos`, `pad`, `unk`. + +## GGUF Tokenizer Loading + +The `gguf` sub-package extracts tokenizer data from GGUF file metadata without depending on a specific GGUF parser. It defines a `Metadata` interface: + +```go +type Metadata interface { + GetString(key string) (string, bool) + GetStringArray(key string) ([]string, bool) + GetUint32(key string) (uint32, bool) + GetInt32Array(key string) ([]int32, bool) + GetFloat32Array(key string) ([]float32, bool) +} +``` + +`ExtractTokenizer(m Metadata)` reads the following GGUF keys: + +| Key | Required | Purpose | +|-----|----------|---------| +| `tokenizer.ggml.tokens` | Yes | Token vocabulary (string array) | +| `tokenizer.ggml.merges` | No | BPE merge rules (space-separated strings) | +| `tokenizer.ggml.scores` | No | SentencePiece unigram scores (float32 array) | +| `tokenizer.ggml.bos_token_id` | No | Beginning-of-sequence token ID | +| `tokenizer.ggml.eos_token_id` | No | End-of-sequence token ID | +| `tokenizer.ggml.unknown_token_id` | No | Unknown token ID | +| `tokenizer.ggml.padding_token_id` | No | Padding token ID | +| `tokenizer.ggml.model` | No | Model type; `"llama"` enables SentencePiece mode | +| `tokenizer.ggml.token_type` | No | Per-token type array; type 3 = control/special token | + +### Mode Selection + +- If `tokenizer.ggml.model` is `"llama"`, SentencePiece pre-tokenization is enabled +- If scores are present but merges are absent, the tokenizer uses greedy unigram encoding +- If merges are present, standard BPE merge encoding is used +- Control tokens (type 3) are registered for exact-match during encoding + +### Interface Decoupling + +The `Metadata` interface decouples ztoken from any specific GGUF parser. In the Zerfoo ecosystem, `zerfoo/model` implements this interface over its GGUF reader, but any implementation satisfying the five-method interface works. + +## Text Normalization + +Normalizers are optional functions applied before tokenization. The HuggingFace loader builds them from JSON config: + +| Type | Behavior | +|------|----------| +| `NFC` | Unicode NFC normalization | +| `NFD` | Unicode NFD normalization | +| `Lowercase` | Case folding | +| `Strip` | Trim leading/trailing whitespace | +| `Sequence` | Chain of normalizers applied in order | + +Both `BPETokenizer` and `WordPieceTokenizer` accept a `NormalizerFunc` internally. GGUF-loaded tokenizers do not currently carry normalizer configuration (normalization is handled at the model level).