From 6cb60732cf2f0e2b9dc512a118fe00ac95bddbb7 Mon Sep 17 00:00:00 2001 From: David Ndungu Date: Mon, 30 Mar 2026 06:55:29 -0700 Subject: [PATCH] docs(adr): add API stability contract for ztoken v1.0.0 --- docs/adr/001-api-stability-v1.md | 88 ++++++++++++++++++++++++++++++++ 1 file changed, 88 insertions(+) create mode 100644 docs/adr/001-api-stability-v1.md diff --git a/docs/adr/001-api-stability-v1.md b/docs/adr/001-api-stability-v1.md new file mode 100644 index 0000000..60b7b11 --- /dev/null +++ b/docs/adr/001-api-stability-v1.md @@ -0,0 +1,88 @@ +# ADR-001: API Stability Contract for ztoken v1.0.0 + +**Status:** Accepted +**Date:** 2026-03-29 + +## Context + +The `ztoken` package (`github.com/zerfoo/ztoken`) provides BPE and WordPiece tokenization for ML inference in Go. The public API has been stable across multiple releases and is consumed by `zerfoo` and downstream applications. We need to formalize which symbols are part of the stable v1 contract and which are internal implementation details. + +## Decision + +### Stable v1 Public API + +The following exported symbols constitute the stable v1 API. They will not have breaking changes within the v1.x release series, following Go module compatibility guarantees. + +#### Root package (`github.com/zerfoo/ztoken`) + +**Interface:** + +- `Tokenizer` -- the core abstraction; all tokenizer implementations satisfy this interface + - `Encode(text string) ([]int, error)` + - `Decode(ids []int) (string, error)` + - `VocabSize() int` + - `GetToken(id int) (string, bool)` + - `GetID(token string) (int, bool)` + - `SpecialTokens() SpecialTokens` + +**Types:** + +- `SpecialTokens` -- struct with BOS, EOS, PAD, UNK fields +- `MergePair` -- struct with Left, Right fields +- `NormalizerFunc` -- function type `func(string) string` +- `BERTEncoding` -- struct with InputIDs, AttentionMask, TokenTypeIDs fields + +**Concrete tokenizers:** + +- `BPETokenizer` (struct, implements `Tokenizer`) + - `NewBPETokenizer(vocab, merges, special, byteLevelBPE)` + - `Encode`, `Decode`, `VocabSize`, `GetToken`, `GetID`, `SpecialTokens` + - `EncodeWithSpecialTokens(text, addBOS, addEOS)` + - `SetScores(scores)` + - `SetSentencePiece(enabled)` + - `SetAddLeadingSpace(enabled)` + - `SetSpecialTokenStrings(tokens)` + +- `WhitespaceTokenizer` (struct, implements `Tokenizer`) + - `NewWhitespaceTokenizer()` + - `Encode`, `Decode`, `VocabSize`, `GetToken`, `GetID`, `SpecialTokens` + - `AddToken(token)` + +- `WordPieceTokenizer` (struct, implements `Tokenizer`) + - `NewWordPieceTokenizer(vocab, special)` + - `Encode`, `Decode`, `VocabSize`, `GetToken`, `GetID`, `SpecialTokens` + - `EncodeForBERT(textA, textB, maxLen)` + +**Loader functions:** + +- `Load(path string) (Tokenizer, error)` -- loads HuggingFace tokenizer.json, returns appropriate implementation +- `LoadFromJSON(path string) (*BPETokenizer, error)` -- loads HuggingFace tokenizer.json as BPE specifically + +#### Subpackage `gguf` (`github.com/zerfoo/ztoken/gguf`) + +- `Metadata` -- interface for GGUF key-value access +- `ExtractTokenizer(m Metadata) (*ztoken.BPETokenizer, error)` -- builds a BPETokenizer from GGUF metadata + +### Not Public API + +The following are explicitly **not** part of the v1 stability contract: + +- **Unexported fields and methods** on all types +- **Internal implementation details** of BPE merge algorithms, SentencePiece encoding, and WordPiece subword matching +- **Test utilities and test data** under `testdata/` +- **File format parsing internals** within the `gguf` subpackage (only the `Metadata` interface and `ExtractTokenizer` function are stable) + +### Compatibility Rules + +1. No exported type, function, or method listed above will be removed or have its signature changed in v1.x. +2. New methods may be added to concrete types (`BPETokenizer`, `WhitespaceTokenizer`, `WordPieceTokenizer`). +3. New fields may be added to `SpecialTokens`, `MergePair`, `BERTEncoding`, and other struct types. +4. The `Tokenizer` interface will not gain new methods in v1.x -- that would break external implementations. +5. New subpackages or new exported functions may be added. +6. Bug fixes that change tokenization output to match the reference implementation (HuggingFace, llama.cpp) are permitted. + +## Consequences + +- Downstream consumers can depend on `github.com/zerfoo/ztoken` v1.x with confidence that upgrades will not break their code. +- The `Tokenizer` interface is frozen for v1 -- any new capabilities must be added via separate interfaces or concrete type methods. +- Internal refactoring (merge algorithm, byte-level BPE internals, GGUF parsing helpers) can proceed freely without versioning concerns.