Pure Elixir TikToken-style byte-level BPE tokenizer (Kimi K2 compatible).
TiktokenEx is a small, dependency-light implementation of the core TikToken idea:
- Split text with a Unicode-aware regex (
pat_str) - Encode pieces with byte-pair encoding (BPE) using
mergeable_ranks - Optionally recognize special tokens (e.g.
<|im_end|>)
It’s focused on matching the behavior of MoonshotAI’s Kimi K2 tokenizers
that ship a tiktoken.model file and a TikToken-compatible pat_str.
Add tiktoken_ex to your dependencies:
def deps do
[
{:tiktoken_ex, "~> 0.2.0"}
]
endalias TiktokenEx.Encoding
mergeable_ranks = %{
"He" => 0,
"ll" => 1,
"llo" => 2,
"H" => 10,
"e" => 11,
"l" => 12,
"o" => 13
}
{:ok, enc} = Encoding.new(pat_str: ".+", mergeable_ranks: mergeable_ranks)
{:ok, ids} = Encoding.encode(enc, "Hello")
{:ok, text} = Encoding.decode(enc, ids)Kimi provides:
tiktoken.model(mergeable ranks)tokenizer_config.json(special tokens, etc)
alias TiktokenEx.{Encoding, Kimi}
{:ok, enc} =
Kimi.from_hf_files(
tiktoken_model_path: "/path/to/tiktoken.model",
tokenizer_config_path: "/path/to/tokenizer_config.json"
)
{:ok, ids} = Encoding.encode(enc, "Say hi")
{:ok, decoded} = Encoding.decode(enc, ids)from_hf_repo/2 downloads and caches tiktoken.model and
tokenizer_config.json under your user cache directory.
alias TiktokenEx.{Encoding, Kimi}
{:ok, enc} =
Kimi.from_hf_repo(
"moonshotai/Kimi-K2-Thinking",
revision: "main",
encoding_cache: true
)
{:ok, ids} = Encoding.encode(enc, "Say hi")To test without network, inject a :fetch_fun (see TiktokenEx.HuggingFace).
Special tokens are recognized by default. To treat them as plain text:
{:ok, ids} = TiktokenEx.Encoding.encode(enc, "<|im_end|>", allow_special_tokens: false)When special tokens overlap (one is a prefix of another), the matching behavior depends on the regex alternative order.
- Default:
special_token_matching: :parity(unspecified order; closer to upstreamtiktoken). - Optional:
special_token_matching: :longest(deterministic "longest match wins").
Kimi’s upstream pat_str uses character-class intersections (&&), which are
not supported by Erlang’s PCRE engine. TiktokenEx.Kimi.pat_str/0 provides a
PCRE-compatible translation.
- Run tests:
mix test - Run oracle parity tests (downloads HF artifacts):
mix test --include oracle - Run tests across backends:
scripts/test_backends.sh(add--oracleto include parity) - Run dialyzer:
mix dialyzer
MIT © 2025 North-Shore-AI