Skip to content

North-Shore-AI/tiktoken_ex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TiktokenEx Logo

TiktokenEx

Pure Elixir TikToken-style byte-level BPE tokenizer (Kimi K2 compatible).

CI Hex.pm Docs License

TiktokenEx is a small, dependency-light implementation of the core TikToken idea:

  • Split text with a Unicode-aware regex (pat_str)
  • Encode pieces with byte-pair encoding (BPE) using mergeable_ranks
  • Optionally recognize special tokens (e.g. <|im_end|>)

It’s focused on matching the behavior of MoonshotAI’s Kimi K2 tokenizers that ship a tiktoken.model file and a TikToken-compatible pat_str.

Installation

Add tiktoken_ex to your dependencies:

def deps do
  [
    {:tiktoken_ex, "~> 0.2.0"}
  ]
end

Usage

Build an encoding directly

alias TiktokenEx.Encoding

mergeable_ranks = %{
  "He" => 0,
  "ll" => 1,
  "llo" => 2,
  "H" => 10,
  "e" => 11,
  "l" => 12,
  "o" => 13
}

{:ok, enc} = Encoding.new(pat_str: ".+", mergeable_ranks: mergeable_ranks)

{:ok, ids} = Encoding.encode(enc, "Hello")
{:ok, text} = Encoding.decode(enc, ids)

Load a Kimi K2 encoding from local HuggingFace artifacts

Kimi provides:

  • tiktoken.model (mergeable ranks)
  • tokenizer_config.json (special tokens, etc)
alias TiktokenEx.{Encoding, Kimi}

{:ok, enc} =
  Kimi.from_hf_files(
    tiktoken_model_path: "/path/to/tiktoken.model",
    tokenizer_config_path: "/path/to/tokenizer_config.json"
  )

{:ok, ids} = Encoding.encode(enc, "Say hi")
{:ok, decoded} = Encoding.decode(enc, ids)

Load a Kimi K2 encoding from a HuggingFace repo (cached)

from_hf_repo/2 downloads and caches tiktoken.model and tokenizer_config.json under your user cache directory.

alias TiktokenEx.{Encoding, Kimi}

{:ok, enc} =
  Kimi.from_hf_repo(
    "moonshotai/Kimi-K2-Thinking",
    revision: "main",
    encoding_cache: true
  )

{:ok, ids} = Encoding.encode(enc, "Say hi")

To test without network, inject a :fetch_fun (see TiktokenEx.HuggingFace).

Special tokens

Special tokens are recognized by default. To treat them as plain text:

{:ok, ids} = TiktokenEx.Encoding.encode(enc, "<|im_end|>", allow_special_tokens: false)

Special token matching

When special tokens overlap (one is a prefix of another), the matching behavior depends on the regex alternative order.

  • Default: special_token_matching: :parity (unspecified order; closer to upstream tiktoken).
  • Optional: special_token_matching: :longest (deterministic "longest match wins").

Regex compatibility note

Kimi’s upstream pat_str uses character-class intersections (&&), which are not supported by Erlang’s PCRE engine. TiktokenEx.Kimi.pat_str/0 provides a PCRE-compatible translation.

Development

  • Run tests: mix test
  • Run oracle parity tests (downloads HF artifacts): mix test --include oracle
  • Run tests across backends: scripts/test_backends.sh (add --oracle to include parity)
  • Run dialyzer: mix dialyzer

License

MIT © 2025 North-Shore-AI