Skip to content

Latest commit

 

History

History
77 lines (50 loc) · 3.56 KB

File metadata and controls

77 lines (50 loc) · 3.56 KB

Tokenization + Encode + Decode (explination)

Tokenization is how we turn raw text into a sequence of tokens (discrete units) that a model can process. A token can be a character, subword, word, or even a byte. Models work with integer ids, so tokenization also maps tokens to ids.

Encode & Decode — what they are and why they’re needed

Encode and decode are the two directions between human-readable text and model-ready numbers. Since the model only operate on numbers the text is represented as number that can be translated back to text.

  • Encode: turn text → tokens → integer IDs.
    Neural nets operate on tensors of numbers, not strings. Encoding maps each token to a stable id so we can build batches, run embeddings, and train.
  • Decode: turn integer IDs → tokens → text.
    After a model samples ids, we must convert them back to readable text. For some schemes (e.g., word-level without space tokens), decoding also includes detokenization rules (e.g., no space before punctuation).

The same decoder and encoder need to be used as the models training encoder.


Type of tokenizations

Character-level tokenization

Each chracter in the text will be a token and the vocabolary will be tiny compared to other tokenization methods. the downside is the sequences becomes long.

Example: "Hello, world!" → ['H','e','l','l','o',',',' ','w','o','r','l','d','!']

Word-level tokenization

Each word is a token; punctuation is often its own token. (Depending on your rules, spaces may or may not be tokens.), Larger vocabolary but smaller sequenses. If you exclude spaces from the vocab, re-insert them at decode (e.g., no space before , . ! ? and after opening brackets).

Example: "Hello, world!" → ['Hello', ',' , 'world', '!']

Subword-level tokenization

Words are decomposed into frequent pieces. This is the dominant approach in modern LMs. It is compromize between word and character level, gives a vocaboloary and a sequense size in between.

Example (BPE-ish): "unhappiness" → ['un', 'happiness'] or ['un', 'hap', 'piness']


Design tips

When to pick what (quick guide)

  • Small toy LM or pedagogical bigram/trigram:
    • Char-level (simple) or word-level without spaces + detokenizer.
  • Production/large LM:
    • Subword (BPE/WordPiece/SentencePiece) or byte-level BPE.
  • Classical ML features (TF-IDF, LDA):
    • Word-level; remove punctuation; consider lowercasing/stopwords.

Key design choices (that change model behavior)

  • Spaces: keep as tokens (char/subword) vs. exclude and reinsert at decode (word-level + detokenizer).
  • Punctuation: separate tokens (common) vs. attached to words (simpler but messy).
  • Casing/normalization: lowercase, Unicode normalize (NFC/NFKC), strip accents—trades info for robustness.
  • Special tokens: BOS, EOS, PAD, UNK`—reserve ids and ensure they don’t collide with real tokens (e.g., the space).
  • Vocabulary size: larger → fewer steps but more params; smaller → longer sequences but cheaper embedding layers.

Implemented

In this repo, chracter and word level tokenization is currently implemented in the tokenization.py file and can be illustrated how to use them in this section.

To create a map from string → integer and integer → string, use the folloing funciton. It creates a JSON file with the mapping for the encoder and decoder to use.

    create_<tokenization>_tok_map()

Select a encoder:

    <Tokenization>_encode()

And select a decoder

    <Tokenization>_decode()