Skip to content

FBR65/no_tokenizer

Repository files navigation

No-Tokenizer: Hierarchical Autoregressive Transformer

This project implements a Hierarchical Autoregressive Transformer (HAT) that combines byte- and word-level processing for robust and adaptable language models using a tokenizer-free approach.

Overview

Traditional language models rely heavily on tokenizers, which can be brittle and have difficulty with out-of-vocabulary words. This project takes a different approach by processing text directly at the byte level and organizing it hierarchically:

  1. Byte-level processing: Works directly with raw bytes, avoiding tokenization issues
  2. Hierarchical structure: Aggregates byte representations into word-like units
  3. Autoregressive generation: Predicts next bytes based on context

Key Features

  • Tokenizer-free: No need for vocabulary management or handling of out-of-vocabulary tokens
  • Multilingual by design: Natural support for multiple languages and scripts
  • Hierarchical processing: Combines the advantages of both character and word-level models
  • Efficient implementation: Leverages PyTorch for GPU acceleration

Getting Started

See the documentation in the hierarchical-transformer directory for installation and usage instructions.

License

This project is licensed under the MIT License.

About

Implements a hierarchical transformer for language modeling that operates directly on raw text without a tokenizer.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors