This project implements a Hierarchical Autoregressive Transformer (HAT) that combines byte- and word-level processing for robust and adaptable language models using a tokenizer-free approach.
Traditional language models rely heavily on tokenizers, which can be brittle and have difficulty with out-of-vocabulary words. This project takes a different approach by processing text directly at the byte level and organizing it hierarchically:
- Byte-level processing: Works directly with raw bytes, avoiding tokenization issues
- Hierarchical structure: Aggregates byte representations into word-like units
- Autoregressive generation: Predicts next bytes based on context
- Tokenizer-free: No need for vocabulary management or handling of out-of-vocabulary tokens
- Multilingual by design: Natural support for multiple languages and scripts
- Hierarchical processing: Combines the advantages of both character and word-level models
- Efficient implementation: Leverages PyTorch for GPU acceleration
See the documentation in the hierarchical-transformer directory for installation and usage instructions.
This project is licensed under the MIT License.