No-Tokenizer: Hierarchical Autoregressive Transformer

This project implements a Hierarchical Autoregressive Transformer (HAT) that combines byte- and word-level processing for robust and adaptable language models using a tokenizer-free approach.

Overview

Traditional language models rely heavily on tokenizers, which can be brittle and have difficulty with out-of-vocabulary words. This project takes a different approach by processing text directly at the byte level and organizing it hierarchically:

Byte-level processing: Works directly with raw bytes, avoiding tokenization issues
Hierarchical structure: Aggregates byte representations into word-like units
Autoregressive generation: Predicts next bytes based on context

Key Features

Tokenizer-free: No need for vocabulary management or handling of out-of-vocabulary tokens
Multilingual by design: Natural support for multiple languages and scripts
Hierarchical processing: Combines the advantages of both character and word-level models
Efficient implementation: Leverages PyTorch for GPU acceleration

Getting Started

See the documentation in the hierarchical-transformer directory for installation and usage instructions.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
hierarchical-transformer		hierarchical-transformer
.gitignore		.gitignore
.python-version		.python-version
LICENSE.md		LICENSE.md
README.md		README.md
gpt2_small_variant.ipynb		gpt2_small_variant.ipynb
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

No-Tokenizer: Hierarchical Autoregressive Transformer

Overview

Key Features

Getting Started

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

No-Tokenizer: Hierarchical Autoregressive Transformer

Overview

Key Features

Getting Started

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages