BitNet-TT

A native TT-NN implementation of Microsoft's BitNet b1.58 2B-4T on Tenstorrent Blackhole p150a.

The system targets a single-user inference setting with 2-bit ternary weights packed into the platform's BFP2 format and decoded via a captured Metal-Trace pipeline.

Headlines (re-measured 2026-05-06)

Metric	Value
Decode throughput (packed_ternary, batch-32 internal)	73.4 ± 0.4 t/s
Speed-up vs `bitnet.cpp` (CPU pinned single-socket, t=16)	3.25×
Speed-up vs `bitnet.cpp` (CPU as-shipped default, t=2)	4.1×
Energy/token (sustained) — TT vs CPU	0.97 vs 3.79 J = 3.9× lower
Energy/token (burst) — TT vs CPU	0.80 vs 3.83 J = 4.7× lower
Pearson correlation vs HF reference (16-prompt mean)	0.86 ± 0.10
Model footprint	~600 MB (8× smaller than BF16)

Full measurement journal in paper/data/RESULTS.md; methodology in paper/ (AdaptFM @ ICML 2026 submission).

Layout

bitnet-tt/
├── README.md, LICENSE, pyproject.toml
├── main.py                  # interactive entry
├── bench_batch32.py         # production throughput bench
├── bench_accuracy.py        # PCC vs HF reference
├── src/bitnet_tt/           # the library
├── tests/                   # canonical test suite (pytest)
├── examples/demo.py         # end-to-end demo
├── scripts/
│   ├── bench/               # secondary benches (vs bitnet.cpp / HF / TPS / profile)
│   ├── pcc_localize.py      # per-op RFE harness
│   └── legacy/              # archived dev scratch
├── docs/
│   ├── STATUS.md            # consolidated reference (Sessions 3-13)
│   ├── plan_*.md            # Phase K kernel-fork future-work plans
│   └── _archive/            # session histories + working notes
└── paper/                   # AdaptFM workshop submission package

Quick start

# On the Tenstorrent p150a server
source ~/.tenstorrent-venv/bin/activate
cd ~/bitnet-tt

# Throughput benchmark (128-token greedy decode)
TT_METAL_ENABLE_L1_DATA_CACHE_RISCVS=BR,NC,TR,ER \
BITNET_TT_TRACE_REGION_SIZE=200000000 \
python bench_batch32.py --dtype packed_ternary --max-new 128

# Accuracy comparison vs HuggingFace fp32 reference
python bench_accuracy.py --dtype packed_ternary --decode-steps 16

# Interactive chat
python main.py --chat

# Comparison benches (now under scripts/bench/)
python scripts/bench/bench_vs_bitnetcpp.py --ref-logits /path/to/cpp.bin
python scripts/bench/profile_decode.py --dtype packed_ternary

Key implementation choices

2-bit BFP2_b weights packed at 256 bytes per 32×32 tile, unpacked in hardware inside the matmul compute kernel.
Fused RMSNorm + Q/K/V matmul: a single fused kernel reads the input activation once, computes the per-row sum-of-squares reduction along the matmul reduction loop, and emits the Q/K/V projections without materialising the post-norm tensor.
Trace-captured decode loop: the entire decode step (embedding lookup → 30 transformer blocks → final norm → 4-way split lm_head → in-trace multicore argmax) is captured once and replayed without host involvement.
HuggingFace weights: load microsoft/bitnet-b1.58-2B-4T-bf16 directly; weight loader handles the BitLinear sub-norms.

Limitations

Single hardware platform (Tenstorrent Blackhole p150a) — porting to other accelerators would require re-deriving the BFP2 face permutation and the multi-core RMSNorm shard geometry.
Closing the residual quality gap (PCC < 0.99) likely requires operator-level changes inside the runtime's C++ kernel library.
bench_accuracy.py exhibits a model-cache aliasing bug for non-2-bit dtypes; throughput sweep uses a separate path and is unaffected. See paper/data/RESULTS.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BitNet-TT

Headlines (re-measured 2026-05-06)

Layout

Quick start

Key implementation choices

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 395 Commits
.claude		.claude
.omc		.omc
docs		docs
examples		examples
scripts		scripts
src/bitnet_tt		src/bitnet_tt
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
bench_accuracy.py		bench_accuracy.py
bench_batch32.py		bench_batch32.py
main.py		main.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

BitNet-TT

Headlines (re-measured 2026-05-06)

Layout

Quick start

Key implementation choices

Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages