A native TT-NN implementation of Microsoft's BitNet b1.58 2B-4T on Tenstorrent Blackhole p150a.
The system targets a single-user inference setting with 2-bit ternary weights packed into the platform's BFP2 format and decoded via a captured Metal-Trace pipeline.
| Metric | Value |
|---|---|
| Decode throughput (packed_ternary, batch-32 internal) | 73.4 ± 0.4 t/s |
Speed-up vs bitnet.cpp (CPU pinned single-socket, t=16) |
3.25× |
Speed-up vs bitnet.cpp (CPU as-shipped default, t=2) |
4.1× |
| Energy/token (sustained) — TT vs CPU | 0.97 vs 3.79 J = 3.9× lower |
| Energy/token (burst) — TT vs CPU | 0.80 vs 3.83 J = 4.7× lower |
| Pearson correlation vs HF reference (16-prompt mean) | 0.86 ± 0.10 |
| Model footprint | ~600 MB (8× smaller than BF16) |
Full measurement journal in paper/data/RESULTS.md;
methodology in paper/ (AdaptFM @ ICML 2026 submission).
bitnet-tt/
├── README.md, LICENSE, pyproject.toml
├── main.py # interactive entry
├── bench_batch32.py # production throughput bench
├── bench_accuracy.py # PCC vs HF reference
├── src/bitnet_tt/ # the library
├── tests/ # canonical test suite (pytest)
├── examples/demo.py # end-to-end demo
├── scripts/
│ ├── bench/ # secondary benches (vs bitnet.cpp / HF / TPS / profile)
│ ├── pcc_localize.py # per-op RFE harness
│ └── legacy/ # archived dev scratch
├── docs/
│ ├── STATUS.md # consolidated reference (Sessions 3-13)
│ ├── plan_*.md # Phase K kernel-fork future-work plans
│ └── _archive/ # session histories + working notes
└── paper/ # AdaptFM workshop submission package
# On the Tenstorrent p150a server
source ~/.tenstorrent-venv/bin/activate
cd ~/bitnet-tt
# Throughput benchmark (128-token greedy decode)
TT_METAL_ENABLE_L1_DATA_CACHE_RISCVS=BR,NC,TR,ER \
BITNET_TT_TRACE_REGION_SIZE=200000000 \
python bench_batch32.py --dtype packed_ternary --max-new 128
# Accuracy comparison vs HuggingFace fp32 reference
python bench_accuracy.py --dtype packed_ternary --decode-steps 16
# Interactive chat
python main.py --chat
# Comparison benches (now under scripts/bench/)
python scripts/bench/bench_vs_bitnetcpp.py --ref-logits /path/to/cpp.bin
python scripts/bench/profile_decode.py --dtype packed_ternary- 2-bit BFP2_b weights packed at 256 bytes per 32×32 tile, unpacked in hardware inside the matmul compute kernel.
- Fused RMSNorm + Q/K/V matmul: a single fused kernel reads the input activation once, computes the per-row sum-of-squares reduction along the matmul reduction loop, and emits the Q/K/V projections without materialising the post-norm tensor.
- Trace-captured decode loop: the entire decode step (embedding lookup → 30 transformer blocks → final norm → 4-way split lm_head → in-trace multicore argmax) is captured once and replayed without host involvement.
- HuggingFace weights: load
microsoft/bitnet-b1.58-2B-4T-bf16directly; weight loader handles the BitLinear sub-norms.
- Single hardware platform (Tenstorrent Blackhole p150a) — porting to other accelerators would require re-deriving the BFP2 face permutation and the multi-core RMSNorm shard geometry.
- Closing the residual quality gap (PCC < 0.99) likely requires operator-level changes inside the runtime's C++ kernel library.
bench_accuracy.pyexhibits a model-cache aliasing bug for non-2-bit dtypes; throughput sweep uses a separate path and is unaffected. Seepaper/data/RESULTS.md.