Skip to content

deveworld/bitnet-tt

Repository files navigation

BitNet-TT

A native TT-NN implementation of Microsoft's BitNet b1.58 2B-4T on Tenstorrent Blackhole p150a.

The system targets a single-user inference setting with 2-bit ternary weights packed into the platform's BFP2 format and decoded via a captured Metal-Trace pipeline.

Headlines (re-measured 2026-05-06)

Metric Value
Decode throughput (packed_ternary, batch-32 internal) 73.4 ± 0.4 t/s
Speed-up vs bitnet.cpp (CPU pinned single-socket, t=16) 3.25×
Speed-up vs bitnet.cpp (CPU as-shipped default, t=2) 4.1×
Energy/token (sustained) — TT vs CPU 0.97 vs 3.79 J = 3.9× lower
Energy/token (burst) — TT vs CPU 0.80 vs 3.83 J = 4.7× lower
Pearson correlation vs HF reference (16-prompt mean) 0.86 ± 0.10
Model footprint ~600 MB (8× smaller than BF16)

Full measurement journal in paper/data/RESULTS.md; methodology in paper/ (AdaptFM @ ICML 2026 submission).

Layout

bitnet-tt/
├── README.md, LICENSE, pyproject.toml
├── main.py                  # interactive entry
├── bench_batch32.py         # production throughput bench
├── bench_accuracy.py        # PCC vs HF reference
├── src/bitnet_tt/           # the library
├── tests/                   # canonical test suite (pytest)
├── examples/demo.py         # end-to-end demo
├── scripts/
│   ├── bench/               # secondary benches (vs bitnet.cpp / HF / TPS / profile)
│   ├── pcc_localize.py      # per-op RFE harness
│   └── legacy/              # archived dev scratch
├── docs/
│   ├── STATUS.md            # consolidated reference (Sessions 3-13)
│   ├── plan_*.md            # Phase K kernel-fork future-work plans
│   └── _archive/            # session histories + working notes
└── paper/                   # AdaptFM workshop submission package

Quick start

# On the Tenstorrent p150a server
source ~/.tenstorrent-venv/bin/activate
cd ~/bitnet-tt

# Throughput benchmark (128-token greedy decode)
TT_METAL_ENABLE_L1_DATA_CACHE_RISCVS=BR,NC,TR,ER \
BITNET_TT_TRACE_REGION_SIZE=200000000 \
python bench_batch32.py --dtype packed_ternary --max-new 128

# Accuracy comparison vs HuggingFace fp32 reference
python bench_accuracy.py --dtype packed_ternary --decode-steps 16

# Interactive chat
python main.py --chat

# Comparison benches (now under scripts/bench/)
python scripts/bench/bench_vs_bitnetcpp.py --ref-logits /path/to/cpp.bin
python scripts/bench/profile_decode.py --dtype packed_ternary

Key implementation choices

  • 2-bit BFP2_b weights packed at 256 bytes per 32×32 tile, unpacked in hardware inside the matmul compute kernel.
  • Fused RMSNorm + Q/K/V matmul: a single fused kernel reads the input activation once, computes the per-row sum-of-squares reduction along the matmul reduction loop, and emits the Q/K/V projections without materialising the post-norm tensor.
  • Trace-captured decode loop: the entire decode step (embedding lookup → 30 transformer blocks → final norm → 4-way split lm_head → in-trace multicore argmax) is captured once and replayed without host involvement.
  • HuggingFace weights: load microsoft/bitnet-b1.58-2B-4T-bf16 directly; weight loader handles the BitLinear sub-norms.

Limitations

  • Single hardware platform (Tenstorrent Blackhole p150a) — porting to other accelerators would require re-deriving the BFP2 face permutation and the multi-core RMSNorm shard geometry.
  • Closing the residual quality gap (PCC < 0.99) likely requires operator-level changes inside the runtime's C++ kernel library.
  • bench_accuracy.py exhibits a model-cache aliasing bug for non-2-bit dtypes; throughput sweep uses a separate path and is unaffected. See paper/data/RESULTS.md.

About

BitNet LLM with Tenstorrent p150a (tenstorrent-korea-oss-student-program)

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors