Portable C11 inference engine for
BitMamba-2 models (255M and 1B parameters).
Weights are BitNet 1.58-bit ternary values {-1, 0, +1} packed 4 per byte,
reducing the linear projection matmuls to additions and subtractions.
Non-ternary components (SSM recurrence, convolution, norms) still use float32.
Edge AI inference is memory-bandwidth-bound. On a Raspberry Pi 5 (Cortex-A76, ~18 GB/s LPDDR4X), a 4-bit 1.5B Transformer already saturates the memory bus at ~30-40 tok/s. Scaling to 3.8B drops to ~5 tok/s because the CPU stalls waiting on DRAM.
BitMamba-2 attacks both bottlenecks simultaneously:
- Mamba-2 SSM:
$O(1)$ per-token state update versus$O(N)$ attention decode. No KV cache, constant memory regardless of context length. The Structured State Space Duality (SSD) framework enables optimized matrix kernels during training while preserving linear-time recurrence at inference. - 1.58-bit ternary weights {-1, 0, 1}: a 1B model fits in ~250 MB (versus ~600 MB at 4-bit). Weights are trained from scratch via Quantization-Aware Training (QAT), not post-training compressed. The zero value is critical: it grants explicit feature filtering that strict binary {-1, 1} models lack, and reduces weight-activation multiplication to addition and subtraction.
- Hybrid precision sanctuary: over 90% of parameters (dense projections) are ternary; the delicate SSM recurrence variables (z, B, C, dt) and the 1D convolution remain in full precision (bfloat16 during training, float32 in bitmamba.c), preventing the catastrophic forgetting that plagues uniformly quantized recurrent networks.
The result: BitMamba-2 1B achieves 52.86 tok/s on an Intel Core i3-12100F (entry-level, DDR4, no NPU), in the same range as a 4-bit Llama 3.2 1B benchmarked at ~50.7 tok/s on an AMD Ryzen AI 9 HX 375 (CPU-side AVX-512; NPU offload via vendor backends may differ). The trend suggests algorithmic efficiency can narrow the gap with hardware acceleration.
On zero-shot reasoning benchmarks, the 1B ternary model scores 63.3% on ARC-Easy (versus 43.9% for the larger 1.3B binary Bi-Mamba) and defeats Llama 3.2 1B on BoolQ (62.35% versus 50.24%), proving that linear-time SSMs do not sacrifice reasoning under aggressive quantization.
This project provides a from-scratch C11 inference engine with T-MAC lookup-table kernels, Metal GPU dispatch on Apple Silicon, and runtime ISA detection, with no dependencies beyond a C compiler and POSIX.
- C11 with POSIX dependencies (mmap, clock_gettime), no external libraries
- Runtime kernel dispatch: scalar baseline, AVX2 (x86_64), NEON (AArch64), Metal GPU (Apple Silicon)
- mmap-based model loading (zero-copy, OS pages in weights on demand)
- T-MAC lookup-table kernel for ternary bitlinear
- Built-in GPT-2 byte-pair tokenizer
- Persistent spin-wait thread pool with
--threads N(auto-detects P-cores on Apple Silicon) - Per-stage profiling (
--profileflag) - GPU offload for batch prefill (
--gpuflag, Apple Metal) - Interactive chat mode (
--chat) with multi-turn SSM state persistence - Built-in arithmetic evaluator and comparison solver (no shell dependency)
- Query rewriting for improved base-model completions
Reference numbers from the bitmamba.cpp C++ engine on commodity CPUs (no NPU, no GPU):
| Model | Parameters | Precision | Peak RAM | Speed (i3-12100F) |
|---|---|---|---|---|
| Small | 255M | 1.58-bit | 252 MB | 146.16 tok/s |
| Medium | 1B | 1.58-bit | 621 MB | 52.86 tok/s |
Average human reading speed is ~5-7 tok/s. The 1B model runs at roughly 10x real-time conversational pace on entry-level hardware.
The T-MAC LUT kernel replaces weight-activation multiplies with table lookups: ternary weights index precomputed partial sums in L1/L2 cache, keeping the critical path off the memory bus.
Requires a C compiler (C11 or later), POSIX headers (mmap, clock_gettime),
and Python 3 for tokenizer generation (make download).
ISA-specific kernels (AVX2, NEON, Metal) are auto-detected at build time.
make # Build ./bitmamba
make clean # Remove build artifactsmake downloadThis downloads the BitMamba-2 1B model weights (bitmamba_1b.bin),
fetches the GPT-2 vocabulary, and generates tokenizer.bin.
The model and tokenizer binaries are required before running inference in tokenizer mode.
For the 255M model, download manually:
curl -L -o bitmamba_255m.bin \
https://huggingface.co/Zhayr1/BitMamba-2-0.25B/resolve/main/bitmamba_cpp/bitmamba_255m.bin./bitmamba [--profile] [--gpu] [--threads N] <model.bin> <input> <mode> \
[temp] [penalty] [min_p] [top_p] [top_k] [max_tokens] [output_mode]Parameters:
model.bin- path to model weightsinput- input text (tokenizer mode) or space-separated token IDs (raw mode)mode-tokenizer(text in/out) orraw(token IDs)temp- temperature (default: 0.8)penalty- repetition penalty (default: 1.15)min_p- Min-P sampling threshold (default: 0.05)top_p- Top-P / nucleus sampling (default: 0.90)top_k- Top-K sampling (default: 40)max_tokens- maximum tokens to generate (default: 400)output_mode-bench(default, shows stats) orclean(output only)
# Text input/output
./bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1
# Raw token IDs (greedy)
./bitmamba bitmamba_1b.bin "15496 11 314 716" raw 0.0
# Per-stage profiling
./bitmamba --profile bitmamba_1b.bin "Hello, I am" tokenizer
# Metal GPU (Apple Silicon, auto-fallback to CPU for short sequences)
./bitmamba --gpu bitmamba_1b.bin "15496 11 314 716" raw 0.0Interactive multi-turn conversation using Mamba's fixed-size SSM state. No growing KV cache or context window limit -- memory stays constant across turns.
# Interactive REPL (Ctrl-D or /quit to exit)
./bitmamba --chat bitmamba_1b.bin
# Pipe mode (single prompt, no REPL)
echo "Where is Tokyo?" | ./bitmamba --chat bitmamba_1b.bin
# Custom sampling parameters
./bitmamba --chat bitmamba_1b.bin 0.8 1.15 0.05 0.9 40 400
# Save/load SSM state across sessions
./bitmamba --chat --state session.bin bitmamba_1b.bin
# Custom prompt template (%s = user input)
./bitmamba --chat --template 'Q: %s\nA:' bitmamba_1b.bin
# Raw mode (no template wrapping)
./bitmamba --chat --no-template bitmamba_1b.binChat mode features:
- Default template
"Question: %s\nDetailed answer:"for informative base-model completions - Built-in arithmetic evaluator:
2+3*4returns14,3^2^2returns81 - Comparison solver:
4*4 vs 3*3returns4*4 = 16, 3*3 = 9, so the first is larger. - English number words:
twenty three plus forty tworeturns65 - Query rewriting expands short questions for better output (e.g., "Where is X?" becomes "Where is X located, and what is it known for?")
- Turn-boundary detection prevents the model from rambling into new questions
- SSM state persistence (
--state) enables multi-session conversations - REPL commands:
/reset,/save,/help,/quit(Tab completion) - Configurable stop conditions:
--stop-token ID,--newline-limit N,--repeat-limit N
The docs/ directory contains detailed technical documentation:
- Background: Edge AI memory wall, Transformer bottlenecks, and state space model foundations
- Architecture: Ternary quantization, BitLinear mechanics, hybrid precision strategy, and comparative landscape
- Performance: Scaling validation, CPU inference economics, T-MAC optimization, and energy efficiency
src/
bitmamba.h Shared types, constants, API declarations
dispatch.h/.c Runtime kernel dispatch table
main.c CLI entry point, generation loop
chat.c/.h Interactive chat mode, math evaluator, state persistence
model.c mmap model loading, forward pass, sampling
kernel.c Scalar T-MAC bitlinear, RMS norm, activations
block.c Mamba-2 block (SSM + conv1d) forward pass
quantization.c Weight unpacking LUTs, T-MAC mask tables
tokenizer.c GPT-2 byte-pair tokenizer (encode/decode)
threadpool.h/.c Persistent spin-wait thread pool (fork-join)
utils.c Memory, timing, PRNG, profiling utilities
arch/avx2.c AVX2-optimized kernels (x86_64)
arch/neon.c NEON-optimized kernels (AArch64)
metal/ Metal GPU compute shaders (Apple Silicon)
Makefile Build system (auto ISA detection)
scripts/
export-bin.py JAX checkpoint to .bin converter
export-tokenizer.py GPT-2 tokenizer export (stdlib only)
generate-golden.py Golden vector generation
Binary format with magic 0x42495432 ("BIT2").
The header contains magic, vocab_size, d_model, n_layers, and n_heads,
followed by tensors.
Float32 tensors store type, ndim, shape, and data.
BitNet tensors store type, rows, cols, scale, and packed 2-bit data.
For Raspberry Pi 5 (Cortex-A76) and similar ARMv8.2-A targets:
- The NEON kernel (
src/arch/neon.c) uses the signed dot-product extension (SDOT viavdotq_s32) available on Cortex-A76+, with avmull_s8fallback for older ARMv8-A cores. The build auto-detects via-mcpu=native. - Ternary weights for the 1B model pack to ~250 MB (1B x 2 bits / 8).
The
.binfile is larger because embedding and norm tensors are float32, but total runtime footprint stays well under 1 GB. - The T-MAC LUT kernel indexes precomputed partial sums in L1 cache instead of multiplying by {-1, 0, +1}, sidestepping the lack of native ternary instructions. Reference T-MAC benchmarks on RPi 5 achieve 11 tok/s for a 3B model.
- Energy consumption for ternary inference can be as low as 0.028 J/token, up to 70% less than 4-bit Transformer inference. This is critical for battery-operated and thermally constrained deployments.
- For comparison, a 4-bit Qwen-2.5-1.5B on the same hardware achieves ~40 tok/s but requires ~1.1 GB. The 1B BitMamba-2 uses ~621 MB peak RAM (roughly half) with ternary weight storage at ~1/4 the bits per parameter.
This project draws from bitmamba.cpp by Jesus Salazar, a C++ inference implementation using AVX2 intrinsics and OpenMP threading. bitmamba.c rewrites the engine in plain C11 with POSIX-only dependencies, eliminating the C++ runtime and OpenMP to reduce system utilization and widen portability to bare-metal and resource-constrained targets.
bitmamba.c is freely redistributable under the MIT License.