tensorbit-core is the high-performance C++20/CUDA 12 "Surgical Engine" for Hessian-aware structured pruning. It is stage one of the Tensorbit Labs P-D-Q pipeline (Prune → Distill → Quantize).
C++20 Strict: This project targets C++20 only.
std::expected(C++23),std::format_string(C++23), andstd::print(C++23) are intentionally avoided. A customResult<T,E>replacesstd::expected, andstd::vformatwithstd::make_format_args(both C++20) handles message formatting.
tensorbit-core/
├── CMakeLists.txt # Build system (TENSORBIT_ENABLE_CUDA option)
├── .clang-format # C++20 code style (Google-based, 4-space indent)
├── .gitignore # Build artifacts, logs, .tb files, model weights
├── README.md # Project overview and usage
├── format.sh # Clang-format runner (prerequisite check)
│
├── include/
│ └── tensorbit/
│ └── core/
│ ├── common.hpp # CUDA_CHECK, TENSORBIT_CHECK, Logger, Result<T,E>
│ ├── tensor.hpp # TensorDense<T>, FloatingPoint/TensorType concepts
│ │ # to_device()/to_host() transfer, host/device alloc
│ ├── ehap.hpp # EHAPPruner<F> — ALL implementations inline
│ ├── coring.hpp # CORINGPruner<F> — ALL implementations inline
│ ├── kernels.hpp # CUDA kernel launch declarations (6 functions)
│ ├── serialization.hpp # TBWriter/TBReader — .tb binary format
│ └── loader.hpp # SafeTensorsFile — .safetensors metadata parser
│
├── src/
│ ├── main.cpp # CLI entry point — full load→prune→save pipeline
│ ├── ehap.cpp # Explicit template instantiations only
│ ├── coring.cpp # Explicit template instantiations only
│ ├── serialization.cpp # Explicit template instantiations (TBWriter/Reader)
│ ├── kernels.cu # 6 CUDA kernels (compiled when CUDA enabled)
│ └── kernels_stubs.cpp # No-op stubs for CPU-only builds
├── tests/
│ ├── test_ehap.cpp # EHAP pruner unit tests (7 cases)
│ ├── test_coring.cpp # CORING pruner unit tests (7 cases)
│ └── test_all.sh # Test runner (prereq checks, --skip-gpu, --clean)
│
├── scripts/
│ ├── setup_cloud.sh # Ubuntu 22.04 provisioning (CUDA 12, Eigen3, Python)
│ ├── download_model.py # HuggingFace .safetensors downloader
│ └── verify_ubuntu.sh # WSL/Ubuntu environment diagnostic tool
│
├── docs/
│ ├── ARCHITECTURE.md # This file
│ ├── ALGORITHMS.md # High-level algorithm overview
│ ├── EHAP.md # EHAP: complete mathematical exposition
│ └── CORING.md # CORING: complete mathematical exposition
│
└── third_party/ # Reserved for non-vcpkg dependencies
tb-prune (executable)
├── tensorbit-core-cuda (static lib) // src/kernels.cu — 6 GPU kernels
│ ├── CUDA::cudart // cudaMalloc, cudaFree, cudaMemcpy
│ ├── CUDA::cublas // Reserved for future cuBLAS integration
│ └── tensorbit-core (static lib) // Host headers + pruner implementations
│
└── tensorbit-core (static lib) // src/ehap.cpp, src/coring.cpp
├── Eigen3::Eigen // Header-only linear algebra (reserved)
├── include/tensorbit/core/common.hpp // Logging, error-checking macros
├── include/tensorbit/core/tensor.hpp // TensorDense<F>, concepts, device memory
├── include/tensorbit/core/ehap.hpp // EHAPPruner interface
├── include/tensorbit/core/coring.hpp // CORINGPruner interface
├── include/tensorbit/core/kernels.hpp // CUDA kernel launch declarations
└── include/tensorbit/core/serialization.hpp // .tb binary I/O
common.hpp (independent — macros, Logger)
↑
tensor.hpp (includes common.hpp for CUDA_CHECK, TENSORBIT_CHECK)
↑
ehap.hpp coring.hpp kernels.hpp (each includes tensor.hpp for TensorDense<F>)
↑ ↑ ↑
ehap.cpp coring.cpp kernels.cu (implementation files)
↘ ↙ ↓
main.cpp tensorbit-core-cuda
tensor.hpp defines two core concepts:
template<typename T>
concept FloatingPoint = std::is_floating_point_v<T>;
template<typename T>
concept TensorType = requires(T t) {
typename T::value_type;
{ t.data() } -> std::same_as<typename T::value_type*>;
{ t.size() } -> std::convertible_to<std::size_t>;
{ t.shape() } -> std::convertible_to<std::span<const std::size_t>>;
{ t.rank() } -> std::convertible_to<std::size_t>;
};The FloatingPoint concept constrains pruner templates (EHAPPruner<F>,
CORINGPruner<F>). TensorDense<T> is unconstrained — it accepts any scalar
type including uint8_t for mask tensors.
std::expected is a C++23 type, unavailable in C++20 mode. A custom Result<T,E>
class (union-based, 175 lines in common.hpp) provides the same API:
Result<T, E> // disc union of T/E with has_value_ flag
Result<void, E> // specialization for error-only results
Unexpected<E> // error wrapper (analogous to std::unexpected)
unexpected(e) // factory function returning Unexpected<decay_t<E>>
All pruner methods return Result<void, ErrorEnum> or Result<std::size_t, ErrorEnum>.
Error propagation: if (!result) return unexpected(result.error());
All pruner member functions are defined inline in the headers (ehap.hpp,
coring.hpp). The .cpp files (ehap.cpp, coring.cpp) contain only
explicit template instantiations:
template class EHAPPruner<float>;
template class EHAPPruner<double>;This avoids the brittle extern template class + explicit specialization
pattern, which fails when inline methods in the class body trigger eager
implicit instantiation in GCC.
The Logger::log() method has a simple signature:
void log(LogLevel level, std::string_view msg,
const std::source_location& loc = std::source_location::current());All formatting happens at each LOG macro site via std::vformat +
std::make_format_args (both C++20):
#define TENSORBIT_LOG_INFO(fmt, ...) \
Logger::instance().log(LogLevel::kInfo, \
std::vformat((fmt), std::make_format_args(__VA_ARGS__)), \
std::source_location::current())This avoids the template deduction failure that occurs when a variadic
parameter pack is followed by a defaulted std::source_location parameter.
It also avoids std::format_string (C++23).
Critical constraint: std::make_format_args(_Args&...) only accepts
lvalue references. All format arguments must be named variables, not
temporary expressions (no literals, no ternary results).
The TensorDense<F> class owns its buffer on either host or device through a
std::unique_ptr<F[], void(*)(F*)> with compile-time selectable deleters:
| Allocator | Deallocator | Guard |
|---|---|---|
new F[n]() |
delete[] ptr |
Always available |
cudaMalloc |
cudaFree(ptr) |
#ifdef __CUDACC__ |
Memory transfer is explicit via two methods:
to_device()—cudaMemcpy(host→device), returns new device-ownedTensorDense.to_host()—cudaMemcpy(device→host), returns new host-ownedTensorDense.
These are used by the CORING pruner's CPU fallback path: when double-precision tensors reside on the GPU (where only float kernels exist), the implementation transparently copies to host, processes on CPU, and copies back.
Move semantics are implemented (copy is deleted) — tensors can be efficiently returned from functions without double-free risk.
Maximum rank is 8 — covers transformer tensors up to
[batch × seq_len × num_heads × head_dim × ...].
Rather than storing the full O(N²) Hessian, EHAP uses the empirical Fisher Information diagonal:
This is accumulated incrementally via accumulate_fisher(gradients, alpha):
GPU path: fisher_accumulate_kernel — 1 thread per element, uses fmaf()
fused multiply-add for 1-ULP precision. Zero shared memory, ~100% occupancy.
CPU path: Element-wise __restrict__ loop (autovectorizable by GCC/Clang 12+).
The importance score then couples magnitude with curvature:
where λ is the damping factor (default 0.01) for numerical stability.
| Stage | Method | GPU Kernel | CPU Fallback |
|---|---|---|---|
| Accumulate | accumulate_fisher(grad, α) |
fisher_accumulate_kernel |
__restrict__ loop |
| Compute | compute_importance(w, out) |
ehap_importance_kernel |
Direct loop |
| Select | select_pruning_mask(imp, mask) |
(host-only) | std::nth_element O(N) |
The selection stage is host-only by design — sorting/partitioning on GPU
requires complex warp-level reductions and is dominated by PCIe transfer cost
for mask export anyway. std::nth_element finds the (1−sparsity_ratio)·N
percentile in O(N) time with cache-friendly memory access.
CORING enforces hardware-friendly patterns for NVIDIA Ampere Sparse Tensor
Cores (instruction mma.sp):
| Path | N:M Pattern | Kernel | Thread Model | Shared Memory |
|---|---|---|---|---|
| Fast | 2:4 (Ampere native) | nm_mask_2_4_kernel |
1 thread/group | 0 bytes |
| Generic | Any N:M, M ≤ 32 | nm_mask_generic_kernel |
M threads/group | ~128 bytes |
2:4 fast path — One thread loads 4 importance values into registers, finds the top-2 magnitudes via a fixed comparison tree (fully unrolled, branchless after PTX compilation), and writes a packed mask byte. Near-100% theoretical occupancy on SM80/SM90.
Generic N:M path — A thread block of size M processes one group. Each thread holds one element. Cooperative ranking via shared memory: each thread counts how many others have strictly higher value. Tie-breaking is deterministic (lower index wins). Thread 0 assembles the mask byte from the rank array.
CPU fallback — For double-precision tensors or non-CUDA builds, each group
is processed via std::nth_element operating on a vector<pair<F, int>> of
size M. Since M ≤ 32, this is negligible overhead.
Mask format — Packed as 1 byte per group:
Byte g: bit 0 = keep element (g·M + 0)?
bit 1 = keep element (g·M + 1)?
...
bit (M-1) = keep element (g·M + M−1)?
Analytical pruned count — Since validate_config ensures the tensor size is
divisible by M, the number of pruned weights is exact:
This eliminates the need for atomic counters in the CUDA mask kernel.
Every pruner method checks tensor.device() at runtime and routes accordingly:
if (tensor.device() == DeviceLocation::kDevice && use_cuda)
→ launch CUDA kernel + CUDA_SYNC_CHECK()
else
→ CPU loop (host-resident data or CUDA unavailable)
Double-precision GPU tensors fall back to CPU transparently:
weights (device, double) → to_host() → CPU processing → cudaMemcpy back
This provides a uniform API regardless of whether CUDA is available, while still benefiting from GPU acceleration when possible.
Both EHAPPruner and CORINGPruner use explicit instantiation (extern template class) for float and double. This:
- Controls compile times — specializations are compiled once in each
.cpp. - Prevents ODR violations — one definition rule across translation units.
- Isolates CUDA dependencies —
kernels.cuis linked separately; the.cppfiles only call host launch wrappers declared inkernels.hpp.
common.hpp provides a singleton Logger with 6 severity levels (kTrace →
kFatal), timestamped output via std::format (C++20), and std::mutex-guarded
write access. Convenience macros (TENSORBIT_LOG_INFO, etc.) include
std::source_location for automatic file:line attribution.
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 | magic | 0x31304254 ("TB01" big-endian) |
| 4 | 4 | version | Format version (1) |
| 8 | 4 | nm_n | N in N:M sparsity pattern |
| 12 | 4 | nm_m | M in N:M sparsity pattern |
| 16 | 8 | num_weights | Total weight elements (dense count) |
| 24 | 8 | num_masks | Total mask bytes (num_weights / M groups) |
| 32 | 8 | weights_offset | Byte offset to start of pruned weight data |
| 40 | 8 | masks_offset | Byte offset to start of packed mask data |
| 48 | 1 | precision | 0=FP32, 1=FP16, 2=BF16 |
| 49 | 2047 | reserved | Padding for future extensions |
| 4096 | varies | weights_data | Pruned weight buffer (dense, pruned=0.0) |
| offset | varies | masks_data | Packed N:M bitmasks (1 byte per M-sized group) |
Each group of M weights consumes exactly 1 byte in masks_data:
Group g at masks_data[g]:
Bit 0 → weight[g·M + 0] is kept (1) or pruned (0)
Bit 1 → weight[g·M + 1] is kept (1) or pruned (0)
...
Bit (M-1) → weight[g·M + M−1] is kept (1) or pruned (0)
For M ≤ 8, one byte per group. For M = 16 or M = 32, bytes are packed
accordingly (2 or 4 bytes per group respectively). The .tb reader uses
nm_m from the header to determine the unpacking stride.
| Kernel | Grid | BlockDim | Shared Memory | Compute Intensity | Bound |
|---|---|---|---|---|---|
fisher_diagonal_kernel |
⌈N/256⌉ | 256 | 0 B | 2 FLOP/elem | Memory (HBM) |
fisher_accumulate_kernel |
⌈N/256⌉ | 256 | 0 B | 2 FLOP/elem | Memory (HBM) |
ehap_importance_kernel |
⌈N/256⌉ | 256 | 0 B | 3 FLOP/elem | Memory (HBM) |
nm_mask_2_4_kernel |
⌈N₄/256⌉ | 256 | 0 B | 4 FLOP/elem | Compute (ALU) |
nm_mask_generic_kernel |
N/M | M | 256 B | O(M²)/elem | Compute (ALU) |
apply_mask_kernel |
⌈N/256⌉ | 256 | 0 B | 0 FLOP/elem | Memory (HBM) |
All grid/block dimensions are computed by host launch wrappers in
kernels.hpp. Kernels use std::size_t for element counts to handle
models with >2³¹ parameters.
fisher_diagonal_kernel sums over batch dimension B:
Used for initial Fisher buffer population from a full batch of gradients.
Subsequent accumulation uses fisher_accumulate_kernel (element-wise +=).
HuggingFace Hub
│
│ download_model.py
▼
.safetensors ──────────────────────────────────────────┐
(dense FP32/FP16/BF16 weights) │
│ │
│ tb-prune --model <path> --sparsity 2:4 │
▼ │
┌──────────────────────────────────────────┐ │
│ EHAPPruner │ │
│ │ │
│ accumulate_fisher(gradients, α) │ │
│ ├─ GPU: fisher_accumulate_kernel │ │
│ └─ CPU: __restrict__ loop │ │
│ │ │
│ compute_importance(weights, imp_out) │ │
│ ├─ GPU: ehap_importance_kernel │ │
│ │ s[i] = w[i]² · (F[i] + λ) │ │
│ └─ CPU: direct loop │ │
│ │ │
│ select_pruning_mask(imp, mask) │ │
│ └─ CPU: std::nth_element O(N) │ │
│ threshold = percentile(r) │ │
│ mask[i] = (imp[i] >= threshold) │ │
└──────────────┬───────────────────────────┘ │
│ importance scores │
▼ │
│ CORINGPruner │ │
│ │ │
│ validate_config(importance.size()) │ │
│ ├─ N < M, shape divisible by M │ │
│ └─ Size divisible by M │ │
│ │ │
│ generate_nm_mask(importance, mask_out) │ │
│ ├─ 2:4 → nm_mask_2_4_kernel (fast) │ │
│ ├─ N:M → nm_mask_generic_kernel │ │
│ └─ CPU: nth_element per group │ │
│ │ │
│ apply_mask(weights, mask) │ │
│ ├─ GPU: apply_mask_kernel │ │
│ └─ CPU: bit-test + zero per group │ │
│ │ │
│ N_pruned = (N_elems/M) × (M−N) │ │
└──────────────┬───────────────────────────┘ │
│ pruned weights + N:M masks │
▼ │
┌──────────────────────────────────────────┐ │
│ TBWriter (serialization.hpp) │ │
│ │ │
│ write(pruned_weights, masks, N, M) │ │
│ ├─ Header: magic + version + metadata │ │
│ ├─ Weights blob: dense float buffer │ │
│ └─ Masks blob: packed bitmask buffer │ │
└──────────────┬───────────────────────────┘ │
│ │
▼ ▼
output.tb (future) inference
(standalone file, runtime → TBReader →
ready for inference) sparse matmul on GPU)
| Component | Precision | Size |
|---|---|---|
| Original weights | FP32 | 28 GB |
| Fisher diagonal (during pruning) | FP32 | 28 GB |
| Importance scores (temporary) | FP32 | 28 GB |
| EHAP mask (uint8, 1 B/elem) | uint8 | 7 GB |
| N:M mask (packed, N/M bytes) | uint8 | 1.75 GB |
| Pruned weights (.tb output) | FP32 | 28 GB |
| Peak pruning memory | ~58 GB | |
| Minimum GPU required | A100-80GB |
Memory is freed incrementally: after mask selection, importance scores are
released. After N:M mask generation, the EHAP mask is released. The Fisher
diagonal is kept until pruning completes, then freed via reset().
| Area | Feature | Status |
|---|---|---|
| Core | C++20 build system with CMake + CUDA 12 + Eigen3 | ✅ |
| Core | Custom Result<T,E> type, thread-safe Logger, CUDA_CHECK |
✅ |
| Core | TensorDense<T> — host/device memory with to_device()/to_host() |
✅ |
| Core | CPU-only build path (TENSORBIT_ENABLE_CUDA=OFF) |
✅ |
| EHAP | Fisher EMA accumulation + Fisher diagonal + GPU kernels | ✅ |
| EHAP | Importance scores: OBD, OBS-style, Normalized | ✅ |
| EHAP | Iterative pruning with cubic schedule (Zhu & Gupta 2017) | ✅ |
| EHAP | Blockwise exact OBS — Woodbury H⁻¹, gradient-covariance, adaptive sparsity | ✅ |
| EHAP | Weight compensation: bias, proportional redistribution | ✅ |
| CORING | N:M mask selection: top-N, optimal (C(M,N)), iterative swap-refine | ✅ |
| CORING | GPU-accelerated 2:4 kernel (0 shared mem) + generic kernel (256 B shared) | ✅ |
| CORING | Absolute-magnitude redistribution, hardware-aware Ampere 2:4 layout | ✅ |
| IO | .tb binary format — TBWriter/TBReader with 4096-byte header, round-trip verify |
✅ |
| IO | .safetensors parser — header-only SafeTensorsFile (F32/F16/BF16/I64) |
✅ |
| CLI | tb-prune — full pipeline: load → EHAP → CORING → save .tb |
✅ |
| CLI | Mock tensor mode for development on low-spec hardware | ✅ |
| Future | Multi-stream CUDA parallelism | 🔜 |
| Future | FP16/BF16 CUDA kernels | 🔜 |
| Future | cuSPARSELt integration for direct sparse matmul | 🔜 |
| Future | tensorbit-run standalone inference engine |
🔜 |