This document provides a high-level overview of BitNet-rs architecture, design patterns, and key components.
BitNet-rs is organized as a Rust workspace with 110 crates (as of this writing):
bitnet-tokenizers → bitnet-models (GGUF loader) → bitnet-quantization → bitnet-kernels → bitnet-inference → bitnet-server / bitnet-cli
↑
bitnet-logits → bitnet-sampling → bitnet-generation ─┘
bitnet-gguf, bitnet-device-probe, bitnet-engine-core
bitnet-prompt-templates, bitnet-receipts, bitnet-honest-compute
bitnet-runtime-feature-flags, bitnet-simd, bitnet-rope
bitnet(root): Main library with unified public API and GGUF weight loadingbitnet-common: Shared types, traits, utilities, and enhanced error types for GGUF operationsbitnet-models: Enhanced model loading with real GGUF weight parsing - replaces mock tensor initialization with comprehensive transformer layer weight loading (AC1), supporting all quantization formats with device-aware placementbitnet-quantization: Real quantized computation with I2S, TL1/TL2 accuracy validation vs FP32 baselines (target thresholds defined in test fixtures) - STRICT MODE ENFORCED to prevent mock fallbacksbitnet-kernels: Device-aware quantization kernels with SIMD/CUDA acceleration, mixed precision support (FP16/BF16), automatic CPU/GPU selection, FFI bridge for C++ cross-validation, plus comprehensive GPU detection utilities supporting CUDA, Metal (#992), Vulkan (#993), ROCm (#995), Intel oneAPI (#986), and OpenGL/OpenCL probingbitnet-inference: Real neural network inference engine (Issue #254) with autoregressive generation, multi-head attention, quantized linear layers (I2S/TL1/TL2 GEMV), RoPE positional embeddings, GQA support, KV-cache optimization, deterministic generation, and receipt-backed performance validation - compute_path="real" enforcedbitnet-tokenizers: Universal tokenizer with GGUF integration, automatic discovery, and graceful fallback system
See also: Issue #254 Real Inference Specification for comprehensive real inference architecture
These small, single-responsibility crates reduce coupling in bitnet-inference and are re-exported from their original locations for zero breaking changes.
bitnet-logits: Pure logit-transform functions (apply_temperature,apply_top_k,apply_top_p,softmax_in_place,apply_repetition_penalty,argmax) — no external dependencies, suitable forno_stdbitnet-logits-filters: Extended logit filtering (min-p, tail-free sampling, typical sampling)bitnet-sampling: Token-sampling strategies (greedy, temperature, top-k, top-p, repetition penalty) built onbitnet-logits; seeded withChaCha8Rngfor reproducibilitybitnet-generation: Decode-loop contracts:StopCriteria,check_stoppriority logic,StopReason,GenerationConfig, streamingStreamEvent/GenerationStatsbitnet-engine-core: Session/orchestration contracts:SessionConfig,SessionMetrics,BackendInfobitnet-device-probe: OS/GPU probing and capability snapshot (gpu_compiled(),gpu_available_runtime(),detect_simd_level(),DeviceCapabilities) — respectsBITNET_GPU_FAKEandBITNET_STRICT_MODEbitnet-gguf: GGUF parser surface with property and fuzz testsbitnet-prompt-templates/bitnet-prompt-templates-core: Chat template types (PromptTemplate,TemplateType,ChatTurn) and formattersbitnet-receipts/bitnet-receipts-core: Honest-compute receipt schema v1.0.0 and serializationbitnet-honest-compute:compute_path=realenforcement and kernel-ID hygiene policy (no-mock gate)bitnet-runtime-feature-flags/bitnet-runtime-feature-flags-core: Runtime snapshot of compiled features (cpu,gpu,cudaflags reported independently)bitnet-simd: Portable SIMD abstraction layerbitnet-rope: RoPE positional-embedding table generationbitnet-math: Numerically-stable math primitives (log-sum-exp, softmax helpers)bitnet-validation: Production tensor shape/dtype validationbitnet-warn-once:warn_once!macro for rate-limited hot-path loggingbitnet-cpu-detect: Runtime CPU feature detection (AVX2, AVX-512, NEON)bitnet-cpu-activations: CPU activation functions (GELU, SiLU, etc.)bitnet-qk256-dispatch: QK256 format dispatch with scalar/AVX2 runtime selectionbitnet-transformer: Transformer block contracts and layer compositionbitnet-runtime-bootstrap: Runtime initialization and startup sequencingbitnet-runtime-context/bitnet-runtime-context-core: Runtime execution contextbitnet-runtime-profile/bitnet-runtime-profile-core: Runtime profiling framework
bitnet-gpu-hal: Unified GPU hardware abstraction layer with backend selector, async runtime, checkpoint manager, and deployment managerbitnet-opencl: Intel Arc OpenCL backend with built-in kernel registry and work-size optimization (experimental; featureopencl)bitnet-metal: Metal GPU backend for macOS/iOS Apple Silicon (featuremetal)bitnet-vulkan/bitnet-vulkan-shaders: Vulkan compute backend (featurevulkan)bitnet-spirv: SPIR-V shader compilation supportbitnet-nvidia: NVIDIA-specific GPU utilitiesbitnet-rocm: AMD ROCm detection and device probe (featurerocm)bitnet-intel-gpu-id: Intel GPU identification utilitiesbitnet-webgpu/bitnet-wgpu/bitnet-wgpu-runner/bitnet-wgpu-bench/bitnet-wgpu-shaders/bitnet-wgpu-shaders-i2s: WebGPU/wgpu compute backends and I2_S shader kernels
bitnet-bdd-grid/bitnet-bdd-grid-core: BDD compatibility grid with compile-coverage enforcement (xtask grid-check)bitnet-feature-matrix/bitnet-feature-contract: Feature-lattice contracts and enforcement testsbitnet-testing-policy/bitnet-testing-policy-core/bitnet-testing-policy-runtime/bitnet-testing-policy-kit: Test policy framework and runtime enforcementbitnet-testing-scenarios/bitnet-testing-scenarios-core: Test scenario definitionsbitnet-testing-profile/bitnet-runtime-profile-contract: Testing profile primitivesbitnet-test-fixtures-core: Shared test fixtures (GGUF fixtures, mock tensors)bitnet-test-env/bitnet-test-support: Environment isolation and test helpersbitnet-bench-receipts/bitnet-bench-regression-core: Benchmark receipt validation and regression detection
bitnet-server: Production HTTP/REST inference server providing scalable inference endpoints with batch processing, model hot-swapping capabilities, comprehensive health monitoring (liveness/readiness/startup probes), real-time system metrics collection (CPU, memory, disk, network I/O), Prometheus metrics integration, OpenTelemetry observability, streaming inference support, and deployment-ready configurations for Docker and Kubernetes environmentsbitnet-cli: Command-line interface for local inference, model verification, and compatibility checking
bitnet-compat: GGUF compatibility fixes and diagnosticsbitnet-ffi: C API for llama.cpp-compatible API (validation pending)bitnet-py: Python 3.12+ bindings compatible with llama-cpp-python (PyO3 ABI3-py312)bitnet-wasm: WebAssembly bindings with enhanced browser/Node.js compatibility and optimized SIMD intrinsics
crossval: Framework for testing against C++ implementation- Tests use
BITNET_GGUForCROSSVAL_GGUFenvironment variable for model path
The bitnet-kernels crate contains 58 source files organized into backend-specific subdirectories. See Kernel Module Reference for the complete module table.
Generic compute kernels with platform-specific SIMD accelerations:
- Core ops:
matmul,softmax,layernorm,attention,ffn,embedding,rope,gating,pooling,convolution,transpose,reduction,residual,loss - x86 SIMD (
x86.rs): AVX2/AVX-512 paths for QK256 dequantization and GEMV; property tests inx86_qk256_property_tests.rs - ARM NEON (24 modules):
neon_activations,neon_quantized_gemm,neon_quantized_matmul,neon_rope,neon_softmax,neon_layernorm,neon_kv_cache,neon_scatter_gather,neon_sliding_window_attention,neon_batch_norm,neon_elementwise,neon_reductions,neon_transpose,neon_inference_bridge, among others - Scatter/gather (
scatter_gather.rs,cpu/scatter_gather.rs): Memory-layout-aware data movement - Fusion:
layer_fusion.rs,fusion.rsfor fused attention and FFN passes - Parallelism:
pipeline_parallel.rs,tensor_parallel.rsfor multi-device distribution
GPU compute kernels targeting NVIDIA CUDA, gated behind #[cfg(any(feature = "gpu", feature = "cuda"))]:
- Core ops:
matmul,softmax,layernorm,rmsnorm,attention,fused_attention,multi_head_attention,ffn,embedding,rope,gating,linear,dequant,quantize - Quantized:
quantized_gemm,quantized_matmul,qk256_gemvfor 2-bit GEMV - Memory management:
memory_pool(device memory pooling),kv_cache,kv_cache_gpu - Execution management:
stream_mgmt(CUDA stream manager),warp_ops(warp-level primitives),cooperative_groups - Optimization:
graph_exec(CUDA graph execution),shader_cache,profiling,sparse - Other ops:
batch_norm,conv1d,elementwise,pooling,residual,loss,transpose,fusion
Experimental Intel Arc / OpenCL backend, each as a top-level module in bitnet-kernels/src/:
- Compute:
opencl_attention,opencl_flash_attention,opencl_gqa,opencl_ffn,opencl_layer_norm,opencl_reductions,opencl_softmax_variants,opencl_elementwise,opencl_embedding,opencl_token_embed - Quantized:
opencl_quantized,opencl_quantized_matmul,opencl_matmul_variants,opencl_mixed_precision - Infrastructure:
opencl_context,opencl_cmd_queue,opencl_buffer,opencl_memory,opencl_device_caps,opencl_work_size,opencl_kernel_sources,opencl_program_cache,opencl_registry - Pipeline:
opencl_pipeline,opencl_continuous_batch,opencl_graph_compiler,opencl_layer_compose,opencl_transformer,opencl_engine_bridge,opencl_model_converter - Caching:
opencl_cache,opencl_prefix_cache,opencl_kv_cache,opencl_rope_cache - Utilities:
opencl_autotuner,opencl_profiling,opencl_telemetry,opencl_async_executor,opencl_numerical_stability,opencl_weight_manager,opencl_token_gen
| Module | Feature gate | Purpose |
|---|---|---|
metal_compute |
feature = "metal" |
Apple Metal compute kernels |
rocm/ (4 modules) |
feature = "rocm" |
AMD ROCm attention, QK256 GEMV, RMSNorm |
npu/ |
feature = "npu-backend" |
NPU bridge (C++ interop) |
gpu/ (mixed-backend) |
feature = "gpu" |
Shared GPU utilities, OpenCL dispatch, validation, benchmarks |
| Module | Purpose |
|---|---|
kernels.rs |
KernelManager — runtime provider selection (CUDA > CPU fallback) |
capability_matrix.rs |
Kernel capability reporting per backend |
device_aware.rs |
Device-aware kernel dispatch |
device_features.rs |
gpu_compiled(), gpu_available_runtime() helpers |
convolution.rs |
Generic convolution ops |
reduction.rs / shaped_reduction.rs |
Reduction primitives |
scatter_gather.rs |
Top-level scatter/gather |
tl_lut.rs |
Table-lookup (TL1/TL2) LUT generation |
simd_diagnostics.rs |
SIMD feature detection and diagnostics |
perf_tracker.rs |
Kernel performance tracking |
gpu_utils.rs |
GPU utility functions |
stubs.rs |
Stub implementations for disabled backends |
ffi.rs / ffi/ |
C++ FFI bridge (feature ffi) |
BitNet-rs has comprehensive test infrastructure spanning multiple strategies:
| Strategy | Scope | Details |
|---|---|---|
| Property-based (proptest) | 63 crates | Randomized invariant testing across quantization, tokenization, KV-cache, tensor shapes |
| Snapshot (insta) | 49 crates, ~1 233 .snap files |
Struct/output stability for serialization, CLI output, receipt schemas |
| Fuzz (cargo-fuzz) | 98 targets | Nightly nightly-fuzz.yml: RoPE table gen, tokenizer encode, softmax stability, embedding lookup, memory layout, and more |
| BDD grid | bitnet-bdd-grid |
Compile-coverage matrix (xtask grid-check) |
| Feature-lattice | bitnet-feature-matrix |
Orthogonal feature-gate contracts |
| Fixture-based | bitnet-models, bitnet-test-fixtures-core |
GGUF dual-flavor detection, alignment (12/12 passing) |
| Environment isolation | EnvGuard + #[serial(bitnet_env)] |
Parallel-safe env mutation via temp_env |
| CPU golden-path E2E | tests/ |
7 deterministic tests always in PR CI (no model download) |
| Criterion benchmarks | benches/srp_ops.rs |
logits pipeline, top-k, repetition penalty, argmax, RoPE, KV cache |
BitNet-rs implements a comprehensive GGUF weight loading system that replaces mock tensor initialization with real neural network model parsing. This system represents a major architectural advancement enabling meaningful neural network inference.
pub fn load_gguf(
path: &Path,
device: Device,
) -> Result<(BitNetConfig, HashMap<String, CandleTensor>)>Pipeline Stages:
- Memory-Mapped File Access - Zero-copy GGUF file access via
MmapFile - Enhanced Parser Attempt - Try comprehensive GGUF reader with full validation
- Fallback Parser - Graceful degradation to minimal parser for backward compatibility
- Device-Aware Tensor Placement - Automatic GPU/CPU placement with fallback
- Comprehensive Validation - Security checks, tensor completeness, shape validation
Attention Layers (All Transformer Blocks):
layers.{i}.attention.wq- Query projection weightslayers.{i}.attention.wk- Key projection weightslayers.{i}.attention.wv- Value projection weightslayers.{i}.attention.wo- Output projection weights
Feed-Forward Layers (SwiGLU Architecture):
layers.{i}.feed_forward.w1- Gate projectionlayers.{i}.feed_forward.w2- Down projectionlayers.{i}.feed_forward.w3- Up projection
Normalization Layers:
layers.{i}.attention_norm.weight- Pre-attention RMSNormlayers.{i}.ffn_norm.weight- Pre-FFN RMSNorm
Embedding & Output:
token_embd.weight- Token embedding matrixoutput.weight- Language modeling head
I2_S (2-bit Signed) - Two Flavors with Automatic Detection:
- BitNet Native (32-element blocks): Values [-2, -1, 1, 2] with 10 bytes/block format
- GGML QK256 (256-element blocks): Values [-2, -1, 1, 2] with 64 bytes/block format, separate scales
- Automatic Detection: Loader inspects tensor sizes to identify format
- Transparent Dispatch: Forwards automatically use appropriate kernel
- Performance: 66+ Melem/s (CPU), 200+ Melem/s (GPU)
- Accuracy: Target accuracy thresholds defined in test fixtures
- Pure-Rust Support: Both formats run without FFI dependency
TL1/TL2 (Table Lookup Quantization):
- TL1: Linear mapping optimized for ARM (NEON)
- TL2: Non-linear mapping optimized for x86 (AVX2/AVX-512)
- Device-aware selection for optimal performance
Legacy Format Support:
- F32, F16: Full/half precision for accuracy comparison
- IQ2_S: GGML-compatible 82-byte blocks via FFI bridge
GPU Acceleration:
let cdevice = match device {
Device::Cuda(id) => match CDevice::new_cuda(id) {
Ok(cuda_device) => {
tracing::info!("Using CUDA device {} for tensor placement", id);
cuda_device
}
Err(e) => {
tracing::warn!("CUDA device {} unavailable, falling back to CPU: {}", id, e);
CDevice::Cpu
}
},
// ... other device types
};CPU Fallback Strategy:
- Automatic detection of GPU availability
- Graceful degradation with performance logging
- Optimal SIMD kernel selection (AVX2/AVX-512/NEON)
Pre-Loading Security Checks:
- GGUF magic byte validation ('GGUF')
- Version compatibility (1-3 supported)
- Tensor count bounds checking (< 10^6 security limit)
- KV pair count validation (< 10^5 security limit)
- File size sanity checks
Tensor Completeness Validation:
fn validate_tensor_completeness(
tensor_infos: &HashMap<String, TensorInfo>,
config: &BitNetConfig,
) -> Result<()>- Verifies all required transformer layers present
- Validates tensor shapes against model configuration
- Checks quantization format compatibility
- Ensures memory alignment requirements
Enhanced Error Types:
GgufParseError: Detailed GGUF parsing errors with contextQuantizationError: Quantization-specific errors with recovery suggestionsValidationError: Model validation failures with diagnostic informationSecurityError: Security limit violations with actionable guidance
Recovery Strategies:
- Automatic fallback from enhanced to minimal parser
- Mock tensor generation for test compatibility
- CPU fallback for GPU memory failures
- Alternative quantization format suggestions
Loading Performance:
- Zero-copy operations where possible
- Memory-mapped file access for large models
- Parallel tensor loading for multi-core systems
- Device-aware placement optimization
Memory Efficiency:
- 2GB parameter models load in <1.5GB RAM
- GPU memory pooling for tensor operations
- Efficient cache management for repeated loads
- Memory-mapped model sharing across instances
Accuracy Guarantees:
- I2_S quantization: Target accuracy thresholds defined in test fixtures
- Cross-validation against C++ reference implementation
- Systematic regression testing for accuracy preservation
- Property-based testing for numerical stability
The bitnet-server crate provides an HTTP/REST inference server built on the BitNet-rs inference engine. It serves as the application layer for deploying BitNet models in production environments.
Key Components:
- Inference Engine Integration: Direct integration with
bitnet-inferencefor autoregressive generation - Model Management: Hot-swappable model loading with graceful failover and validation
- Health Monitoring: Three-tier health check system (liveness, readiness, startup)
- System Metrics: Real-time collection of CPU, memory, disk, and network I/O metrics via
sysinfo - Observability: Prometheus metrics and OpenTelemetry integration for distributed tracing
- Streaming Support: Server-sent events (SSE) for real-time token streaming
- Batch Processing: Request batching for improved throughput
Architecture Position:
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ bitnet-server (HTTP/REST) │ │
│ │ • Axum web framework │ │
│ │ • Health endpoints (/health, /ready, /live) │ │
│ │ • Inference endpoints (/v1/completions) │ │
│ │ • Metrics endpoints (/metrics) │ │
│ │ • Streaming support (SSE) │ │
│ └──────────────────┬───────────────────────────────────┘ │
└────────────────────┼────────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────────┐
│ Inference Engine │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ bitnet-inference │ │
│ │ • Autoregressive generation │ │
│ │ • Multi-head attention │ │
│ │ • KV-cache optimization │ │
│ └──────────────────┬───────────────────────────────────┘ │
└────────────────────┼────────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────────┐
│ Core Components │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │
│ │bitnet-models│ │bitnet-quant │ │bitnet-tokenizers │ │
│ └─────────────┘ └─────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Integration Points:
- Model Loading: Uses
bitnet-modelsfor GGUF parsing and tensor loading - Tokenization: Integrates
bitnet-tokenizersfor universal tokenizer support - Inference: Wraps
bitnet-inferenceengine with HTTP/REST interface - Monitoring: Exposes internal metrics to Prometheus and OpenTelemetry
Deployment Support:
- Docker: Multi-stage builds for CPU and GPU variants (see
infra/docker/) - Kubernetes: Helm charts with autoscaling and health probes (see
infra/helm/bitnet/) - Configuration: Environment variables and TOML configuration files
- Security: Non-root execution, read-only filesystems, minimal dependencies
For detailed deployment guides, see:
- Feature-Gated Architecture: Default features are empty - always specify features explicitly
- Production GGUF Loading: Comprehensive tensor parsing replacing mock initialization with real model weights
- Zero-Copy Operations: Memory-mapped models, careful lifetime management with enhanced tensor loading
- Device-Aware Quantization: Automatic GPU acceleration with CPU fallback for all quantization formats
- SIMD Abstraction: Unified interface over platform-specific instructions with enhanced performance
- Cross-Validation: Systematic comparison with C++ for correctness using real model weights
- Enhanced Validation Framework: Comprehensive GPU/CPU validation with performance metrics and error tolerance
- Security-First Design: Input validation, bounds checking, and resource limits for production deployment
- FFI Bridge Architecture: Safe C++ kernel integration for gradual migration with comprehensive testing and error handling
- Multi-Backend GPU Detection: System-aware GPU detection with automatic fallback, supporting CUDA, Metal (#992), Vulkan (#993), ROCm (#995), Intel oneAPI (#986), and OpenGL/OpenCL probing (#984/#985)
- GPU Infrastructure Access: Low-level CUDA context and module access for advanced GPU programming (PR #199), enabling custom kernel loading and device-specific optimization
- Mixed Precision Computing: Native CUDA kernels for FP16/BF16 operations with device-aware precision selection and automatic fallback (PR #202)
- Server Architecture: Scalable HTTP/REST inference server with comprehensive health monitoring, system metrics, and deployment automation (PR #422)
BitNet-rs includes a comprehensive quality assurance system designed for production reliability:
- Real-Time System Monitoring: Comprehensive system metrics collection using
sysinfocrate - Performance Correlation: Application performance metrics correlated with system resource usage
- Prometheus Integration: System metrics exposed via Prometheus endpoints for alerting and dashboards
- Resource Tracking: CPU usage, memory utilization, disk usage, and network I/O monitoring
- Health Monitoring: Service uptime tracking and performance regression detection
- GPU/CPU Parity Testing: Systematic validation between GPU and CPU implementations
- Performance Benchmarking: Built-in performance measurement with speedup calculations
- Numerical Accuracy Testing: Configurable tolerance testing for quantization operations
- Memory Leak Detection: Automatic GPU memory monitoring and leak prevention
- Error Handling Validation: Comprehensive error path testing with recovery verification
- Weight Mapper Integration: GGUF tensor validation using weight mapper for compatibility checks
- Unmapped Tensor Detection: Detailed reporting of unmapped tensors with debugging metrics
- Fixture-Based Testing: Comprehensive test coverage for both success and corruption scenarios
- Enhanced Error Reporting: ValidationResult metrics include
unmapped_countandunmapped_tensors - GGUF Parsing Integration: Direct model file analysis for compatibility validation
- Auto-Detection: Automatic backend selection based on GGUF model metadata
- Enhanced GGUF Integration: Direct extraction of tokenizer configuration from model files with optimized byte mapping
- O(1) Byte Lookup Performance:
byte_to_id[256]array replaces HashMap for faster tokenization - Improved UTF-8 Handling: Proper byte buffer management in decode operations for robust text processing
- BOS Token Support: Enhanced BasicTokenizer with vocab boundary checks and special token handling
- SPM Compilation Fix: Resolved critical compilation error in SentencePiece tokenizer integration
- Fallback Strategy: Graceful degradation with compatibility validation for unsupported formats
- Runtime Construction: Build tokenizers from vocabulary and merge rules without external dependencies
- Cross-Format Support: BPE, SentencePiece, and custom tokenizer formats
- Gradual Migration Support: Safe C++ kernel integration enabling gradual transition to pure Rust
- Quantization Bridge: Complete FFI quantization support for I2S, TL1, and TL2 types
- Performance Comparison Framework: Built-in tools for comparing FFI vs Rust implementations
- Error Handling Integration: Enhanced C++ error propagation with
get_last_error()bridge - Feature-Gated Safety: Proper conditional compilation and graceful fallback when FFI unavailable
- Migration Decision Support: Automated recommendations based on performance and accuracy metrics
- Comprehensive Clippy Integration: Zero-tolerance policy for clippy warnings
- Type Safety Improvements: Enhanced type annotations and error handling
- Documentation Standards: Comprehensive inline documentation with examples
- Test Coverage: Extensive test suites with property-based testing
- Performance Regression Testing: Automated performance monitoring and validation
We maintain strict compatibility with llama.cpp while providing enhanced validation:
- C API functions have exact signature matches
- Python API is llama-cpp-python compatible
- We handle models that llama.cpp fails on (e.g., GPT-2 without pre-tokenizer)
- Enhanced GGUF parsing with tensor alignment validation for better error detection
- Robust handling of malformed GGUF files with detailed error messages
- See COMPATIBILITY.md for detailed guarantees
- Pre-Alpha Status: Correctness, performance, and validation are ongoing. Do not use in production.
- QK256 Performance: Scalar kernels only (~0.1 tok/s for 2B models). AVX2 nibble-LUT + FMA tiling planned for ≥3× uplift. Limit to
--max-tokens 4-16for validation. - GPU Backends: Metal, Vulkan, oneAPI, ROCm, OpenCL, and WebGPU backends are scaffolded but not validated end-to-end. CUDA is the furthest along.
- OpenCL Modules: 42 OpenCL modules in
bitnet-kernelsare experimental (Intel Arc focus); API surface may change. - Model Quality: microsoft-bitnet-b1.58-2B-4T-gguf produces non-sensical output in some configurations (known model quality issue, not an inference bug).
- Test Scaffolding: ~466
#[ignore]tests across the workspace, all with justification strings. Categories: real-model, CUDA, slow mock-inference, crossval, and TDD scaffolds.
- Making Changes: Always run tests for affected crates
- Before Committing: Run
cargo fmtandcargo clippy - Cross-Validation: Run
cargo xtask crossvalfor inference changes - Compatibility: Check COMPATIBILITY.md before changing public APIs
For detailed information on specific components, see: