Skip to content

Commit 166f070

Browse files
Update README.md docs
1 parent 295158d commit 166f070

2 files changed

Lines changed: 35 additions & 33 deletions

File tree

README.md

Lines changed: 35 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,27 @@
11
# RSR-core
22

3-
**RSR (Redundant Segment Reduction)** algorithm.
3+
**RSR (Redundant Segment Reduction)** for efficient low-bit inference (matrix-vector multiplication).
44

5-
Reference: [UIC-InDeXLab/RSR](https://github.com/UIC-InDeXLab/RSR)
5+
This repository contains the core kernels, model integrations, and benchmarking code for **RSR** across CPU and CUDA backends. RSR targets fast matrix-vector multiplication when the matrix is low-bit quantized by grouping repeated column patterns, aggregating the corresponding input values once, and then scattering the result to the affected output rows.
66

7-
## Installation
7+
This is especially useful for workloads such as low-bit LLM inference, where decoding repeatedly applies quantized matvec operations. For the original algorithm, see [UIC-InDeXLab/RSR](https://github.com/UIC-InDeXLab/RSR) and [docs/ALGORITHM.md](docs/ALGORITHM.md).
88

9-
**Prerequisites:** Python >= 3.10, a C compiler (for CPU kernels), and optionally CUDA for GPU support.
9+
## Installation 🛠️
10+
11+
**Prerequisites:** Python >= 3.10, a C compiler for CPU kernels, and optionally CUDA for GPU support.
1012

1113
```bash
1214
git clone https://github.com/UIC-InDeXLab/RSR-Core.git
1315
cd RSR-Core
1416
pip install -e .
1517
```
1618

17-
## Structure
18-
19-
```
20-
RSR-core/
21-
├── multiplier/ # Python wrappers for kernels
22-
│ ├── bit_1/ # 1-bit (binary) multipliers (CPU/CUDA)
23-
│ └── bit_1_58/ # 1.58-bit (ternary) multipliers (CPU/CUDA)
24-
├── kernels/ # Low-level C/CUDA kernel source
25-
│ ├── bit_1/
26-
│ │ ├── cpu/ # C kernels
27-
│ │ └── cuda/ # CUDA kernels (.cu)
28-
│ └── bit_1_58/
29-
│ ├── cpu/ # C kernels
30-
│ └── cuda/ # CUDA kernels (.cu)
31-
├── integrations/ # Model integrations
32-
│ └── hf/ # HuggingFace integration
33-
├── benchmarking/ # Benchmarking scripts & results
34-
└── tests/ # Unit and integration tests
35-
```
36-
37-
38-
## Demo
39-
40-
<!-- <p align="center">
41-
<a href="assets/rsr_baseline_compare.mp4">
42-
<img src="assets/rsr_baseline_compare.webp" alt="Comparison of the Hugging Face baseline and RSR inference on 1.58-bit LLM inference. Click to open the MP4 version." width="900" />
43-
</a>
44-
</p> -->
19+
## Demo 🎬
20+
Inference on CPU for a 1.58-bit LLM decoding step. Click the image to view the original high-quality video. `HF` denotes the Hugging Face baseline running `bfloat16` on PyTorch.
4521

4622
[![RSR vs Baseline](assets/rsr_baseline_compare.webp)](https://drive.google.com/file/d/1ub-MITJUepmfBLkyUZFb50hbJsuhgwCH/view?usp=sharing)
4723

48-
## Benchmark Results
24+
## Benchmark Results 📊
4925

5026
### Matrix-Vector Multiplication
5127

@@ -82,3 +58,29 @@ Speedup is computed against the HuggingFace `bfloat16` baseline for the same mod
8258
| Llama3-8B-1.58-100B-tokens | 31.9 | **59.3** | **1.9x** |
8359
| bitnet-b1.58-2B-4T-bf16 | 33.1 | **57.4** | **1.7x** |
8460
| bitnet-b1.58-2B-4T | 41.6 | **57.1** | **1.4x** |
61+
62+
## Updates 📝
63+
64+
<!--
65+
- Add project updates here.
66+
-->
67+
68+
## Project Structure 🗂️
69+
70+
```text
71+
RSR-core/
72+
├── multiplier/ # Python wrappers for kernels
73+
│ ├── bit_1/ # 1-bit (binary) multipliers (CPU/CUDA)
74+
│ └── bit_1_58/ # 1.58-bit (ternary) multipliers (CPU/CUDA)
75+
├── kernels/ # Low-level C/CUDA kernel source
76+
│ ├── bit_1/
77+
│ │ ├── cpu/ # C kernels
78+
│ │ └── cuda/ # CUDA kernels (.cu)
79+
│ └── bit_1_58/
80+
│ ├── cpu/ # C kernels
81+
│ └── cuda/ # CUDA kernels (.cu)
82+
├── integrations/ # Model integrations
83+
│ └── hf/ # HuggingFace integration
84+
├── benchmarking/ # Benchmarking scripts & results
85+
└── tests/ # Unit and integration tests
86+
```
File renamed without changes.

0 commit comments

Comments
 (0)