update readme

DD-DuDa · DD-DuDa · commit 2bea5b5eb9f9 · 2025-12-18T13:10:44.000Z
diff --git a/.gitignore b/.gitignore
@@ -48,4 +48,6 @@ logs/
 
 *.so
 
-libtorch/
+libtorch/
+
+KIVI/
diff --git a/README.md b/README.md
@@ -1,8 +1,10 @@
-![overview](imgs/title.png)
+<p align="center">
+  <img src="imgs/title.png" width="400">
+</p>
 
 <div align="center">
 
-## Efficient low-bit KV cache decoding
+## Efficient LLMs decoding with low-bit KV cache
 
 [![arXiv](https://img.shields.io/badge/arXiv-2410.13276-b31b1b.svg)](https://arxiv.org/abs/2503.18773)
 [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
@@ -16,13 +18,12 @@ cache. Achieve **3-9x speedup** than Flash-Decoding-v2.
 
 
 ## News
-* [2025.11] 🔥 BitDecoding has been accepted to HPCA 2025! 
+* [2025.11] 🔥 BitDecoding has been accepted to HPCA 2026! 
 
 ## Benchmark
-* Kernel Performance in RTX4090
-![overview](imgs/4090.png)
-* Kernel Performance in A100
-![overview](imgs/a100.png)
+* Kernel Performance in Blackwell GPU
+![overview](imgs/blackwell.jpg)
+
 
 ## Installation
 ```
@@ -34,17 +35,51 @@ python setup.py install
 ```
 
 ## Quick Start
-1. See benchmark/bench_single_decode.ipynb
-2. (Optional) Play with libtorch c++      
-    ```
-    # download libtorch 
 
+```python
+import torch
+import math
+from bit_decode import kvcache_pack_int, fwd_kvcache_int
+
+# Parameters
+batch_size, nheads, nheads_k, d = 1, 32, 32, 128
+seqlen_q, seqlen_kv = 1, 4096
+num_bits, group_size = 4, 128  # 4-bit quantization
+quant_mode = "k-channel"
+pack_nums = int(16 / num_bits)
+
+# Input tensors
+q = torch.randn(batch_size, seqlen_q, nheads, d, device="cuda", dtype=torch.float16)
+k_cache = torch.randn(batch_size, seqlen_kv, nheads_k, d, device="cuda", dtype=torch.float16)
+v_cache = torch.randn(batch_size, seqlen_kv, nheads_k, d, device="cuda", dtype=torch.float16)
+
+# Quantized KV cache buffers
+k_pack   = torch.zeros((batch_size, seqlen_kv // pack_nums, nheads_k, d), dtype=torch.uint16, device="cuda")
+k_params = torch.zeros((batch_size, seqlen_kv // group_size, nheads_k, d), dtype=torch.float32, device="cuda")
+v_pack   = torch.zeros((batch_size, seqlen_kv, nheads_k, d // pack_nums), dtype=torch.uint16, device="cuda")
+v_params = torch.zeros((batch_size, d // group_size, nheads_k, seqlen_kv), dtype=torch.float32, device="cuda")
+cu_seqlens_k = torch.arange(0, (batch_size + 1) * seqlen_kv, seqlen_kv, dtype=torch.int32, device="cuda")
+
+# Pack KV cache
+kvcache_pack_int(k_cache, k_pack, k_params, v_cache, v_pack, v_params,
+                 None, cu_seqlens_k, seqlen_kv, quant_mode, group_size, num_bits)
+
+# Decode with BitDecoding
+output = fwd_kvcache_int(q, k_pack, k_params, v_pack, v_params, None,
+                         1.0 / math.sqrt(d), quant_mode, group_size, num_bits)
+```
+
+## Examples
+
+- **Benchmark notebook**: See [benchmark/bench_single_decode.ipynb](benchmark/bench_single_decode.ipynb)
+- **End-to-end inference**: See [e2e branch](https://github.com/DD-DuDa/BitDecoding/tree/e2e)
+- **(Optional) LibTorch C++ build**:
+    ```bash
     cd BitDecoding/csrc/bit_decode
     mkdir build && cd build
     cmake -DCMAKE_PREFIX_PATH=<libtorch_path> ..
     make -j12
     ```
-3. End2end inference example, please see [e2e](https://github.com/DD-DuDa/BitDecoding/tree/e2e)
 
 ## Citation
 If you find BitDecoding useful or want to use in your projects, please kindly cite our paper:
diff --git a/imgs/blackwell.jpg b/imgs/blackwell.jpg
diff --git a/libs/cutlass b/libs/cutlass
@@ -1 +1 @@
-Subproject commit 3fe62887d8dd75700fdaf57f9c181878701b0802
+Subproject commit c6aeb9179c5f74a0fcdbd28527bf4b6ba8c60752