Skip to content

Latest commit

 

History

History
39 lines (33 loc) · 2.22 KB

File metadata and controls

39 lines (33 loc) · 2.22 KB

Ternary LLM

Large Language Models (LLMs) require substantial computational resources, limiting their deployment on resource-constrained hardware. Ternary LLMs mitigate these demands through weight quantization via 2-bit ternary values {-1, 0, +1}, achieving significant compression often with 50 − 90% sparsity. However, existing approaches face limitations:

  • Existing CPU and GPU do not support native 2-bit operations, and existing libraries like PyTorch and CUDA do not have dedicated computing kernels for ternary weights.
  • Existing sparse formats like Compressed Sparse Column are not optimized for ternary values, causing extra storage and decompression overhead.
  • Methods optimized for ternary weights, such as BitNet, RSR, and RSR++, fail to capture sparsity structures.

Therefore, we aim to solve these challenges by novel algorithms, code optimization, and hardware accelerators. This repository contains code for three projects:

  • SSR: Sparse Segment Reduction for Ternary GEMM Acceleration (target limitation 3)
  • Efficient Addition-Based Sparse GEMM for Fast Ternary Large Language Model Inference on Edge Devices (target limitations 1 and 2)
  • An Accelerator for Ternary Language Models based on FPGA (target limitation 1)

File organization and main contributors:

  • SSR: Adeline Pittet, Valerie Verdan, and Shien Zhu
  • ternaryLLM_CPU: Mila Kjoseva, and Shien Zhu
  • ternaryLLM_GPU: Guanshujie Fu
  • ternaryLLM_FPGA: Gabriele Giacone

Please refer to the README inside each folder for the detailed experiment setups. If you find this repository helpful, please cite the following papers:

@inproceedings{SSR_DATE_2026,
  title={SSR: Sparse Segment Reduction for Ternary GEMM Acceleration},
  author={Adeline Pittet and Shien Zhu and Valerie Verdan and Gustavo Alonso},
  booktitle={Design, Automation and Test in Europe (DATE)},
  year={2026}
}

@article{ternaryLLM_TECS_2026,
title = {Efficient Addition-Based Sparse GEMM for Fast Ternary Large Language Model Inference on Edge Devices},
author = {Zhu, Shien and Fu, Guanshujie and Kjoseva, Mila and Alonso, Gustavo},
journal = {ACM Trans. Embed. Comput. Syst.},
issn = {1539-9087},
url = {https://doi.org/10.1145/3807782},
doi = {10.1145/3807782},
month = apr,
year = {2026},
}