You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+19-6Lines changed: 19 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,8 +4,8 @@ Intel® Neural Compressor
4
4
===========================
5
5
<h3> An open-source Python library supporting popular model compression techniques on mainstream deep learning frameworks (PyTorch, TensorFlow, and JAX)</h3>
@@ -25,6 +25,8 @@ across diverse quantization techniques and low-precision data types through inte
25
25
support AMD CPU, ARM CPU, and NVidia GPU with limited testing.
26
26
27
27
## What's New
28
+
*[2026/03] FP8 quantization support for [Keras/JAX](./docs/source/JAX.md) (experimental)
29
+
*[2026/03] FP8 KV cache/Attention static quantization with [AutoRound](./docs/source/PT_AutoRound.md) (experimental)
28
30
*[2025/12][NVFP4 quantization](./docs/source/PT_NVFP4Quant.md) experimental support
29
31
*[2025/10][MXFP8 / MXFP4 quantization](./docs/source/PT_MXQuant.md) experimental support
30
32
*[2025/09] FP8 dynamic quantization, including Linear, FusedMoE on Intel Gaudi AI Accelerators
@@ -33,20 +35,22 @@ support AMD CPU, ARM CPU, and NVidia GPU with limited testing.
33
35
34
36
## Installation
35
37
Choose the necessary framework dependencies to install based on your deploy environment.
36
-
### Install Framework
38
+
### Install Framework for PyTorch Backend (on-demand)
39
+
Intel Neural Compressor supports PyTorch with CPU, GPU and HPU. Please install the corresponding PyTorch version based on your hardware environment.
37
40
*[Install intel_extension_for_pytorch for CPU](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/)
38
41
*[Install intel_extension_for_pytorch for Intel GPU](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/)
39
42
*[Use Docker Image with torch installed for HPU](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#bare-metal-fresh-os-single-click)
40
43
**Note**: There is a version mapping between Intel Neural Compressor and Gaudi Software Stack, please refer to this [table](./docs/source/gaudi_version_map.md) and make sure to use a matched combination.
41
44
*[Install torch for other platform](https://pytorch.org/get-started/locally)
# Framework extension API + JAX dependency, available since v3.8
53
+
pip install neural-compressor-jax
50
54
```
51
55
**Note**: Further installation methods can be found under [Installation Guide](./docs/source/installation_guide.md). check out our [FAQ](./docs/source/faq.md) for more details.
AutoRound is an advanced model quantization algorithm integrated into Neural Compressor for low-bit LLM. As a key algorithm component of INC, AutoRound enables efficient quantization across a wide range of models and features while consistently achieving superior accuracy. While requiring additional tuning time, it provides a robust foundation for INC's comprehensive quantization capabilities.
6
+
7
+
## Supported Features
8
+
9
+
-**Weight-Only Quantization (WoQ)** - Quantize model weights while keeping activations in full precision. See [Weight-Only Quantization](./PT_WeightOnlyQuant.md) for details.
10
+
11
+
-**Microscaling (MX) Quantization** - Neural Compressor seamlessly applies the MX data type to post-training quantization, offering meticulously crafted recipes to empower users to quantize LLMs without sacrificing accuracy. Refer to [MX Quantization](./PT_MXQuant.md).
12
+
13
+
-**NVFP4 Quantization** - NVFP4 is a specialized 4-bit floating-point format (FP4) developed by NVIDIA for deep learning workloads. See [NVFP4 Quantization](./PT_NVFP4Quant.md).
14
+
15
+
-**Quantization-Aware Training (QAT)** - Fine-tune models during quantization to achieve better accuracy. See [Quantization-Aware Training](./PT_QAT.md) for details.
16
+
17
+
-**FP8 KV Cache and Attention Static Quantization (Experimental)** - The support for the FP8 data type enhances inference performance by quantizing key-value cache and attention computations to FP8 precision.
18
+
19
+
## Getting Started
20
+
21
+
### Basic Usage
22
+
23
+
```python
24
+
from neural_compressor.torch.quantization import prepare, convert, AutoRoundConfig
25
+
26
+
quant_config = AutoRoundConfig(tokenizer=tokenizer) # tokenizer used for calibration
27
+
model = prepare(model, quant_config)
28
+
model = convert(model)
29
+
30
+
# For more detailed usage, please refer to the [Supported Features] documentation.
31
+
```
32
+
### FP8 KV Cache and FP8 Attention support
33
+
```python
34
+
from transformers import AutoModelForCausalLM, AutoTokenizer
35
+
from neural_compressor.torch.quantization import (
Intel Neural Compressor 3.X extends PyTorchand TensorFlow's APIs to support compression techniques.
24
+
Intel Neural Compressor extends PyTorch, TensorFlow and JAX's APIs to support compression techniques.
25
25
The below table provides a quick overview of the APIs available in Intel Neural Compressor 3.X.
26
-
The Intel Neural Compressor 3.X mainly focuses on quantization-related features, especially for algorithms that benefit LLM accuracy and inference.
26
+
The project mainly focuses on quantization-related features, especially for algorithms that benefit LLM accuracy and inference.
27
27
It also provides some common modules across different frameworks. For example, Auto-tune support accuracy driven quantization and mixed precision, benchmark aimed to measure the multiple instances performance of the quantized model.
28
28
29
29
<tableclass="docutils">
@@ -37,8 +37,7 @@ It also provides some common modules across different frameworks. For example, A
# Framework extension API + JAX dependency, available since v3.8
62
+
pip install neural-compressor-jax
63
+
```
63
64
64
65
### Install from Source
65
-
The latest code on master branch may not be stable. Feel free to open an [issue](https://github.com/intel/neural-compressor/issues) if you encounter an error.
66
+
The latest code on master branch may not be stable. Please switch to the latest release tag for better stability. Feel free to open an [issue](https://github.com/intel/neural-compressor/issues) if you encounter an error.
Copy file name to clipboardExpand all lines: examples/pytorch/nlp/huggingface_models/language-modeling/quantization/smooth_quant/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
Step-by-Step
1
+
Step-by-Step (Deprecated)
2
2
============
3
3
This document describes the step-by-step instructions to run large language models (LLMs) using Smooth Quantization on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch.
0 commit comments