Skip to content

Commit a9bfaf0

Browse files
Merge branch 'master' into dev/bkowalsk/jax_int8_pr
2 parents 72b2e8b + acda521 commit a9bfaf0

19 files changed

Lines changed: 233 additions & 113 deletions

File tree

.azure-pipelines/scripts/ut/run_3x_pt.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ echo "##[section]import check pass"
1313
echo "##[group]set up UT env..."
1414
export LD_LIBRARY_PATH=${HOME}/.local/lib/:$LD_LIBRARY_PATH
1515
sed -i '/^deepspeed/d' /neural-compressor/test/torch/requirements.txt
16-
pip install -r /neural-compressor/test/torch/requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu
16+
pip install -r /neural-compressor/test/torch/requirements.txt
1717
pip install pytest-cov
1818
pip install pytest-html
1919
pip install beautifulsoup4==4.13.5

.azure-pipelines/template/model-template.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ steps:
1818
dockerConfigName: "commonDockerConfig"
1919
repoName: "neural-compressor"
2020
repoTag: "py312"
21-
dockerFileName: "Dockerfile"
21+
dockerFileName: "ubuntu-2404"
2222
containerName: ${{ parameters.modelContainerName }}
2323

2424
- script: |

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ repos:
7979
rev: v1.7.7
8080
hooks:
8181
- id: docformatter
82+
language_version: python3.13
8283
args: [
8384
--in-place,
8485
--wrap-summaries=0, # 0 means disable wrap

README.md

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ Intel® Neural Compressor
44
===========================
55
<h3> An open-source Python library supporting popular model compression techniques on mainstream deep learning frameworks (PyTorch, TensorFlow, and JAX)</h3>
66

7-
[![python](https://img.shields.io/badge/python-3.10%2B-blue)](https://github.com/intel/neural-compressor)
8-
[![version](https://img.shields.io/badge/release-3.7-green)](https://github.com/intel/neural-compressor/releases)
7+
[![python](https://img.shields.io/badge/python-3.11%2B-blue)](https://github.com/intel/neural-compressor)
8+
[![version](https://img.shields.io/badge/release-3.8-green)](https://github.com/intel/neural-compressor/releases)
99
[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/neural-compressor/blob/master/LICENSE)
1010
[![coverage](https://img.shields.io/badge/coverage-85%25-green)](https://github.com/intel/neural-compressor)
1111
[![Downloads](https://static.pepy.tech/personalized-badge/neural-compressor?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/neural-compressor)
@@ -25,6 +25,8 @@ across diverse quantization techniques and low-precision data types through inte
2525
support AMD CPU, ARM CPU, and NVidia GPU with limited testing.
2626

2727
## What's New
28+
* [2026/03] FP8 quantization support for [Keras/JAX](./docs/source/JAX.md) (experimental)
29+
* [2026/03] FP8 KV cache/Attention static quantization with [AutoRound](./docs/source/PT_AutoRound.md) (experimental)
2830
* [2025/12] [NVFP4 quantization](./docs/source/PT_NVFP4Quant.md) experimental support
2931
* [2025/10] [MXFP8 / MXFP4 quantization](./docs/source/PT_MXQuant.md) experimental support
3032
* [2025/09] FP8 dynamic quantization, including Linear, FusedMoE on Intel Gaudi AI Accelerators
@@ -33,20 +35,22 @@ support AMD CPU, ARM CPU, and NVidia GPU with limited testing.
3335

3436
## Installation
3537
Choose the necessary framework dependencies to install based on your deploy environment.
36-
### Install Framework
38+
### Install Framework for PyTorch Backend (on-demand)
39+
Intel Neural Compressor supports PyTorch with CPU, GPU and HPU. Please install the corresponding PyTorch version based on your hardware environment.
3740
* [Install intel_extension_for_pytorch for CPU](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/)
3841
* [Install intel_extension_for_pytorch for Intel GPU](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/)
3942
* [Use Docker Image with torch installed for HPU](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#bare-metal-fresh-os-single-click)
4043
**Note**: There is a version mapping between Intel Neural Compressor and Gaudi Software Stack, please refer to this [table](./docs/source/gaudi_version_map.md) and make sure to use a matched combination.
4144
* [Install torch for other platform](https://pytorch.org/get-started/locally)
42-
* [Install TensorFlow](https://www.tensorflow.org/install)
4345

4446
### Install Neural Compressor from pypi
4547
```
4648
# Framework extension API + PyTorch dependency
4749
pip install neural-compressor-pt
4850
# Framework extension API + TensorFlow dependency
4951
pip install neural-compressor-tf
52+
# Framework extension API + JAX dependency, available since v3.8
53+
pip install neural-compressor-jax
5054
```
5155
**Note**: Further installation methods can be found under [Installation Guide](./docs/source/installation_guide.md). check out our [FAQ](./docs/source/faq.md) for more details.
5256

@@ -113,8 +117,7 @@ model = load(
113117
<td colspan="2" align="center"><a href="./docs/source/design.md#architecture">Architecture</a></td>
114118
<td colspan="2" align="center"><a href="./docs/source/design.md#workflows">Workflow</a></td>
115119
<td colspan="2" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
116-
<td colspan="1" align="center"><a href="./docs/source/llm_recipes.md">LLMs Recipes</a></td>
117-
<td colspan="1" align="center"><a href="./examples/README.md">Examples</a></td>
120+
<td colspan="2" align="center"><a href="./examples/README.md">Examples</a></td>
118121
</tr>
119122
</tbody>
120123
<thead>
@@ -163,6 +166,16 @@ model = load(
163166
<td colspan="8" align="center"><a href="./docs/source/transformers_like_api.md">Overview</a></td>
164167
</tr>
165168
</tbody>
169+
<thead>
170+
<tr>
171+
<th colspan="8">JAX Extension APIs</th>
172+
</tr>
173+
</thead>
174+
<tbody>
175+
<tr>
176+
<td colspan="8" align="center"><a href="./docs/source/JAX.md">Overview</a></td>
177+
</tr>
178+
</tbody>
166179
<thead>
167180
<tr>
168181
<th colspan="8">Other Modules</th>

docs/source/PT_AutoRound.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
2+
# PyTorch AutoRound
3+
4+
## Overview
5+
AutoRound is an advanced model quantization algorithm integrated into Neural Compressor for low-bit LLM. As a key algorithm component of INC, AutoRound enables efficient quantization across a wide range of models and features while consistently achieving superior accuracy. While requiring additional tuning time, it provides a robust foundation for INC's comprehensive quantization capabilities.
6+
7+
## Supported Features
8+
9+
- **Weight-Only Quantization (WoQ)** - Quantize model weights while keeping activations in full precision. See [Weight-Only Quantization](./PT_WeightOnlyQuant.md) for details.
10+
11+
- **Microscaling (MX) Quantization** - Neural Compressor seamlessly applies the MX data type to post-training quantization, offering meticulously crafted recipes to empower users to quantize LLMs without sacrificing accuracy. Refer to [MX Quantization](./PT_MXQuant.md).
12+
13+
- **NVFP4 Quantization** - NVFP4 is a specialized 4-bit floating-point format (FP4) developed by NVIDIA for deep learning workloads. See [NVFP4 Quantization](./PT_NVFP4Quant.md).
14+
15+
- **Quantization-Aware Training (QAT)** - Fine-tune models during quantization to achieve better accuracy. See [Quantization-Aware Training](./PT_QAT.md) for details.
16+
17+
- **FP8 KV Cache and Attention Static Quantization (Experimental)** - The support for the FP8 data type enhances inference performance by quantizing key-value cache and attention computations to FP8 precision.
18+
19+
## Getting Started
20+
21+
### Basic Usage
22+
23+
```python
24+
from neural_compressor.torch.quantization import prepare, convert, AutoRoundConfig
25+
26+
quant_config = AutoRoundConfig(tokenizer=tokenizer) # tokenizer used for calibration
27+
model = prepare(model, quant_config)
28+
model = convert(model)
29+
30+
# For more detailed usage, please refer to the [Supported Features] documentation.
31+
```
32+
### FP8 KV Cache and FP8 Attention support
33+
```python
34+
from transformers import AutoModelForCausalLM, AutoTokenizer
35+
from neural_compressor.torch.quantization import (
36+
AutoRoundConfig,
37+
convert,
38+
prepare,
39+
)
40+
41+
fp32_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
42+
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m", trust_remote_code=True)
43+
44+
output_dir = "./saved_inc"
45+
quant_config = AutoRoundConfig(
46+
tokenizer=tokenizer,
47+
scheme="MXFP4", # MXFP4, MXFP8, NVFP4
48+
iters=0, # rtn mode
49+
seqlen=2,
50+
static_kv_dtype="fp8", # None, fp8, float16
51+
static_attention_dtype=None, # None, fp8
52+
export_format="auto_round",
53+
output_dir=output_dir,
54+
)
55+
56+
model = prepare(model=fp32_model, quant_config=quant_config)
57+
model = convert(model)
58+
```
59+
60+
## Reference
61+
62+
[1]. Cheng, Wenhua, et al. "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs" arXiv preprint arXiv:2309.05516 (2023).
63+
64+
[2]: NVIDIA, Introducing NVFP4 for efficient and accurate low-precision inference,NVIDIA Developer Blog, Jun. 2025. [Online]. Available: https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
65+
66+
[3]. Intel AutoRound, https://github.com/intel/auto-round

docs/source/get_started.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ quantized_model = convert(model=prepared_model)
2121
```
2222

2323
## Feature Matrix
24-
Intel Neural Compressor 3.X extends PyTorch and TensorFlow's APIs to support compression techniques.
24+
Intel Neural Compressor extends PyTorch, TensorFlow and JAX's APIs to support compression techniques.
2525
The below table provides a quick overview of the APIs available in Intel Neural Compressor 3.X.
26-
The Intel Neural Compressor 3.X mainly focuses on quantization-related features, especially for algorithms that benefit LLM accuracy and inference.
26+
The project mainly focuses on quantization-related features, especially for algorithms that benefit LLM accuracy and inference.
2727
It also provides some common modules across different frameworks. For example, Auto-tune support accuracy driven quantization and mixed precision, benchmark aimed to measure the multiple instances performance of the quantized model.
2828

2929
<table class="docutils">
@@ -37,8 +37,7 @@ It also provides some common modules across different frameworks. For example, A
3737
<td colspan="2" align="center"><a href="design.md#architecture">Architecture</a></td>
3838
<td colspan="2" align="center"><a href="design.md#workflow">Workflow</a></td>
3939
<td colspan="2" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
40-
<td colspan="1" align="center"><a href="llm_recipes.md">LLMs Recipes</a></td>
41-
<td colspan="1" align="center"><a href="/examples/README.md">Examples</a></td>
40+
<td colspan="2" align="center"><a href="/examples/README.md">Examples</a></td>
4241
</tr>
4342
</tbody>
4443
<thead>
@@ -71,6 +70,26 @@ It also provides some common modules across different frameworks. For example, A
7170
<td colspan="2" align="center"><a href="TF_SQ.md">Smooth Quantization</a></td>
7271
</tr>
7372
</tbody>
73+
<thead>
74+
<tr>
75+
<th colspan="8">Transformers-like APIs</th>
76+
</tr>
77+
</thead>
78+
<tbody>
79+
<tr>
80+
<td colspan="8" align="center"><a href="transformers_like_api.md">Overview</a></td>
81+
</tr>
82+
</tbody>
83+
<thead>
84+
<tr>
85+
<th colspan="8">JAX Extension APIs</th>
86+
</tr>
87+
</thead>
88+
<tbody>
89+
<tr>
90+
<td colspan="8" align="center"><a href="JAX.md">Overview</a></td>
91+
</tr>
92+
</tbody>
7493
<thead>
7594
<tr>
7695
<th colspan="8">Other Modules</th>

docs/source/imgs/architecture.png

-24.7 KB
Loading

docs/source/installation_guide.md

Lines changed: 19 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,8 @@ The following prerequisites and requirements must be satisfied for a successful
2424
> Notes:
2525
> - If you get some build issues, please check [frequently asked questions](faq.md) at first.
2626
27-
### Install Framework
27+
### Install Framework for PyTorch Backend (on-demand)
28+
Intel Neural Compressor supports PyTorch with CPU, GPU and HPU. Please install the corresponding PyTorch version based on your hardware environment.
2829
#### Install torch for CPU
2930
```Shell
3031
pip install torch --index-url https://download.pytorch.org/whl/cpu
@@ -38,19 +39,15 @@ https://intel.github.io/intel-extension-for-pytorch/index.html#installation
3839
#### Install torch for other platform
3940
https://pytorch.org/get-started/locally
4041

41-
#### Install tensorflow
42-
```Shell
43-
pip install tensorflow
44-
```
45-
4642
### Install from Binary
4743
- Install from Pypi
4844
```Shell
49-
# Framework extension API for PyTorch/Tensorflow
45+
# Framework extension API for PyTorch/Tensorflow/JAX
5046
pip install neural-compressor
51-
# Framework extension API + specific dependency
47+
# Framework extension API + corresponding framework dependency
5248
pip install neural-compressor[pt]
5349
pip install neural-compressor[tf]
50+
pip install neural-compressor[jax] # JAX support is available since v3.8
5451
```
5552
```Shell
5653
# Framework extension API + PyTorch dependency
@@ -60,12 +57,17 @@ pip install neural-compressor-pt
6057
# Framework extension API + TensorFlow dependency
6158
pip install neural-compressor-tf
6259
```
60+
```Shell
61+
# Framework extension API + JAX dependency, available since v3.8
62+
pip install neural-compressor-jax
63+
```
6364

6465
### Install from Source
65-
The latest code on master branch may not be stable. Feel free to open an [issue](https://github.com/intel/neural-compressor/issues) if you encounter an error.
66+
The latest code on master branch may not be stable. Please switch to the latest release tag for better stability. Feel free to open an [issue](https://github.com/intel/neural-compressor/issues) if you encounter an error.
6667
```Shell
6768
git clone https://github.com/intel/neural-compressor.git
6869
cd neural-compressor
70+
git fetch --tags && git checkout "$(git tag -l 'v*' --sort=-v:refname | head -n 1)"
6971
```
7072

7173
```Shell
@@ -79,7 +81,7 @@ INC_TF_ONLY=1 pip install .
7981
```
8082

8183
```Shell
82-
# JAX framework extension API + JAX dependency
84+
# JAX framework extension API + JAX dependency, available since v3.8
8385
INC_JAX_ONLY=1 pip install .
8486
```
8587

@@ -94,43 +96,35 @@ INC_JAX_ONLY=1 pip install .
9496

9597
* Intel Xeon Scalable processor (Sapphire Rapids, Emerald Rapids, Granite Rapids)
9698
* Intel Xeon CPU Max Series (Sapphire Rapids HBM)
97-
* Intel Core Ultra Processors (Meteor Lake, Lunar Lake)
9899

99100
#### Intel® Neural Compressor supports GPUs built on Intel's Xe architecture:
100101

101-
* Intel Data Center GPU Flex Series (Arctic Sound-M)
102-
* Intel Data Center GPU Max Series (Ponte Vecchio)
103102
* Intel® Arc™ B-Series Graphics (Battlemage)
104103

105104
### Validated Software Environment
106105

107-
* OS version: CentOS 8.4, Ubuntu 24.04, MacOS Ventura 13.5, Windows 11
108-
* Python version: 3.10, 3.11, 3.12, 3.13
106+
* OS version: Ubuntu 24.04, MacOS Ventura 13.5, Windows 11
107+
* Python version: 3.11, 3.12, 3.13
109108

110109
<table class="docutils">
111110
<thead>
112111
<tr style="vertical-align: middle; text-align: center;">
113112
<th>Framework</th>
114113
<th>TensorFlow</th>
115114
<th>PyTorch</th>
116-
<th>Intel®<br>Extension for<br>PyTorch*</th>
115+
<th>JAX</th>
117116
</tr>
118117
</thead>
119118
<tbody>
120119
<tr align="center">
121120
<th>Version</th>
122121
<td class="tg-7zrl">
123-
<a href=https://github.com/tensorflow/tensorflow/tree/v2.16.1>2.16.1</a><br>
124-
<a href=https://github.com/tensorflow/tensorflow/tree/v2.15.0>2.15.0</a><br>
125-
<a href=https://github.com/tensorflow/tensorflow/tree/v2.14.1>2.14.1</a><br></td>
122+
<a href=https://github.com/tensorflow/tensorflow/releases/tag/v2.19.0>2.19.0</a><br></td>
126123
<td class="tg-7zrl">
127-
<a href=https://github.com/pytorch/pytorch/tree/v2.8.0>2.8.0</a><br>
128-
<a href=https://github.com/pytorch/pytorch/tree/v2.7.1>2.7.1</a><br>
129-
<a href=https://github.com/pytorch/pytorch/tree/v2.6.0>2.6.0</a><br></td>
124+
<a href=https://github.com/pytorch/pytorch/releases/tag/v2.10.0>2.10.0</a><br>
125+
<a href=https://github.com/pytorch/pytorch/releases/tag/v2.9.1>2.9.1</a><br></td>
130126
<td class="tg-7zrl">
131-
<a href=https://github.com/intel/intel-extension-for-pytorch/tree/v2.8.0%2Bcpu>2.8.0</a><br>
132-
<a href=https://github.com/intel/intel-extension-for-pytorch/tree/v2.7.0%2Bcpu>2.7.0</a><br>
133-
<a href=https://github.com/intel/intel-extension-for-pytorch/tree/v2.6.0%2Bcpu>2.6.0</a><br></td>
127+
<a href=https://github.com/jax-ml/jax/releases/tag/jax-v0.9.1>0.9</a><br></td>
134128
</tr>
135129
</tbody>
136130
</table>

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/smooth_quant/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Step-by-Step
1+
Step-by-Step (Deprecated)
22
============
33
This document describes the step-by-step instructions to run large language models (LLMs) using Smooth Quantization on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch.
44

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/smooth_quant/requirements.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ accelerate
22
protobuf
33
sentencepiece != 0.1.92
44
datasets >= 1.1.3
5-
torch == 2.7.0
5+
torch == 2.8.0
66
transformers
77
pytest
88
wandb
@@ -11,4 +11,4 @@ neural-compressor
1111
lm_eval <= 0.4.7
1212
peft <= 0.17.0
1313
optimum-intel
14-
intel_extension_for_pytorch == 2.7.0
14+
intel_extension_for_pytorch == 2.8.0

0 commit comments

Comments
 (0)