intel
diff --git a/‎.azure-pipelines/scripts/ut/run_3x_pt.sh‎
Lines changed: 1 addition & 1 deletion b/‎.azure-pipelines/scripts/ut/run_3x_pt.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.azure-pipelines/template/model-template.yml‎
Lines changed: 1 addition & 1 deletion b/‎.azure-pipelines/template/model-template.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 19 additions & 6 deletions b/‎README.md‎
Lines changed: 19 additions & 6 deletions
diff --git a/‎docs/source/PT_AutoRound.md‎
Lines changed: 66 additions & 0 deletions b/‎docs/source/PT_AutoRound.md‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎docs/source/get_started.md‎
Lines changed: 23 additions & 4 deletions b/‎docs/source/get_started.md‎
Lines changed: 23 additions & 4 deletions
diff --git a/‎docs/source/imgs/architecture.png‎
-24.7 KB b/‎docs/source/imgs/architecture.png‎
-24.7 KB
diff --git a/‎docs/source/installation_guide.md‎
Lines changed: 19 additions & 25 deletions b/‎docs/source/installation_guide.md‎
Lines changed: 19 additions & 25 deletions
diff --git a/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/smooth_quant/README.md‎
Lines changed: 1 addition & 1 deletion b/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/smooth_quant/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/smooth_quant/requirements.txt‎
Lines changed: 2 additions & 2 deletions b/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/smooth_quant/requirements.txt‎
Lines changed: 2 additions & 2 deletions
@@ -13,7 +13,7 @@ echo "##[section]import check pass"
 echo "##[group]set up UT env..."
 export LD_LIBRARY_PATH=${HOME}/.local/lib/:$LD_LIBRARY_PATH
 sed -i '/^deepspeed/d' /neural-compressor/test/torch/requirements.txt
-pip install -r /neural-compressor/test/torch/requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu
+pip install -r /neural-compressor/test/torch/requirements.txt
 pip install pytest-cov
 pip install pytest-html
 pip install beautifulsoup4==4.13.5
 
@@ -18,7 +18,7 @@ steps:
       dockerConfigName: "commonDockerConfig"
       repoName: "neural-compressor"
       repoTag: "py312"
-      dockerFileName: "Dockerfile"
+      dockerFileName: "ubuntu-2404"
       containerName: ${{ parameters.modelContainerName }}
 
   - script: |
 
@@ -79,6 +79,7 @@ repos:
     rev: v1.7.7
     hooks:
       - id: docformatter
+        language_version: python3.13
         args: [
             --in-place,
             --wrap-summaries=0, # 0 means disable wrap
 
@@ -4,8 +4,8 @@ Intel® Neural Compressor
 ===========================
 <h3> An open-source Python library supporting popular model compression techniques on mainstream deep learning frameworks (PyTorch, TensorFlow, and JAX)</h3>
 
-[![python](https://img.shields.io/badge/python-3.10%2B-blue)](https://github.com/intel/neural-compressor)
-[![version](https://img.shields.io/badge/release-3.7-green)](https://github.com/intel/neural-compressor/releases)
+[![python](https://img.shields.io/badge/python-3.11%2B-blue)](https://github.com/intel/neural-compressor)
+[![version](https://img.shields.io/badge/release-3.8-green)](https://github.com/intel/neural-compressor/releases)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/neural-compressor/blob/master/LICENSE)
 [![coverage](https://img.shields.io/badge/coverage-85%25-green)](https://github.com/intel/neural-compressor)
 [![Downloads](https://static.pepy.tech/personalized-badge/neural-compressor?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/neural-compressor)
@@ -25,6 +25,8 @@ across diverse quantization techniques and low-precision data types through inte
 support AMD CPU, ARM CPU, and NVidia GPU with limited testing. 
 
 ## What's New
+* [2026/03] FP8 quantization support for [Keras/JAX](./docs/source/JAX.md) (experimental) 
+* [2026/03] FP8 KV cache/Attention static quantization with [AutoRound](./docs/source/PT_AutoRound.md) (experimental) 
 * [2025/12] [NVFP4 quantization](./docs/source/PT_NVFP4Quant.md) experimental support
 * [2025/10] [MXFP8 / MXFP4 quantization](./docs/source/PT_MXQuant.md) experimental support
 * [2025/09] FP8 dynamic quantization, including Linear, FusedMoE on Intel Gaudi AI Accelerators
@@ -33,20 +35,22 @@ support AMD CPU, ARM CPU, and NVidia GPU with limited testing.
 
 ## Installation
 Choose the necessary framework dependencies to install based on your deploy environment.
-### Install Framework
+### Install Framework for PyTorch Backend (on-demand)
+Intel Neural Compressor supports PyTorch with CPU, GPU and HPU. Please install the corresponding PyTorch version based on your hardware environment.
 * [Install intel_extension_for_pytorch for CPU](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/)    
 * [Install intel_extension_for_pytorch for Intel GPU](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/)    
 * [Use Docker Image with torch installed for HPU](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#bare-metal-fresh-os-single-click)    
   **Note**: There is a version mapping between Intel Neural Compressor and Gaudi Software Stack, please refer to this [table](./docs/source/gaudi_version_map.md) and make sure to use a matched combination.    
 * [Install torch for other platform](https://pytorch.org/get-started/locally)    
-* [Install TensorFlow](https://www.tensorflow.org/install)    
 
 ### Install Neural Compressor from pypi
 ```
 # Framework extension API + PyTorch dependency
 pip install neural-compressor-pt
 # Framework extension API + TensorFlow dependency
 pip install neural-compressor-tf
+# Framework extension API + JAX dependency, available since v3.8
+pip install neural-compressor-jax
 ```    
 **Note**: Further installation methods can be found under [Installation Guide](./docs/source/installation_guide.md). check out our [FAQ](./docs/source/faq.md) for more details.
 
@@ -113,8 +117,7 @@ model = load(
       <td colspan="2" align="center"><a href="./docs/source/design.md#architecture">Architecture</a></td>
       <td colspan="2" align="center"><a href="./docs/source/design.md#workflows">Workflow</a></td>
       <td colspan="2" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
-      <td colspan="1" align="center"><a href="./docs/source/llm_recipes.md">LLMs Recipes</a></td>
-      <td colspan="1" align="center"><a href="./examples/README.md">Examples</a></td>
+      <td colspan="2" align="center"><a href="./examples/README.md">Examples</a></td>
     </tr>
   </tbody>
   <thead>
@@ -163,6 +166,16 @@ model = load(
           <td colspan="8" align="center"><a href="./docs/source/transformers_like_api.md">Overview</a></td>
       </tr>
   </tbody>
+  <thead>
+      <tr>
+        <th colspan="8">JAX Extension APIs</th>
+      </tr>
+  </thead>
+  <tbody>
+      <tr>
+          <td colspan="8" align="center"><a href="./docs/source/JAX.md">Overview</a></td>
+      </tr>
+  </tbody>
   <thead>
       <tr>
         <th colspan="8">Other Modules</th>
 
@@ -0,0 +1,66 @@
+
+# PyTorch AutoRound
+
+## Overview
+AutoRound is an advanced model quantization algorithm integrated into Neural Compressor for low-bit LLM. As a key algorithm component of INC, AutoRound enables efficient quantization across a wide range of models and features while consistently achieving superior accuracy. While requiring additional tuning time, it provides a robust foundation for INC's comprehensive quantization capabilities.
+
+## Supported Features
+
+- **Weight-Only Quantization (WoQ)** - Quantize model weights while keeping activations in full precision. See [Weight-Only Quantization](./PT_WeightOnlyQuant.md) for details.
+
+- **Microscaling (MX) Quantization** - Neural Compressor seamlessly applies the MX data type to post-training quantization, offering meticulously crafted recipes to empower users to quantize LLMs without sacrificing accuracy. Refer to [MX Quantization](./PT_MXQuant.md).
+
+- **NVFP4 Quantization** - NVFP4 is a specialized 4-bit floating-point format (FP4) developed by NVIDIA for deep learning workloads. See [NVFP4 Quantization](./PT_NVFP4Quant.md).
+
+- **Quantization-Aware Training (QAT)** - Fine-tune models during quantization to achieve better accuracy. See [Quantization-Aware Training](./PT_QAT.md) for details.
+
+- **FP8 KV Cache and Attention Static Quantization (Experimental)** - The support for the FP8 data type enhances inference performance by quantizing key-value cache and attention computations to FP8 precision.
+
+## Getting Started
+
+### Basic Usage
+
+```python
+from neural_compressor.torch.quantization import prepare, convert, AutoRoundConfig
+
+quant_config = AutoRoundConfig(tokenizer=tokenizer)  # tokenizer used for calibration
+model = prepare(model, quant_config)
+model = convert(model)
+
+# For more detailed usage, please refer to the [Supported Features] documentation.
+```
+### FP8 KV Cache and FP8 Attention support
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from neural_compressor.torch.quantization import (
+    AutoRoundConfig,
+    convert,
+    prepare,
+)
+
+fp32_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
+tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m", trust_remote_code=True)
+
+output_dir = "./saved_inc"
+quant_config = AutoRoundConfig(
+    tokenizer=tokenizer,
+    scheme="MXFP4",  # MXFP4, MXFP8, NVFP4
+    iters=0,  # rtn mode
+    seqlen=2,
+    static_kv_dtype="fp8",  # None, fp8, float16
+    static_attention_dtype=None,  # None, fp8
+    export_format="auto_round",
+    output_dir=output_dir,
+)
+
+model = prepare(model=fp32_model, quant_config=quant_config)
+model = convert(model)
+```
+
+## Reference
+
+[1]. Cheng, Wenhua, et al. "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs" arXiv preprint arXiv:2309.05516 (2023).
+
+[2]: NVIDIA, Introducing NVFP4 for efficient and accurate low-precision inference,NVIDIA Developer Blog, Jun. 2025. [Online]. Available: https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
+
+[3]. Intel AutoRound, https://github.com/intel/auto-round
@@ -21,9 +21,9 @@ quantized_model = convert(model=prepared_model)
 ```
 
 ## Feature Matrix
-Intel Neural Compressor 3.X extends PyTorch and TensorFlow's APIs to support compression techniques.
+Intel Neural Compressor extends PyTorch, TensorFlow and JAX's APIs to support compression techniques.
 The below table provides a quick overview of the APIs available in Intel Neural Compressor 3.X.
-The Intel Neural Compressor 3.X mainly focuses on quantization-related features, especially for algorithms that benefit LLM accuracy and inference.
+The project mainly focuses on quantization-related features, especially for algorithms that benefit LLM accuracy and inference.
 It also provides some common modules across different frameworks. For example, Auto-tune support accuracy driven quantization and mixed precision, benchmark aimed to measure the multiple instances performance of the quantized model.
 
 <table class="docutils">
@@ -37,8 +37,7 @@ It also provides some common modules across different frameworks. For example, A
       <td colspan="2" align="center"><a href="design.md#architecture">Architecture</a></td>
       <td colspan="2" align="center"><a href="design.md#workflow">Workflow</a></td>
       <td colspan="2" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
-      <td colspan="1" align="center"><a href="llm_recipes.md">LLMs Recipes</a></td>
-      <td colspan="1" align="center"><a href="/examples/README.md">Examples</a></td>
+      <td colspan="2" align="center"><a href="/examples/README.md">Examples</a></td>
     </tr>
   </tbody>
   <thead>
@@ -71,6 +70,26 @@ It also provides some common modules across different frameworks. For example, A
           <td colspan="2" align="center"><a href="TF_SQ.md">Smooth Quantization</a></td>
       </tr>
   </tbody>
+  <thead>
+      <tr>
+        <th colspan="8">Transformers-like APIs</th>
+      </tr>
+  </thead>
+  <tbody>
+      <tr>
+          <td colspan="8" align="center"><a href="transformers_like_api.md">Overview</a></td>
+      </tr>
+  </tbody>
+  <thead>
+      <tr>
+        <th colspan="8">JAX Extension APIs</th>
+      </tr>
+  </thead>
+  <tbody>
+      <tr>
+          <td colspan="8" align="center"><a href="JAX.md">Overview</a></td>
+      </tr>
+  </tbody>
   <thead>
       <tr>
         <th colspan="8">Other Modules</th>
 
@@ -24,7 +24,8 @@ The following prerequisites and requirements must be satisfied for a successful
 > Notes:
 > - If you get some build issues, please check [frequently asked questions](faq.md) at first.
 
-### Install Framework
+### Install Framework for PyTorch Backend (on-demand)
+Intel Neural Compressor supports PyTorch with CPU, GPU and HPU. Please install the corresponding PyTorch version based on your hardware environment.
 #### Install torch for CPU
 ```Shell
 pip install torch --index-url https://download.pytorch.org/whl/cpu
@@ -38,19 +39,15 @@ https://intel.github.io/intel-extension-for-pytorch/index.html#installation
 #### Install torch for other platform
 https://pytorch.org/get-started/locally
 
-#### Install tensorflow
-```Shell
-pip install tensorflow
-```
-
 ### Install from Binary
 - Install from Pypi
 ```Shell
-# Framework extension API for PyTorch/Tensorflow
+# Framework extension API for PyTorch/Tensorflow/JAX
 pip install neural-compressor
-# Framework extension API + specific dependency
+# Framework extension API + corresponding framework dependency
 pip install neural-compressor[pt]
 pip install neural-compressor[tf]
+pip install neural-compressor[jax] # JAX support is available since v3.8
 ```
 ```Shell
 # Framework extension API + PyTorch dependency
@@ -60,12 +57,17 @@ pip install neural-compressor-pt
 # Framework extension API + TensorFlow dependency
 pip install neural-compressor-tf
 ```
+```Shell
+# Framework extension API + JAX dependency, available since v3.8
+pip install neural-compressor-jax
+```
 
 ### Install from Source
-The latest code on master branch may not be stable. Feel free to open an [issue](https://github.com/intel/neural-compressor/issues) if you encounter an error.  
+The latest code on master branch may not be stable. Please switch to the latest release tag for better stability. Feel free to open an [issue](https://github.com/intel/neural-compressor/issues) if you encounter an error.  
 ```Shell
 git clone https://github.com/intel/neural-compressor.git
 cd neural-compressor
+git fetch --tags && git checkout "$(git tag -l 'v*' --sort=-v:refname | head -n 1)"
 ```
 
 ```Shell
@@ -79,7 +81,7 @@ INC_TF_ONLY=1 pip install .
 ```
 
 ```Shell
-# JAX framework extension API + JAX dependency
+# JAX framework extension API + JAX dependency, available since v3.8
 INC_JAX_ONLY=1 pip install .
 ```
 
@@ -94,43 +96,35 @@ INC_JAX_ONLY=1 pip install .
 
 * Intel Xeon Scalable processor (Sapphire Rapids, Emerald Rapids, Granite Rapids)
 * Intel Xeon CPU Max Series (Sapphire Rapids HBM)
-* Intel Core Ultra Processors (Meteor Lake, Lunar Lake)
 
 #### Intel® Neural Compressor supports GPUs built on Intel's Xe architecture:
 
-* Intel Data Center GPU Flex Series (Arctic Sound-M)
-* Intel Data Center GPU Max Series (Ponte Vecchio)
 * Intel® Arc™ B-Series Graphics (Battlemage)
 
 ### Validated Software Environment
 
-* OS version: CentOS 8.4, Ubuntu 24.04, MacOS Ventura 13.5, Windows 11
-* Python version: 3.10, 3.11, 3.12, 3.13
+* OS version: Ubuntu 24.04, MacOS Ventura 13.5, Windows 11
+* Python version: 3.11, 3.12, 3.13
 
 <table class="docutils">
 <thead>
   <tr style="vertical-align: middle; text-align: center;">
     <th>Framework</th>
     <th>TensorFlow</th>
     <th>PyTorch</th>
-    <th>Intel®<br>Extension for<br>PyTorch*</th>
+    <th>JAX</th>
   </tr>
 </thead>
 <tbody>
   <tr align="center">
     <th>Version</th>
     <td class="tg-7zrl">
-    <a href=https://github.com/tensorflow/tensorflow/tree/v2.16.1>2.16.1</a><br>
-    <a href=https://github.com/tensorflow/tensorflow/tree/v2.15.0>2.15.0</a><br>
-    <a href=https://github.com/tensorflow/tensorflow/tree/v2.14.1>2.14.1</a><br></td>
+    <a href=https://github.com/tensorflow/tensorflow/releases/tag/v2.19.0>2.19.0</a><br></td>
     <td class="tg-7zrl">
-    <a href=https://github.com/pytorch/pytorch/tree/v2.8.0>2.8.0</a><br>
-    <a href=https://github.com/pytorch/pytorch/tree/v2.7.1>2.7.1</a><br>
-    <a href=https://github.com/pytorch/pytorch/tree/v2.6.0>2.6.0</a><br></td>
+    <a href=https://github.com/pytorch/pytorch/releases/tag/v2.10.0>2.10.0</a><br>
+    <a href=https://github.com/pytorch/pytorch/releases/tag/v2.9.1>2.9.1</a><br></td>
     <td class="tg-7zrl">
-    <a href=https://github.com/intel/intel-extension-for-pytorch/tree/v2.8.0%2Bcpu>2.8.0</a><br>
-    <a href=https://github.com/intel/intel-extension-for-pytorch/tree/v2.7.0%2Bcpu>2.7.0</a><br>
-    <a href=https://github.com/intel/intel-extension-for-pytorch/tree/v2.6.0%2Bcpu>2.6.0</a><br></td>
+    <a href=https://github.com/jax-ml/jax/releases/tag/jax-v0.9.1>0.9</a><br></td>
   </tr>
 </tbody>
 </table>
@@ -1,4 +1,4 @@
-Step-by-Step
+Step-by-Step (Deprecated)
 ============
 This document describes the step-by-step instructions to run large language models (LLMs) using Smooth Quantization on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch.
 
 
@@ -2,7 +2,7 @@ accelerate
 protobuf
 sentencepiece != 0.1.92
 datasets >= 1.1.3
-torch == 2.7.0
+torch == 2.8.0
 transformers
 pytest
 wandb
@@ -11,4 +11,4 @@ neural-compressor
 lm_eval <= 0.4.7
 peft <= 0.17.0
 optimum-intel
-intel_extension_for_pytorch == 2.7.0
+intel_extension_for_pytorch == 2.8.0
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-Step-by-Step`
	`1`	`+Step-by-Step (Deprecated)`
`2`	`2`	`============`
`3`	`3`	`This document describes the step-by-step instructions to run large language models (LLMs) using Smooth Quantization on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch.`
`4`	`4`