Add compressed-tensors format export support for W4A16 and W8A16 #1669
Add compressed-tensors format export support for W4A16 and W8A16 #1669
Conversation
# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Adds llm_compressor (compressed-tensors) export support for INT weight-only schemes (W4A16, W8A16), and updates docs/tests accordingly.
Changes:
- Extend
llm_compressorformat to accept W4A16/W8A16 and route them through a new backend path. - Update compressed-tensors scheme construction to omit activation quantization for weight-only exports.
- Add/adjust CPU export tests and document the newly supported schemes (EN + CN).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
auto_round/formats.py |
Adds W4A16/W8A16 to llm_compressor support and introduces a WOQ backend selector (wint_a16). |
auto_round/export/export_to_llmcompressor/export.py |
Treats W*A16 as weight-only in compressed-tensors scheme creation; tightens dependency expectations around compress_module. |
auto_round/compressors/utils.py |
Adds helper to detect integer weight-only quantization (WOQ). |
test/test_cpu/export/test_export.py |
Refactors INT8_W8A8 export test and adds new W4A16/W8A16 llm_compressor export assertions. |
README.md |
Documents llm_compressor support for FP8_BLOCK, INT8_W8A8, W4A16, W8A16. |
README_CN.md |
Mirrors the README support-matrix update in Chinese. |
| autoround = AutoRound( | ||
| self.model_name, | ||
| iters=2, | ||
| nsamples=2, | ||
| seqlen=2, | ||
| scheme=scheme, |
There was a problem hiding this comment.
This test constructs AutoRound with iters=2 (tuning enabled) but doesn’t pass an explicit dataset. That will fall back to the default HF dataset (e.g., "NeelNanda/pile-10k"), which can introduce unwanted network dependence/flakiness in CI. Consider passing the dataloader fixture (or setting iters=0 for RTN-only) so the test is hermetic.
| group_size=group_size, | ||
| sym=sym, | ||
| ) | ||
| quantized_model_path = "./saved" |
There was a problem hiding this comment.
Using a hard-coded relative output directory (./saved) can collide with other tests, depends on the process working directory, and risks deleting a real local folder when shutil.rmtree runs. Prefer using self.save_dir (tmp_path-based) or a tmp_path subdirectory for isolation.
| quantized_model_path = "./saved" | |
| quantized_model_path = os.path.join( | |
| self.save_dir, | |
| f"llmc_wint_a16_export_{scheme}_{bits}_{group_size}_{'sym' if sym else 'asym'}", | |
| ) |
| logger.error( | ||
| "Unable to import compress_module from compressed_tensors. " | ||
| "This functionality requires compressed_tensors >= 0.15.0. " | ||
| "Please upgrade: pip install --upgrade compressed_tensors" | ||
| ) |
There was a problem hiding this comment.
_compress_and_set_format now hard-requires compressed_tensors.compressors.compress_module and raises if it’s missing, but the repo doesn’t appear to pin/enforce a minimum compressed-tensors version in top-level dependencies. Either keep the previous fallback for older versions or add an explicit version check / dependency constraint so users get a deterministic install-time failure instead of a runtime error here.
README_CN.md
Outdated
| |**auto_awq**| W4A16(推荐)、BF16 | | ||
| |**auto_gptq**| W4A16(推荐)、W2A16、W3A16、W8A16、W2A16G64、W2A16G32、BF16 | | ||
| |**llm_compressor**| NVFP4(推荐)、`MXFP4`、`MXFP8`、`FPW8A16`、`FP8_STATIC` | | ||
| |**llm_compressor**| NVFP4(推荐)、`MXFP4`、`MXFP8`、`FPW8A16`、`FP8_STATIC`、`FP8_BLOCK`、`INT8_W8A8`、W4A16、W8A16 | |
There was a problem hiding this comment.
In AR terminology, INT8 explicitly denotes W8A8 INT quantization. If the WA suffix is omitted, INT8 by default refers to W8A8,similar to MXFP4
There was a problem hiding this comment.
INT_W8A8 is an existing scheme, here just expose it to doc. if plan to change name, can open a new PR to update both doc and code.
btw, considering native INT8 without smoothing does not scale well from acc perspective, maybe we should remove it from doc until smoothing feature ready, what do u think? @wenhuach21
…into thuang6/int4-ct
for more information, see https://pre-commit.ci
…into thuang6/int4-ct
Co-authored-by: Yi Liu <yi4.liu@intel.com>
Description
Added compressed-tensors format export support for W4A16 and W8A16,
Replaced previous INT W8A8 support from internal NaiveQuantizationCompressor interface with new compress_module interface (require >=0.15.0)
updated PR to use BaseCompressor class method to be compatiable with old version
Type of Change
Related Issues
Fixes or relates to #1567
Checklist Before Submitting