Skip to content

Add compressed-tensors format export support for W4A16 and W8A16 #1669

Open
thuang6 wants to merge 11 commits intomainfrom
thuang6/int4-ct
Open

Add compressed-tensors format export support for W4A16 and W8A16 #1669
thuang6 wants to merge 11 commits intomainfrom
thuang6/int4-ct

Conversation

@thuang6
Copy link
Copy Markdown
Contributor

@thuang6 thuang6 commented Apr 9, 2026

Description

Added compressed-tensors format export support for W4A16 and W8A16,
Replaced previous INT W8A8 support from internal NaiveQuantizationCompressor interface with new compress_module interface (require >=0.15.0)

updated PR to use BaseCompressor class method to be compatiable with old version

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #1567

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

thuang6 added 2 commits April 9, 2026 11:35
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
@thuang6 thuang6 added this to the 0.13.0 milestone Apr 9, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds llm_compressor (compressed-tensors) export support for INT weight-only schemes (W4A16, W8A16), and updates docs/tests accordingly.

Changes:

  • Extend llm_compressor format to accept W4A16/W8A16 and route them through a new backend path.
  • Update compressed-tensors scheme construction to omit activation quantization for weight-only exports.
  • Add/adjust CPU export tests and document the newly supported schemes (EN + CN).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
auto_round/formats.py Adds W4A16/W8A16 to llm_compressor support and introduces a WOQ backend selector (wint_a16).
auto_round/export/export_to_llmcompressor/export.py Treats W*A16 as weight-only in compressed-tensors scheme creation; tightens dependency expectations around compress_module.
auto_round/compressors/utils.py Adds helper to detect integer weight-only quantization (WOQ).
test/test_cpu/export/test_export.py Refactors INT8_W8A8 export test and adds new W4A16/W8A16 llm_compressor export assertions.
README.md Documents llm_compressor support for FP8_BLOCK, INT8_W8A8, W4A16, W8A16.
README_CN.md Mirrors the README support-matrix update in Chinese.

Comment on lines +512 to +517
autoround = AutoRound(
self.model_name,
iters=2,
nsamples=2,
seqlen=2,
scheme=scheme,
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test constructs AutoRound with iters=2 (tuning enabled) but doesn’t pass an explicit dataset. That will fall back to the default HF dataset (e.g., "NeelNanda/pile-10k"), which can introduce unwanted network dependence/flakiness in CI. Consider passing the dataloader fixture (or setting iters=0 for RTN-only) so the test is hermetic.

Copilot uses AI. Check for mistakes.
group_size=group_size,
sym=sym,
)
quantized_model_path = "./saved"
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a hard-coded relative output directory (./saved) can collide with other tests, depends on the process working directory, and risks deleting a real local folder when shutil.rmtree runs. Prefer using self.save_dir (tmp_path-based) or a tmp_path subdirectory for isolation.

Suggested change
quantized_model_path = "./saved"
quantized_model_path = os.path.join(
self.save_dir,
f"llmc_wint_a16_export_{scheme}_{bits}_{group_size}_{'sym' if sym else 'asym'}",
)

Copilot uses AI. Check for mistakes.
Comment on lines 118 to 122
logger.error(
"Unable to import compress_module from compressed_tensors. "
"This functionality requires compressed_tensors >= 0.15.0. "
"Please upgrade: pip install --upgrade compressed_tensors"
)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_compress_and_set_format now hard-requires compressed_tensors.compressors.compress_module and raises if it’s missing, but the repo doesn’t appear to pin/enforce a minimum compressed-tensors version in top-level dependencies. Either keep the previous fallback for older versions or add an explicit version check / dependency constraint so users get a deterministic install-time failure instead of a runtime error here.

Copilot uses AI. Check for mistakes.
README_CN.md Outdated
|**auto_awq**| W4A16(推荐)、BF16 |
|**auto_gptq**| W4A16(推荐)、W2A16、W3A16、W8A16、W2A16G64、W2A16G32、BF16 |
|**llm_compressor**| NVFP4(推荐)、`MXFP4`​、`MXFP8`​、`FPW8A16`​、`FP8_STATIC` |
|**llm_compressor**| NVFP4(推荐)、`MXFP4`​、`MXFP8`​、`FPW8A16`​、`FP8_STATIC`、`FP8_BLOCK`、`INT8_W8A8`、W4A16、W8A16 |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In AR terminology, INT8 explicitly denotes W8A8 INT quantization. If the WA suffix is omitted, INT8 by default refers to W8A8,similar to MXFP4

Copy link
Copy Markdown
Contributor Author

@thuang6 thuang6 Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INT_W8A8 is an existing scheme, here just expose it to doc. if plan to change name, can open a new PR to update both doc and code.
btw, considering native INT8 without smoothing does not scale well from acc perspective, maybe we should remove it from doc until smoothing feature ready, what do u think? @wenhuach21

Copy link
Copy Markdown
Contributor

@yiliu30 yiliu30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants