Conversation
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Enable loading FP8 models on Habana HPU by faking CUDA capability checks during from_pretrained, and add an HPU-focused FP8 quantization test.
Changes:
- Add a context manager to temporarily report CUDA availability on HPU and override CUDA device capability checks.
- Wrap model loading with these context managers when HPEx is available to support FP8 model load on HPU.
- Add an HPU test validating FP8 quantization output weights and basic numerics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| test/test_hpu/test_quant_fp8.py | New HPU test that quantizes small models to FP8 and validates dtype + NaN/Inf. |
| auto_round/utils/model.py | Wrapes from_pretrained with HPU/CUDA-override contexts to enable FP8 loading on HPU. |
| auto_round/utils/device.py | Introduces fake_cuda_for_hpu context manager to temporarily force torch.cuda.is_available() true on HPU. |
| auto_round/compressors/base.py | Removes HPU-specific exclusion of FP8 layers to allow FP8 on HPU. |
| def test_small_model_rtn_generation(self, model_name): | ||
| ar = AutoRound(model_name, iters=0, scheme="FP8_STATIC", nsamples=16) | ||
| model, folder = ar.quantize_and_save(output_dir=self.save_dir, format="llm_compressor") |
There was a problem hiding this comment.
This test will likely fail in environments without HPU/HPEx because it unconditionally runs and attempts an FP8/HPU-specific flow. Add a skipif (or importorskip) guard so the test only runs when the HPU runtime is available (e.g., based on is_hpex_available() / HPU availability).
| @@ -0,0 +1,35 @@ | |||
| import os | |||
There was a problem hiding this comment.
os is imported but not used in this new test file. Please remove it to keep the test minimal and avoid lint warnings.
| import os |
| trust_remote_code=trust_remote_code, | ||
| device_map="auto" if use_auto_mapping else None, | ||
| ) | ||
| if is_hpex_available(): |
There was a problem hiding this comment.
The CUDA-faking/capability-override is applied whenever HPEx is available, regardless of the selected device_str. This can unintentionally alter load-time behavior for non-HPU runs on machines that have HPEx installed. Consider additionally gating this block on device_str (e.g., only apply when loading for HPU) so other device paths aren’t affected.
| if is_hpex_available(): | |
| if is_hpex_available() and device_str is not None and "hpu" in device_str: |
| @@ -339,6 +339,25 @@ def __exit__(self, exc_type, exc, exc_tb): | |||
| return False | |||
|
|
|||
|
|
|||
There was a problem hiding this comment.
This is a class but is named like a function (lower_snake_case). For clarity and consistency, consider either renaming it to a CapWords class name (e.g., FakeCudaForHpu) or converting it into a @contextmanager function named fake_cuda_for_hpu.
| if is_hpex_available(): | ||
| self._orig_is_available = torch.cuda.is_available |
There was a problem hiding this comment.
This mutates a global function (torch.cuda.is_available) process-wide, which can cause surprising behavior if other threads/tasks call CUDA checks while this context is active. If possible, prefer a safer patching approach (e.g., unittest.mock.patch scoped to the smallest block) and keep the patched window as short as possible.
Signed-off-by: yiliu30 <yi4.liu@intel.com>
for more information, see https://pre-commit.ci
Signed-off-by: yiliu30 <yi4.liu@intel.com>
for more information, see https://pre-commit.ci
Description
Please briefly describe your main changes, the motivation.
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting