Problem Description
With release V0.9.7, I was able to begin quantization of google/gemma-3-27b-it on a 48GB GPU (RTX 6000 Ada) despite the model not fitting entirely into VRAM. Now, the exact same call in v0.10.0 fails when the model is moving to CPU:
2026-02-12 15:00:32 INFO autoround.py L165: using MLLM mode for multimodal model.
Loading weights: 100%|████████████████| 1247/1247 [00:00<00:00, 2167.15it/s, Materializing param=model.vision_tower.vision_model.post_layernorm.weight]
2026-02-12 15:01:03 INFO base.py L486: using torch.bfloat16 for quantization tuning
2026-02-12 15:01:03 WARNING formats.py L154: some layers are skipped quantization (shape not divisible by 32).
2026-02-12 15:01:03 INFO base.py L1739: start to cache block inputs
Traceback (most recent call last):
File "/home/daved/project/code/convert_autoround.py", line 43, in <module>
ar.quantize_and_save(output_dir = quant_path, format = "auto_round")
File "/home/daved/miniforge3/envs/autoround/lib/python3.12/site-packages/auto_round/compressors/base.py", line 949, in quantize_and_save
model, _ = self.quantize()
^^^^^^^^^^^^^^^
File "/home/daved/miniforge3/envs/autoround/lib/python3.12/site-packages/auto_round/compressors/base.py", line 1740, in quantize
all_inputs = self.try_cache_inter_data_gpucpu(all_first_block_names, self.nsamples, layer_names=layer_names)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/daved/miniforge3/envs/autoround/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/daved/miniforge3/envs/autoround/lib/python3.12/site-packages/auto_round/compressors/base.py", line 2209, in try_cache_inter_data_gpucpu
new_max_memory[device] = max_memory[device] * 0.9
~~~~~~~~~~^^^^^^^^
KeyError: 0
The error occurs here:
|
new_max_memory[device] = max_memory[device] * 0.9 |
The cause is due to a mismatch between the list of devices here:
|
devices = parse_available_devices(self.device_map) |
And the max memory produced here:
|
max_memory = get_max_memory() |
The former includes device 0 for the GPU, whereas the latter only picks up the CPU. Hence, there is no 0 key on the later line:
['cuda:0'] # devices
{'cpu': 143219032064} # max_memory
By manually setting low_gpu_mem_usage to True, this KeyError can be avoided, but this is suboptimal since it leads to performance reductions. Note that correcting the max_memory to include the gpu ('0') led to additional (different) errors.
Reproduction Steps
- Run this command:
auto-round
--model model
--scheme "W4A16"
--format "auto_round"
- Where model is any model larger than available VRAM (e.g., google/gemma-3-27b-it for a 48GB GPU)
- Run as shown
- See error above
Environment Information
- OS: Ubuntu 24.04
- Python version: 3.12
- AutoRound version: v0.10.0
- Hardware: RTX 6000 Ada 48GB
Error Logs
Additional Context
No response
Problem Description
With release V0.9.7, I was able to begin quantization of google/gemma-3-27b-it on a 48GB GPU (RTX 6000 Ada) despite the model not fitting entirely into VRAM. Now, the exact same call in v0.10.0 fails when the model is moving to CPU:
The error occurs here:
auto-round/auto_round/compressors/base.py
Line 2190 in 81c8ee4
The cause is due to a mismatch between the list of devices here:
auto-round/auto_round/compressors/base.py
Line 2172 in 81c8ee4
And the max memory produced here:
auto-round/auto_round/compressors/base.py
Line 2174 in 81c8ee4
The former includes device 0 for the GPU, whereas the latter only picks up the CPU. Hence, there is no 0 key on the later line:
By manually setting
low_gpu_mem_usagetoTrue, thisKeyErrorcan be avoided, but this is suboptimal since it leads to performance reductions. Note that correcting the max_memory to include the gpu ('0') led to additional (different) errors.Reproduction Steps
auto-round
--model model
--scheme "W4A16"
--format "auto_round"
Environment Information
Error Logs
Additional Context
No response