Skip to content

[Bug]: cannot quantize models larger than available VRAM using v0.10.0 unless low_gpu_mem_usage = True #1451

@davedgd

Description

@davedgd

Problem Description

With release V0.9.7, I was able to begin quantization of google/gemma-3-27b-it on a 48GB GPU (RTX 6000 Ada) despite the model not fitting entirely into VRAM. Now, the exact same call in v0.10.0 fails when the model is moving to CPU:

2026-02-12 15:00:32 INFO autoround.py L165: using MLLM mode for multimodal model.
Loading weights: 100%|████████████████| 1247/1247 [00:00<00:00, 2167.15it/s, Materializing param=model.vision_tower.vision_model.post_layernorm.weight]
2026-02-12 15:01:03 INFO base.py L486: using torch.bfloat16 for quantization tuning
2026-02-12 15:01:03 WARNING formats.py L154: some layers are skipped quantization (shape not divisible by 32).
2026-02-12 15:01:03 INFO base.py L1739: start to cache block inputs
Traceback (most recent call last):
  File "/home/daved/project/code/convert_autoround.py", line 43, in <module>
    ar.quantize_and_save(output_dir = quant_path, format = "auto_round")
  File "/home/daved/miniforge3/envs/autoround/lib/python3.12/site-packages/auto_round/compressors/base.py", line 949, in quantize_and_save
    model, _ = self.quantize()
               ^^^^^^^^^^^^^^^
  File "/home/daved/miniforge3/envs/autoround/lib/python3.12/site-packages/auto_round/compressors/base.py", line 1740, in quantize
    all_inputs = self.try_cache_inter_data_gpucpu(all_first_block_names, self.nsamples, layer_names=layer_names)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daved/miniforge3/envs/autoround/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/daved/miniforge3/envs/autoround/lib/python3.12/site-packages/auto_round/compressors/base.py", line 2209, in try_cache_inter_data_gpucpu
    new_max_memory[device] = max_memory[device] * 0.9
                             ~~~~~~~~~~^^^^^^^^
KeyError: 0

The error occurs here:

new_max_memory[device] = max_memory[device] * 0.9

The cause is due to a mismatch between the list of devices here:

devices = parse_available_devices(self.device_map)

And the max memory produced here:

max_memory = get_max_memory()

The former includes device 0 for the GPU, whereas the latter only picks up the CPU. Hence, there is no 0 key on the later line:

['cuda:0'] # devices
{'cpu': 143219032064} # max_memory

By manually setting low_gpu_mem_usage to True, this KeyError can be avoided, but this is suboptimal since it leads to performance reductions. Note that correcting the max_memory to include the gpu ('0') led to additional (different) errors.

Reproduction Steps

  1. Run this command:
    auto-round
    --model model
    --scheme "W4A16"
    --format "auto_round"
  2. Where model is any model larger than available VRAM (e.g., google/gemma-3-27b-it for a 48GB GPU)
  3. Run as shown
  4. See error above

Environment Information

  • OS: Ubuntu 24.04
  • Python version: 3.12
  • AutoRound version: v0.10.0
  • Hardware: RTX 6000 Ada 48GB

Error Logs

Additional Context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions