-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Fix meta tensor error with bitsandbytes quantization and device_map #12799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix meta tensor error with bitsandbytes quantization and device_map #12799
Conversation
…e_map Fixes huggingface#12719 When loading transformers models with both bitsandbytes quantization (via quantization_config) and device_map (especially 'balanced' for multi-GPU), the combination of low_cpu_mem_usage=True and device_map causes meta tensors to be used for memory-efficient loading. However, bitsandbytes quantization state objects (containing code and absmax tensors) cannot be materialized from meta device, resulting in: 'NotImplementedError: Cannot copy out of meta tensor; no data!' This occurs because: 1. With low_cpu_mem_usage=True and device_map, transformers uses meta tensors as placeholders for lazy weight loading 2. During quantization, bitsandbytes creates quantization state with meta tensors 3. When accelerate's AlignDevicesHook tries to move parameters to target devices, it calls quant_state.to(device) 4. The quantization state's code/absmax tensors are still meta and cannot be copied/moved The fix: Disable low_cpu_mem_usage when loading transformers models with bitsandbytes quantization (llm_int8, fp4, nf4) and device_map. This ensures tensors are materialized during loading, not kept as meta placeholders, allowing quantization state to be properly moved to target devices. This allows users to successfully use quantization with device_map strategies like 'balanced' or 'auto' for multi-GPU inference without encountering meta tensor errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR attempts to fix a critical issue where using bitsandbytes quantization with device_map on transformers models results in a meta tensor error (NotImplementedError: Cannot copy out of meta tensor; no data!). The solution disables low_cpu_mem_usage when loading transformers models with bitsandbytes quantization and device_map to ensure tensors are materialized during loading.
Key Changes:
- Added logic to detect bitsandbytes quantization with device_map combinations
- Sets
low_cpu_mem_usage=Falsefor these cases to avoid meta tensor issues - Added informative logging and documentation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| and hasattr(model_quant_config, "quant_method") | ||
| ): | ||
| quant_method = getattr(model_quant_config.quant_method, "value", model_quant_config.quant_method) | ||
| if quant_method in ["llm_int8", "fp4", "nf4"]: # bitsandbytes quantization methods |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for detecting bitsandbytes quantization is incorrect. The quant_method attribute is set to QuantizationMethod.BITS_AND_BYTES (which has the value "bitsandbytes"), not to the specific quantization method strings ["llm_int8", "fp4", "nf4"].
This condition will always be False, meaning the workaround will never be applied.
The fix should check if quant_method equals "bitsandbytes" (or QuantizationMethod.BITS_AND_BYTES):
quant_method = getattr(model_quant_config.quant_method, "value", model_quant_config.quant_method)
if quant_method == "bitsandbytes": # or quant_method == QuantizationMethod.BITS_AND_BYTESAlternatively, if you want to check the specific quantization type, you should call the quantization_method() method instead:
if hasattr(model_quant_config, "quantization_method"):
quant_method = model_quant_config.quantization_method()
if quant_method in ["llm_int8", "fp4", "nf4"]:Reference: The quant_method attribute is defined in BitsAndBytesConfig.__init__ at line 248 of quantization_config.py as self.quant_method = QuantizationMethod.BITS_AND_BYTES. The specific method names are returned by the quantization_method() method (lines 365-377).
| if quant_method in ["llm_int8", "fp4", "nf4"]: # bitsandbytes quantization methods | |
| if quant_method == "bitsandbytes": # bitsandbytes quantization |
| ) | ||
| if model_quant_config is not None: | ||
| loading_kwargs["quantization_config"] = model_quant_config | ||
|
|
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace detected. Please remove the trailing spaces on this line.
What does this PR do?
Fixes #12719
This PR fixes a critical issue where using bitsandbytes quantization with
device_map='balanced'(or other device_map strategies) on transformers models within diffusers pipelines results in a meta tensor error:NotImplementedError: Cannot copy out of meta tensor; no data!Root Cause
When loading transformers models with both:
quantization_config(bitsandbytes 4-bit/8-bit)device_map(especially 'balanced' for multi-GPU)The combination of
low_cpu_mem_usage=True(default) anddevice_mapcauses transformers to use meta tensors for memory-efficient loading. However, bitsandbytes quantization state objects cannot be materialized from meta device.The error occurs because:
low_cpu_mem_usage=Trueanddevice_map, transformers uses meta tensors as placeholdersAlignDevicesHooktries to move parameters to target devices viaquant_state.to(device)Solution
Disable
low_cpu_mem_usagewhen loading transformers models with bitsandbytes quantization (llm_int8,fp4,nf4) anddevice_map. This ensures tensors are materialized during loading rather than kept as meta placeholders, allowing quantization state to be properly moved to target devices.Changes
_load_sub_modelinpipeline_loading_utils.pyto detect bitsandbytes quantization + device_map combinationslow_cpu_mem_usage=Falsefor these casesTesting
This fix allows the exact code from issue #12719 to work correctly:
Impact
cc @yiyixuxu @DN6