Skip to content

Kimi K2 NVFP4 ModelOpt Calibration Error #766

@pdasgup

Description

@pdasgup

Before submitting an issue, please make sure it hasn't been already addressed by searching through the existing and past issues.

Describe the bug

Cannot quantize kimi k2 thinking to nvfp4 with hf_ptq script. There is a calibration error.

Steps/Code to reproduce bug

  • ?
python examples/llm_ptq/hf_ptq.py --pyt_ckpt_path /models/kimi-bf16/kimi-k2-thinking-bf16 --qformat nvfp4 --export_path /models/kimi-nvfp4-0106 --kv_cache_qformat none --calib_size 64 --trust_remote_code --dataset cnn_dailymail
...

example outputs after ptq: [' adult and a rich one at that. "I think I\'m going to have to be very, very careful about what I get up to in public," he said. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed. Associated Press contributed to this report. \n\n\n\nThe article is about Daniel Radcliffe\'s 18th birthday and his access to his £20 million fortune. The user asks: "What']
/home/prithudasgupta_google_com/Model-Optimizer/modelopt/torch/export/unified_export_hf.py:609: UserWarning: Cannot export model to the model_config. The modelopt-optimized model state_dict can be saved with torch.save for further inspection.
  warnings.warn(
Traceback (most recent call last):
  File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1034, in <module>
    main(args)
  File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1013, in main
    quantize_main(
  File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 817, in quantize_main
    export_quantized(args, full_model, language_model, model_type, tokenizer, default_padding_side)
  File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 547, in export_quantized
    export_hf_checkpoint(
  File "/home/prithudasgupta_google_com/Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 613, in export_hf_checkpoint
    raise e
  File "/home/prithudasgupta_google_com/Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 582, in export_hf_checkpoint
    post_state_dict, hf_quant_config = _export_hf_checkpoint(model, dtype)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 522, in _export_hf_checkpoint
    _export_quantized_weight(sub_module, dtype)
  File "/home/prithudasgupta_google_com/Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 300, in _export_quantized_weight
    quantizer_attrs.weight_scale, get_weight_scaling_factor(sub_module, weight_name)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/Model-Optimizer/modelopt/torch/export/quant_utils.py", line 280, in get_weight_scaling_factor
    weight_scaling_factor_2 = NVFP4QTensor.get_weights_scaling_factor_2_from_quantizer(
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/Model-Optimizer/modelopt/torch/quantization/qtensor/nvfp4_tensor.py", line 59, in get_weights_scaling_factor_2_from_quantizer
    assert hasattr(weight_quantizer, "_amax"), "Weight quantizer does not have attribute amax"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Weight quantizer does not have attribute amax
########
GPU 0: Peak memory usage = 177.66 GB for all processes on the GPU
GPU 1: Peak memory usage = 165.08 GB for all processes on the GPU
GPU 2: Peak memory usage = 165.08 GB for all processes on the GPU
GPU 3: Peak memory usage = 165.08 GB for all processes on the GPU
GPU 4: Peak memory usage = 165.08 GB for all processes on the GPU
GPU 5: Peak memory usage = 165.08 GB for all processes on the GPU
GPU 6: Peak memory usage = 165.08 GB for all processes on the GPU
GPU 7: Peak memory usage = 165.08 GB for all processes on the GPU
########

I also replaced the original modeling file with https://huggingface.co/nvidia/Kimi-K2-Thinking-NVFP4/blob/main/modeling_deepseek.py to run quantization.

Expected behavior

Can quantize Kimi K2 Thinking to NVFP4
Can serve quantized Kimi K2 Thinking nvfp4 checkpoint with SGLang.

Who can help?

  • ?

System information

  • Container used (if applicable): ?
  • OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ?
  • CPU architecture (x86_64, aarch64): ?
  • GPU name (e.g. H100, A100, L40S): ?
  • GPU memory size: ?
  • Number of GPUs: ?
  • Library versions (if applicable):
    • Python: ?
    • ModelOpt version or commit hash: ?
    • CUDA: ?
    • PyTorch: ?
    • Transformers: ?
    • TensorRT-LLM: ?
    • ONNXRuntime: ?
    • TensorRT: ?
  • Any other details that may help: ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions