Skip to content

Conversation

@jingyu-ml
Copy link
Contributor

@jingyu-ml jingyu-ml commented Jan 14, 2026

What does this PR do?

Type of change: New feature

Overview:

This PR adds support for exporting quantized diffusers models (DiT, Flux, SD3, UNet, etc.) to HuggingFace checkpoint format, enabling deployment to inference frameworks like SGLang, vLLM, and TensorRT-LLM.

Changes

New file: diffusers_utils.py

  • Dummy input generation for various diffusion models
  • Pipeline component extraction helpers
  • QKV projection detection and grouping
  • hide_quantizers_from_state_dict() context manager for clean saves

Refactored: unified_export_hf.py

  • New _fuse_qkv_linears_diffusion() for QKV amax fusion
  • _export_diffusers_checkpoint() to export full pipelines (models + tokenizers + schedulers etc.)

Plans

  • [1/3] Add the basic functionalities to support limited image models with NVFP4 + FP8, with some refactoring on the previous LLM code and the diffusers example. PIC: @jingyu-ml
  • [2/3] Add support to more video gen models, and the export support with SVDQuant. PIC: @jingyu-ml
  • [3/3] Add test cases, refactor on the doc, and all related README. PIC: @jingyu-ml

Usage

mtq.quantize(pipe, quant_config, forward_call)
export_hf_checkpoint(pipe, export_dir=hf_ckpt_dir)

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?:No
  • Did you add or update any necessary documentation?:No
  • Did you update Changelog?:No

Additional Information

Summary by CodeRabbit

New Features

  • Added HuggingFace checkpoint export support for quantized diffusion models with configurable output directory
  • Introduced new --hf-ckpt-dir CLI argument for specifying checkpoint export destination
  • Extended export functionality to support selective component exports from diffusion pipelines
  • Enhanced quantized model export with improved component handling and multi-stage checkpoint generation

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
@jingyu-ml jingyu-ml requested review from a team as code owners January 14, 2026 03:56
@jingyu-ml jingyu-ml marked this pull request as draft January 14, 2026 03:56
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 14, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This PR adds HuggingFace checkpoint export support for quantized diffusion models. It introduces a new CLI option and export configuration field for specifying a checkpoint directory, then extends the unified export module to route and handle diffusion pipeline exports with quantizer management and QKV fusion.

Changes

Cohort / File(s) Summary
Quantization Script Enhancement
examples/diffusers/quantization/quantize.py
Adds hf_ckpt_dir field to ExportConfig, new export_hf_ckpt() method to ExportManager, CLI argument --hf-ckpt-dir to accept export directory, and wiring to trigger HF checkpoint export after ONNX export and at the end of the main export flow.
Unified Export Module Extension
modelopt/torch/export/unified_export_hf.py
Introduces context manager and helper functions for quantizer handling during export (_hide_quantizers_from_state_dict, _process_quantized_modules). Adds diffusion model support via _export_diffusers_checkpoint with per-component export, model_index.json creation, and non-nn.Module component handling. Implements QKV fusion utilities (_is_qkv_projection, _get_qkv_group_key, _fuse_qkv_linears_diffusion). Adds diffusion-specific helpers (_generate_diffusion_dummy_inputs, _get_diffusers_components, _has_quantized_modules, _infer_dtype_from_model). Updates export_hf_checkpoint signature to accept components parameter and route DiffusionPipeline models to diffusion-specific export path.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant Config as ExportConfig
    participant Manager as ExportManager
    participant Export as export_hf_ckpt()
    participant Router as Route Logic
    participant DiffusionExp as _export_diffusers_checkpoint()
    participant Components as Component Handler

    User->>Config: Pass --hf-ckpt-dir
    Config->>Manager: Create with hf_ckpt_dir set
    Manager->>Export: Call export_hf_ckpt(pipe)
    Export->>Router: Detect model type
    Router->>DiffusionExp: Route DiffusionPipeline
    DiffusionExp->>Components: Extract & process components
    Components->>Components: Hide quantizers
    Components->>Components: Fuse QKV linears
    Components->>Components: Save per-component subdirs
    DiffusionExp->>DiffusionExp: Save model_index.json
    DiffusionExp->>Export: Export complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title '[1/3] Diffusion ckpt export for NVFP4 & FP8' directly reflects the main change: adding diffusion checkpoint export support for quantized models.
Docstring Coverage ✅ Passed Docstring coverage is 90.48% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 14, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@jingyu-ml jingyu-ml requested a review from Edwardf0t1 January 14, 2026 03:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1055-1057: The _get_diffusers_components currently raises for
anything not a DiffusionPipeline but _export_diffusers_checkpoint accepts
DiffusionPipeline | ModelMixin; update _get_diffusers_components to also accept
instances of ModelMixin (e.g., a standalone UNet) by detecting isinstance(model,
ModelMixin) and returning a components mapping consistent with what
_export_diffusers_checkpoint expects (for example {'unet': model} or the
appropriate single-component keys used downstream); ensure the DiffusionPipeline
branch behavior is unchanged and that callers handle the returned mapping
uniformly.
- Around line 452-463: The loop over model.named_modules() sets
fsdp_module_to_reshard for each FSDPModule but never resshards the last one
after the loop, leaving it unsharded; after the loop completes add a final check
and call to reshard on fsdp_module_to_reshard (i.e., if fsdp_module_to_reshard
is not None: fsdp_module_to_reshard.reshard()) so the last FSDPModule is
properly resharded; locate symbols model.named_modules, FSDPModule,
fsdp_module_to_reshard, and reshard() to apply the fix.
🧹 Nitpick comments (5)
modelopt/torch/export/unified_export_hf.py (5)

82-83: Move import to top of file with other imports.

The contextmanager import should be grouped with other imports at the top of the file (around lines 18-27) rather than inserted mid-file.

Suggested fix

Move to the imports section at the top:

from contextlib import contextmanager

954-955: Consider using logging instead of print statements.

The function uses print() for debug output which is inconsistent with the rest of the codebase that uses warnings.warn() or could use a logger. This also affects production output.

Suggested approach

Replace print() calls with warnings.warn() for warning-level messages, or consider adding an optional logger parameter:

-            print("No quantized linear modules found for QKV fusion.")
+            warnings.warn("No quantized linear modules found for QKV fusion.")
...
-                print(f"Warning: Unknown model type '{model_class_name}', skipping QKV fusion.")
+                warnings.warn(f"Unknown model type '{model_class_name}', skipping QKV fusion.")
...
-        print(f"Warning: Failed to run dummy forward for QKV fusion: {e}")
-        print("Skipping QKV fusion. Quantization may still work but amax values won't be unified.")
+        warnings.warn(f"Failed to run dummy forward for QKV fusion: {e}. Skipping QKV fusion.")

Also applies to: 970-970, 979-980, 1014-1015, 1017-1020


1113-1125: Minor: Step numbering is inconsistent - "Step 2" is missing.

The comments jump from "Step 1" (line 1113) to "Step 3" (line 1125). Consider renumbering for clarity.


1129-1129: Consider using warnings.warn() or a logger instead of print() statements.

Multiple print() calls throughout this function for status messages. For consistency with the rest of the codebase and to allow users to control output, consider using warnings.warn() or passing in a logger.

Also applies to: 1147-1148, 1178-1178, 1190-1190, 1206-1206, 1229-1229


23-23: Unnecessary import of builtin ValueError.

ValueError is a Python builtin and doesn't need to be imported.

Suggested fix
-from builtins import ValueError
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18d9b1e and a33cf13.

📒 Files selected for processing (2)
  • examples/diffusers/quantization/quantize.py
  • modelopt/torch/export/unified_export_hf.py
🧰 Additional context used
🧬 Code graph analysis (2)
examples/diffusers/quantization/quantize.py (1)
modelopt/torch/export/unified_export_hf.py (1)
  • export_hf_checkpoint (1232-1303)
modelopt/torch/export/unified_export_hf.py (6)
modelopt/torch/export/layer_utils.py (1)
  • is_quantlinear (346-348)
modelopt/torch/export/quant_utils.py (1)
  • get_quantization_format (432-533)
modelopt/torch/quantization/utils.py (1)
  • fsdp2_aware_weight_update (689-797)
modelopt/torch/quantization/conversion.py (1)
  • set_quantizer_by_cfg_context (305-327)
modelopt/torch/export/convert_hf_config.py (1)
  • convert_hf_quant_config_format (21-117)
modelopt/torch/export/plugins/hf_spec_export.py (2)
  • spec_opt_only (107-112)
  • export_spec_ckpt_state_dict (115-153)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (13)
modelopt/torch/export/unified_export_hf.py (7)

85-119: LGTM - Context manager for hiding quantizers during save.

The implementation correctly backs up and restores quantizer attributes. Using delattr/setattr with a try/finally ensures quantizers are restored even if an exception occurs during save.


617-816: LGTM - Comprehensive dummy input generation for diffusion models.

Good implementation with defensive coding:

  • Uses isinstance checks with fallback to string matching when imports fail
  • Provides sensible defaults via getattr for missing config attributes
  • Returns None for unsupported models, which the caller handles gracefully

818-869: LGTM - QKV projection identification logic.

The pattern matching is comprehensive, covering common diffusers naming conventions (to_q, to_k, to_v, etc.) and correctly handles nested module paths.


871-909: LGTM - QKV grouping logic.

Correctly groups QKV projections by parent attention block and distinguishes between main and added (cross-attention) QKV types.


1059-1072: LGTM - Simple quantization check.

Clean implementation using generator expression with any() for early termination.


1074-1086: LGTM - dtype inference with sensible default.

Returns the dtype of the first parameter found, with a reasonable float16 fallback for edge cases (e.g., models with no parameters).


1232-1261: LGTM - Clean routing between diffusers and transformers export.

The updated public API correctly routes to the appropriate export function based on model type. The components parameter documentation clearly states it's only for diffusers pipelines.

examples/diffusers/quantization/quantize.py (6)

69-69: LGTM - Import follows existing pattern.

The import is correctly placed with other modelopt imports.


352-352: LGTM - ExportConfig extension follows existing patterns.

The hf_ckpt_dir field and its validation mirror the existing onnx_dir handling.

Also applies to: 368-370


870-883: LGTM - Method follows existing ExportManager patterns.

Clean implementation that mirrors other export methods like save_checkpoint and export_onnx.


1016-1020: LGTM - CLI argument follows existing conventions.

The --hf-ckpt-dir argument is consistent with the existing --onnx-dir pattern.


1097-1097: LGTM - Config construction follows existing pattern.

Correctly handles the optional hf_ckpt_dir argument.


1153-1155: LGTM - HF checkpoint export integrated at appropriate point in flow.

Placed after ONNX export, following the logical export sequence.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@codecov
Copy link

codecov bot commented Jan 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.23%. Comparing base (e6e4efd) to head (302e2f4).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #781   +/-   ##
=======================================
  Coverage   74.23%   74.23%           
=======================================
  Files         192      192           
  Lines       19033    19033           
=======================================
  Hits        14129    14129           
  Misses       4904     4904           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
@jingyu-ml jingyu-ml self-assigned this Jan 14, 2026
@jingyu-ml jingyu-ml marked this pull request as ready for review January 14, 2026 06:02
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
jingyu-ml added a commit that referenced this pull request Jan 15, 2026
See #781

This is the MR that only includes the refactoring of the llm export,
please ignore the change on quantize.py from the diffusion example.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added `--hf-ckpt-dir` CLI option to save checkpoints in HuggingFace
format
  * Enabled support for exporting Diffusers-based pipelines
* Unified export system now handles both transformer and diffusion model
architectures

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants