feat(quant): Quantizer class with BaseQuantPass pipeline (#964)#985
Merged
DingmaomaoBJTU merged 8 commits intoJun 29, 2026
Conversation
5a4da8c to
519b562
Compare
xieofxie
reviewed
Jun 26, 2026
xieofxie
reviewed
Jun 26, 2026
xieofxie
reviewed
Jun 26, 2026
- Add passes/ sub-package with BaseQuantPass ABC - Implement FP16Pass, RTNPass, QDQPass — each accepts WinMLQuantizationConfig and reads only the fields relevant to that pass - Add Quantizer class: chains passes sequentially, uses tempfile for intermediates, merges QuantizeResult stats across passes - Add expand_precision(mode, config) to map precision strings to pass lists (supports 'fp16', 'rtn', 'static', 'dynamic', 'w4a16') - Keep quantize_onnx() as backward-compatible entry point - Add tests/unit/test_quant_passes.py (19 tests, all passing)
timenick
approved these changes
Jun 29, 2026
- QDQPass.run(): forward use_external_data to final save_onnx call - WinMLQuantizationConfig: add 'w4a16' to mode Literal; to_dict() now serialises rtn_* and fp16_* fields when mode is 'w4a16' - quantize_onnx(): raise TypeError on unrecognised kwargs instead of silently discarding them - Tests: add TestW4a16Config (3 cases) and TestQuantizeOnnxKwargsGuard (1 case)
- Add TYPE_CHECKING import block for Quantizer, expand_precision, and quantize_onnx so mypy resolves their types instead of falling back to Any? (fixes 'Any? not callable [misc]' in hf.py and onnx.py) - Same TYPE_CHECKING imports satisfy CodeQL's 'Explicit export is not defined' alerts for those names in __all__ - Remove trailing ... after docstring in BaseQuantPass.run() to fix CodeQL 'Statement has no effect' alert
… onto main w4a16 is a composite pipeline concept, not a single-pass quantization mode. Removing it from the mode Literal keeps config.py focused on atomic pass modes (static, dynamic, rtn, fp16). Multi-pass pipelines are expressed through Quantizer + expand_precision at a higher level. Changes: - config.py: revert mode Literal to [static, dynamic, rtn, fp16], revert to_dict() guards back to equality checks, remove w4a16 docstring example - quantizer.py: remove w4a16 from _COMPOSITE_PRECISIONS and docstrings - __init__.py: update module docstring example - commands/build.py: fix stale 'single-pass' comment - tests: remove TestW4a16Config and test_w4a16_returns_rtn_then_fp16
… pipeline
Rename:
- passes/qdq.py → passes/static.py; QDQPass → StaticPass throughout
- Update all imports, __all__, quantizer.py pass_factories, and tests
Multi-precision --precision:
- precision_option() gains multiple=True support
- quantize command accepts repeated --precision flags; len > 1 routes to
_run_multi_precision() which chains expand_precision() calls into a
single Quantizer pipeline
- Default output path for multi-pass: {stem}_{p1}_{p2}.onnx
- Calibration-unused warning emitted when no static pass is in pipeline
E2E tests (TestMultiPrecision):
- test_int4_then_fp16_pipeline: verifies MatMulNBits nodes (RTN) and
FLOAT16 initializers (FP16 pass) are both present in output
- test_pipeline_default_output_path: verifies auto-named output file
- static.py: fix model_name -> model_id (WinMLQuantizationConfig has no model_name field; correct field is model_id) — fixes mypy [attr-defined] - cli.py: widen precision_option default type to str | tuple[str,...] | None so passing default=() for multiple=True passes mypy — fixes [arg-type] - quantizer.py: make expand_precision mode optional, falling back to config.mode when not provided; removes redundant arg from quantize_onnx caller (addresses reviewer: 'why have mode param when config has it') - quantize.py: remove redundant 'from typing import cast' inside _run_multi_precision (cast already imported at module level) - passes/base.py: add note in run() docstring explaining why file-based I/O is used (addresses reviewer suggestion about in-memory model proto) - tests: add test_no_mode_uses_config_mode to cover expand_precision(config=) path
0b6098a to
7ccce14
Compare
xieofxie
reviewed
Jun 29, 2026
xieofxie
reviewed
Jun 29, 2026
xieofxie
reviewed
Jun 29, 2026
xieofxie
left a comment
Contributor
There was a problem hiding this comment.
A few correctness concerns (2 should-fix, 2 to consider) on the pass-pipeline refactor. Overall the design is clean and well-tested — these are mostly about parity with the previous single-dispatch behavior.
…ntized suffix
- expand_precision(): rename parameter mode -> precision (more accurate:
'mode' maps only to config.mode values, while 'precision' also covers
future _COMPOSITE_PRECISIONS entries like w4a16)
- Rename internal effective_mode -> effective_precision
- quantize_onnx() default output: {stem}_qdq.onnx -> {stem}_quantized.onnx
(_qdq is an implementation detail of one pass; _quantized is generic)
- Update all callers, tests, CLI help text, and docstrings
xieofxie
reviewed
Jun 29, 2026
xieofxie
left a comment
Contributor
There was a problem hiding this comment.
Two minor nits to go with the earlier review.
… types, nits Should-fix: - passes/static.py: restore per-target symmetry override — weight_symmetric and activation_symmetric now fall back to config.symmetric only when None, fixing the w8a16 regression where both were collapsed to config.symmetric - commands/quantize.py: resolve weight/activation types from the first static precision in the multi-pass pipeline via _resolve_quant_types, so '-p int16 -p fp16' correctly uses int16 instead of silently defaulting to uint8 Nits: - quantizer.py: replace eager _pass_factories (instances) with lazy _pass_types (classes); only the selected pass is instantiated - passes/fp16.py, passes/rtn.py: remove redundant __init__ that only calls super().__init__(config)
xieofxie
approved these changes
Jun 29, 2026
timenick
added a commit
that referenced
this pull request
Jun 29, 2026
Resolve conflict in src/winml/modelkit/quant/quantizer.py against main's #985 Quantizer/BaseQuantPass pipeline refactor (which replaced the old _quantize_single_pass / _quantize_qdq dispatch): - Drop the obsolete old-architecture handlers; keep main's expand_precision + Quantizer(passes).run pipeline. - Re-wire the disk-full input guard (_check_input_model_opset) into quantize_onnx, ordered after the kwargs TypeError check and before the model-type finalizer. Not placed in Quantizer.run() because its orchestration unit tests drive it with dummy non-ONNX inputs. - Fix two finalizer tests that #985 silently broke (they monkeypatched the removed _quantize_qdq): retarget them to the new Quantizer seam. - Cover the standalone multi-precision CLI path (_run_multi_precision), which drives Quantizer directly, with the same guard for parity. Address PR review comments in onnx/persistence.py: - ONNXSaveError.errno was always None; preserve the originating errno via a new errno_code argument so `except OSError` callers can inspect e.errno. - Add a zero-byte stat() fast-path to _check_input_model_opset so the healthy success path avoids a full proto parse.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors the quantizer from a flat dispatch function into an extensible pass-based pipeline, as tracked in #964.
Changes
New:
passes/sub-packagepasses/base.pyBaseQuantPass__init__(config)+ abstractrun(model_path, output_path) -> QuantizeResultpasses/fp16.pyFP16Passfp16_keep_io_types,fp16_op_block_listfrom configpasses/rtn.pyRTNPassrtn_bits,rtn_block_size,rtn_symmetric,rtn_accuracy_levelfrom configpasses/static.pyStaticPassAll passes accept a single
WinMLQuantizationConfig— each reads only its relevant fields.Refactored:
quantizer.pyQuantizer(passes)— chains passes sequentially; single-pass takes the direct path, multi-pass routes intermediates through aTemporaryDirectory; mergesQuantizeResultstats across passesexpand_precision(mode, config)— maps mode strings to pass lists;modeis optional and falls back toconfig.mode:"fp16"→[FP16Pass(config)]"rtn"→[RTNPass(config)]"static"/"dynamic"→[StaticPass(config)]quantize_onnx()— kept as backward-compatible entry point; now delegates toQuantizerUpdated:
commands/quantize.py--precisionnow accepts multiple values to compose a pass pipeline (e.g.--precision int4 --precision fp16runs RTN then FP16){stem}_int4_fp16.onnxTests
tests/unit/test_quant_passes.py— 19 tests, all passing:TestExpandPrecision(7) — mapping correctness, unknown mode, None config, mode-from-config fallbackTestQuantizerSinglePass(4) — path routing, missing model, exception handling, empty passes guardTestQuantizeOnnxKwargsGuard(1) — unexpected kwargs raiseTypeErrorTestQuantizerMultiPass(4) — chaining, stat merging, abort on failure, warning accumulationTestFP16PassConfig(1) — config field wiringTestRTNPassConfig(2) — config field wiring, accuracy_level=0 → Nonetests/e2e/test_quantize_e2e.py—TestMultiPrecision(2 tests):test_int4_then_fp16_pipeline— verifiesMatMulNBitsnodes (RTN) +FLOAT16initializers (FP16) are both presenttest_pipeline_default_output_path— verifies auto-named output file