fix(build): surface clear disk-full error instead of misleading opset ValueError by timenick · Pull Request #987 · microsoft/winml-cli

timenick · 2026-06-26T07:52:37Z

Closes #259

Problem

When the disk fills during the optimize step of winml build, the ONNX write fails: onnx.save_model truncates the target to 0 bytes and then write() raises OSError(ENOSPC), leaving a corrupt/zero-byte optimized.onnx behind. The quantize step then loads that file — onnx.load parses the empty bytes into an empty ModelProto with no opset_import, and ORT raises the opaque:

ValueError: Failed to find proper ai.onnx domain

which surfaces as Quantization failed: .... The real cause (out of disk space) is never reported, so users chase a phantom code bug.

Fix

Root-cause fix at the ONNX write boundary, plus a defensive guard:

ONNXSaveError (onnx/persistence.py) — new OSError subclass with path and disk_full attributes. Exported from onnx/__init__.py.
save_onnx — both write paths (external-data and inline) now catch OSError, remove the partial .onnx/.data artifact, and raise ONNXSaveError. Disk-full (errno.ENOSPC / Windows ERROR_DISK_FULL 112) gets a clear "Insufficient disk space" message; other errors get a generic write-failure message.
copy_onnx_model (onnx/external_data.py) — same treatment for the copy boundary, cleaning up dst (+ sidecar).
Quantizer guard (quant/quantizer.py) — _quantize_single_pass now detects an empty/opset-less/unparseable input model up front and returns a clear QuantizeResult failure instead of letting ORT raise its opaque opset error.
Build hint (commands/build.py) — disk-space errors map to an actionable hint (free up disk / clear ~/.cache/winml).
Docs (docs/troubleshooting.md) — note on the new behavior and automatic partial-file cleanup.

Subclassing OSError keeps existing except OSError callers working and lets the build command's top-level handler surface the clear message verbatim. The write-boundary cleanup guarantees no truncated artifact is left for a later stage to misread.

Out of scope (decided during triage): no proactive free-disk pre-check before build.

When the disk fills during the optimize step, onnx.save_model truncates the target to 0 bytes then raises OSError(ENOSPC), leaving a corrupt optimized.onnx. The quantize step loaded it into an empty ModelProto and ORT raised the opaque 'Failed to find proper ai.onnx domain', hiding the real cause (issue #259). Catch OSError at every ONNX write boundary (save_onnx, copy_onnx_model), remove the partial .onnx/.data artifact, and raise a clear ONNXSaveError (subclass of OSError, attrs path/disk_full). Add a defensive guard in the quantizer that detects an empty/corrupt input model and returns a clear failure instead of ORT's opset error. Map disk-space errors to an actionable hint in the build command.

…ng-valueerror-on-disk-full

CodeQL's py/import-and-import-from flagged the quantizer guard and its tests for importing the onnx module via both 'import onnx' and 'from onnx import ...' in the same file (the guard's plain 'import onnx' collided by short name with the relative 'from ..onnx import save_onnx'). Use 'from onnx import load_model' in the guard and drop the redundant 'from onnx import' in the tests so each file uses a single, consistent import form.

Resolve conflict in tests/unit/test_quantizer.py: keep both the disk-full input-guard regression tests (this branch) and the model_type finalizer tests (main). Update the finalizer tests to write a valid ONNX model via _write_minimal_onnx_model instead of write_text("input"), since the new _check_input_model_opset guard in _quantize_single_pass now rejects a truncated/invalid input before dispatch.

DingmaomaoBJTU

The disk-full fix is well-structured and the test coverage is thorough. Two issues to address before merging:

Bug — ONNXSaveError.errno is always None, contradicting the docstring that says "errno is preserved". Any caller that checks e.errno == errno.ENOSPC will silently get the wrong answer.
Warning — _check_input_model_opset fully parses the ONNX proto on every single quantize call, even for perfectly healthy models. For large models with multi-MB proto headers, this is unnecessary latency.

Resolve conflict in src/winml/modelkit/quant/quantizer.py against main's #985 Quantizer/BaseQuantPass pipeline refactor (which replaced the old _quantize_single_pass / _quantize_qdq dispatch): - Drop the obsolete old-architecture handlers; keep main's expand_precision + Quantizer(passes).run pipeline. - Re-wire the disk-full input guard (_check_input_model_opset) into quantize_onnx, ordered after the kwargs TypeError check and before the model-type finalizer. Not placed in Quantizer.run() because its orchestration unit tests drive it with dummy non-ONNX inputs. - Fix two finalizer tests that #985 silently broke (they monkeypatched the removed _quantize_qdq): retarget them to the new Quantizer seam. - Cover the standalone multi-precision CLI path (_run_multi_precision), which drives Quantizer directly, with the same guard for parity. Address PR review comments in onnx/persistence.py: - ONNXSaveError.errno was always None; preserve the originating errno via a new errno_code argument so `except OSError` callers can inspect e.errno. - Add a zero-byte stat() fast-path to _check_input_model_opset so the healthy success path avoids a full proto parse.

DingmaomaoBJTU

Both issues from the previous review are resolved.

Bug fix confirmed (persistence.py): ONNXSaveError now accepts errno_code and sets self.errno = errno_code explicitly after super().init(message). _raise_save_error passes errno_code=error.errno, and the new tests explicitly assert err.errno == errno.ENOSPC / err.errno == errno.EACCES.

Warning addressed (quantizer.py): _check_input_model_opset now has a stat().st_size == 0 fast-path as suggested. The most common disk-full artefact (zero-byte file) is caught cheaply without a proto parse; non-zero truncated files fall through to load_model.

Minor note (non-blocking): the docstring at lines 232-233 says "the healthy success path never pays for a full proto parse" which is slightly inaccurate — a non-zero healthy model still does load_model. Consider rewording to "a zero-byte file is caught before the full parse".

timenick added 2 commits June 26, 2026 15:51

Merge remote-tracking branch 'origin/main' into timenick-fix-misleadi…

292e5b9

…ng-valueerror-on-disk-full

timenick requested a review from a team as a code owner June 26, 2026 07:52

timenick changed the title ~~Surface clear disk-full error instead of misleading opset ValueError~~ fix(build): surface clear disk-full error instead of misleading opset ValueError Jun 26, 2026

github-advanced-security AI found potential problems Jun 26, 2026

View reviewed changes

Comment thread src/winml/modelkit/quant/quantizer.py Fixed

Comment thread tests/unit/test_quantizer.py Fixed

Comment thread tests/unit/test_quantizer.py Fixed

Comment thread tests/unit/test_quantizer.py Fixed

timenick added 3 commits June 26, 2026 16:22

Merge branch 'main' into timenick-fix-misleading-valueerror-on-disk-full

4614636

DingmaomaoBJTU reviewed Jun 29, 2026

View reviewed changes

Comment thread src/winml/modelkit/onnx/persistence.py

Comment thread src/winml/modelkit/quant/quantizer.py

timenick added 2 commits June 29, 2026 17:54

Merge branch 'main' into timenick-fix-misleading-valueerror-on-disk-full

22ca4fb

DingmaomaoBJTU approved these changes Jun 30, 2026

View reviewed changes

timenick merged commit 947e5ec into main Jun 30, 2026
9 checks passed

timenick deleted the timenick-fix-misleading-valueerror-on-disk-full branch June 30, 2026 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(build): surface clear disk-full error instead of misleading opset ValueError#987

fix(build): surface clear disk-full error instead of misleading opset ValueError#987
timenick merged 7 commits into
mainfrom
timenick-fix-misleading-valueerror-on-disk-full

timenick commented Jun 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU left a comment

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

timenick commented Jun 26, 2026

Problem

Fix

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants