Skip to content

fix(build): surface clear disk-full error instead of misleading opset ValueError#987

Merged
timenick merged 7 commits into
mainfrom
timenick-fix-misleading-valueerror-on-disk-full
Jun 30, 2026
Merged

fix(build): surface clear disk-full error instead of misleading opset ValueError#987
timenick merged 7 commits into
mainfrom
timenick-fix-misleading-valueerror-on-disk-full

Conversation

@timenick

Copy link
Copy Markdown
Collaborator

Closes #259

Problem

When the disk fills during the optimize step of winml build, the ONNX write fails: onnx.save_model truncates the target to 0 bytes and then write() raises OSError(ENOSPC), leaving a corrupt/zero-byte optimized.onnx behind. The quantize step then loads that file — onnx.load parses the empty bytes into an empty ModelProto with no opset_import, and ORT raises the opaque:

ValueError: Failed to find proper ai.onnx domain

which surfaces as Quantization failed: .... The real cause (out of disk space) is never reported, so users chase a phantom code bug.

Fix

Root-cause fix at the ONNX write boundary, plus a defensive guard:

  • ONNXSaveError (onnx/persistence.py) — new OSError subclass with path and disk_full attributes. Exported from onnx/__init__.py.
  • save_onnx — both write paths (external-data and inline) now catch OSError, remove the partial .onnx/.data artifact, and raise ONNXSaveError. Disk-full (errno.ENOSPC / Windows ERROR_DISK_FULL 112) gets a clear "Insufficient disk space" message; other errors get a generic write-failure message.
  • copy_onnx_model (onnx/external_data.py) — same treatment for the copy boundary, cleaning up dst (+ sidecar).
  • Quantizer guard (quant/quantizer.py) — _quantize_single_pass now detects an empty/opset-less/unparseable input model up front and returns a clear QuantizeResult failure instead of letting ORT raise its opaque opset error.
  • Build hint (commands/build.py) — disk-space errors map to an actionable hint (free up disk / clear ~/.cache/winml).
  • Docs (docs/troubleshooting.md) — note on the new behavior and automatic partial-file cleanup.

Subclassing OSError keeps existing except OSError callers working and lets the build command's top-level handler surface the clear message verbatim. The write-boundary cleanup guarantees no truncated artifact is left for a later stage to misread.

Out of scope (decided during triage): no proactive free-disk pre-check before build.

timenick added 2 commits June 26, 2026 15:51
When the disk fills during the optimize step, onnx.save_model truncates the target to 0 bytes then raises OSError(ENOSPC), leaving a corrupt optimized.onnx. The quantize step loaded it into an empty ModelProto and ORT raised the opaque 'Failed to find proper ai.onnx domain', hiding the real cause (issue #259).

Catch OSError at every ONNX write boundary (save_onnx, copy_onnx_model), remove the partial .onnx/.data artifact, and raise a clear ONNXSaveError (subclass of OSError, attrs path/disk_full). Add a defensive guard in the quantizer that detects an empty/corrupt input model and returns a clear failure instead of ORT's opset error. Map disk-space errors to an actionable hint in the build command.
@timenick timenick requested a review from a team as a code owner June 26, 2026 07:52
@timenick timenick changed the title Surface clear disk-full error instead of misleading opset ValueError fix(build): surface clear disk-full error instead of misleading opset ValueError Jun 26, 2026
Comment thread src/winml/modelkit/quant/quantizer.py Fixed
Comment thread tests/unit/test_quantizer.py Fixed
Comment thread tests/unit/test_quantizer.py Fixed
Comment thread tests/unit/test_quantizer.py Fixed
timenick added 3 commits June 26, 2026 16:22
CodeQL's py/import-and-import-from flagged the quantizer guard and its tests for importing the onnx module via both 'import onnx' and 'from onnx import ...' in the same file (the guard's plain 'import onnx' collided by short name with the relative 'from ..onnx import save_onnx'). Use 'from onnx import load_model' in the guard and drop the redundant 'from onnx import' in the tests so each file uses a single, consistent import form.
Resolve conflict in tests/unit/test_quantizer.py: keep both the disk-full
input-guard regression tests (this branch) and the model_type finalizer
tests (main). Update the finalizer tests to write a valid ONNX model via
_write_minimal_onnx_model instead of write_text("input"), since the new
_check_input_model_opset guard in _quantize_single_pass now rejects a
truncated/invalid input before dispatch.

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disk-full fix is well-structured and the test coverage is thorough. Two issues to address before merging:

  1. BugONNXSaveError.errno is always None, contradicting the docstring that says "errno is preserved". Any caller that checks e.errno == errno.ENOSPC will silently get the wrong answer.

  2. Warning_check_input_model_opset fully parses the ONNX proto on every single quantize call, even for perfectly healthy models. For large models with multi-MB proto headers, this is unnecessary latency.

Comment thread src/winml/modelkit/onnx/persistence.py
Comment thread src/winml/modelkit/quant/quantizer.py
timenick added 2 commits June 29, 2026 17:54
Resolve conflict in src/winml/modelkit/quant/quantizer.py against main's
#985 Quantizer/BaseQuantPass pipeline refactor (which replaced the old
_quantize_single_pass / _quantize_qdq dispatch):

- Drop the obsolete old-architecture handlers; keep main's expand_precision
  + Quantizer(passes).run pipeline.
- Re-wire the disk-full input guard (_check_input_model_opset) into
  quantize_onnx, ordered after the kwargs TypeError check and before the
  model-type finalizer. Not placed in Quantizer.run() because its
  orchestration unit tests drive it with dummy non-ONNX inputs.
- Fix two finalizer tests that #985 silently broke (they monkeypatched the
  removed _quantize_qdq): retarget them to the new Quantizer seam.
- Cover the standalone multi-precision CLI path (_run_multi_precision), which
  drives Quantizer directly, with the same guard for parity.

Address PR review comments in onnx/persistence.py:
- ONNXSaveError.errno was always None; preserve the originating errno via a
  new errno_code argument so `except OSError` callers can inspect e.errno.
- Add a zero-byte stat() fast-path to _check_input_model_opset so the healthy
  success path avoids a full proto parse.

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both issues from the previous review are resolved.

Bug fix confirmed (persistence.py): ONNXSaveError now accepts errno_code and sets self.errno = errno_code explicitly after super().init(message). _raise_save_error passes errno_code=error.errno, and the new tests explicitly assert err.errno == errno.ENOSPC / err.errno == errno.EACCES.

Warning addressed (quantizer.py): _check_input_model_opset now has a stat().st_size == 0 fast-path as suggested. The most common disk-full artefact (zero-byte file) is caught cheaply without a proto parse; non-zero truncated files fall through to load_model.

Minor note (non-blocking): the docstring at lines 232-233 says "the healthy success path never pays for a full proto parse" which is slightly inaccurate — a non-zero healthy model still does load_model. Consider rewording to "a zero-byte file is caught before the full parse".

@timenick timenick merged commit 947e5ec into main Jun 30, 2026
9 checks passed
@timenick timenick deleted the timenick-fix-misleading-valueerror-on-disk-full branch June 30, 2026 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: winml build shows misleading ValueError (opset domain) when disk is full during quantize step

3 participants