fix(build): surface clear disk-full error instead of misleading opset ValueError#987
Conversation
When the disk fills during the optimize step, onnx.save_model truncates the target to 0 bytes then raises OSError(ENOSPC), leaving a corrupt optimized.onnx. The quantize step loaded it into an empty ModelProto and ORT raised the opaque 'Failed to find proper ai.onnx domain', hiding the real cause (issue #259). Catch OSError at every ONNX write boundary (save_onnx, copy_onnx_model), remove the partial .onnx/.data artifact, and raise a clear ONNXSaveError (subclass of OSError, attrs path/disk_full). Add a defensive guard in the quantizer that detects an empty/corrupt input model and returns a clear failure instead of ORT's opset error. Map disk-space errors to an actionable hint in the build command.
…ng-valueerror-on-disk-full
CodeQL's py/import-and-import-from flagged the quantizer guard and its tests for importing the onnx module via both 'import onnx' and 'from onnx import ...' in the same file (the guard's plain 'import onnx' collided by short name with the relative 'from ..onnx import save_onnx'). Use 'from onnx import load_model' in the guard and drop the redundant 'from onnx import' in the tests so each file uses a single, consistent import form.
Resolve conflict in tests/unit/test_quantizer.py: keep both the disk-full
input-guard regression tests (this branch) and the model_type finalizer
tests (main). Update the finalizer tests to write a valid ONNX model via
_write_minimal_onnx_model instead of write_text("input"), since the new
_check_input_model_opset guard in _quantize_single_pass now rejects a
truncated/invalid input before dispatch.
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
The disk-full fix is well-structured and the test coverage is thorough. Two issues to address before merging:
-
Bug —
ONNXSaveError.errnois alwaysNone, contradicting the docstring that says "errno is preserved". Any caller that checkse.errno == errno.ENOSPCwill silently get the wrong answer. -
Warning —
_check_input_model_opsetfully parses the ONNX proto on every single quantize call, even for perfectly healthy models. For large models with multi-MB proto headers, this is unnecessary latency.
Resolve conflict in src/winml/modelkit/quant/quantizer.py against main's #985 Quantizer/BaseQuantPass pipeline refactor (which replaced the old _quantize_single_pass / _quantize_qdq dispatch): - Drop the obsolete old-architecture handlers; keep main's expand_precision + Quantizer(passes).run pipeline. - Re-wire the disk-full input guard (_check_input_model_opset) into quantize_onnx, ordered after the kwargs TypeError check and before the model-type finalizer. Not placed in Quantizer.run() because its orchestration unit tests drive it with dummy non-ONNX inputs. - Fix two finalizer tests that #985 silently broke (they monkeypatched the removed _quantize_qdq): retarget them to the new Quantizer seam. - Cover the standalone multi-precision CLI path (_run_multi_precision), which drives Quantizer directly, with the same guard for parity. Address PR review comments in onnx/persistence.py: - ONNXSaveError.errno was always None; preserve the originating errno via a new errno_code argument so `except OSError` callers can inspect e.errno. - Add a zero-byte stat() fast-path to _check_input_model_opset so the healthy success path avoids a full proto parse.
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
Both issues from the previous review are resolved.
Bug fix confirmed (persistence.py): ONNXSaveError now accepts errno_code and sets self.errno = errno_code explicitly after super().init(message). _raise_save_error passes errno_code=error.errno, and the new tests explicitly assert err.errno == errno.ENOSPC / err.errno == errno.EACCES.
Warning addressed (quantizer.py): _check_input_model_opset now has a stat().st_size == 0 fast-path as suggested. The most common disk-full artefact (zero-byte file) is caught cheaply without a proto parse; non-zero truncated files fall through to load_model.
Minor note (non-blocking): the docstring at lines 232-233 says "the healthy success path never pays for a full proto parse" which is slightly inaccurate — a non-zero healthy model still does load_model. Consider rewording to "a zero-byte file is caught before the full parse".
Closes #259
Problem
When the disk fills during the optimize step of
winml build, the ONNX write fails:onnx.save_modeltruncates the target to 0 bytes and thenwrite()raisesOSError(ENOSPC), leaving a corrupt/zero-byteoptimized.onnxbehind. The quantize step then loads that file —onnx.loadparses the empty bytes into an emptyModelProtowith noopset_import, and ORT raises the opaque:which surfaces as
Quantization failed: .... The real cause (out of disk space) is never reported, so users chase a phantom code bug.Fix
Root-cause fix at the ONNX write boundary, plus a defensive guard:
ONNXSaveError(onnx/persistence.py) — newOSErrorsubclass withpathanddisk_fullattributes. Exported fromonnx/__init__.py.save_onnx— both write paths (external-data and inline) now catchOSError, remove the partial.onnx/.dataartifact, and raiseONNXSaveError. Disk-full (errno.ENOSPC/ WindowsERROR_DISK_FULL112) gets a clear "Insufficient disk space" message; other errors get a generic write-failure message.copy_onnx_model(onnx/external_data.py) — same treatment for the copy boundary, cleaning updst(+ sidecar).quant/quantizer.py) —_quantize_single_passnow detects an empty/opset-less/unparseable input model up front and returns a clearQuantizeResultfailure instead of letting ORT raise its opaque opset error.commands/build.py) — disk-space errors map to an actionable hint (free up disk / clear~/.cache/winml).docs/troubleshooting.md) — note on the new behavior and automatic partial-file cleanup.Subclassing
OSErrorkeeps existingexcept OSErrorcallers working and lets the build command's top-level handler surface the clear message verbatim. The write-boundary cleanup guarantees no truncated artifact is left for a later stage to misread.Out of scope (decided during triage): no proactive free-disk pre-check before build.