Reject non-positive tile_size when cog=True regardless of tiled#2316
Conversation
`to_geotiff(..., cog=True, tiled=False, tile_size=-1)` previously hung the
writer. tile_size validation only ran when `tiled=True`, but the COG path
consumed tile_size for overview auto-generation regardless of the strip-
vs-tiled flag. With a negative tile_size the auto-overview loop in
`_writer.py:490` had `oh > tile_size and ow > tile_size` permanently true
once oh, ow halved to 0 while the inner guard suppressed appends, so the
loop spun forever.
- Validate tile_size at the to_geotiff boundary when tiled OR cog is
true, so a non-positive value raises ValueError up front.
- Validate tile_size at the top of `_write` when cog=True, before any
encode work runs. This converts the prior ZeroDivisionError
(tile_size=0, tiled=True) and the infinite loop (tile_size<=0,
tiled=False) into one typed ValueError for direct callers.
- Tighten the auto-overview loop to require oh, ow > 0.
Regression test exercises cog=True with tile_size in {-1, 0} for both
tiled flags, with a SIGALRM watchdog so any future regression fails the
test instead of hanging the run.
brendancol
left a comment
There was a problem hiding this comment.
PR Review: Reject non-positive tile_size when cog=True regardless of tiled
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
-
xrspatial/geotiff/_writer.py:522-526-- the inner-loop defense check doesnot isinstance(tile_size, (int, np.integer)) or tile_size <= 0but does not explicitly rejectbool. The top-of-_writegate at line 407-414 does reject bool (or isinstance(tile_size, bool)). The two defense layers are inconsistent. Either (a) drop the redundant inner check now that the top gate covers it, or (b) add the bool reject to the inner check for parity. (b) is cheap and matches the public validator wording. -
xrspatial/geotiff/tests/test_cog_tile_size_hang_2311.py-- no test exercises the auto-overview-loop guard withtiled=Falsedirectly.test_writer_auto_overview_loop_rejects_non_positive_tile_sizeonly coverstiled=True, which actually trips the top-of-_writevalidator at line 407 before reaching the loop. To make the inner-loop guard reachable, monkeypatch the top gate to a no-op and then assert the loop guard at line 522 fires. Without that, the inner-loop guard is uncovered. Lower priority because the top gate is the load-bearing one for users, but worth at least one test row so a future refactor that removes the top gate is caught.
Nits (optional improvements)
-
xrspatial/geotiff/_writer.py:407-414-- the comment block is long (~10 lines) for a single isinstance/<=0 check. Consider folding the explanation into a one-line# defense in depth: #2311and let the body comment carry the detail. Pure style preference; existing geotiff code mixes both conventions. -
xrspatial/geotiff/tests/test_cog_tile_size_hang_2311.py:88--import warningsinside the function body works but is non-idiomatic for this repo (the rest of the file imports at module scope). Move to module scope. -
xrspatial/geotiff/tests/test_cog_tile_size_hang_2311.py-- no test coverstile_size=None,tile_size=128.0, ortile_size=True. The top-of-_writegate rejects all three; a one-row parametrization would close that gap.
What looks good
- Two-layer fix: public-boundary rejection in
to_geotiffplus inner-writer rejection in_write, with comments cross-referencing the issue. - SIGALRM watchdog on every test so a regression that reintroduces the hang fails the test instead of locking up CI.
- Existing GPU writer path already runs
_validate_tile_size_argunconditionally; no GPU-side change needed. - Sanity test confirms
cog=False, tiled=False, tile_size=-1is still accepted (matches the "ignored" warning shape) so the fix does not overcorrect into breaking existing strip-layout callers. - Test file message wording asserts both
tile_sizeandpositivesubstrings, which matches the existing_validate_tile_sizeconvention from #1776/#2301.
Checklist
- Algorithm matches reference/paper (N/A -- validation-only)
- All implemented backends produce consistent results (GPU path was already validated)
- NaN handling is correct (N/A)
- Edge cases are covered by tests (tile_size in {-1, 0}; minor gap on tile_size=None/float/bool)
- Dask chunk boundaries handled correctly (N/A)
- No premature materialization or unnecessary copies (N/A)
- Benchmark exists or is not needed (not needed -- validation)
- README feature matrix updated (N/A -- no new public function)
- Docstrings present and accurate (no signature change)
Suggestion 1 (eager.py:407 vs _writer.py:522 inconsistency): align the
inner-loop defense check at _writer.py:522 with the top-of-_write gate
by adding the bool reject. Both layers now use the same isinstance +
bool + <=0 trinity, matching the _validate_tile_size convention.
Suggestion 2 (inner-guard coverage): add a constants-inspection test
that pins the inner-overview guard's error message literal so a future
refactor that removes the loop-side defense fails the test even when
the top gate at line 407 still catches the bad input at runtime.
Nit 2 (import scope): move `import warnings` to module scope in the
test file to match the rest of the imports.
Nit 3 (non-int tile_size coverage): add parametrized test rows for
tile_size in {None, 128.0, True, False}, all of which the public
boundary rejects with a typed ValueError that names tile_size.
Nit 1 (long defense-in-depth comment) dismissed: the verbose block
matches the geotiff module's prevailing comment style (#2301, #2216,
#1987 PR 3) and removing it would diverge from the surrounding code.
brendancol
left a comment
There was a problem hiding this comment.
Follow-up review (post-fixes)
Disposition of first-pass findings
- Suggestion 1 (bool parity at
_writer.py:522): fixed in commit a1767fa. Both defense layers now use the sameisinstance + bool + <=0trinity. - Suggestion 2 (inner-loop guard coverage): fixed via
test_inner_overview_loop_guard_message_is_pinned. The constants-inspection approach pins the literal without requiring an invasive monkeypatch of the top gate. Reaching the inner guard through a real_writecall requires either patching the top gate's inlineisinstance/bool/<=0check (which has no helper to monkeypatch) or duplicating most of_writein the test. The constants pin catches the case the second-line defense exists to catch (someone deletes it), without that complexity. - Nit 1 (long comment block): dismissed. The verbose comment matches the prevailing geotiff style (#2301, #2216, #1987 PR 3 all use ~10-line context comments on validation gates). Trimming would diverge from the surrounding code.
- Nit 2 (import scope): fixed.
import warningsis now at module scope. - Nit 3 (non-int tile_size rows): fixed. Added parametrized rows for
tile_size in {None, 128.0, True, False}, all passing.
Second-pass findings
None. The follow-up commit is focused and the test count is now 12 passing in 0.09s.
Closes #2311
Summary
tile_sizeat theto_geotiffboundary whenevercog=True, regardless oftiled, so non-positive values raiseValueErrorinstead of falling through to the auto-overview loop.tile_sizeat the top of the inner_writewhencog=True, before any encode work runs. This converts the priorZeroDivisionError(tile_size=0, tiled=True) and the infinite loop (tile_size<=0, tiled=False) into one typedValueErrorfor direct callers of the array-level writer._writer.pyto requireoh, ow > 0, so a future caller that bypasses both validators still cannot drive an infinite loop.Backend coverage
Writer-side change. The eager CPU writer dispatch covers numpy and dask+numpy callers. The GPU writer already runs
_validate_tile_size_argunconditionally at entry, so the hang path never reached the COG overview loop there.Test plan
pytest xrspatial/geotiff/tests/test_cog_tile_size_hang_2311.py(7 passed)pytest xrspatial/geotiff/tests/test_cog_invalid_input_errors_2286.py xrspatial/geotiff/tests/test_cog.py xrspatial/geotiff/tests/test_size_param_validation_gpu_vrt_1776.py(81 passed, no regressions)pytest xrspatial/geotiff/tests/ -k "not http and not gpu and not cupy"(4028 passed, no regressions)The new regression test uses a SIGALRM-based watchdog (POSIX) so a future regression that reintroduces the hang fails the test instead of locking up the run.