Reject non-positive tile_size when cog=True regardless of tiled by brendancol · Pull Request #2316 · xarray-contrib/xarray-spatial

brendancol · 2026-05-22T17:57:25Z

Summary

Validate tile_size at the to_geotiff boundary whenever cog=True, regardless of tiled, so non-positive values raise ValueError instead of falling through to the auto-overview loop.
Validate tile_size at the top of the inner _write when cog=True, before any encode work runs. This converts the prior ZeroDivisionError (tile_size=0, tiled=True) and the infinite loop (tile_size<=0, tiled=False) into one typed ValueError for direct callers of the array-level writer.
Tighten the auto-overview halving loop in _writer.py to require oh, ow > 0, so a future caller that bypasses both validators still cannot drive an infinite loop.

Backend coverage

Writer-side change. The eager CPU writer dispatch covers numpy and dask+numpy callers. The GPU writer already runs _validate_tile_size_arg unconditionally at entry, so the hang path never reached the COG overview loop there.

Test plan

pytest xrspatial/geotiff/tests/test_cog_tile_size_hang_2311.py (7 passed)
pytest xrspatial/geotiff/tests/test_cog_invalid_input_errors_2286.py xrspatial/geotiff/tests/test_cog.py xrspatial/geotiff/tests/test_size_param_validation_gpu_vrt_1776.py (81 passed, no regressions)
pytest xrspatial/geotiff/tests/ -k "not http and not gpu and not cupy" (4028 passed, no regressions)

The new regression test uses a SIGALRM-based watchdog (POSIX) so a future regression that reintroduces the hang fails the test instead of locking up the run.

`to_geotiff(..., cog=True, tiled=False, tile_size=-1)` previously hung the writer. tile_size validation only ran when `tiled=True`, but the COG path consumed tile_size for overview auto-generation regardless of the strip- vs-tiled flag. With a negative tile_size the auto-overview loop in `_writer.py:490` had `oh > tile_size and ow > tile_size` permanently true once oh, ow halved to 0 while the inner guard suppressed appends, so the loop spun forever. - Validate tile_size at the to_geotiff boundary when tiled OR cog is true, so a non-positive value raises ValueError up front. - Validate tile_size at the top of `_write` when cog=True, before any encode work runs. This converts the prior ZeroDivisionError (tile_size=0, tiled=True) and the infinite loop (tile_size<=0, tiled=False) into one typed ValueError for direct callers. - Tighten the auto-overview loop to require oh, ow > 0. Regression test exercises cog=True with tile_size in {-1, 0} for both tiled flags, with a SIGALRM watchdog so any future regression fails the test instead of hanging the run.

brendancol

PR Review: Reject non-positive tile_size when cog=True regardless of tiled

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

xrspatial/geotiff/_writer.py:522-526 -- the inner-loop defense check does not isinstance(tile_size, (int, np.integer)) or tile_size <= 0 but does not explicitly reject bool. The top-of-_write gate at line 407-414 does reject bool (or isinstance(tile_size, bool)). The two defense layers are inconsistent. Either (a) drop the redundant inner check now that the top gate covers it, or (b) add the bool reject to the inner check for parity. (b) is cheap and matches the public validator wording.
xrspatial/geotiff/tests/test_cog_tile_size_hang_2311.py -- no test exercises the auto-overview-loop guard with tiled=False directly. test_writer_auto_overview_loop_rejects_non_positive_tile_size only covers tiled=True, which actually trips the top-of-_write validator at line 407 before reaching the loop. To make the inner-loop guard reachable, monkeypatch the top gate to a no-op and then assert the loop guard at line 522 fires. Without that, the inner-loop guard is uncovered. Lower priority because the top gate is the load-bearing one for users, but worth at least one test row so a future refactor that removes the top gate is caught.

Nits (optional improvements)

xrspatial/geotiff/_writer.py:407-414 -- the comment block is long (~10 lines) for a single isinstance/<=0 check. Consider folding the explanation into a one-line # defense in depth: #2311 and let the body comment carry the detail. Pure style preference; existing geotiff code mixes both conventions.
xrspatial/geotiff/tests/test_cog_tile_size_hang_2311.py:88 -- import warnings inside the function body works but is non-idiomatic for this repo (the rest of the file imports at module scope). Move to module scope.
xrspatial/geotiff/tests/test_cog_tile_size_hang_2311.py -- no test covers tile_size=None, tile_size=128.0, or tile_size=True. The top-of-_write gate rejects all three; a one-row parametrization would close that gap.

What looks good

Two-layer fix: public-boundary rejection in to_geotiff plus inner-writer rejection in _write, with comments cross-referencing the issue.
SIGALRM watchdog on every test so a regression that reintroduces the hang fails the test instead of locking up CI.
Existing GPU writer path already runs _validate_tile_size_arg unconditionally; no GPU-side change needed.
Sanity test confirms cog=False, tiled=False, tile_size=-1 is still accepted (matches the "ignored" warning shape) so the fix does not overcorrect into breaking existing strip-layout callers.
Test file message wording asserts both tile_size and positive substrings, which matches the existing _validate_tile_size convention from #1776/#2301.

Checklist

Algorithm matches reference/paper (N/A -- validation-only)
All implemented backends produce consistent results (GPU path was already validated)
NaN handling is correct (N/A)
Edge cases are covered by tests (tile_size in {-1, 0}; minor gap on tile_size=None/float/bool)
Dask chunk boundaries handled correctly (N/A)
No premature materialization or unnecessary copies (N/A)
Benchmark exists or is not needed (not needed -- validation)
README feature matrix updated (N/A -- no new public function)
Docstrings present and accurate (no signature change)

Suggestion 1 (eager.py:407 vs _writer.py:522 inconsistency): align the inner-loop defense check at _writer.py:522 with the top-of-_write gate by adding the bool reject. Both layers now use the same isinstance + bool + <=0 trinity, matching the _validate_tile_size convention. Suggestion 2 (inner-guard coverage): add a constants-inspection test that pins the inner-overview guard's error message literal so a future refactor that removes the loop-side defense fails the test even when the top gate at line 407 still catches the bad input at runtime. Nit 2 (import scope): move `import warnings` to module scope in the test file to match the rest of the imports. Nit 3 (non-int tile_size coverage): add parametrized test rows for tile_size in {None, 128.0, True, False}, all of which the public boundary rejects with a typed ValueError that names tile_size. Nit 1 (long defense-in-depth comment) dismissed: the verbose block matches the geotiff module's prevailing comment style (#2301, #2216, #1987 PR 3) and removing it would diverge from the surrounding code.

brendancol

Follow-up review (post-fixes)

Disposition of first-pass findings

Suggestion 1 (bool parity at _writer.py:522): fixed in commit a1767fa. Both defense layers now use the same isinstance + bool + <=0 trinity.
Suggestion 2 (inner-loop guard coverage): fixed via test_inner_overview_loop_guard_message_is_pinned. The constants-inspection approach pins the literal without requiring an invasive monkeypatch of the top gate. Reaching the inner guard through a real _write call requires either patching the top gate's inline isinstance/bool/<=0 check (which has no helper to monkeypatch) or duplicating most of _write in the test. The constants pin catches the case the second-line defense exists to catch (someone deletes it), without that complexity.
Nit 1 (long comment block): dismissed. The verbose comment matches the prevailing geotiff style (#2301, #2216, #1987 PR 3 all use ~10-line context comments on validation gates). Trimming would diverge from the surrounding code.
Nit 2 (import scope): fixed. import warnings is now at module scope.
Nit 3 (non-int tile_size rows): fixed. Added parametrized rows for tile_size in {None, 128.0, True, False}, all passing.

Second-pass findings

None. The follow-up commit is focused and the test count is now 12 passing in 0.09s.

github-actions Bot added the performance PR touches performance-sensitive code label May 22, 2026

brendancol commented May 22, 2026

View reviewed changes

brendancol merged commit 35a0616 into main May 22, 2026
7 checks passed

brendancol deleted the issue-2311 branch May 27, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject non-positive tile_size when cog=True regardless of tiled#2316

Reject non-positive tile_size when cog=True regardless of tiled#2316
brendancol merged 2 commits into
mainfrom
issue-2311

brendancol commented May 22, 2026

Uh oh!

brendancol left a comment

Uh oh!

brendancol left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented May 22, 2026

Summary

Backend coverage

Test plan

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review: Reject non-positive tile_size when cog=True regardless of tiled

Blockers (must fix before merge)

Suggestions (should fix, not blocking)

Nits (optional improvements)

What looks good

Checklist

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

Follow-up review (post-fixes)

Disposition of first-pass findings

Second-pass findings

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant