Skip to content

GeoTIFF: case-insensitive HTTP(S) scheme routing for SSRF protection (#2332)#2337

Merged
brendancol merged 3 commits into
mainfrom
issue-2332
May 24, 2026
Merged

GeoTIFF: case-insensitive HTTP(S) scheme routing for SSRF protection (#2332)#2337
brendancol merged 3 commits into
mainfrom
issue-2332

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • Routing on startswith(('http://', 'https://')) was case-sensitive, so HTTP://169.254.169.254/... slipped past _HTTPSource and the SSRF / DNS-pinning checks and fell through to fsspec. Per RFC 3986 section 3.1, URI schemes are case-insensitive.
  • Adds a single _is_http_source(s) helper in _sources.py backed by urlparse(s).scheme.lower() in ('http', 'https') and routes every dispatch site through it: _sources.py (_is_fsspec_uri, _open_source), _reader.py, _writer.py (_is_fsspec_uri), _backends/dask.py, _backends/gpu.py (eager-via-CPU branch + GDS gate), and the _sidecar.py _is_http_url helper.
  • Backend coverage: numpy / cupy / dask+numpy / dask+cupy. The helper is on every dispatch path.

Test plan

  • New test_http_scheme_case_2321.py: helper, _open_source dispatch for HTTP / HTTPS / Http / hTTpS, both _is_fsspec_uri classifiers, the sidecar helper, and end-to-end SSRF rejection of uppercase URLs that resolve to 127.0.0.1, 169.254.169.254, 10.0.0.1, 192.168.1.1, 0.0.0.0, and localhost. socket.getaddrinfo is monkeypatched so tests stay offline.
  • test_ssrf_hardening_1664.py still passes (43 / 43).
  • Full xrspatial/geotiff/tests/ suite (excluding the golden-corpus HTTP test that pulls real fixtures): 5317 passed, 68 skipped.

Closes #2332.
Parent: #2321 sub-task 5.

Several call sites in xrspatial/geotiff/ classified HTTP(S) sources
with startswith(('http://', 'https://')). That comparison is
case-sensitive, so URLs like HTTP://169.254.169.254/... slipped past
_HTTPSource and the SSRF / DNS-pinning checks in _validate_http_url
and fell through to fsspec. RFC 3986 section 3.1 defines URI schemes
as case-insensitive.

Adds _is_http_source(s) in _sources.py, backed by
urlparse(s).scheme.lower() in ('http', 'https'), and routes every
dispatch site through it: _sources.py (_is_fsspec_uri + _open_source),
_reader.py, _writer.py (_is_fsspec_uri), _backends/dask.py,
_backends/gpu.py (eager-via-CPU branch + GDS opt-in gate), and the
_sidecar.py _is_http_url helper.

New regression tests in test_http_scheme_case_2321.py cover the
helper, _open_source dispatch for HTTP / HTTPS / Http / hTTpS, the
partner _is_fsspec_uri classifiers in both _sources and _writer, the
sidecar helper, and the end-to-end SSRF rejection of uppercase URLs
that resolve to 127.0.0.1, 169.254.169.254, 10.0.0.1, 192.168.1.1,
0.0.0.0, and localhost. socket.getaddrinfo is monkeypatched so the
tests stay offline.

Parent: #2321 sub-task 5.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 23, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: GeoTIFF: case-insensitive HTTP(S) scheme routing for SSRF protection (#2332)

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

  • Repeated from ._sources import _is_http_source inside function bodies at _backends/dask.py:197, _backends/gpu.py:319 (top of read_geotiff_gpu), _backends/gpu.py:1057 (inside _gds_chunk_path_available), _sidecar.py:50 (inside _is_http_url), and _writer.py:1248 (inside _is_fsspec_uri) all do per-call imports. _reader.py already imports _is_http_source at module scope. Local imports add ~1us per dispatch and obscure the dependency graph. Consider promoting at least the two hot ones (_backends/dask.py:197, _writer.py:1248) to module scope, or folding them into the top-of-function import block that already exists in read_geotiff_gpu at line 315.

    Counter-argument: the gpu.py block at line 315 already does function-scope imports because the backend modules avoid eager-importing each other. Following that convention is reasonable. If you keep them, a one-line comment ("local to avoid import cycles") would help the next reader.

Nits (optional improvements)

  • urlparse import inside _is_http_source at _sources.py:1474. The function imports urlparse on every call. urllib.parse is already used at module top (line 468 in _validate_http_url). Hoist from urllib.parse import urlparse to the module top so the import cost only happens once. _is_http_source runs once per dispatch, so the perf delta is tiny.

  • Behavioural broadening note. The new helper accepts schemes-only-with-colon (http:foo returns True; previously the startswith returned False). The validator still rejects these as "no hostname" so the security posture is unchanged, but a one-line test asserting _is_http_source('http:foo') and then _open_source('http:foo') raises UnsafeURLError would lock in the new behaviour. Not a blocker; the new behaviour is more RFC-correct.

  • Writer error type for uppercase HTTP write targets. After this PR an HTTP://... write target falls through _is_fsspec_uri (correctly returns False) and then through the file-like check, landing at the local file write path on line 1297. The resulting error is OSError: [Errno 22] Invalid argument (or similar OS-specific) rather than a typed "writes not supported over HTTP". Consider adding a _is_http_source(path) check at _writer.py:1273 that raises NotImplementedError("writes are not supported over HTTP"). Out of scope for this PR; pre-existing issue, just made more reachable.

What looks good

  • The helper is small, documented, and replaces seven case-sensitive startswith call sites with one. The urlparse(...).scheme.lower() in ('http', 'https') approach handles http: (scheme without //), leading whitespace, and host case automatically.
  • The test file is thorough: helper parametrized across mixed-case schemes, end-to-end _open_source dispatch with __init__ interception (no real network), and parametrized SSRF rejection across (scheme casing) x (private IP). All offline via socket.getaddrinfo monkeypatch.
  • The _sidecar.py _is_http_url helper now delegates to the same helper, so future drift between the two is harder.
  • No public API change. Backend coverage (numpy / cupy / dask+numpy / dask+cupy) is preserved on every dispatch path.
  • Full geotiff test suite passes (5317 / 5317 non-network tests).

Checklist

  • Algorithm matches reference (RFC 3986 sect 3.1, case-insensitive schemes)
  • All implemented backends produce consistent results (single helper used everywhere)
  • NaN handling: N/A
  • Edge cases covered (mixed case, non-string, empty string, non-HTTP schemes)
  • Dask chunk boundaries: N/A
  • No premature materialization or unnecessary copies
  • Tests added (60 new test cases)
  • Docstrings present and accurate
  • README feature matrix: N/A (no new public function)

#2332)

Follow-ups on the /review-pr pass for PR #2337:

* Hoist ``from urllib.parse import urlparse`` to ``_sources.py`` module
  top so ``_is_http_source`` no longer pays the import cost per call.
* Hoist ``from ._sources import _is_http_source`` to module top in
  ``_writer.py`` and ``_sidecar.py`` (no cycle: neither is imported by
  ``_sources``).
* Keep the function-scope imports in ``_backends/dask.py`` and
  ``_backends/gpu.py``: the backend modules deliberately defer the
  reader / sources import so the package can be imported without
  urllib3 in dask-only environments. Add a one-line comment so the
  next reader does not assume it is an oversight.
* Reject HTTP(S) write targets in ``_write_bytes`` with a typed
  ``NotImplementedError`` before the local file write path interprets
  the URL as a filename. The error message lists the supported
  destinations. Covers both lowercase and uppercase scheme casing.
* Lock the behavioural broadening: ``_is_http_source('http:foo')``
  now returns True (urlparse-correct, broader than the old
  ``startswith``). Added a test asserting the classifier matches and
  that ``_open_source('http:foo')`` raises ``UnsafeURLError`` so the
  validator still rejects it as "no hostname".
* Added 7 new test cases (helper + open_source + 5 writer-rejection
  parametrizations); full geotiff suite: 5324 passed, 68 skipped.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review: PR #2337 (after commit 532d5bc)

Blockers

None.

Suggestions

None. The original Suggestion ("hoist function-scope imports") was applied for _writer.py and _sidecar.py (no cycle risk), and dismissed-with-reason for _backends/dask.py + _backends/gpu.py. The dismissal rationale is sound: the backend modules deliberately defer importing _reader / _sources so the package imports cleanly without urllib3 in dask-only environments. An explanatory comment now sits next to each remaining local import.

Nits

None. All three original Nits are now resolved:

  • urlparse is at module scope in _sources.py (no per-call import).
  • _is_http_source('http:foo') returning True is locked in by an explicit test, and _open_source('http:foo') -> UnsafeURLError is also asserted.
  • The writer rejects HTTP(S) destinations with a typed NotImplementedError in _write_bytes, covered by 5 parametrized cases across (http / https) x (lower / mixed / upper).

What looks good

  • The NotImplementedError in _write_bytes matches the existing wording in _write_streaming (lines 738-740), so the user sees a consistent message regardless of which entry point they hit.
  • Test count grew from 60 to 67 cases; full geotiff suite stays at 5324 passed, 68 skipped, no warnings introduced.
  • The remaining local imports in the backend modules now carry a one-line comment so the next reviewer does not flag them as oversight.

Checklist

All boxes still ticked; no new findings.

Disposition of original findings: 3 fixed in-PR (urlparse hoist, behavioural-broadening test, typed writer error), 1 partially fixed + partially dismissed-with-reason (imports hoisted where safe, kept function-scope with comment where backend isolation requires it).

Reconciles the case-insensitive HTTP(S) dispatch helper introduced on
both branches:

* main #2326 added ``_is_http_url`` in ``_sources.py`` and routed the
  geotiff dispatch sites through it.
* PR #2337 added the same fix as ``_is_http_source`` with broader test
  coverage, hoisted imports, and a typed ``NotImplementedError`` for
  HTTP(S) write targets in ``_write_bytes``.

Resolution keeps ``_is_http_source`` as the canonical name (used by
this PR's tests and the writer-side guard) and adds an ``_is_http_url``
alias in ``_sources.py``, plus a re-export from ``_reader.py``, so the
test file shipped with #2326 (``test_uppercase_scheme_ssrf_2323.py``)
keeps importing ``_is_http_url`` unchanged. The ``_sidecar.py`` wrapper
``_is_http_url`` now delegates to ``_is_http_source`` directly instead
of through the canonical alias.

Full geotiff suite: 5413 passed, 64 skipped, 1 xfailed, 1 xpassed.
@brendancol brendancol merged commit 1e314f8 into main May 24, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GeoTIFF: case-insensitive HTTP(S) scheme routing for SSRF protection (#2321 sub-task 5)

1 participant