GeoTIFF: case-insensitive HTTP(S) scheme routing for SSRF protection (#2332)#2337
Conversation
Several call sites in xrspatial/geotiff/ classified HTTP(S) sources
with startswith(('http://', 'https://')). That comparison is
case-sensitive, so URLs like HTTP://169.254.169.254/... slipped past
_HTTPSource and the SSRF / DNS-pinning checks in _validate_http_url
and fell through to fsspec. RFC 3986 section 3.1 defines URI schemes
as case-insensitive.
Adds _is_http_source(s) in _sources.py, backed by
urlparse(s).scheme.lower() in ('http', 'https'), and routes every
dispatch site through it: _sources.py (_is_fsspec_uri + _open_source),
_reader.py, _writer.py (_is_fsspec_uri), _backends/dask.py,
_backends/gpu.py (eager-via-CPU branch + GDS opt-in gate), and the
_sidecar.py _is_http_url helper.
New regression tests in test_http_scheme_case_2321.py cover the
helper, _open_source dispatch for HTTP / HTTPS / Http / hTTpS, the
partner _is_fsspec_uri classifiers in both _sources and _writer, the
sidecar helper, and the end-to-end SSRF rejection of uppercase URLs
that resolve to 127.0.0.1, 169.254.169.254, 10.0.0.1, 192.168.1.1,
0.0.0.0, and localhost. socket.getaddrinfo is monkeypatched so the
tests stay offline.
Parent: #2321 sub-task 5.
brendancol
left a comment
There was a problem hiding this comment.
PR Review: GeoTIFF: case-insensitive HTTP(S) scheme routing for SSRF protection (#2332)
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
-
Repeated
from ._sources import _is_http_sourceinside function bodies at_backends/dask.py:197,_backends/gpu.py:319(top ofread_geotiff_gpu),_backends/gpu.py:1057(inside_gds_chunk_path_available),_sidecar.py:50(inside_is_http_url), and_writer.py:1248(inside_is_fsspec_uri) all do per-call imports._reader.pyalready imports_is_http_sourceat module scope. Local imports add ~1us per dispatch and obscure the dependency graph. Consider promoting at least the two hot ones (_backends/dask.py:197,_writer.py:1248) to module scope, or folding them into the top-of-function import block that already exists inread_geotiff_gpuat line 315.Counter-argument: the
gpu.pyblock at line 315 already does function-scope imports because the backend modules avoid eager-importing each other. Following that convention is reasonable. If you keep them, a one-line comment ("local to avoid import cycles") would help the next reader.
Nits (optional improvements)
-
urlparseimport inside_is_http_sourceat_sources.py:1474. The function importsurlparseon every call.urllib.parseis already used at module top (line 468 in_validate_http_url). Hoistfrom urllib.parse import urlparseto the module top so the import cost only happens once._is_http_sourceruns once per dispatch, so the perf delta is tiny. -
Behavioural broadening note. The new helper accepts schemes-only-with-colon (
http:fooreturns True; previously thestartswithreturned False). The validator still rejects these as "no hostname" so the security posture is unchanged, but a one-line test asserting_is_http_source('http:foo')and then_open_source('http:foo')raisesUnsafeURLErrorwould lock in the new behaviour. Not a blocker; the new behaviour is more RFC-correct. -
Writer error type for uppercase HTTP write targets. After this PR an
HTTP://...write target falls through_is_fsspec_uri(correctly returns False) and then through the file-like check, landing at the local file write path on line 1297. The resulting error isOSError: [Errno 22] Invalid argument(or similar OS-specific) rather than a typed "writes not supported over HTTP". Consider adding a_is_http_source(path)check at_writer.py:1273that raisesNotImplementedError("writes are not supported over HTTP"). Out of scope for this PR; pre-existing issue, just made more reachable.
What looks good
- The helper is small, documented, and replaces seven case-sensitive
startswithcall sites with one. Theurlparse(...).scheme.lower() in ('http', 'https')approach handleshttp:(scheme without//), leading whitespace, and host case automatically. - The test file is thorough: helper parametrized across mixed-case schemes, end-to-end
_open_sourcedispatch with__init__interception (no real network), and parametrized SSRF rejection across (scheme casing) x (private IP). All offline viasocket.getaddrinfomonkeypatch. - The
_sidecar.py_is_http_urlhelper now delegates to the same helper, so future drift between the two is harder. - No public API change. Backend coverage (numpy / cupy / dask+numpy / dask+cupy) is preserved on every dispatch path.
- Full geotiff test suite passes (5317 / 5317 non-network tests).
Checklist
- Algorithm matches reference (RFC 3986 sect 3.1, case-insensitive schemes)
- All implemented backends produce consistent results (single helper used everywhere)
- NaN handling: N/A
- Edge cases covered (mixed case, non-string, empty string, non-HTTP schemes)
- Dask chunk boundaries: N/A
- No premature materialization or unnecessary copies
- Tests added (60 new test cases)
- Docstrings present and accurate
- README feature matrix: N/A (no new public function)
#2332) Follow-ups on the /review-pr pass for PR #2337: * Hoist ``from urllib.parse import urlparse`` to ``_sources.py`` module top so ``_is_http_source`` no longer pays the import cost per call. * Hoist ``from ._sources import _is_http_source`` to module top in ``_writer.py`` and ``_sidecar.py`` (no cycle: neither is imported by ``_sources``). * Keep the function-scope imports in ``_backends/dask.py`` and ``_backends/gpu.py``: the backend modules deliberately defer the reader / sources import so the package can be imported without urllib3 in dask-only environments. Add a one-line comment so the next reader does not assume it is an oversight. * Reject HTTP(S) write targets in ``_write_bytes`` with a typed ``NotImplementedError`` before the local file write path interprets the URL as a filename. The error message lists the supported destinations. Covers both lowercase and uppercase scheme casing. * Lock the behavioural broadening: ``_is_http_source('http:foo')`` now returns True (urlparse-correct, broader than the old ``startswith``). Added a test asserting the classifier matches and that ``_open_source('http:foo')`` raises ``UnsafeURLError`` so the validator still rejects it as "no hostname". * Added 7 new test cases (helper + open_source + 5 writer-rejection parametrizations); full geotiff suite: 5324 passed, 68 skipped.
brendancol
left a comment
There was a problem hiding this comment.
Follow-up review: PR #2337 (after commit 532d5bc)
Blockers
None.
Suggestions
None. The original Suggestion ("hoist function-scope imports") was applied for _writer.py and _sidecar.py (no cycle risk), and dismissed-with-reason for _backends/dask.py + _backends/gpu.py. The dismissal rationale is sound: the backend modules deliberately defer importing _reader / _sources so the package imports cleanly without urllib3 in dask-only environments. An explanatory comment now sits next to each remaining local import.
Nits
None. All three original Nits are now resolved:
urlparseis at module scope in_sources.py(no per-call import)._is_http_source('http:foo')returning True is locked in by an explicit test, and_open_source('http:foo')->UnsafeURLErroris also asserted.- The writer rejects HTTP(S) destinations with a typed
NotImplementedErrorin_write_bytes, covered by 5 parametrized cases across (http / https) x (lower / mixed / upper).
What looks good
- The
NotImplementedErrorin_write_bytesmatches the existing wording in_write_streaming(lines 738-740), so the user sees a consistent message regardless of which entry point they hit. - Test count grew from 60 to 67 cases; full geotiff suite stays at 5324 passed, 68 skipped, no warnings introduced.
- The remaining local imports in the backend modules now carry a one-line comment so the next reviewer does not flag them as oversight.
Checklist
All boxes still ticked; no new findings.
Disposition of original findings: 3 fixed in-PR (urlparse hoist, behavioural-broadening test, typed writer error), 1 partially fixed + partially dismissed-with-reason (imports hoisted where safe, kept function-scope with comment where backend isolation requires it).
Reconciles the case-insensitive HTTP(S) dispatch helper introduced on both branches: * main #2326 added ``_is_http_url`` in ``_sources.py`` and routed the geotiff dispatch sites through it. * PR #2337 added the same fix as ``_is_http_source`` with broader test coverage, hoisted imports, and a typed ``NotImplementedError`` for HTTP(S) write targets in ``_write_bytes``. Resolution keeps ``_is_http_source`` as the canonical name (used by this PR's tests and the writer-side guard) and adds an ``_is_http_url`` alias in ``_sources.py``, plus a re-export from ``_reader.py``, so the test file shipped with #2326 (``test_uppercase_scheme_ssrf_2323.py``) keeps importing ``_is_http_url`` unchanged. The ``_sidecar.py`` wrapper ``_is_http_url`` now delegates to ``_is_http_source`` directly instead of through the canonical alias. Full geotiff suite: 5413 passed, 64 skipped, 1 xfailed, 1 xpassed.
Summary
startswith(('http://', 'https://'))was case-sensitive, soHTTP://169.254.169.254/...slipped past_HTTPSourceand the SSRF / DNS-pinning checks and fell through to fsspec. Per RFC 3986 section 3.1, URI schemes are case-insensitive._is_http_source(s)helper in_sources.pybacked byurlparse(s).scheme.lower() in ('http', 'https')and routes every dispatch site through it:_sources.py(_is_fsspec_uri,_open_source),_reader.py,_writer.py(_is_fsspec_uri),_backends/dask.py,_backends/gpu.py(eager-via-CPU branch + GDS gate), and the_sidecar.py_is_http_urlhelper.Test plan
test_http_scheme_case_2321.py: helper,_open_sourcedispatch forHTTP/HTTPS/Http/hTTpS, both_is_fsspec_uriclassifiers, the sidecar helper, and end-to-end SSRF rejection of uppercase URLs that resolve to127.0.0.1,169.254.169.254,10.0.0.1,192.168.1.1,0.0.0.0, andlocalhost.socket.getaddrinfois monkeypatched so tests stay offline.test_ssrf_hardening_1664.pystill passes (43 / 43).xrspatial/geotiff/tests/suite (excluding the golden-corpus HTTP test that pulls real fixtures): 5317 passed, 68 skipped.Closes #2332.
Parent: #2321 sub-task 5.