refactor(waterdata): Unify list and filter chunkers into one joint planner by thodson-usgs · Pull Request #283 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-05-18T15:24:44Z

Stacked on top of #280. The diff includes #280's chunker commits; rebase after #280 merges to see only the unification delta (~460 added / 470 removed = net −10 lines).

Replaces the two-decorator stack (@multi_value_chunked outside @filters.chunked) with a single planner that allocates URL byte budget across list dims and filter clauses together.

Why

The old two-layer design had a real suboptimality: the outer planner sized list chunks against the inner chunker's bail-floor (one clause of the longest size). When both dims were simultaneously long, the outer would over-chunk the list, forcing the inner to over-chunk the filter into many sub-requests. Up to ~10× sub-request inflation in pathological cases.

It also carried accidental complexity from the cross-decorator coordination — _filter_aware_probe_args, _max_per_clause_encoding_ratio, _effective_filter_budget, and the bail-floor probe machinery existed only because the two layers couldn't see each other. The hardest-to-explain parts of the codebase.

Algorithm

Enumerate filter chunk counts k = 1, 2, 4, ..., n_clauses.
For each k, partition clauses into k balanced groups joined by OR and identify the worst (longest URL-encoded) group.
Substitute the worst group as the filter and plan list dims with greedy halving against the remaining budget.
Pick the candidate whose list_count × k is smallest.

Net code change (delta vs PR #280's state)

filters.py: −110 lines (retired chunked decorator + _effective_filter_budget + _max_per_clause_encoding_ratio + _NON_FILTER_URL_HEADROOM)
chunking.py: roughly flat (joint planner adds the budget-search loop; bail-floor coordination machinery and _filter_aware_probe_args removed)
utils.py: −1 line (unstacked decorators on _fetch_once)
Tests: rewrites of cross-decorator coordination tests collapse into joint-planner tests; new URL-construction stress test added.

Total: −12 lines net, with the structurally hardest-to-explain coordination layer gone.

Regression test for URL construction

test_joint_planner_url_construction_long_filter_and_long_sites exercises the joint planner with 500 USGS site IDs + 20 datetime OR-clauses using the real _construct_api_requests builder (not a fake). The test asserts:

Every sub-request URL stays under the 8000-byte limit.
Filter partitions cover every original clause exactly once.
List partitions cover every original site exactly once.
Total sub-request count beats the bail-floor-style worst case (500 × 20 = 10,000 → joint planner reduces to <500, and in this case finds an optimal 2 sub-requests with no filter chunking needed).

Live API verified

The canonical doc example (Ohio Stream sites → daily discharge for P7D) runs end-to-end against the live USGS API. 2,888 sites chunk into 12 sub-requests, 1,455 rows of daily discharge returned, canonical md.url preserved (58,138 bytes), cumulative md.query_time accurate.

What was preserved

RequestTooLarge and QuotaExhausted exception shapes — same caller-facing contract.
Helper functions (_split_top_level_or, _chunk_cql_or, _is_chunkable, _check_numeric_filter_pitfall, _combine_chunk_frames, _combine_chunk_responses) stay in filters.py as primitives the joint planner uses.
Lexicographic-comparison pitfall guard still fires on chunkable filters (moved into _plan_joint).
Quota-floor safety check between sub-requests.
Canonical URL restoration so md.url always reflects the user's original query.

Coordination with PR #282 (resume)

PR #282's ChunkManifest.completed indexes into a list-only cartesian product. With the joint planner, the index space grows by × len(filter_chunks) — a mechanical update to the manifest plan representation (add a filter_chunks dim), not a design change. To be addressed in a follow-up rebase of #282.

For multi-value waterdata queries (e.g. monitoring_location_id with ~300+ sites), the GET URL produced by PR DOI-USGS#233 blows past the server's ~8 KB nginx buffer and the API returns HTTP 414. This PR adds a chunker that transparently splits long list params across sub-requests so each URL fits the byte budget. The chunker is a decorator applied to ``_fetch_once`` outside the existing ``@filters.chunked`` (CQL chunker), so list-chunking is the outer loop and filter-chunking is the inner loop: @chunking.multi_value_chunked(build_request=_construct_api_requests) @filters.chunked(build_request=_construct_api_requests) def _fetch_once(args): ... Key design points: - ``_plan_chunks`` greedy-halves the largest chunk across all dimensions until the worst-case sub-request fits ``url_limit`` (URL + body, via ``_request_bytes``, so POST routes are sized correctly). Cartesian product of per-dim partitions becomes the sub-request set; capped at ``max_chunks=1000``. - ``_filter_aware_probe_args`` coordinates with ``filters.chunked``: the planner probes URL length using a synthetic clause that matches the inner filter chunker's bail-floor size (longest single clause, scaled by worst-case URL encoding ratio). Without this coordination, the outer planner would raise ``RequestTooLarge`` on combinations the stacked chunkers can actually handle. - ``QuotaExhausted`` mid-call guard reads ``x-ratelimit-remaining`` after each sub-request; if it drops below ``quota_safety_floor=50``, the wrapper raises with the partial frame, completed-chunk offset, and last observed remaining quota — letting callers salvage or resume after the rate-limit window resets, rather than crash into a silent mid-pagination 429. - ``RequestTooLarge`` is raised when the smallest reducible plan still exceeds ``url_limit`` (every multi-value param at a singleton chunk and any chunkable filter at the inner chunker's bail floor) or when the cartesian product exceeds ``max_chunks``. - All defaults (``url_limit``, ``max_chunks``, ``quota_safety_floor``) resolve at call time, so monkey-patching ``filters._WATERDATA_URL_ BYTE_LIMIT`` for tests / non-default quotas affects the decorator uniformly. Public additions: - ``dataretrieval.waterdata.chunking.multi_value_chunked`` - ``dataretrieval.waterdata.chunking.RequestTooLarge`` - ``dataretrieval.waterdata.chunking.QuotaExhausted`` (carries ``partial_frame``, ``partial_response``, ``completed_chunks``, ``total_chunks``, ``remaining``) Tests (30 new): - ``_filter_aware_probe_args`` worst-case-clause modelling - ``_plan_chunks`` greedy halving, RequestTooLarge floor, filter- chunker coordination, ``max_chunks`` cap, lazy-default reads - ``multi_value_chunked`` pass-through, cartesian-product shape, end-to-end with stacked filter chunker - ``QuotaExhausted`` header parsing, mid-call abort, last-chunk no- abort, zero-floor disable - ``RequestTooLarge`` message contents and triggering conditions End-to-end correctness verified against the live API: identical per-site cell-for-cell output between unchunked (single call) and chunked (forced fan-out via patched limit) paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two correctness gaps surfaced in review: 1. ``limit`` and ``skip_geometry`` are scalars by contract (``int | None`` and ``bool | None``) but a list smuggled through type erasure (e.g. ``limit=["100","200"]`` slipping past _normalize_str_iterable when elements happen to be strings) would be picked up by ``_chunkable_params`` and fanned into multiple sub-requests with conflicting per-request caps. Add both to ``_NEVER_CHUNK`` so the chunker leaves scalar-by-contract params alone. 2. ``quota_safety_floor=0`` is the documented "disable the guard" sentinel, but negative values were accepted silently and also disabled the guard — obscuring caller intent. Reject at decoration time, parallel to ``_plan_chunks``'s ``max_chunks < 1`` check.

…anner Replaces the two-decorator stack (@multi_value_chunked outside @filters.chunked) with a single planner that allocates URL byte budget across list dims and filter clauses together. Same correctness guarantees, fewer sub-requests when the previous design forced the inner filter chunker to bail at its singleton-clause floor while the outer list chunker held the bulk of the URL budget. Algorithm: - Enumerate filter chunk counts k = 1, 2, 4, ..., n_clauses. - For each k, partition clauses into k balanced groups joined by OR and identify the worst (longest URL-encoded) group. - Substitute the worst group as the filter and plan the list dims with greedy halving against the remaining byte budget. - Pick the candidate whose list_count × k is smallest. Net code shrinks: -50 lines in filters.py (retired the chunked decorator and _effective_filter_budget), +30 in chunking.py for the joint planner (offset by removing _filter_aware_probe_args and the bail-floor coordination machinery), unstack the decorator pair on _fetch_once. Two existing cross-decorator coordination tests collapse into joint-planner tests (mismatched-clause-length probing was the hardest-to-explain artefact of the old design — gone now). New regression test: ``test_joint_planner_url_construction_long_filter_and_long_sites`` exercises the planner with 500 USGS site IDs + 20 datetime OR-clauses using the real ``_construct_api_requests`` builder. Confirms every sub-request URL stays under 8000 bytes, filter partitions cover every clause exactly once, list partitions cover every site exactly once, and the total sub-request count beats the naive bail-floor-style worst case by ≥4×. Live API verified: Ohio Stream sites (2888) → daily discharge (P7D) chunks into 12 sub-requests with canonical URL preserved and cumulative query_time accurate.

…both chunking dims The two chunking dimensions (list values and CQL OR-clauses) shared an obvious primitive: "URL-encoded byte length of atoms joined by a separator." Extract _joined_url_bytes(atoms, sep); list dims call it with "," and filter dims call it with " OR ". _chunk_bytes collapses to a one-liner using the helper, and the inline len(quote_plus(c or "")) in the joint planner becomes _joined_url_bytes(group, " OR "). Partition shape also unifies: _partition_clauses now returns list[list[str]] (raw atom groups) instead of pre-joined strings. The joint planner sizes candidates by _joined_url_bytes on the raw groups and joins only the winning groups for the wrapper to iterate, so discarded partition candidates never pay the join cost. Side cleanups motivated by the /simplify review: - Add "filter" to _NEVER_CHUNK so _chunkable_params doesn't need a k != "filter" special case alongside the frozenset check. - Drop the redundant filter_chunkable variable in _plan_joint; derive from len(clauses) >= 2. - Bug fix in _plan_joint: when there are no list dims to shrink and the filter alone overflows the URL limit, the planner used to pick k=1 and emit one over-limit sub-request. Now it verifies the request fits with the chosen filter chunking before accepting that k. Dead code removal: - _chunk_cql_or and _CQL_FILTER_CHUNK_LEN in filters.py had zero production callers after the joint planner subsumed their role. Deleted, with their 4 unit tests. - 4 _effective_filter_budget tests (function already deleted in the unification commit) and their _build_request / _WATERDATA_URL_BYTE_LIMIT test scaffolding. Test rewrites: the three end-to-end tests that previously mocked _effective_filter_budget (long_filter fan-out, dedup, empty-chunk GeoDataFrame preservation) now exercise the joint planner directly via a filter-size-aware fake URL builder. Same invariants, no mock of removed internals. Net diff: -180 lines across 4 files (-72 production, -108 tests).

Three small extractions and one minor optimization. No behavior change; 130 chunker/filter tests stay green. _iter_sub_args generator yields per-sub-request args dicts; the wrapper's nested-loop-with-manual-counter collapses to ``for i, sub_args in enumerate(...)``. The "is this the last sub-request" branch in the quota-floor check flips to ``if i == total - 1: continue`` so the gate is a guard clause rather than the body of an inverse condition. _finalize_response folds the ``_combine_chunk_responses(responses); response.url = canonical_url`` pattern (used in both the success path and the QuotaExhausted partial-state payload) into one helper. _filter_candidates generator emits ``(filter_chunks, worst_filter)`` pairs for each candidate filter chunk count; ``_plan_joint`` then iterates candidates uniformly without the ``if filter_chunkable: ... else: ...`` fork. The redundant ``filter_chunkable`` flag is gone — ``len(clauses) >= 2`` is the single truth. Per-iteration optimization: ``{**args, **list_overrides}`` was being recomputed for every filter chunk; now built once per outer combo and reused (or shallow-overridden when a filter substitution applies). Module constants ``_LIST_SEP = ","`` and ``_OR_SEP = " OR "`` replace the scattered string literals — the two chunking dimensions are now self-documenting at every call site that sizes them.

…sub_args Three micro-refinements after the previous pass settles. No behavior change; 130 tests stay green. - Extract _resolve_max_chunks() so the default + validation rule for ``max_chunks`` lives in one place, called from both _plan_list_chunks and _plan_joint. The 5-line if-None/if-<1 block was duplicated verbatim. - _iter_sub_args drops its explicit ``list_keys = list(list_plan)`` cache; iterating ``list_plan`` directly gives the same insertion-order sequence (Python 3.7+ dict guarantee), and ``zip(list_plan, combo)`` reads as "pair each list-dim name with its chunk for this combo". - Tighten the wrapper's option resolution to the "default if None else passed" form so each line reads in argument order. - Categorize the _NEVER_CHUNK comment so future additions land in the right category instead of a flat narrative.

…rule After investigating: the OGC getters expose ~94 list-shaped params, all chunkable. The current 13-entry denylist captures every exception. An allowlist would be ~7x longer and would need updating every time USGS adds a column. Reframe the comment to state the default rule first ("any list-shaped kwarg gets chunked"), then enumerate the exceptions by reason (response-shaping, structured, intervals, handled-elsewhere, scalar-by- contract). Reads as "here are the few cases the default-chunk rule doesn't apply" rather than "here is an arbitrary exclusion set."

Standalone runner (``python3 tests/stress_chunker.py``) that exercises the chunker across eight scenarios with the URL byte limit lowered well below the live API's. No live HTTP — mocks fetch_once and uses the real _construct_api_requests for URL sizing. Per-scenario invariants verified: 1. Every sub-request URL ≤ url_limit (primary correctness). 2. List-dim coverage: the union of distinct chunks issued for each list dim equals the input with no overlap (no data dropped, no duplicate fetches of the same atom within its dim). 3. Filter-clause coverage: the distinct filter chunks split back into clauses, concatenated in iteration order, equal the original clauses (lossless OR-disjunction). 4. Speedup vs the bail-floor-singleton baseline that the old two- decorator design would have produced in pathological cases. Plus a greedy-search adaptation check: scanning ``url_limit`` across 1200 → 10000 confirms sub-request count is monotonically non-increasing as the budget grows (the planner adapts to the limit). Scenarios: A. Long sites only (pure list chunking) B. Long filter only (pure filter chunking) C. Long sites + long filter (joint trade-off — 1000× vs baseline) D. 3-D list cartesian product (3000× vs baseline) E. Lopsided clause sizes (worst-case sizing) F. URL-encoding-heavy clauses (quote_plus inflation) G. Very tight URL limit (singleton chunks) H. Generous URL limit (no chunking needed) I. url_limit sweep proving greedy adaptation All 15 chunked calls pass every invariant.

…, trim sweep Profile showed `_construct_api_requests` (PreparedRequest building) dominated the stress test's runtime: 421 calls / ~152ms of the ~290ms profile time. ~75 of those calls came from ``assert_urls_fit`` re-walking every captured sub-request to rebuild its URL after the chunker had already built it during planning. Two simple changes: - ``run_chunked`` now returns a parallel ``url_bytes_seen`` list; the mock ``fetch_once`` captures the built URL's byte count once during execution. ``assert_urls_fit`` just compares ints instead of rebuilding PreparedRequests. - The url_limit sweep dropped from 7 points × (150 sites, 30 clauses) to 5 points × (100 sites, 20 clauses). Monotonicity reads just as clearly with the smaller grid — the curve (8 → 2 → 2 → 1 → 1) is unambiguous. Result: 118ms → 53ms per run. 13 chunked calls, every invariant still holds.

thodson-usgs

Pay close attention to the layout: are all variables and functions placed logically into their modules? Or has the logic been mixed up.

thodson-usgs · 2026-05-18T18:42:44Z

+- Chunkable dims include multi-value list params (sites, parameter
+  codes, ...) and the cql-text ``filter`` (split at top-level ``OR``
+  to keep each chunk valid CQL).
+- The planner enumerates candidate filter chunk counts


what is k here?

thodson-usgs · 2026-05-18T18:45:13Z

+class QuotaExhausted(RuntimeError):
+    """Raised mid-chunked-call when the API's reported remaining quota
+    (``x-ratelimit-remaining`` header) drops below the configured safety
+    floor. The chunker stops before issuing the next sub-request to
+    avoid a mid-call HTTP 429 that would silently truncate paginated
+    results.


This seems like a bug. A mid-call HTTP 429 should not silently truncate. If it does, fix it, then we won't need to defend against this case.

thodson-usgs · 2026-05-18T18:49:53Z

-# per-request budget from ``_WATERDATA_URL_BYTE_LIMIT``.
-_CQL_FILTER_CHUNK_LEN = 5000
-
 # Empirically the API replies HTTP 414 above ~8200 bytes of full URL —


Should this be moved to a different module now?

…r helpers, clarify docs Three review responses bundled together: - chunking.py module docstring: define ``k`` as the candidate filter chunk count before using it in the planner description. - ``QuotaExhausted`` docstring: drop the stale "silently truncate" framing. PR DOI-USGS#273 / DOI-USGS#279 already raise on a mid-pagination 429, so this exception is the structured-recovery alternative (partial frames in hand) rather than a defense against silent truncation. - Move chunker-only orphans from filters.py to chunking.py: ``_WATERDATA_URL_BYTE_LIMIT`` (the URL byte ceiling), ``_FetchOnce`` TypeVar, ``_combine_chunk_frames``, and ``_combine_chunk_responses``. filters.py was a leftover home from the pre-unification two-decorator stack; these helpers have no callers outside the chunker. Test ``test_multi_value_chunked_lazy_url_limit`` now monkeypatches the constant on its new module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three test docstrings/comments still framed their reasoning against the removed two-decorator stack (PR DOI-USGS#283 unified them). Reword to describe the current joint-planner behavior on its own terms: - ``test_plan_joint_fans_out_filter_when_list_alone_cannot_fit``: drop the "previous two-decorator design" aside. - ``test_chunkable_params_skips_filter_passed_as_list``: rewrite the "inner filters.chunked is the only place that may shrink filter" line to point at ``_plan_joint``. - ``stress_chunker._bail_floor_baseline``: reframe the baseline as "degenerate singleton plan" rather than "worst case the old two-decorator design produced." No behavioral changes; prose only. Chunker tests + offline stress test still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thodson-usgs and others added 9 commits May 17, 2026 11:44

thodson-usgs commented May 18, 2026

View reviewed changes

thodson-usgs and others added 2 commits May 18, 2026 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(waterdata): Unify list and filter chunkers into one joint planner#283

refactor(waterdata): Unify list and filter chunkers into one joint planner#283
thodson-usgs wants to merge 11 commits into
DOI-USGS:mainfrom
thodson-usgs:chunker-unified

thodson-usgs commented May 18, 2026

Uh oh!

thodson-usgs left a comment

Uh oh!

thodson-usgs May 18, 2026

Uh oh!

thodson-usgs May 18, 2026

Uh oh!

thodson-usgs May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thodson-usgs commented May 18, 2026

Uh oh!

thodson-usgs left a comment

Choose a reason for hiding this comment

Uh oh!

thodson-usgs May 18, 2026

Choose a reason for hiding this comment

Uh oh!

thodson-usgs May 18, 2026

Choose a reason for hiding this comment

Uh oh!

thodson-usgs May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant