feat(waterdata): drop hash-valued ID columns by default#281
Draft
thodson-usgs wants to merge 6 commits into
Draft
feat(waterdata): drop hash-valued ID columns by default#281thodson-usgs wants to merge 6 commits into
thodson-usgs wants to merge 6 commits into
Conversation
The waterdata OGC services previously returned per-record UUID columns (``daily_id``, ``continuous_id``, ``peak_id``, …) plus secondary hash columns (``time_series_id``, ``parent_time_series_id``, ``field_visit_id``, ``field_measurements_series_id``) in every response. These IDs are unstable across record refreshes and not human-meaningful — stable identifiers like ``monitoring_location_id`` (AGENCY-ID format), ``parameter_code``, ``statistic_id`` and ``time`` are sufficient to pin a row. Drop the hash columns by default and add ``include_hash_ids: bool = False`` to every OGC ``get_*`` function for opt-in. Implementation trims server-side via the OGC ``properties=`` query parameter (cached per service from one queryables fetch) so the payload itself is smaller, with a client-side drop as a safety net. ``monitoring_location_id`` and other AGENCY-ID / code-style identifiers are unaffected. Offline benchmark on a synthetic 30,000-row payload (mirrors the on-wire shape and per-row size of a 1-year ``get_continuous`` query): - Server payload: 14,310,081 → 10,230,081 bytes (28.5% smaller) - DataFrame memory: 14.2 MB → 9.4 MB (33.5% smaller) - Peak traced memory: 94.1 MB → 73.9 MB (21.5% smaller) - Local parse + DataFrame construction: 1.05s → 0.94s (10.8% faster) Network savings stack on top of the local speedup. For very small queries (≲1k rows) the one-time queryables fetch overhead can dominate the savings; large queries are the target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e_range The OGC ``get_*`` functions in the prior commit drop hash columns through ``get_ogc_data``. The statistics services (which return JSON through ``get_stats_data`` rather than OGC features) bypassed that path, so ``get_stats_por`` and ``get_stats_date_range`` were still returning ``computation_id`` (UUID) and ``parent_time_series_id`` (hex hash) by default. This commit: - Adds ``computation_id`` to ``_HASH_ID_COLUMNS`` (``parent_time_series_id`` was already there). - Plumbs ``include_hash_ids: bool = False`` through ``get_stats_data``, ``get_stats_por``, and ``get_stats_date_range``. - Drops the hash columns at the end of ``get_stats_data``, after ``_expand_percentiles`` (which still needs ``computation_id`` as a join key while it explodes the percentile lists into rows). - Updates ``test_get_stats_por_expanded_false`` / ``test_get_stats_date_range`` to reflect the new column count and adds ``test_get_stats_por_include_hash_ids`` documenting the opt-in. Discovered while running a live-API sweep across every public waterdata ``get_*`` function — the OGC services now pass, the stats ones used to leak, and this commit closes that gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two mocked-response tests for ``get_stats_data``: - ``test_get_stats_data_drops_hash_ids_by_default`` asserts ``computation_id`` and ``parent_time_series_id`` are removed when ``include_hash_ids=False`` (the new default). - ``test_get_stats_data_keeps_hash_ids_when_opted_in`` asserts the opt-in path preserves them, matching the legacy behavior. Both use ``monkeypatch`` to stub ``_handle_stats_nesting`` so the fake response only needs to carry the column shape, not the full nested-percentile body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A live-API column-content scan found that the Samples CSV service returns two UUID-valued columns by default: - ``Activity_ActivityIdentifier`` (per-activity UUID) - ``Result_MeasureIdentifier`` (per-measurement UUID) These weren't covered by the prior OGC / stats commits because ``get_samples`` parses CSV directly without going through ``get_ogc_data`` or ``get_stats_data``. This commit: - Adds the two CamelCase column names to ``_HASH_ID_COLUMNS``. - Plumbs ``include_hash_ids: bool = False`` through ``get_samples`` and drops the named columns from the parsed CSV before returning. - Updates ``test_mock_get_samples`` to reflect the new column count (187 → 185) and adds ``test_mock_get_samples_include_hash_ids`` for the opt-in path. - Updates ``test_samples_results`` and ``test_samples_activity`` similarly. Stable identifiers (``Org_Identifier``, ``Location_Identifier``, ``Project_Identifier``, ``USGSpcode``, …) are kept unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-PR-281 cleanup from a code review pass. No behavior change for users; the three code paths that drop hash columns (OGC, stats, samples) now share one helper, and a few smaller wins. - New ``_drop_hash_columns(df, include_hash_ids, keep=None)`` helper replaces three near-identical drop blocks across ``_arrange_cols``, ``get_stats_data``, and ``get_samples``. Uses ``set(df.columns) & _HASH_ID_COLUMNS - keep`` (one Index intersection) in place of the per-call list-comprehension + ``if drop_cols:`` truthiness guard. ``df.drop`` accepts an empty Index, so the guard was unnecessary. - ``_HASH_ID_COLUMNS`` no longer leaks into ``api.py`` — the samples branch now calls the helper instead of touching the constant. - New ``_properties_unspecified(properties)`` extracts the ``properties is None or all(pd.isna(properties))`` predicate so the five call sites that need it stop drifting between ``not properties`` and ``properties is None`` variants. - ``_default_non_hash_properties`` memoizes its result by ``(service, output_id)`` via ``_default_props_cache``. The previous version rebuilt a ~30–100-item list on every OGC call after the queryables cache was warm; this saves the per-call list rebuild on the hot path. - ``get_ogc_data`` flattens its nested ``try/except + if/else`` branches into a single ``use_server_trim`` flag, removing the duplicated ``_switch_properties_id`` assignment. - The benchmark no longer clears ``_queryables_cache`` between measured rounds. Clearing per round only penalizes the default path (the legacy path doesn't consult the cache), so the previous comparison was pessimistic against the default. Real-world callers pay the queryables fetch once per process. - The benchmark's ``capture_fixtures`` now calls ``_default_non_hash_properties`` directly instead of reimplementing the filter, so the "trimmed" fixture matches what the runtime actually sends bit-for-bit. All 157 unit and mock tests pass after the refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``benchmarks/`` directory was useful during development for measuring the hash-ID-drop impact, but it's not part of the runtime or test surface and the numbers are captured in the PR description. Removing to keep the diff focused on the library change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The waterdata services previously returned hash-valued ID columns (UUIDs and hex digests) in every response across three distinct code paths. These IDs are unstable across record refreshes and not human-meaningful — stable identifiers like
monitoring_location_id(AGENCY-ID format),parameter_code,statistic_id, andtimeare sufficient to pin a row.This PR drops the hash columns by default across every public
get_*function and addsinclude_hash_ids: bool = Falsefor opt-in. Stable AGENCY-ID and code-style identifiers are unaffected.What gets dropped
get_daily,get_continuous,get_latest_*,get_peaks,get_field_measurements,get_field_measurements_metadata,get_channel,get_time_series_metadata,get_combined_metadata,get_monitoring_locations,get_reference_tabledaily_id,continuous_id,latest_*_id,peak_id,channel_measurements_id,field_measurement_id,field_visit_id,field_measurements_series_id,field_series_id,time_series_id,parent_time_series_id,combined_meta_idget_stats_por,get_stats_date_rangecomputation_id,parent_time_series_idget_samplesActivity_ActivityIdentifier,Result_MeasureIdentifierget_monitoring_locationsis unchanged — theid(USGS-…) is the AGENCY-ID format, not a hash.Implementation
A single
_drop_hash_columns(df, include_hash_ids, keep=None)helper handles the post-processing drop across all three paths viaset(df.columns) & _HASH_ID_COLUMNS - keep. The constants live indataretrieval/waterdata/utils.py:_HASH_ID_COLUMNS— the set of column names whose values are server-generated hashes._queryables_cache— per-service queryables lookup, one HTTP fetch per service per process._default_props_cache— memoized non-hash property whitelist by(service, output_id).OGC path (
get_ogc_data): trims server-side via the OGCproperties=query parameter._default_non_hash_properties(service, output_id)builds the whitelist by subtracting_HASH_ID_COLUMNSfrom the service's queryables._arrange_colscalls_drop_hash_columnsas a client-side safety net for the queryables-fetch-failed fallback path.Statistics path (
get_stats_data): server doesn't acceptproperties=, so_drop_hash_columnsruns client-side after_expand_percentiles(which still needscomputation_idas a join key while it explodes percentile lists into rows).Samples CSV path (
get_samples):_drop_hash_columnsruns after CSV parse.All three honor
include_hash_ids: bool = False.Performance
Measured against the live API on
get_continuous(USGS-02238500, 00060, 2023-01-01/2024-01-01)— ~35,000 rows, ~one server page. Methodology: warmup pair + 5 alternating rounds, queryables cache left warm across rounds (mirrors real-world callers that issue multiple queries per process).include_hash_ids=TrueThe wall-clock picture is honest but mixed. Bytes-on-wire are reliably 18% smaller and the local parse/DataFrame work is reliably 11% faster (offline measurements have tight variance: ±0.05s). But repeated bare-HTTP measurements show USGS's server takes ~16% longer to serialize a filtered response than a full one, and on a fast home connection that extra server time exceeds the byte-transfer savings. Online wall has high run-to-run variance (5–10× spread for the same query); the median favors the legacy path by ~1.2s here.
Reliable wins: memory (33% smaller DataFrame, 17–21% smaller peak), bytes-on-wire (18% smaller), and the UX cleanup of not dragging UUID columns around by default. Wall-clock: roughly neutral on a fast connection (server-side filter cost ≈ byte-transfer savings), wins on bandwidth-constrained ones.
Verification
I ran a live-API column-content sweep against every public
dataretrieval.waterdatafunction — call each, scan every column's first 50 non-null values against UUID / 32-char-hex / 16+-char-hex regexes, flag any column whose first values look hash-shaped. The sweep found two leaks beyond the OGC path (get_stats_*andget_samples), both fixed in this PR.get_*servicesget_reference_table(parameter-codes, agency-codes)get_codes(states)get_samples(1,494 rows non-empty)get_samples_summaryget_stats_por(default)get_stats_por (include_hash_ids=True)get_stats_date_rangeget_ratingsINDEP/DEP/STOR/SHIFT) — no hashes by definitionget_nearest_continuousget_continuous(verified)Migration
Default behavior changes — callers relying on hash columns need to either:
include_hash_ids=True, orproperties=(OGC services only).Existing tests that asserted hash columns were present have been updated; new tests document the opt-in path on each surface.
Test plan
waterdata_utils_test.py+ filters/nearest/ratings; 274 pass across the whole non-live tree.waterdata_test.py.get_*function — no leaks remain.include_hash_ids=Truereturns the legacy columns on every surface, verified live for stats and samples.properties=keeps it even with the default.🤖 Generated with Claude Code