Item fragments#448
Draft
bitner wants to merge 8 commits into
Draft
Conversation
- Dockerfile: add plprofiler and plpgsql_check for profiling sessions - scripts/loadsampledata: new host-facing fixture-loader; extend in-container version to load Planetary Computer NAIP, Landsat, and Sentinel-2 fixtures - scripts/container-scripts/test: add --pgdump gate; update flag docs - Developer docs: CLAUDE.md migration workflow and test-gate guidance; AGENTS.md persona definitions; scripts.instructions.md updated for new scripts - CHANGELOG.md: unreleased entries for v0.10.0 split-storage changes - .gitignore: ignore local .plans/ planning artifacts Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add 1,000-item NDJSON snapshots for landsat-c2-l2, naip, and sentinel-2-l2a under src/pgstac/tests/testdata/planetary-computer/. Deterministic fixtures (fetched once, checked in) for reproducible disk-size measurement and benchmarking of the v0.10 split-storage schema. Each collection exercises a different data shape: Landsat (25 assets with many constant sub-keys), NAIP (4 assets dominated by per-item Azure blob hrefs), Sentinel-2 (23 assets with per-item varying properties). Includes a fixture-summary.json recording fetch parameters. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s, ingest Split the monolithic items.content JSONB into typed columns and a deduplicated fragment store, with server-side hydration on every read. Schema - items: per-item delta (assets/properties/links/extra) + ~30 promoted scalar columns (datetime, platform, eo:*, proj:*, view:*, sat:*, file:*, sci:*) with native BTREE indexes + fragment_id reference into item_fragments - item_fragments(collection, hash bytea, content, links_template): deduplicated shared subtrees keyed by raw 32-byte sha256 (compact unique index) - collections.fragment_config text[]: per-collection fragment paths, auto-derived from item_assets sub-keys (depth-3 paths for stable asset metadata) - item_field_registry: tracks observed JSON paths per collection for queryable discovery and schema inference - items_deleted_log: tombstone table for soft-delete audit Dehydrate at ingest (items_staging_triggerfunc → items_staging_dehydrate) - Set-based pipeline: dehydrate → fragment extract → ON CONFLICT hash dedup → strip fragment-covered keys; shared by insert/ignore/upsert branches via items_staging_dehydrate() so the enriched column list lives in one place - Links split storage: shared link shape (rel/type/title, no href) deduped in item_fragments.links_template; per-item hrefs in items.link_hrefs - Partition creation and stats updates queued via run_or_queue (ingest returns fast) Hydrate at read (content_hydrate, format_item, search) - jsonb_merge_recursive with disjoint fast-path: ingest strip removes fragment-owned keys from per-item columns, so the two sub-objects almost always have disjoint keys; merge shallow-concats and only recurses on real overlap (~2.5× faster asset merge, byte-identical output verified on 3,000 real items + depth-4/collision unit tests) - promoted_properties_from_item: direct jsonb_strip_nulls(jsonb_build_object) mirroring content_dehydrate (~35% faster than the prior per-item defs-join) - tstz_to_stac_text: canonical UTC serializer (trims trailing zeros) - Net: content_hydrate 27–50% faster on the Planetary Computer fixtures Externally reproducible content_hash - jsonb_canonical(jsonb): RFC 8785-aligned serializer (code-point-sorted keys, compact separators, UTF-8 strings, IEEE-754 shortest-round-trip numbers) - content_hash = sha256(jsonb_canonical(item)) — verified byte-identical to a Python reference on 3,000 real items plus numeric/unicode edge cases - Set once at ingest; items_touch_triggerfunc no longer recomputes on UPDATE Queryables and CQL routing - promoted_queryables_defaults() populates queryables.property_path for all promoted scalar columns; CQL2 translator bypasses JSONB cast and hits native BTREE indexes directly for promoted queryables - Permissions for new tables/functions in 998_idempotent_post.sql Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- pgtap/001a_jsonutils.sql: jsonb_merge_recursive disjoint fast-path (depth-4, collision, NULL/empty guards); jsonb_canonical key-sort, numbers, nested objects; pgstac_item_hash vector pinning the external reproducibility contract - pgtap/002_collections.sql: fragment_config auto-derivation from item_assets - pgtap/002a_queryables.sql: promoted_queryables_defaults, property_path routing - pgtap/003_items.sql: split-storage round-trip (create/get/update/upsert/delete), fragment dedup, root-key fragmentation, link split storage, promoted column values, touch trigger leaves content_hash stable on direct UPDATE - pgtap/004_search.sql: format_item hydration, CQL promoted-column routing - pgtap/9999_readonly.sql: read-only role access checks for new tables/functions - pgtap.sql: plan count updated to 343 - basic/hydration.sql + .sql.out: assert properties.datetime absent from stored row (promoted) and correctly rehydrated via get_item - basic/crud_functions.sql: ORDER BY id on multi-row queries for deterministic output; .sql.out regenerated - basic/cql2_searches.sql.out: updated for promoted-column routing output Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Regenerate src/pgstac/pgstac.sql (assembled base install) and the unreleased base migration from the edited sql/ source. The incremental migration (pgstac--0.9.11--unreleased.sql) reflects the schema delta from 0.9.11; it will be finalized and renamed when the v0.10.0 release branch is assembled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…, #425) The STAC spec requires `"datetime": null` to be explicitly present in item properties when `start_datetime`/`end_datetime` are used. Earlier pgstac versions applied jsonb_strip_nulls to the full properties object during hydration, silently dropping it and producing invalid STAC output. The new split-storage hydration (temporal_properties_from_item) builds `jsonb_build_object('datetime', NULL)` before the jsonb_strip_nulls block that covers only the promoted scalar columns, so the explicit JSON null is preserved end-to-end. Tests added: - pgtap/003_items.sql: four assertions covering get_item and search — key presence (? 'datetime') and value type ('null'::jsonb) for a range item - basic/hydration.sql: search() check alongside the existing get_item check, with regenerated .out confirming null in both paths (plan: 343 → 347) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… private, move jsonb_field_rows
Five related schema and API improvements:
item_hash bytea (was content_hash text)
Store the canonical item digest as a raw 32-byte sha256 (bytea) instead of a
64-char hex text. Half the storage per row; direct binary comparison on the
unique index; 'octet_length(item_hash) = 32' replaces 'length = 64' in tests.
Applies to both items and items_deleted_log.
jsonb_hash(jsonb) RETURNS bytea (was pgstac_item_hash RETURNS text)
General-purpose RFC 8785-aligned canonical hash: sha256(utf8(jsonb_canonical(j))).
Returns bytea directly; call encode(..., 'hex') when a printable string is needed.
Always schema-qualified as pgstac.jsonb_hash() to avoid shadowing the pg_catalog
hash support function of the same name (which returns integer for index hashing).
The private column is intentionally excluded from this hash — it is operator
metadata outside the STAC item identity contract.
private jsonb on items (restored)
The old items schema had a private jsonb column for operator metadata not
returned by the STAC API. It was dropped in the v0.10 rewrite; add it back.
Not included in content_dehydrate (always NULL from ingest), not in
items_content_distinct_sql (not item content), and not in hydration output.
Operators set it via direct UPDATE, same pattern as collections.private.
jsonb_field_rows moved to 001a_jsonutils.sql
The recursive JSONB path-walker is a general utility, not items-specific.
Moving it to jsonutils makes it available earlier in the load order and
alongside jsonb_leaf_rows, jsonb_common_values, and the other JSONB helpers.
A comment in 003a_items.sql notes that it is defined in 001a.
Tests updated
- plan: 347 → 349 (two new has_function checks: jsonb_hash, jsonb_field_rows)
- 003_items.sql: content_hash → item_hash everywhere; length = 64 →
octet_length = 32; tombstone INSERT uses decode(repeat('aa',32), 'hex')
(32-byte bytea placeholder, was 64-char text)
- 004_search.sql: content_hash → item_hash in explicit INSERT column list
- 001a_jsonutils.sql: pgstac_item_hash → jsonb_hash; expected value is
decode('77f18c0a…', 'hex') (bytea, not text)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Technical Context
jsonb_strip_nulls()in hydration functions strips datetime: null from item properties, producing invalid STAC items #425itemstable schema. The monolithiccontent jsonbcolumn is gone. Common STAC properties and properties from extensions currently marked as stable are promoted into actual columns. Hydration/Dehydration still exists, but is now tied to a versioned item_fragments table rather than directly to item_assets on the collections table.Description
Reworks how STAC items are stored and retrieved for the upcoming v0.10.0 breaking release of pgstac.
New storage model
item_fragments— deduplicated shared subtrees (asset metadata, link shapes, root keys) stored once per collection, keyed by a 32-byte sha256 (hash bytea). Items reference their fragment viafragment_id.items— per-item delta columns (assets,properties,links,extra) plus ~30 promoted scalar columns for well-known queryables (datetime,platform,gsd,eo:*,proj:*,view:*,sat:*,file:*,sci:*).collections.fragment_config text[]— list of fragment paths, auto-derived fromitem_assetson collection creation. This can be overridden to be able to further optimize and deduplicate information stored across items that is common across all items in a collection.items.link_hrefs/item_fragments.links_template— link split storage: shared link shape (rel/type/title) deduped into the fragment; per-item hrefs stored separately.item_field_registry— tracks observed JSON paths per collection for queryable discovery. This will also allow figuring out the full schema of all data in a collection for use when storing to schema-requiring formats like parquet.Items_staging tables have been updated to work with schema changes.
Functions are added to create a canonical hash that can be calculated the same internally in postgres/pgstac as well as externally to allow for fast lookups/diffing when loading data.
Fixes #158 and #425 —
datetime: nullround-tripThe STAC spec requires
"datetime": nullto be explicitly present whenstart_datetime/end_datetimeare used. Earlier versions appliedjsonb_strip_nullsto the full properties object, silently dropping it. The newtemporal_properties_from_itembuildsjsonb_build_object('datetime', NULL)before thejsonb_strip_nullsblock that covers only promoted scalars, so the explicit JSON null survives end-to-end through bothget_itemandsearch.Test gate
scripts/test --formatting --pgtap --basicsql --pgdumpis green (349 PGTap tests; pg_dump →pgstac_restoreround-trip verified).--pypgstacand--migrationsare intentionally skipped These will be fixed in upcoming PRs prior to pgstac v0.10.0 release. We are intentionally keeping the slices for PRs leading to v0.10.0 smaller and allowing for some tests not to pass to allow us to iterate to the point that we can be ready for this breaking release.Checklist
jsonb_strip_nulls()in hydration functions strips datetime: null from item properties, producing invalid STAC items #425), jsonb_merge_recursive depth-4/collision correctness, and item_hash bytea sizing.AI tool usage