pgstac sql changes for updated table layout by bitner · Pull Request #444 · stac-utils/pgstac

bitner · 2026-05-20T19:40:01Z

No description provided.

Copilot

Pull request overview

This PR updates PgSTAC’s SQL schema and associated tests to support the v0.10 “split” item storage layout (top-level STAC fields split into dedicated columns), adds fragment-backed deduplication and a field registry, and updates queryables/CQL translation to prefer native promoted columns for better indexing.

Changes:

Introduces split columns on items, plus item_fragments (dedup) and item_field_registry (path/type sampling).
Updates queryables + CQL generation to resolve promoted fields to native columns and generate native-column index DDL.
Updates pgTap/basic SQL fixtures and expected outputs to match the new storage model and null-handling semantics.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/pgstac/sql/003a_items.sql	Implements split `items` schema, fragment storage/dedup, field registry functions, and staging ingest changes.
src/pgstac/sql/002_collections.sql	Adds `collections.fragment_config` and derives defaults from `item_assets`.
src/pgstac/sql/002a_queryables.sql	Resolves queryables to split columns/native promoted paths; updates managed index generation and field matching.
src/pgstac/sql/002b_cql.sql	Avoids applying cast wrappers to native promoted columns to preserve index-only scans.
src/pgstac/sql/004_search.sql	Switches full-text search inputs from `content->properties` to split `properties` column.
src/pgstac/sql/004a_collectionsearch.sql	Enhances `collections_asitems` to expose collection metadata via a `properties` column for CQL2 filtering.
src/pgstac/sql/998_idempotent_post.sql	Registers promoted native-column queryables for the split schema and backfills older installs.
src/pgstac/sql/001a_jsonutils.sql	Updates `merge_jsonb`/`strip_jsonb` to preserve explicit JSON `null` values.
src/pgstac/pgstac.sql	Regenerated aggregate SQL reflecting the new schema/functions.
src/pgstac/tests/pgtap/003_items.sql	Updates items tests for split columns, fragments, registry, and JSON null preservation.
src/pgstac/tests/pgtap/004_search.sql	Updates search tests for promoted columns / split storage expectations.
src/pgstac/tests/pgtap/002a_queryables.sql	Adds assertions for promoted/native queryable resolution and index DDL.
src/pgstac/tests/pgtap.sql	Updates pgTap plan count for new tests.
src/pgstac/tests/basic/crud_functions.sql	Updates basic CRUD script to query split columns.
src/pgstac/tests/basic/crud_functions.sql.out	Updates expected output for CRUD basic test.
src/pgstac/tests/basic/cql2_searches.sql.out	Updates expected hydrated output (e.g., `stac_version` presence).
src/pgstac/tests/basic/hydration.sql	Adds basic test covering JSON null survival through dehydrate/hydrate.
src/pgstac/tests/basic/hydration.sql.out	Expected output for the new hydration test.
scripts/container-scripts/test	Updates DB setup for fragment_config; changes default test-suite selection.
scripts/benchmark_dehydrate.sql	Adds a benchmarking script for dehydrate variants and hashing costs.
scripts/benchmark_dehydrate_v2.sql	Adds an alternative benchmarking harness for dehydrate variants.
docker/pgstac/Dockerfile	Adds `plprofiler` packages and preloading; adjusts environment defaults.

Copilot · 2026-05-20T20:06:10Z

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

apt.postgresql.org
- Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http (dns block)
astral.sh
- Triggering command: /usr/bin/curl curl -LsSf REDACTED git conf�� amd64.deb --global /opt/pipx_bin/git committer.email (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

bitner · 2026-05-20T20:06:38Z

+--   '"𒍟※"'   _a  → return SQL NULL (sentinel used by strip_jsonb to mark removed values)
+-- Objects are merged key-by-key recursively; same-length arrays are merged element-by-element.
+-- Any other type: _a wins.
 CREATE OR REPLACE FUNCTION merge_jsonb(_a jsonb, _b jsonb) RETURNS jsonb AS $$


@copilot do we still need merge_jsonb and strip_jsonb with the new fragments setup? or at least, do we need to modify things so that we aren't using the null marker any more? are there any further optimizations that we can do here?

With the v0.10 fragment layout neither function is called from any live code path — confirmed by grep across src/pgstac/sql/:

strip_jsonb is already marked "RETAINED FOR USE BY MIGRATION SCRIPTS ONLY" in its own comment. The new ingest path uses strip_fragment_col() instead, which extracts shared keys into item_fragments without the sentinel; strip_jsonb can be dropped once the migration path is finalised.

merge_jsonb is also unreachable from v0.10 SQL: content_hydrate now merges with the plain || operator (COALESCE(frag_content->'assets','{}') || COALESCE(_item.assets,'{}')) and the sentinel case in merge_jsonb will never be triggered because no new code writes '𒍟※' into the database. It can be dropped at the same time.

The null marker / sentinel (𒍟※): it only lives in strip_jsonb/merge_jsonb on the SQL side — both dead paths above. On the Python side, hydration.py's dehydrate() / hydrate_py() still use DO_NOT_MERGE_MARKER and load.py still calls dehydrate(), but that whole Python load path targets the old content column that no longer exists in the v0.10 items table, so it too is pending redesign.

Suggested cleanup when ready:

Remove merge_jsonb from 001a_jsonutils.sql (no callers in v0.10 SQL).

Remove strip_jsonb from 001a_jsonutils.sql once the migration path no longer needs it.

Drop DO_NOT_MERGE_MARKER, dehydrate(), and apply_marked_keys() from hydration.py when load.py is updated for v0.10 split-column ingest (they're only meaningful with the old content column).

No further SQL optimisations are needed here for now; the two functions are simply dead weight that will be cleaned up as part of the pypgstac/migration work.

Copilot · 2026-05-20T20:07:31Z

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

apt.postgresql.org
- Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http (dns block)
astral.sh
- Triggering command: /usr/bin/curl curl -LsSf REDACTED git (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Copilot · 2026-05-20T20:07:50Z

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

apt.postgresql.org
- Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http minimal_3.13.5-1--control pes_13.0.0_all.d/tmp/apt-dpkg-install-OjeaYT/75-python3-setuptools-whl_78.1.1-0.1_all.deb 6.5_all.deb 3.4.�� eadline-common_8.2-6_all.deb -common_8.2-6_all.deb ine8t64_8.2-6_amd64.deb n3.13-stdlib_3.1sed 13_3.13.5-2+deb1-e n3-stdlib_3.13.5s/postgresql\/17\/man/man/ 3.13.5-1_amd64.deb 3.15�� .0-1_amd64.deb libgdbm-compat4t64_1.24-2_amd64.deb -tree committer.name (dns block)
- Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http (dns block)
astral.sh
- Triggering command: /usr/bin/curl curl -LsSf REDACTED git conf�� amd64.deb --global k/_temp/copilot-developer-action-main/dist/ripgrep/bin/linux-x64/rg user.email (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

bitner · 2026-05-20T20:11:59Z

+-- collection_fragment_config_default: Derive a fragment_config text[] from item_assets if present.
+-- Returns one serialized path per item_asset key (e.g. 'assets.thumbnail'), or NULL if no item_assets.
+-- This is used when creating/upserting a collection to auto-populate fragment_config.
+CREATE OR REPLACE FUNCTION collection_fragment_config_default(content jsonb) RETURNS text[] AS $$


@copilot we need to make sure that we have a good default here. this should include ignoring base fields like stac_extensions and stac_version and any other fields like the proj fields that are likely to be the same for every item in a collection and for things in item_assets, it should look for things that are not going to change from item to item, but we should make sure not to include things like href that will be different per item. look at various collections and items on the microsoft planetary computer (https://planetarycomputer.microsoft.com/api/stac/v1/openapi.json) and come up with a plan so that we can more smartly identify a good base set of paths to allow stripping that highly duplicated data.

Addressed in db79a72. collection_fragment_config_default now:

Always includes stac_version and stac_extensions — always identical across a collection.

When item_assets is present, generates depth-3 paths for each stable asset sub-key (e.g. assets.thumbnail.type, assets.thumbnail.roles, assets.thumbnail.title, assets.image.eo:bands) so that the fragment stores the shared per-asset metadata while href and other per-item fields stay in the per-item assets column where they belong.

The following sub-keys are excluded from fragmentation:
href, file:size, file:checksum, file:local_path, alternate, storage:path, storage:platform, storage:region, storage:requester_pays, storage:tier.

This required extending the whole fragment tooling to handle arbitrary-depth paths (see same commit) — extract_fragment, strip_fragment_col, and content_hydrate all now support depth-3+.

bitner · 2026-05-20T20:13:31Z

-
-
+-- create_collection: Insert a new collection and derive fragment_config from item_assets.
 CREATE OR REPLACE FUNCTION create_collection(data jsonb) RETURNS VOID AS $$


@copilot, let's provide an overload that defaults to null to specify the partition_trunc and another one to specify the fragment_config for this function and the other crud functions here

Added in commit 5fb916a. create_collection, update_collection, and upsert_collection now each have a no-arg overload (for backwards compatibility) plus a full-param variant with optional _partition_trunc text DEFAULT NULL and _fragment_config text[] DEFAULT NULL parameters. When either is NULL the collection-level defaults are used.

bitner · 2026-05-20T20:15:27Z




 CREATE OR REPLACE FUNCTION queryable(


@copilot does it still make sense to have this function that returns all the separate outs or should we break this into smaller functions that each only return a single output. it seems like some of the other functions introduced here might already be working towards that

The multi-output form of queryable() still makes sense for the main CQL translation path — all four outputs (path, expression, wrapper, nulled_wrapper) are consumed together in a single predicate, so a single lookup is more efficient than four separate calls.

The newer single-purpose helpers (queryable_uses_native_path, queryable_index_field, indexdef) were carved off because they serve distinct, non-translation concerns (index DDL generation, index-consistency views). That pattern is the right direction: add focused single-output wrappers only when there are callers that genuinely only need one output and the table lookup cost matters. For now, the core translation path is better served by the multi-output function.

bitner · 2026-05-20T20:16:48Z

+-- CQL2 queries and auto-created indexes will use the native column, not JSONB extraction.
+-- First block: insert new rows only. Second block below: update existing rows that have
+-- property_path=NULL (handles databases upgraded from pre-v0.10 without re-installing).
+WITH promoted_queryables(name, definition, property_path, property_wrapper) AS (


@copilot we are repeating ourselves a lot here, can we do something so that we only specify this values list once?

Consolidated in commit 5fb916a. The column/value list is now defined once in a promoted_queryables_defaults() SETOF function in 002a_queryables.sql; the INSERT … ON CONFLICT DO NOTHING in 998_idempotent_post.sql uses INSERT INTO queryables SELECT * FROM promoted_queryables_defaults() so the values are specified in exactly one place.

Copilot · 2026-05-20T20:48:12Z

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

apt.postgresql.org
- Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http (dns block)
- Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http git flex bison (dns block)
- Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http run --bundle dpkg-deb --fs�� /tmp/apt-dpkg-install-XQMS62/0-python3-minimal_3.13.5-1_amd64.deb (dns block)
astral.sh
- Triggering command: /usr/bin/curl curl -LsSf REDACTED git conf�� amd64.deb --global ptables credential.usern/usr/bin/python3.13 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

- CI/CD: update workflows and release config; add trusted-publisher config for both PyPI projects; extend pg_cron/pg_stat_statements preload - Dockerfile: add plprofiler and plpgsql_check for profiling sessions - scripts/loadsampledata: new host-facing fixture-loader; extend in-container version to load Planetary Computer NAIP, Landsat, and Sentinel-2 fixtures - scripts/container-scripts/test: add --pgdump gate; update flag docs - Developer docs: CLAUDE.md migration workflow and test-gate guidance; AGENTS.md persona definitions; scripts.instructions.md; release-notes path documented - CHANGELOG.md: unreleased entries for v0.10.0 split-storage changes - pypgstac loader: use LIKE items INCLUDING DEFAULTS for temp table DDL so column defaults are preserved when COPY is used without an explicit column list Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add 1,000-item NDJSON snapshots for landsat-c2-l2, naip, and sentinel-2-l2a under src/pgstac/tests/testdata/planetary-computer/. Deterministic fixtures (fetched once, checked in) for reproducible disk-size measurement and benchmarking of the v0.10 split-storage schema. Each collection exercises a different data shape: Landsat (25 assets with many constant sub-keys), NAIP (4 assets dominated by per-item Azure blob hrefs), Sentinel-2 (23 assets with per-item varying properties). Includes a fixture-summary.json recording fetch parameters. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…s, ingest Split the monolithic items.content JSONB into a set of typed columns and a deduplicated fragment store: Schema - items: per-item delta (assets/properties/links/extra), ~30 promoted scalar columns (datetime, platform, eo:*, proj:*, view:*, sat:*, file:*, sci:*) with native BTREE indexes, and fragment_id reference - item_fragments(collection, hash bytea, content, links_template): deduplicated shared subtrees; hash is the raw 32-byte sha256 so the unique index is compact - collections.fragment_config text[]: per-collection fragment paths, auto-derived from item_assets on collection create - item_field_registry: tracks observed JSON paths for queryable discovery - items_deleted_log: tombstone table for soft-delete audit Dehydrate at ingest (items_staging_triggerfunc -> items_staging_dehydrate) - Set-based pipeline: dehydrate -> fragment extract -> ON CONFLICT hash dedup -> strip fragment-covered keys; shared by insert / ignore / upsert branches (enriched column list lives in one place, previously duplicated 3×) - Links split storage: shared link shape (rel/type/title, no href) deduped in item_fragments.links_template; per-item hrefs in items.link_hrefs - Partition creation and stats queued via run_or_queue (ingest returns fast) Hydrate at read (content_hydrate, format_item, search) - jsonb_merge_recursive with disjoint fast-path: because the ingest strip removes fragment-owned keys from the per-item column, the two sub-objects almost always have disjoint keys; merge shallow-concats and only recurses on real overlap (~2.5× faster asset merge, byte-identical output) - promoted_properties_from_item: direct jsonb_strip_nulls(jsonb_build_object) mirroring content_dehydrate (~35% faster than prior defs-join+agg) - tstz_to_stac_text: canonical UTC serializer (trims trailing zeros) - Net: full content_hydrate 27-50% faster on the PC fixtures Externally reproducible content_hash - jsonb_canonical(jsonb): RFC 8785-aligned serializer (code-point-sorted keys, compact separators, UTF-8 strings, IEEE-754 shortest-round-trip numbers); sha256 of that is the content_hash — verified byte-identical to a Python reference on 3,000 real items plus edge cases - items_touch_triggerfunc: no longer recomputes content_hash (it previously hashed hydrated output which never matched the ingest hash); hash is set once at ingest and stays externally reproducible Queryables and CQL - collections.fragment_config derived from item_assets sub-keys (depth-3 paths for stable asset metadata) - promoted_queryables_defaults() populates queryables.property_path for all promoted columns; CQL2 translator bypasses JSONB cast and hits native BTREE indexes directly - Permissions for new tables/functions in 998_idempotent_post.sql Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- pgtap/001a_jsonutils.sql: jsonb_merge_recursive disjoint fast-path (depth-4, collision, NULL/empty guards); jsonb_canonical key-sort, number/array/nested correctness; pgstac_item_hash vector pinning external reproducibility contract - pgtap/002_collections.sql: fragment_config auto-derivation from item_assets - pgtap/002a_queryables.sql: promoted_queryables_defaults, property_path routing - pgtap/003_items.sql: split-storage round-trip (create/get/update/upsert/delete), fragment dedup, stac_version/stac_extensions root-key fragmentation, link split storage, promoted column values, touch trigger leaves content_hash stable - pgtap/004_search.sql: format_item hydration, CQL promoted-column routing - pgtap/9999_readonly.sql: read-only role access checks for new tables - pgtap.sql: plan count updated to 343 - basic/hydration.sql + .sql.out: new suite asserting properties.datetime is absent from stored row (promoted) and correctly rehydrated via get_item - basic/crud_functions.sql: ORDER BY id added to make multi-row result sets deterministic regardless of physical insert order; .sql.out regenerated - basic/cql2_searches.sql.out: updated for promoted-column routing output Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Regenerate src/pgstac/pgstac.sql (the assembled base install) and src/pgstac/migrations/pgstac--unreleased.sql from the edited sql/ source files via scripts/stageversion. The incremental migration (pgstac--0.9.11--unreleased.sql) is the diff from the 0.9.11 base to the new unreleased base; it will be renamed and finalized when the v0.10.0 release branch is assembled (see plan data-migration-singletx slice for the data-backfill step). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bitner added 2 commits May 12, 2026 14:33

don't save _PLAN.md docs

999a74b

Merge branch 'main' of github.com:stac-utils/pgstac

9bd3c43

bitner requested a review from Copilot May 20, 2026 19:46

Copilot started reviewing on behalf of bitner May 20, 2026 19:46 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Copilot started work on behalf of bitner May 20, 2026 20:02 View session

Copilot started work on behalf of bitner May 20, 2026 20:03 View session

Copilot finished work on behalf of bitner May 20, 2026 20:05

Copilot finished work on behalf of bitner May 20, 2026 20:06

bitner commented May 20, 2026

View reviewed changes

Copilot started work on behalf of bitner May 20, 2026 20:06 View session

Copilot finished work on behalf of bitner May 20, 2026 20:07

Copilot finished work on behalf of bitner May 20, 2026 20:09

bitner commented May 20, 2026

View reviewed changes

Copilot started work on behalf of bitner May 20, 2026 20:21 View session

Copilot finished work on behalf of bitner May 20, 2026 20:48

Copilot AI mentioned this pull request May 20, 2026

Add reproducible Planetary Computer benchmark fixtures and CI/manual benchmark reporting for load-path comparison #445

Merged

bitner force-pushed the v010-pr1-load-path-core branch 2 times, most recently from 2413faa to ef9ba25 Compare May 28, 2026 13:35

bitner and others added 4 commits June 1, 2026 15:51

bitner force-pushed the v010-pr1-load-path-core branch from 3489907 to c9aff56 Compare June 1, 2026 20:55

bitner closed this Jun 1, 2026



		-- create_collection: Insert a new collection and derive fragment_config from item_assets.
		CREATE OR REPLACE FUNCTION create_collection(data jsonb) RETURNS VOID AS $$

Conversation

bitner commented May 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented May 20, 2026

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

bitner May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented May 20, 2026

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Copilot AI commented May 20, 2026

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

bitner May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 20, 2026

Choose a reason for hiding this comment

Uh oh!

bitner May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 20, 2026

Choose a reason for hiding this comment

Uh oh!

bitner May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 20, 2026

Choose a reason for hiding this comment

Uh oh!

bitner May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented May 20, 2026

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants