Problem
The idx JSONB column carries significant overhead from keys that are never used for index queries but bloat every row. Measurements on a production instance (AAF, 2.6M rows, 137k cataloged):
| Key |
avg size |
total est |
% of idx |
@meta (mainly image_scales) |
2,810 B |
368 MB |
87% |
object_provides |
1,634 B |
214 MB |
50% |
| Everything else |
~400 B |
~50 MB |
12% |
The total idx size (TOAST-compressed) is 424 MB. Extracting these keys would reduce it to ~50–60 MB (~85% reduction), bringing the average row size under the TOAST threshold (~2 KB).
Why these three?
-
@meta: Contains non-JSON-native metadata (pickled via zodb_json_codec). The dominant entry is image_scales (~315 MB). This data is never queried via SQL — it's only read by brains for metadata access. It doesn't belong in the indexed JSONB.
-
object_provides: A KeywordIndex stored as a JSON array. Queried on almost every catalog search (interface filtering). A native TEXT[] column with GIN index is more efficient than jsonb->'key' ?| on the JSONB GIN.
-
allowedRolesAndUsers: Already partially extracted — allowed_roles TEXT[] column and GIN index exist, and queries already use them. But the key is currently copied to the column, not popped from idx, so it's stored twice. This plan subsumes the existing hardcoded handling into the generic mechanism and removes the redundancy.
Proposed Solution
Generic ExtraIdxColumn mechanism
Instead of hardcoding per-field extraction, introduce a declarative registry:
@dataclasses.dataclass(frozen=True)
class ExtraIdxColumn:
idx_key: str # key in idx dict (e.g. "@meta", "object_provides")
column_name: str # PG column name (e.g. "meta", "object_provides")
column_type: str # PG type (e.g. "JSONB", "TEXT[]")
value_expr: str # SQL value expression for psycopg
gin_index: bool = False
Registered columns are:
- Popped from idx at write time (in
CatalogStateProcessor.process() and indexing.py)
- Written to dedicated PG columns via existing
ExtraColumn infrastructure in zodb-pgjsonb
- Queried directly —
_handle_keyword() checks for dedicated columns (replaces hardcoded allowedRolesAndUsers check)
- Read by brains —
_resolve_from_idx() checks the meta column first, falls back to idx["@meta"] for pre-migration data
Default registrations
ExtraIdxColumn(idx_key="@meta", column_name="meta", column_type="JSONB", ...)
ExtraIdxColumn(idx_key="object_provides", column_name="object_provides", column_type="TEXT[]", gin_index=True)
ExtraIdxColumn(idx_key="allowedRolesAndUsers", column_name="allowed_roles", column_type="TEXT[]", gin_index=True)
DDL
ALTER TABLE object_state ADD COLUMN IF NOT EXISTS meta JSONB;
ALTER TABLE object_state ADD COLUMN IF NOT EXISTS object_provides TEXT[];
CREATE INDEX IF NOT EXISTS idx_os_object_provides
ON object_state USING gin (object_provides) WHERE object_provides IS NOT NULL;
-- allowed_roles column and idx_os_allowed_roles GIN index already exist
Affected Code Paths
| File |
Change |
columns.py |
ExtraIdxColumn dataclass + registry + default registrations (incl. allowedRolesAndUsers) |
schema.py |
DDL for new columns + GIN index (allowed_roles already exists) |
processor.py |
get_extra_columns() declares new ExtraColumns; process() pops keys from idx (replaces hardcoded allowed_roles extraction) |
query.py |
_handle_keyword() uses generic column lookup (replaces hardcoded if idx_key == "allowedRolesAndUsers" check) |
brain.py |
_resolve_from_idx() checks meta column first, fallback to idx["@meta"] |
search.py |
Include meta in SELECT columns |
indexing.py |
Replace hardcoded allowed_roles extraction with generic _extract_extra_columns() |
startup.py |
Remove _backfill_allowed_roles() (superseded by generic mechanism + clear_and_rebuild) |
Migration
- DDL uses
IF NOT EXISTS — safe for rolling deploys
- Pre-migration rows:
meta = NULL, object_provides = NULL, data still in idx
allowed_roles likely already populated by old backfill mechanism
- Brain fallback handles both old and new layout transparently
- After deploy: run
clear_and_rebuild to populate new columns for all objects and pop keys from idx
- The generic mechanism makes future extractions trivial
Detailed Implementation Plan
A full task-by-task plan with tests and code is available at:
docs/plans/2026-04-09-pgcatalog-idx-column-extraction.md
Expected Impact
| Metric |
Before |
After |
| idx size per row (avg) |
3,239 B |
~400 B (under TOAST threshold) |
| idx total (compressed) |
424 MB |
~50–60 MB |
object_provides query |
JSONB GIN (`idx->'key' ? |
`) |
allowedRolesAndUsers in idx |
Copied (redundant) |
Popped (no duplication) |
| Brain metadata access |
Decode from idx JSONB |
Direct from meta column (same codec) |
| Code complexity |
3 hardcoded special cases |
1 generic mechanism |
Problem
The
idxJSONB column carries significant overhead from keys that are never used for index queries but bloat every row. Measurements on a production instance (AAF, 2.6M rows, 137k cataloged):@meta(mainlyimage_scales)object_providesThe total
idxsize (TOAST-compressed) is 424 MB. Extracting these keys would reduce it to ~50–60 MB (~85% reduction), bringing the average row size under the TOAST threshold (~2 KB).Why these three?
@meta: Contains non-JSON-native metadata (pickled viazodb_json_codec). The dominant entry isimage_scales(~315 MB). This data is never queried via SQL — it's only read by brains for metadata access. It doesn't belong in the indexed JSONB.object_provides: AKeywordIndexstored as a JSON array. Queried on almost every catalog search (interface filtering). A nativeTEXT[]column with GIN index is more efficient thanjsonb->'key' ?|on the JSONB GIN.allowedRolesAndUsers: Already partially extracted —allowed_roles TEXT[]column and GIN index exist, and queries already use them. But the key is currently copied to the column, not popped from idx, so it's stored twice. This plan subsumes the existing hardcoded handling into the generic mechanism and removes the redundancy.Proposed Solution
Generic
ExtraIdxColumnmechanismInstead of hardcoding per-field extraction, introduce a declarative registry:
Registered columns are:
CatalogStateProcessor.process()andindexing.py)ExtraColumninfrastructure in zodb-pgjsonb_handle_keyword()checks for dedicated columns (replaces hardcodedallowedRolesAndUserscheck)_resolve_from_idx()checks themetacolumn first, falls back toidx["@meta"]for pre-migration dataDefault registrations
DDL
Affected Code Paths
columns.pyExtraIdxColumndataclass + registry + default registrations (incl.allowedRolesAndUsers)schema.pyallowed_rolesalready exists)processor.pyget_extra_columns()declares newExtraColumns;process()pops keys from idx (replaces hardcodedallowed_rolesextraction)query.py_handle_keyword()uses generic column lookup (replaces hardcodedif idx_key == "allowedRolesAndUsers"check)brain.py_resolve_from_idx()checksmetacolumn first, fallback toidx["@meta"]search.pymetain SELECT columnsindexing.pyallowed_rolesextraction with generic_extract_extra_columns()startup.py_backfill_allowed_roles()(superseded by generic mechanism +clear_and_rebuild)Migration
IF NOT EXISTS— safe for rolling deploysmeta = NULL,object_provides = NULL, data still inidxallowed_roleslikely already populated by old backfill mechanismclear_and_rebuildto populate new columns for all objects and pop keys from idxDetailed Implementation Plan
A full task-by-task plan with tests and code is available at:
docs/plans/2026-04-09-pgcatalog-idx-column-extraction.md
Expected Impact
object_providesqueryallowedRolesAndUsersin idxmetacolumn (same codec)