Skip to content

Extract @meta and object_provides from idx JSONB into dedicated columns #98

@jensens

Description

@jensens

Problem

The idx JSONB column carries significant overhead from keys that are never used for index queries but bloat every row. Measurements on a production instance (AAF, 2.6M rows, 137k cataloged):

Key avg size total est % of idx
@meta (mainly image_scales) 2,810 B 368 MB 87%
object_provides 1,634 B 214 MB 50%
Everything else ~400 B ~50 MB 12%

The total idx size (TOAST-compressed) is 424 MB. Extracting these keys would reduce it to ~50–60 MB (~85% reduction), bringing the average row size under the TOAST threshold (~2 KB).

Why these three?

  • @meta: Contains non-JSON-native metadata (pickled via zodb_json_codec). The dominant entry is image_scales (~315 MB). This data is never queried via SQL — it's only read by brains for metadata access. It doesn't belong in the indexed JSONB.

  • object_provides: A KeywordIndex stored as a JSON array. Queried on almost every catalog search (interface filtering). A native TEXT[] column with GIN index is more efficient than jsonb->'key' ?| on the JSONB GIN.

  • allowedRolesAndUsers: Already partially extracted — allowed_roles TEXT[] column and GIN index exist, and queries already use them. But the key is currently copied to the column, not popped from idx, so it's stored twice. This plan subsumes the existing hardcoded handling into the generic mechanism and removes the redundancy.

Proposed Solution

Generic ExtraIdxColumn mechanism

Instead of hardcoding per-field extraction, introduce a declarative registry:

@dataclasses.dataclass(frozen=True)
class ExtraIdxColumn:
    idx_key: str        # key in idx dict (e.g. "@meta", "object_provides")
    column_name: str    # PG column name (e.g. "meta", "object_provides")
    column_type: str    # PG type (e.g. "JSONB", "TEXT[]")
    value_expr: str     # SQL value expression for psycopg
    gin_index: bool = False

Registered columns are:

  1. Popped from idx at write time (in CatalogStateProcessor.process() and indexing.py)
  2. Written to dedicated PG columns via existing ExtraColumn infrastructure in zodb-pgjsonb
  3. Queried directly_handle_keyword() checks for dedicated columns (replaces hardcoded allowedRolesAndUsers check)
  4. Read by brains_resolve_from_idx() checks the meta column first, falls back to idx["@meta"] for pre-migration data

Default registrations

ExtraIdxColumn(idx_key="@meta", column_name="meta", column_type="JSONB", ...)
ExtraIdxColumn(idx_key="object_provides", column_name="object_provides", column_type="TEXT[]", gin_index=True)
ExtraIdxColumn(idx_key="allowedRolesAndUsers", column_name="allowed_roles", column_type="TEXT[]", gin_index=True)

DDL

ALTER TABLE object_state ADD COLUMN IF NOT EXISTS meta JSONB;
ALTER TABLE object_state ADD COLUMN IF NOT EXISTS object_provides TEXT[];
CREATE INDEX IF NOT EXISTS idx_os_object_provides
    ON object_state USING gin (object_provides) WHERE object_provides IS NOT NULL;
-- allowed_roles column and idx_os_allowed_roles GIN index already exist

Affected Code Paths

File Change
columns.py ExtraIdxColumn dataclass + registry + default registrations (incl. allowedRolesAndUsers)
schema.py DDL for new columns + GIN index (allowed_roles already exists)
processor.py get_extra_columns() declares new ExtraColumns; process() pops keys from idx (replaces hardcoded allowed_roles extraction)
query.py _handle_keyword() uses generic column lookup (replaces hardcoded if idx_key == "allowedRolesAndUsers" check)
brain.py _resolve_from_idx() checks meta column first, fallback to idx["@meta"]
search.py Include meta in SELECT columns
indexing.py Replace hardcoded allowed_roles extraction with generic _extract_extra_columns()
startup.py Remove _backfill_allowed_roles() (superseded by generic mechanism + clear_and_rebuild)

Migration

  • DDL uses IF NOT EXISTS — safe for rolling deploys
  • Pre-migration rows: meta = NULL, object_provides = NULL, data still in idx
  • allowed_roles likely already populated by old backfill mechanism
  • Brain fallback handles both old and new layout transparently
  • After deploy: run clear_and_rebuild to populate new columns for all objects and pop keys from idx
  • The generic mechanism makes future extractions trivial

Detailed Implementation Plan

A full task-by-task plan with tests and code is available at:
docs/plans/2026-04-09-pgcatalog-idx-column-extraction.md

Expected Impact

Metric Before After
idx size per row (avg) 3,239 B ~400 B (under TOAST threshold)
idx total (compressed) 424 MB ~50–60 MB
object_provides query JSONB GIN (`idx->'key' ? `)
allowedRolesAndUsers in idx Copied (redundant) Popped (no duplication)
Brain metadata access Decode from idx JSONB Direct from meta column (same codec)
Code complexity 3 hardcoded special cases 1 generic mechanism

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions