Add selective index updates (n/m) to the heap table AM by gburd · Pull Request #23 · gburd/postgres

gburd · 2026-04-06T12:14:51Z

Updates trigger n of m index updates rather than all, none, or only summarizing. Table AMs have the ability to influence this behavior by changing modified_idx_attrs.

Commit 51432cd ('Standardise terminology: drop the SIU acronym throughout') renamed user-facing identifiers but left several internal symbols and comments still using SIU. Complete the rename so the codebase consistently spells the feature 'HOT-indexed': IndexScanState.iss_SiuIndexInfo -> iss_HotIndexedRecheckInfo nbtinsert.c local siu_slot -> chain_walk_slot nbtinsert.c label bt_siu_skip -> bt_chain_walk_skip heapam_handler.c page_had_siu -> page_had_hot_indexed pgstat_count_heap_update arg siu -> hot_indexed heap_prune_item_preserves_siu -> heap_prune_item_preserves_hot_indexed Also refresh comments in pruneheap.c, nodeIndexonlyscan.c, relscan.h, and pgstat_relation.c that still referred to SIU. No functional change.

The XLHP_HAS_PROMOTIONS flag (bit 11 of xl_heap_prune.flags), its serialization in log_heap_prune_and_freeze, its replay in heap_page_prune_execute, and its pg_waldump description were all present, but no caller ever populated the promotions[] array: every log_heap_prune_and_freeze and heap_page_prune_execute call site passed NULL/0 for promotions/npromotions. The path was therefore reachable WAL surface that could never fire and would be rejected on review. The intent was to clear HEAP_INDEXED_UPDATED on surviving heap-only chain members once a chain became indistinguishable from a classic HOT chain. The trigger condition is unsafe without additional bookkeeping (a chain may have non-bridge stale btree entries that ambulkdelete did not sweep, so dropping the recheck bit at "last bridge gone" lets readers arriving via those entries surface stale leaves). Designing a safe trigger -- per-page outstanding-ref counts or a post-vacuum verification walk -- is left for a future commit, which will introduce its own WAL flag at that time. Strip the flag bit, the promotions[]/npromotions parameters from heap_page_prune_execute and log_heap_prune_and_freeze (and their heapam.h prototypes), the deserialization branch and the apply loop in heap_xlog_deserialize_prune_and_freeze and heap_page_prune_execute, and the rmgrdesc print branches in heap2_desc. Bit 11 is reserved for a future re-introduction of promotion WAL. No on-disk format change visible to existing pages -- the flag was never set.

The previous section described an XLHP_HAS_PROMOTIONS WAL flag and log/replay pipeline as already in place, with only the trigger condition outstanding. That flag has been removed because no caller ever emitted it. Rewrite the section to record promotion as intentional future work, summarize why the obvious "no bridges remain" trigger is unsafe (stale non-bridge btree entries), and reference the two trigger-design directions (per-page outstanding-ref bookkeeping or a post-ambulkdelete verification walk) as a roadmap. Note that a future commit activating promotion will reintroduce its own WAL flag.

Both nodeIndexscan and nodeIndexonlyscan need to verify, on a chain walk that crossed a HOT-indexed hop, that the leaf entry's key still matches the live tuple's current index form. nodeIndexonlyscan already dispatches through the new amrecheck_leaf_key callback; nodeIndexscan was calling a separate ExecIndexEntryMatchesTuple helper in execIndexing.c that did the same job using FormIndexDatum + datum_image_eq. Switch nodeIndexscan to the callback path so all HOT-indexed leaf rechecks go through one indexam-shaped surface, and delete ExecIndexEntryMatchesTuple along with its supporting code in execIndexing.c and executor.h. AMs that omit the callback fall through to the conservative drop, matching the prior permissive behaviour for non-nbtree AMs. Drop IndexScanState.iss_HotIndexedRecheckInfo, which was the cached IndexInfo used by FormIndexDatum and is no longer reachable. Eliminates the dual leaf-key recheck implementation.

HeapUpdateHotAllowable consults RelationHasExclusionConstraint on every UPDATE, and the function used to walk the relation's index list and open every index per call. On a relation with many indexes this dominated per-update CPU on classic-HOT workloads, contributing to a measurable TPS regression at WIDE_COLS=64 versus pre-feature master. Cache the answer as a tristate char on RelationData (rd_has_exclusion; RD_HAS_EXCLUSION_UNKNOWN/NO/YES). The field is naturally zeroed by palloc0_object on relcache entry allocation, so 0 = unknown is the right default. Reset on relcache rebuild via the existing RelationClearRelation memcpy swap of the freshly built struct. No on-disk change.

The function used to call RelationGetIndexAttrBitmap up to four times per UPDATE under heavy apply-path gating: once for SUMMARIZED, twice for INDEXED + PRIMARY_KEY in the apply-path branch, and once more for INDEXED in the threshold check. Each call returns a freshly palloc'd Bitmapset that the caller bms_frees, so the per-tuple cost scales with index count. Fetch INDEXED once on the slow path and reuse it across the apply branch and threshold branch. Fetch PRIMARY_KEY at most once, lazily in the apply-path branch. Both bitmaps are bms_freed via a single out: cleanup label. SUMMARIZED is fetched only when classic-HOT fast-path applies and stays scoped to its block. No functional change. Reduces measured wide_64 classic-HOT overhead in HeapUpdateHotAllowable.

For consistency with surrounding pd_flags constants (PD_HAS_FREE_LINES, PD_ALL_VISIBLE, PD_ALL_FROZEN) which spell out the words rather than abbreviate. Same rename for the matching WAL flag XLHP_HAS_HOT_IDX_BRIDGES -> XLHP_HAS_HOT_INDEXED_BRIDGES. Mechanical change.

Document missing items from the audit: * Bridges under crash recovery: heap_xlog_prune replays them, FPI preserves PD_HAS_HOT_INDEXED_BRIDGES, idempotent re-replay is safe; next vacuum reclaims them. * The chain-match invariant relaxation: heap_hot_search_buffer does not advance prev_xmax across a bridge, so the next hop's xmin/xmax check effectively skips the bridge. * Per-index pg_stat_all_indexes columns n_tup_hot_idx_upd_skipped / matched, with the invariant that they sum to the owning table's n_tup_hot_idx_upd. * Filter 6: write-side check_exclusion_or_unique_constraint recheck was added in 38b3ed5 and is in place; the relation-wide exemption stays for the temporal/decoding gap. Also extend the hot_indexed.h header comment with a precise note about natts == 0 (heap tuple bodies always carry user attrs; pg_attribute is a slight terminology source of confusion), and document why both t_ctid.offnum and the payload's t_target carry the same back-pointer (one is for amcheck, the other for cheap access by readers). Tighten the bufpage.h PD_HAS_HOT_INDEXED_BRIDGES doc: heap-only producer/consumer; index pages don't carry it. No functional change.

Build a HOT-indexed chain on a wide-ish table by repeatedly UPDATEing a single non-PK indexed column, force opportunistic prune so the dead chain members convert to bridge / adjacent tombstones, then crash-restart the primary via stop('immediate'). After WAL replay the test verifies that: * an indexscan walking the chain still returns the live tuple, * stale btree entries through bridges are filtered by xs_hot_indexed_recheck, * pg_amcheck (verify_heapam) reports no errors on the relation, * after DELETE plus two VACUUM (FREEZE) passes every tombstone -- bridge or adjacent -- is reclaimed. The two-VACUUM dance is needed because plain VACUUM does not always visit the page's prune_handle_tombstones path on the first pass once the live row is dead; the second VACUUM forces it. The audit-tracked gap that ordinary VACUUM should reclaim orphaned tombstones in a single pass is item 7.5(b) and is not addressed here. Closes audit item 7.1.

When a transaction performs a HOT-indexed update inside a transaction that subsequently aborts, three things end up on disk: the (live) chain-root tuple R, the aborted heap-only successor H with HEAP_INDEXED_UPDATED set, and a btree leaf entry for the aborted update's key pointing at H. ROLLBACK does not delete the btree entry; that work is deferred to ambulkdelete. Until ambulkdelete runs, the btree leaf is stale-but-pointing-at-an- aborted-tuple. In classic HOT this is harmless: H's HEAP_HOT_UPDATED predecessor R is reachable, so the chain walk recognises H as aborted and continues. Under HOT-indexed the situation is different. H is heap-only with no HEAP_HOT_UPDATED predecessor (R does not have its HOT bit set -- the abort never committed it), so the existing nheaponly_items prune path classifies H as 'dead heap-only, no chain' and reclaims H to LP_UNUSED. An unrelated INSERT later reuses H's slot, and the stale btree leaf now resolves to a valid LP_NORMAL tuple from a different relation. _bt_check_unique then sees a live tuple at the matching key and raises a spurious duplicate-key violation, even though the inserter's logical row does not actually duplicate any existing one. Symptom: stochastic create_view regress failures with ERROR: duplicate key value violates unique constraint "pg_attribute_relid_attnam_index" on the second drop of column f3 inside the test sequence around src/test/regress/sql/create_view.sql:657. Fix: route aborted HOT-indexed heap-only tuples through the existing bridge-tombstone mechanism rather than reclaiming them. heap_prune_find_live_chain_root() walks back to the chain root across same-page LPs whose t_ctid points at the dead tuple; if a chain root is reachable, the LP is overwritten with a bridge that forwards to it. Bridges keep the slot occupied so it cannot be reused, signal hot_indexed_recheck to chain walkers, and integrate with the existing vacuum / ambulkdelete dead-TID flow that reclaims them once stale leaves are gone. When no chain root is reachable (e.g. predecessor is LP_REDIRECT) the loop falls back to LP_UNUSED as before. Reduces measured create_view failure rate from ~10% to ~4% over 30-run loops. The residual is a separate multi-update-aborted-chain case (R -> A -> H, both A and H aborted heap-only with H_HOT_UPDATED on A) that hits a pre-existing 'dead heap-only tuple ... is not linked to from any HOT chain' error in the same loop and warrants follow-up.

The previous fix (commit d9df800) handled the leaf of an aborted HOT-indexed update chain: a heap-only tuple with HEAP_INDEXED_UPDATED and !IsHotUpdated, dead due to xmin abort. It missed the case where the same aborted transaction performed two or more HOT-indexed updates in sequence on the same row, producing R (live) -> A1 (heap-only, dead, IsHotUpdated) -> A2 (heap-only, dead, !IsHotUpdated). heap_prune_chain visits R, finds it LIVE, and stops processing the chain there -- it does not walk into the aborted tail. A2 hits the nheaponly_items branch covered by d9df800; A1 hits the 'else' branch and historically raised elog(ERROR, "dead heap-only tuple ... is not linked to from any\n HOT chain") even though A1 is in fact part of R's chain (R's t_ctid points at\nA1) -- heap_prune_chain just chose not to walk into it. The btree leaf entry from the inner UPDATE pointed at A1, not A2, so A1 has the same stale-leaf hazard as the leaf case: reclaim A1\nto LP_UNUSED and an unrelated INSERT can reuse the slot. Treat A1 the same way as A2: convert to a bridge tombstone\nforwarding to the live chain root if reachable. When no chain root\nis reachable (the chain has already been heavily reorganised by\nprior pruning), fall back to the existing 'is not linked' error\npath as a conservative last resort. This addresses the residual stochastic regress failures left by\nd9df800cff9 in the multi-statement aborted-transaction patterns exercised by create_view, create_index (REINDEX TABLE CONCURRENTLY\nover pg_class), and alter_table.

The bridge fix in d9df800 falls back to LP_UNUSED when no live chain root is reachable on the same page (the chain has been HOT-updated again, displacing the orphan). LP_UNUSED reuses the slot for a fresh INSERT, which is the exact failure mode the bridge fix was designed to prevent: the surviving stale btree leaf entry then resolves to an unrelated tuple at the reused slot and _bt_check_unique fires a spurious unique-violation error. Use heap_prune_record_dead in this case instead. LP_DEAD pins the slot against reuse and adds the offnum to the page's deadoffsets array so ambulkdelete sweeps the matching stale btree leaves on its next pass; a subsequent vacuum cycle then reclaims the LP via the normal LP_DEAD -> LP_UNUSED transition. Reduces measured stochastic create_view failure rate from ~8% to 0% over 80 consecutive regress runs. alter_table, compression, and create_index residuals reduce in proportion (now 0-2 each per 50 runs).

Tests that a reader holding a transaction across a concurrent prune that converts dead chain members into bridge tombstones continues to see consistent index-scan results. Two permutations exercise both orderings: reader snapshots the chain before the prune fires, and reader snapshots after a competing UPDATE has already run on the same row but before the prune+vacuum cycle materialises the bridge. Closes audit gap 7.3.

Now that nodeIndexscan and nodeIndexonlyscan dispatch through amrecheck_leaf_key (commit 3b8e628), no out-of-tree caller needs the function as a public symbol. Drop the prototype from access/nbtree.h and replace it with same-file forward declarations in nbtinsert.c (where it is defined and one early caller lives) and nbtree.c (which registers it against IndexAmRoutine.amrecheck_leaf_key). Pre-amendment audit task 4.5; mechanical change.

Extends 039_hot_indexed_apply.pl with a per-mode scenario that verifies subscriber INSERTs do not produce spurious unique-violation errors after a replicated UPDATE leaves a stale btree leaf key on the subscriber side. The publisher updates a row in tab_uk changing the indexed payload column from 0 to 999, leaving the (0, tag) btree leaf entry behind on the subscriber. The subscriber then INSERTs a fresh row with payload=0 but a unique tag. Under all three apply modes the leaf-key recheck must filter the stale entry on the chain walk and let the INSERT succeed. Closes audit gap 7.2.

Comprehensive sweep of docs and test fixtures for stale identifiers and references after the recent code changes: * README.HOT-INDEXED: replace residual SIU mentions in the catalog enablement narrative with HOT-indexed; spell out the amrecheck_leaf_key callback path now that nodeIndexonlyscan dispatches through it. * nodeIndexonlyscan.c: drop tepid codename mention from a comment; match the wording used elsewhere. * AUDIT_SEQSCAN.md: append addendum noting the two indexOK=true callers found unsafe in stochastic regress investigation (AlterFKConstrEnforceabilityRecurse, RelidByRelfilenumber) and their fix commits. * gdbinit / tepid-helpers.py: replace stale ExecIndexEntryMatchesTuple reference with the surviving _bt_heap_keys_equal_leaf path. * hot_indexed_updates.sql / .out: rename test fixtures from siu_* to hi_* and the helper function get_siu_count -> get_hi_count to match the SIU-rename done elsewhere. Regenerate expected output for the new column widths. * bench/tepid/README.md: fix stale /scratch/siu-bench path. No functional change to source code.

Upstream commit 3bf6373 'Fix style in a few REPACK ereports' dropped the parentheses from REPACK CONCURRENTLY HINT messages. Refresh the expected output that two earlier tepid commits (adding disallowed-temp/unlogged/catalog scenarios) baked in.

A/B benchmark on nuc (FreeBSD/amd64, 8 cores) against upstream/master 0c025ab. WIDE_COLS=64, hot_indexed_update_threshold=100, 60s per workload, 8 clients. Headline numbers: - wide_1: WAL -79.1%, TPS -3.9% - wide_2..wide_48: TPS +5.6% to +11.8%, WAL -13% to -74% - wide_64: TPS -3.3%, WAL -5.1% - wide_0 (no indexed col changes): TPS -55.1% -- known classic-HOT overhead at WIDE_COLS=64 from per-tuple HeapUpdateHotAllowable and ExecUpdateModifiedIdxAttrs work that scales superlinearly with attribute count. Cache invalidation of RelationGetIndexedAttrs and the key-attr bitmaps remains a follow-up. HOT-indexed hit rate stays at ~89-90% across wide_1..wide_64 with threshold=100, confirming the design lets the chain stretch as intended.

RelationGetIndexAttrBitmap returns a defensive bms_copy of the cached per-attrKind bitmap so callers may freely mutate or free it. In hot paths that only test the bitmap (bms_overlap, bms_is_subset, bms_equal, bms_num_members) the copy is pure overhead -- one bms_copy on the way in and one bms_free on the way out. At wide tables those bitmaps span 65+ bits and the copy cost shows up under pgbench-style high-TPS UPDATE workloads. Add a borrowing variant that returns a const pointer to the cached bitmap directly. The caller treats the result as read-only and must not invoke any code that could trigger a relcache invalidation on the relation between fetch and last use. When rd_attrsvalid is not yet set we route through the existing function once to populate the cache, then return the cached pointer; this keeps the slow path identical to before. The variant has no in-tree callers in this commit; subsequent commits in the HOT-indexed update series adopt it where the lifetime constraint is easy to verify.

The CSV writer collapsed n_tup_hot_upd (which is the SUM of classic HOT and HOT-indexed updates under tepid) and n_tup_hot_idx_upd into two columns called 'hot' and 'siu'. Reviewers reading the bench output had to subtract to get the classic-HOT and non-HOT shares. Emit four explicit columns instead: classic_hot_updates = n_tup_hot_upd - n_tup_hot_idx_upd hot_indexed_updates = n_tup_hot_idx_upd non_hot_updates = n_tup_upd - n_tup_hot_upd total_updates = n_tup_upd The console summary line is updated to match. No data loss; the prior schema is recoverable from the new columns.

ExecUpdateModifiedIdxAttrs unconditionally fetches the relation's INDEXED attribute bitmap and walks every set attribute via ExecCompareSlotAttrs (slot_getattr x 2 + datum_image_eq). On wide tables this loop dominates the per-row cost of UPDATEs that do not touch any indexed column -- the canonical pgbench-style 'UPDATE t SET id = id WHERE id = ?' workload at WIDE_COLS=64 measured a -55% TPS regression versus master, all of it spent comparing 65 slot attribute pairs whose result is the empty bitmap. Add a fast path that compares the SQL UPDATE's target column set (ExecGetAllUpdatedCols, which folds in generated columns) against the relation's indexed-attr bitmap and returns NULL immediately when they do not intersect. The cached indexed-attr bitmap is fetched via the new RelationGetIndexAttrBitmapNoCopy variant so the fast path costs exactly one bms_overlap and one ExecGetAllUpdatedCols. The fast path must back off when a BEFORE UPDATE or INSTEAD OF UPDATE row trigger is attached to the relation. Such triggers can replace arbitrary columns of the new tuple via heap_modify_tuple() (the canonical example is tsvector_update_trigger() in tsearch.sql, which sets the indexed tsvector column without going through the executor's SET tracking). ExecGetAllUpdatedCols() does not record those mutations, so when either ri_TrigDesc->trig_update_before_row or trig_update_instead_row is set we fall through to the existing full comparison. ExecUpdateModifiedIdxAttrs now takes an EState argument so it can call ExecGetAllUpdatedCols. Adjust the three callers (nodeModifyTable.c, execReplication.c, repack.c) to thread their existing EState pointer through.

lazy_vacuum_heap_page() previously walked the entire line-pointer array after the dead-item conversion loop just to decide whether any HOT-indexed bridge tombstone remained on the page; if none did, it cleared PD_HAS_HOT_INDEXED_BRIDGES. On a busy page that walk is O(maxoff) per second-pass call and adds up across a vacuum cycle. Replace the post-hoc rescan with a running counter. Before the conversion loop, count the bridges currently present on the page in a single pass. Inside the loop, every reclaim of an LP_NORMAL bridge deadoffset decrements the counter. When the loop ends, the counter shows exactly how many bridges survive on the page without a second walk: zero means clear the advisory bit, non-zero means leave it set. The pre-loop walk plus decrements is exact under the function's exclusive buffer lock. Bridges added by an intervening opportunistic prune between pass-1 and pass-2 do not appear in deadoffsets[], so they will not be decremented and the flag correctly stays set; that prune itself set the flag before releasing the buffer lock, so we never lose the hint. The visible behaviour is unchanged; only the bookkeeping shape moves from a post-hoc rescan to per-reclaim decrements, which Plageman has flagged as the preferred pattern on similar second-pass code.

Tepid (HOT-indexed updates) plants two on-page artifacts that classic HOT never produced. Adjacent tombstones carry the per-update modified- indexed-attrs bitmap next to a live HOT-indexed tuple. Bridge tombstones are written by pruneheap in place of a dead mid-chain HOT-indexed LP whose btree entries may still be stale, so chain walkers arriving via those entries find a walkable hop until vacuum reclaims them. Both items are LP_NORMAL with natts == 0 and HEAP_INDEXED_UPDATED set; the standard verify_heapam per-tuple checks see them as invisible via HEAP_XMIN_INVALID and short-circuit, leaving forged or truncated tombstones undetected. Add an explicit structural check before the standard per-tuple flow. For both variants, validate that natts is zero, HEAP_INDEXED_UPDATED is set, both XMIN_INVALID and XMAX_INVALID are set, and t_hoff matches the fixed sentinel header size. For bridges, require t_ctid.blkno equal to the current block, t_ctid.offnum within the page's live offset range, and the LP length equal to HOT_INDEXED_BRIDGE_SIZE. For adjacent tombstones, require t_ctid.blkno == InvalidBlockNumber, the back-pointer offnum within range, the LP length equal to HotIndexedTombstoneSize for the relation's natts, the payload's t_target equal to t_ctid.offnum, and the payload's t_nbytes equal to ceil(natts/8). Skip the regular per- tuple checks for tombstones: those checks are written for real tuples and the early visibility short-circuit makes them no-ops anyway. Continue to record the bridge's same-page forward link as a chain successor so chain validation observes the connection. Add a regression scenario in check_heap.sql that drives single-step and multi-step HOT-indexed UPDATEs followed by VACUUM, then runs verify_heapam and asserts an empty corruption set.

…input HeapUpdateDetermineLockmode is on the per-UPDATE hot path: it runs once per heapam_tuple_update() call before heap_update(). When no indexed column changed -- which is the common case for the wide_0 "UPDATE t SET id = id" workload after the executor-side fast path in ExecUpdateModifiedIdxAttrs, and also for any UPDATE that touches only non-indexed columns -- modified_idx_attrs is empty and a key column cannot have changed. Short-circuit to LockTupleNoKeyExclusive without consulting the relcache. Also switch the non-empty path to RelationGetIndexAttrBitmapNoCopy. The function tests overlap and discards the result; it never mutates or frees the bitmap, and nothing between the fetch and the bms_overlap can trigger a relcache invalidation on this relation. Together these eliminate one bms_copy, one bms_overlap-against-empty, and one bms_free per UPDATE on the hot path.

prune_handle_tombstones() decides whether each HOT-indexed tombstone collected in the main per-offnum pass survives pruning. It marks a tombstone unchanged when its target offset is still a live hot-indexed tuple, and reclaims it as LP_UNUSED otherwise. The check consults prstate->nowunused[] and prstate->nowdead[] for the chain-processing decisions; previously, targets being rewritten in place as bridge tombstones (prstate->bridges[]) were missed and the corresponding adjacent tombstones lingered on the page after chain collapse. A bridge has no use for the adjacent tombstone's modified-attrs bitmap. Stale-leaf readers landing on the bridge follow t_ctid to the live tuple and recheck the leaf key against the live tuple's current index form via amrecheck_leaf_key; the bitmap is not consulted along that path. Reclaiming the now-orphan tombstone at chain-collapse time frees the LP and slightly speeds future chain walks past the collapsed segment. Add the bridges[] check alongside the existing nowunused/nowdead checks. No new WAL infrastructure is required: the existing XLHP_HAS_NOW_UNUSED_ITEMS path carries the additional reclaimed offsets, and replay applies them through the same heap_page_prune_execute() loop that already handles tombstone reclaim.

Upstream commit 3bf6373 'Fix style in a few REPACK ereports' restructured the HINTs to substitute the literal "REPACK (CONCURRENTLY)" via a %s parameter; the resulting message includes the parens. An earlier commit on tepid (4483529) wrote the expected output with the parens dropped, which was wrong. Refresh.

ExecUpdateModifiedIdxAttrs's fast path bails out when the SQL UPDATE's targeted columns don't intersect the relation's indexed attribute bitmap. That's correct for ordinary UPDATEs but wrong in two narrower cases: * FOR PORTION OF UPDATE: the temporal range column changes implicitly via the FOR PORTION OF machinery, not via the SET clause. ExecGetAllUpdatedCols() doesn't see it. A short-circuit here tells heap_update no indexed column changed and the row-split that FOR PORTION OF needs never happens; the original row stays unsplit and an exclusion-constraint violation surfaces on the next overlapping UPDATE. updatable_views/uv_fpo_view is the regression gate. * Relations carrying any exclusion constraint: temporal PRIMARY KEY ... WITHOUT OVERLAPS and similar custom range/overlap constraints can drive value mutations through paths the SQL target list doesn't capture. HeapUpdateHotAllowable already demotes such relations to non-HOT, but that decision runs after the fast path; the fast path must not pre-empty modified_idx_attrs. Bail from the fast path when ri_forPortionOf is set or when RelationHasExclusionConstraint() returns true. Both checks are cheap (struct-field test plus a cached relcache flag introduced in 6e79d82).

Apply the renames recommended in the reviewer pre-mortem: pg_subscription.subhotindexedmode -> subhotindexedonapply (matches the option name 'hot_indexed_on_apply') pg_stat_get_tuples_hot_idx_updated -> pg_stat_get_tuples_hot_indexed_updated n_tup_hot_idx_upd -> n_tup_hot_indexed_upd (Lane has historically rejected 'idx' abbreviations in user-facing identifiers) pg_stat_get_tuples_hot_idx_updated_skipped -> pg_stat_get_tuples_hot_indexed_updated_skipped pg_stat_get_tuples_hot_idx_updated_matched -> pg_stat_get_tuples_hot_indexed_updated_matched PgStat_Counter tuples_hot_idx_updated -> tuples_hot_indexed_updated PgStat_Counter tuples_hot_idx_upd_skipped -> tuples_hot_indexed_upd_skipped PgStat_Counter tuples_hot_idx_upd_matched -> tuples_hot_indexed_upd_matched pgstat_count_hot_idx_upd_skipped -> pgstat_count_hot_indexed_upd_skipped pgstat_count_hot_idx_upd_matched -> pgstat_count_hot_indexed_upd_matched Catalog version was already bumped during the post-rebase catversion conflict resolution; no second bump needed. Also expands README.HOT-INDEXED's exclusion-constraint exemption section with the precise rationale (temporal PRIMARY KEY ... WITHOUT OVERLAPS, GiST overlap semantics, two prerequisites for lifting), and refreshes hot_indexed_updates expected output to match the post-tombstone-reclaim improvement from commit 63df3b8 (one fewer tombstone after vacuum).

Two-pass A/B sweep on nuc with all 8 prioritized optimizations applied (commits 24f7177, 24ba068, c70bbd3, 9d8f92d, 63df3b8, 2465226, aad3b07 plus the bench split b9b1f53). Pass A (threshold=100, sweet-spot): wide_1: TPS +10.0%, WAL -87.2% (was -3.9% / -79.1% pre-optimization) wide_2..8: TPS +11.9% to +14.7%, WAL -74% to -85% wide_16: TPS +9.1%, WAL -61.6% wide_64: TPS parity, WAL parity Pass B (threshold=80, default): wide_0: TPS +19.5% (was -55% pre-optimization, fast path now firing) wide_1..16: TPS +10% to +17%, WAL -61% to -87% wide_64: HOT-indexed gate fires, degenerates to non-HOT at parity The headline wide_1 result moved from -3.9% TPS / -79.1% WAL to +10.0% TPS / -87.2% WAL. TPS improved by 13.9 percentage points through the wide_0 fast path + RelationGetIndexAttrBitmap dedup + KEY-bitmap skip; WAL improved by 8 points through the bench running on the rebased upstream/master.

When heap_prune_chain collapses a partial-dead HOT-indexed chain (R live -> H1 dead -> H2 dead -> ... -> first_live), each dead intermediate chain member that was a HOT-indexed update has its own adjacent tombstone with a per-update modified-attrs bitmap. At collapse time we redirect R to first_live and bridge or reclaim each H[i]; the bridge case has no need for the H[i] tombstone (the bridge itself signals readers via its t_ctid forward link, not via a modified-attrs bitmap). Until now those H[i] tombstones were reclaimed via the existing nowunused flow (commit 63df3b8), discarding their bitmaps. But the leaf entries described by those bitmaps still chain-walk to the surviving live tuple after collapse: any future reader that consults the bitmap (none consult it today; the leaf-key recheck via amrecheck_leaf_key is the canonical stale-leaf filter) deserves to see the cumulative "attributes that ever changed across the collapsed chain", not just the bitmap of whichever surviving tombstone happened to be adjacent to first_live. OR-merge each discarded H[i] tombstone's bitmap into first_live's adjacent tombstone before the source LP is reclaimed. The OR is byte-by-byte; adjacent tombstones for the same relation always carry an identical t_nbytes (every per-update bitmap covers the relation's full attribute count), so the operation is well-defined. The union over-approximates -- any leaf that triggered recheck via a per-hop bitmap still triggers via the union -- so correctness is preserved. Plumbing: - new XLHP_HAS_TOMBSTONE_UNIONS WAL flag (bit 12) + xlhp_prune_items sub-record carrying (target, source) OffsetNumber pairs - PruneState gains tombstone_unions[] and ntombstone_unions - heap_prune_record_tombstone_union() helper queues the pair - heap_prune_find_tombstone_for() helper locates a chain member's adjacent tombstone in prstate->tombstones[] - heap_page_prune_execute() applies the byte-OR before LP_UNUSED conversion (so the source body is still readable) - log_heap_prune_and_freeze() emits the sub-record - heap_xlog_deserialize_prune_and_freeze() parses it - pg_waldump heap2_desc prints nunions and the unions: array The change is behaviorally a no-op today (no reader consults the bitmap), but lays the WAL-format groundwork for the documented follow-up consumers: apply-path index_recheck_constraint for temporal exclusion-constraint replication (README.HOT-INDEXED "exclusion-constraint exemption" lift), and any future per-index recheck that needs to know the cumulative change set across a collapsed chain. Bit 11 stays reserved for chain promotion.

github-actions Bot force-pushed the master branch 30 times, most recently from 3e6d7f8 to 84ff7f1 Compare April 8, 2026 04:49

Greg Burd and others added 30 commits May 13, 2026 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add selective index updates (n/m) to the heap table AM#23

Add selective index updates (n/m) to the heap table AM#23
gburd wants to merge 107 commits into
masterfrom
tepid

gburd commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gburd commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant