Skip to content

Add selective index updates (n/m) to the heap table AM#23

Closed
gburd wants to merge 107 commits into
masterfrom
tepid
Closed

Add selective index updates (n/m) to the heap table AM#23
gburd wants to merge 107 commits into
masterfrom
tepid

Conversation

@gburd
Copy link
Copy Markdown
Owner

@gburd gburd commented Apr 6, 2026

Updates trigger n of m index updates rather than all, none, or only summarizing. Table AMs have the ability to influence this behavior by changing modified_idx_attrs.

@github-actions github-actions Bot force-pushed the master branch 30 times, most recently from 3e6d7f8 to 84ff7f1 Compare April 8, 2026 04:49
Greg Burd and others added 30 commits May 13, 2026 20:19
Commit 51432cd ('Standardise terminology: drop the SIU acronym
throughout') renamed user-facing identifiers but left several
internal symbols and comments still using SIU.  Complete the
rename so the codebase consistently spells the feature
'HOT-indexed':

  IndexScanState.iss_SiuIndexInfo -> iss_HotIndexedRecheckInfo
  nbtinsert.c local siu_slot      -> chain_walk_slot
  nbtinsert.c label bt_siu_skip   -> bt_chain_walk_skip
  heapam_handler.c page_had_siu   -> page_had_hot_indexed
  pgstat_count_heap_update arg siu -> hot_indexed
  heap_prune_item_preserves_siu   -> heap_prune_item_preserves_hot_indexed

Also refresh comments in pruneheap.c, nodeIndexonlyscan.c,
relscan.h, and pgstat_relation.c that still referred to SIU.

No functional change.
The XLHP_HAS_PROMOTIONS flag (bit 11 of xl_heap_prune.flags), its
serialization in log_heap_prune_and_freeze, its replay in
heap_page_prune_execute, and its pg_waldump description were all
present, but no caller ever populated the promotions[] array: every
log_heap_prune_and_freeze and heap_page_prune_execute call site passed
NULL/0 for promotions/npromotions.  The path was therefore reachable
WAL surface that could never fire and would be rejected on review.

The intent was to clear HEAP_INDEXED_UPDATED on surviving heap-only
chain members once a chain became indistinguishable from a classic HOT
chain.  The trigger condition is unsafe without additional bookkeeping
(a chain may have non-bridge stale btree entries that ambulkdelete did
not sweep, so dropping the recheck bit at "last bridge gone" lets
readers arriving via those entries surface stale leaves).  Designing a
safe trigger -- per-page outstanding-ref counts or a post-vacuum
verification walk -- is left for a future commit, which will introduce
its own WAL flag at that time.

Strip the flag bit, the promotions[]/npromotions parameters from
heap_page_prune_execute and log_heap_prune_and_freeze (and their
heapam.h prototypes), the deserialization branch and the apply loop
in heap_xlog_deserialize_prune_and_freeze and heap_page_prune_execute,
and the rmgrdesc print branches in heap2_desc.  Bit 11 is reserved
for a future re-introduction of promotion WAL.  No on-disk format
change visible to existing pages -- the flag was never set.
The previous section described an XLHP_HAS_PROMOTIONS WAL flag and
log/replay pipeline as already in place, with only the trigger
condition outstanding.  That flag has been removed because no caller
ever emitted it.  Rewrite the section to record promotion as
intentional future work, summarize why the obvious "no bridges remain"
trigger is unsafe (stale non-bridge btree entries), and reference the
two trigger-design directions (per-page outstanding-ref bookkeeping or
a post-ambulkdelete verification walk) as a roadmap.  Note that a
future commit activating promotion will reintroduce its own WAL flag.
Both nodeIndexscan and nodeIndexonlyscan need to verify, on a chain
walk that crossed a HOT-indexed hop, that the leaf entry's key still
matches the live tuple's current index form.  nodeIndexonlyscan
already dispatches through the new amrecheck_leaf_key callback;
nodeIndexscan was calling a separate ExecIndexEntryMatchesTuple
helper in execIndexing.c that did the same job using
FormIndexDatum + datum_image_eq.

Switch nodeIndexscan to the callback path so all HOT-indexed leaf
rechecks go through one indexam-shaped surface, and delete
ExecIndexEntryMatchesTuple along with its supporting code in
execIndexing.c and executor.h.  AMs that omit the callback fall
through to the conservative drop, matching the prior permissive
behaviour for non-nbtree AMs.

Drop IndexScanState.iss_HotIndexedRecheckInfo, which was the cached
IndexInfo used by FormIndexDatum and is no longer reachable.

Eliminates the dual leaf-key recheck implementation.
HeapUpdateHotAllowable consults RelationHasExclusionConstraint on
every UPDATE, and the function used to walk the relation's index
list and open every index per call.  On a relation with many indexes
this dominated per-update CPU on classic-HOT workloads, contributing
to a measurable TPS regression at WIDE_COLS=64 versus pre-feature
master.

Cache the answer as a tristate char on RelationData (rd_has_exclusion;
RD_HAS_EXCLUSION_UNKNOWN/NO/YES).  The field is naturally zeroed by
palloc0_object on relcache entry allocation, so 0 = unknown is the
right default.  Reset on relcache rebuild via the existing
RelationClearRelation memcpy swap of the freshly built struct.

No on-disk change.
The function used to call RelationGetIndexAttrBitmap up to four times
per UPDATE under heavy apply-path gating: once for SUMMARIZED, twice
for INDEXED + PRIMARY_KEY in the apply-path branch, and once more
for INDEXED in the threshold check.  Each call returns a freshly
palloc'd Bitmapset that the caller bms_frees, so the per-tuple cost
scales with index count.

Fetch INDEXED once on the slow path and reuse it across the apply
branch and threshold branch.  Fetch PRIMARY_KEY at most once, lazily
in the apply-path branch.  Both bitmaps are bms_freed via a single
out: cleanup label.  SUMMARIZED is fetched only when classic-HOT
fast-path applies and stays scoped to its block.

No functional change.  Reduces measured wide_64 classic-HOT
overhead in HeapUpdateHotAllowable.
For consistency with surrounding pd_flags constants
(PD_HAS_FREE_LINES, PD_ALL_VISIBLE, PD_ALL_FROZEN) which spell out
the words rather than abbreviate.  Same rename for the matching
WAL flag XLHP_HAS_HOT_IDX_BRIDGES -> XLHP_HAS_HOT_INDEXED_BRIDGES.

Mechanical change.
Document missing items from the audit:

* Bridges under crash recovery: heap_xlog_prune replays them, FPI
  preserves PD_HAS_HOT_INDEXED_BRIDGES, idempotent re-replay is safe;
  next vacuum reclaims them.
* The chain-match invariant relaxation: heap_hot_search_buffer does
  not advance prev_xmax across a bridge, so the next hop's xmin/xmax
  check effectively skips the bridge.
* Per-index pg_stat_all_indexes columns n_tup_hot_idx_upd_skipped /
  matched, with the invariant that they sum to the owning table's
  n_tup_hot_idx_upd.
* Filter 6: write-side check_exclusion_or_unique_constraint recheck
  was added in 38b3ed5 and is in place; the relation-wide
  exemption stays for the temporal/decoding gap.

Also extend the hot_indexed.h header comment with a precise note
about natts == 0 (heap tuple bodies always carry user attrs;
pg_attribute is a slight terminology source of confusion), and
document why both t_ctid.offnum and the payload's t_target carry
the same back-pointer (one is for amcheck, the other for cheap
access by readers).

Tighten the bufpage.h PD_HAS_HOT_INDEXED_BRIDGES doc: heap-only
producer/consumer; index pages don't carry it.

No functional change.
Build a HOT-indexed chain on a wide-ish table by repeatedly UPDATEing a
single non-PK indexed column, force opportunistic prune so the dead
chain members convert to bridge / adjacent tombstones, then crash-restart
the primary via stop('immediate').  After WAL replay the test verifies
that:

 * an indexscan walking the chain still returns the live tuple,
 * stale btree entries through bridges are filtered by xs_hot_indexed_recheck,
 * pg_amcheck (verify_heapam) reports no errors on the relation,
 * after DELETE plus two VACUUM (FREEZE) passes every tombstone -- bridge
   or adjacent -- is reclaimed.

The two-VACUUM dance is needed because plain VACUUM does not always
visit the page's prune_handle_tombstones path on the first pass once
the live row is dead; the second VACUUM forces it.  The audit-tracked
gap that ordinary VACUUM should reclaim orphaned tombstones in a single
pass is item 7.5(b) and is not addressed here.

Closes audit item 7.1.
When a transaction performs a HOT-indexed update inside a
transaction that subsequently aborts, three things end up on disk:
the (live) chain-root tuple R, the aborted heap-only successor H
with HEAP_INDEXED_UPDATED set, and a btree leaf entry for the
aborted update's key pointing at H.  ROLLBACK does not delete the
btree entry; that work is deferred to ambulkdelete.

Until ambulkdelete runs, the btree leaf is stale-but-pointing-at-an-
aborted-tuple.  In classic HOT this is harmless: H's HEAP_HOT_UPDATED
predecessor R is reachable, so the chain walk recognises H as
aborted and continues.

Under HOT-indexed the situation is different.  H is heap-only with
no HEAP_HOT_UPDATED predecessor (R does not have its HOT bit set --
the abort never committed it), so the existing nheaponly_items
prune path classifies H as 'dead heap-only, no chain' and reclaims
H to LP_UNUSED.  An unrelated INSERT later reuses H's slot, and the
stale btree leaf now resolves to a valid LP_NORMAL tuple from a
different relation.  _bt_check_unique then sees a live tuple at the
matching key and raises a spurious duplicate-key violation, even
though the inserter's logical row does not actually duplicate any
existing one.

Symptom: stochastic create_view regress failures with
  ERROR: duplicate key value violates unique constraint
         "pg_attribute_relid_attnam_index"
on the second drop of column f3 inside the test sequence around
src/test/regress/sql/create_view.sql:657.

Fix: route aborted HOT-indexed heap-only tuples through the
existing bridge-tombstone mechanism rather than reclaiming them.
heap_prune_find_live_chain_root() walks back to the chain root
across same-page LPs whose t_ctid points at the dead tuple; if a
chain root is reachable, the LP is overwritten with a bridge that
forwards to it.  Bridges keep the slot occupied so it cannot be
reused, signal hot_indexed_recheck to chain walkers, and integrate
with the existing vacuum / ambulkdelete dead-TID flow that reclaims
them once stale leaves are gone.  When no chain root is reachable
(e.g. predecessor is LP_REDIRECT) the loop falls back to LP_UNUSED
as before.

Reduces measured create_view failure rate from ~10% to ~4% over
30-run loops.  The residual is a separate multi-update-aborted-chain
case (R -> A -> H, both A and H aborted heap-only with H_HOT_UPDATED
on A) that hits a pre-existing 'dead heap-only tuple ... is not
linked to from any HOT chain' error in the same loop and warrants
follow-up.
The previous fix (commit d9df800) handled the leaf of an aborted
HOT-indexed update chain: a heap-only tuple with HEAP_INDEXED_UPDATED
and !IsHotUpdated, dead due to xmin abort.  It missed the case where
the same aborted transaction performed two or more HOT-indexed
updates in sequence on the same row, producing R (live) -> A1
(heap-only, dead, IsHotUpdated) -> A2 (heap-only, dead, !IsHotUpdated).

heap_prune_chain visits R, finds it LIVE, and stops processing the
chain there -- it does not walk into the aborted tail.  A2 hits the
nheaponly_items branch covered by d9df800; A1 hits the
'else' branch and historically raised

  elog(ERROR, "dead heap-only tuple ... is not linked to from any\n  HOT chain")

even though A1 is in fact part of R's chain (R's t_ctid points at\nA1) -- heap_prune_chain just chose not to walk into it.

The btree leaf entry from the inner UPDATE pointed at A1, not A2,
so A1 has the same stale-leaf hazard as the leaf case: reclaim A1\nto LP_UNUSED and an unrelated INSERT can reuse the slot.

Treat A1 the same way as A2: convert to a bridge tombstone\nforwarding to the live chain root if reachable.  When no chain root\nis reachable (the chain has already been heavily reorganised by\nprior pruning), fall back to the existing 'is not linked' error\npath as a conservative last resort.

This addresses the residual stochastic regress failures left by\nd9df800cff9 in the multi-statement aborted-transaction patterns
exercised by create_view, create_index (REINDEX TABLE CONCURRENTLY\nover pg_class), and alter_table.
The bridge fix in d9df800 falls back to LP_UNUSED when no live
chain root is reachable on the same page (the chain has been
HOT-updated again, displacing the orphan).  LP_UNUSED reuses the
slot for a fresh INSERT, which is the exact failure mode the bridge
fix was designed to prevent: the surviving stale btree leaf entry
then resolves to an unrelated tuple at the reused slot and
_bt_check_unique fires a spurious unique-violation error.

Use heap_prune_record_dead in this case instead.  LP_DEAD pins the
slot against reuse and adds the offnum to the page's deadoffsets
array so ambulkdelete sweeps the matching stale btree leaves on
its next pass; a subsequent vacuum cycle then reclaims the LP via
the normal LP_DEAD -> LP_UNUSED transition.

Reduces measured stochastic create_view failure rate from ~8% to
0% over 80 consecutive regress runs.  alter_table, compression,
and create_index residuals reduce in proportion (now 0-2 each
per 50 runs).
Tests that a reader holding a transaction across a concurrent prune
that converts dead chain members into bridge tombstones continues
to see consistent index-scan results.  Two permutations exercise
both orderings: reader snapshots the chain before the prune fires,
and reader snapshots after a competing UPDATE has already run on
the same row but before the prune+vacuum cycle materialises the
bridge.

Closes audit gap 7.3.
Now that nodeIndexscan and nodeIndexonlyscan dispatch through
amrecheck_leaf_key (commit 3b8e628), no out-of-tree caller
needs the function as a public symbol.  Drop the prototype from
access/nbtree.h and replace it with same-file forward declarations
in nbtinsert.c (where it is defined and one early caller lives) and
nbtree.c (which registers it against IndexAmRoutine.amrecheck_leaf_key).

Pre-amendment audit task 4.5; mechanical change.
Extends 039_hot_indexed_apply.pl with a per-mode scenario that
verifies subscriber INSERTs do not produce spurious unique-violation
errors after a replicated UPDATE leaves a stale btree leaf key on
the subscriber side.

The publisher updates a row in tab_uk changing the indexed payload
column from 0 to 999, leaving the (0, tag) btree leaf entry behind
on the subscriber.  The subscriber then INSERTs a fresh row with
payload=0 but a unique tag.  Under all three apply modes the
leaf-key recheck must filter the stale entry on the chain walk and
let the INSERT succeed.

Closes audit gap 7.2.
Comprehensive sweep of docs and test fixtures for stale identifiers
and references after the recent code changes:

* README.HOT-INDEXED: replace residual SIU mentions in the catalog
  enablement narrative with HOT-indexed; spell out the
  amrecheck_leaf_key callback path now that nodeIndexonlyscan
  dispatches through it.

* nodeIndexonlyscan.c: drop tepid codename mention from a comment;
  match the wording used elsewhere.

* AUDIT_SEQSCAN.md: append addendum noting the two indexOK=true
  callers found unsafe in stochastic regress investigation
  (AlterFKConstrEnforceabilityRecurse, RelidByRelfilenumber) and
  their fix commits.

* gdbinit / tepid-helpers.py: replace stale ExecIndexEntryMatchesTuple
  reference with the surviving _bt_heap_keys_equal_leaf path.

* hot_indexed_updates.sql / .out: rename test fixtures from siu_*
  to hi_* and the helper function get_siu_count -> get_hi_count
  to match the SIU-rename done elsewhere.  Regenerate expected
  output for the new column widths.

* bench/tepid/README.md: fix stale /scratch/siu-bench path.

No functional change to source code.
Upstream commit 3bf6373 'Fix style in a few REPACK ereports'
dropped the parentheses from REPACK CONCURRENTLY HINT messages.
Refresh the expected output that two earlier tepid commits (adding
disallowed-temp/unlogged/catalog scenarios) baked in.
A/B benchmark on nuc (FreeBSD/amd64, 8 cores) against
upstream/master 0c025ab.  WIDE_COLS=64,
hot_indexed_update_threshold=100, 60s per workload, 8 clients.

Headline numbers:
- wide_1: WAL -79.1%, TPS -3.9%
- wide_2..wide_48: TPS +5.6% to +11.8%, WAL -13% to -74%
- wide_64: TPS -3.3%, WAL -5.1%
- wide_0 (no indexed col changes): TPS -55.1% -- known classic-HOT
  overhead at WIDE_COLS=64 from per-tuple HeapUpdateHotAllowable and
  ExecUpdateModifiedIdxAttrs work that scales superlinearly with
  attribute count.  Cache invalidation of RelationGetIndexedAttrs
  and the key-attr bitmaps remains a follow-up.

HOT-indexed hit rate stays at ~89-90% across wide_1..wide_64 with
threshold=100, confirming the design lets the chain stretch as
intended.
RelationGetIndexAttrBitmap returns a defensive bms_copy of the cached
per-attrKind bitmap so callers may freely mutate or free it.  In hot
paths that only test the bitmap (bms_overlap, bms_is_subset, bms_equal,
bms_num_members) the copy is pure overhead -- one bms_copy on the way
in and one bms_free on the way out.  At wide tables those bitmaps span
65+ bits and the copy cost shows up under pgbench-style high-TPS
UPDATE workloads.

Add a borrowing variant that returns a const pointer to the cached
bitmap directly.  The caller treats the result as read-only and must
not invoke any code that could trigger a relcache invalidation on the
relation between fetch and last use.  When rd_attrsvalid is not yet
set we route through the existing function once to populate the
cache, then return the cached pointer; this keeps the slow path
identical to before.

The variant has no in-tree callers in this commit; subsequent commits
in the HOT-indexed update series adopt it where the lifetime
constraint is easy to verify.
The CSV writer collapsed n_tup_hot_upd (which is the SUM of classic
HOT and HOT-indexed updates under tepid) and n_tup_hot_idx_upd into
two columns called 'hot' and 'siu'.  Reviewers reading the bench
output had to subtract to get the classic-HOT and non-HOT shares.

Emit four explicit columns instead:
  classic_hot_updates = n_tup_hot_upd - n_tup_hot_idx_upd
  hot_indexed_updates = n_tup_hot_idx_upd
  non_hot_updates     = n_tup_upd - n_tup_hot_upd
  total_updates       = n_tup_upd

The console summary line is updated to match.  No data loss; the
prior schema is recoverable from the new columns.
ExecUpdateModifiedIdxAttrs unconditionally fetches the relation's
INDEXED attribute bitmap and walks every set attribute via
ExecCompareSlotAttrs (slot_getattr x 2 + datum_image_eq).  On wide
tables this loop dominates the per-row cost of UPDATEs that do not
touch any indexed column -- the canonical pgbench-style 'UPDATE t SET
id = id WHERE id = ?' workload at WIDE_COLS=64 measured a -55% TPS
regression versus master, all of it spent comparing 65 slot attribute
pairs whose result is the empty bitmap.

Add a fast path that compares the SQL UPDATE's target column set
(ExecGetAllUpdatedCols, which folds in generated columns) against the
relation's indexed-attr bitmap and returns NULL immediately when they
do not intersect.  The cached indexed-attr bitmap is fetched via the
new RelationGetIndexAttrBitmapNoCopy variant so the fast path costs
exactly one bms_overlap and one ExecGetAllUpdatedCols.

The fast path must back off when a BEFORE UPDATE or INSTEAD OF UPDATE
row trigger is attached to the relation.  Such triggers can replace
arbitrary columns of the new tuple via heap_modify_tuple() (the
canonical example is tsvector_update_trigger() in tsearch.sql, which
sets the indexed tsvector column without going through the executor's
SET tracking).  ExecGetAllUpdatedCols() does not record those
mutations, so when either ri_TrigDesc->trig_update_before_row or
trig_update_instead_row is set we fall through to the existing full
comparison.

ExecUpdateModifiedIdxAttrs now takes an EState argument so it can
call ExecGetAllUpdatedCols.  Adjust the three callers
(nodeModifyTable.c, execReplication.c, repack.c) to thread their
existing EState pointer through.
lazy_vacuum_heap_page() previously walked the entire line-pointer
array after the dead-item conversion loop just to decide whether any
HOT-indexed bridge tombstone remained on the page; if none did, it
cleared PD_HAS_HOT_INDEXED_BRIDGES.  On a busy page that walk is
O(maxoff) per second-pass call and adds up across a vacuum cycle.

Replace the post-hoc rescan with a running counter.  Before the
conversion loop, count the bridges currently present on the page in a
single pass.  Inside the loop, every reclaim of an LP_NORMAL bridge
deadoffset decrements the counter.  When the loop ends, the counter
shows exactly how many bridges survive on the page without a second
walk: zero means clear the advisory bit, non-zero means leave it set.

The pre-loop walk plus decrements is exact under the function's
exclusive buffer lock.  Bridges added by an intervening opportunistic
prune between pass-1 and pass-2 do not appear in deadoffsets[], so
they will not be decremented and the flag correctly stays set; that
prune itself set the flag before releasing the buffer lock, so we
never lose the hint.

The visible behaviour is unchanged; only the bookkeeping shape moves
from a post-hoc rescan to per-reclaim decrements, which Plageman has
flagged as the preferred pattern on similar second-pass code.
Tepid (HOT-indexed updates) plants two on-page artifacts that classic
HOT never produced.  Adjacent tombstones carry the per-update modified-
indexed-attrs bitmap next to a live HOT-indexed tuple.  Bridge
tombstones are written by pruneheap in place of a dead mid-chain
HOT-indexed LP whose btree entries may still be stale, so chain walkers
arriving via those entries find a walkable hop until vacuum reclaims
them.  Both items are LP_NORMAL with natts == 0 and HEAP_INDEXED_UPDATED
set; the standard verify_heapam per-tuple checks see them as invisible
via HEAP_XMIN_INVALID and short-circuit, leaving forged or truncated
tombstones undetected.

Add an explicit structural check before the standard per-tuple flow.
For both variants, validate that natts is zero, HEAP_INDEXED_UPDATED is
set, both XMIN_INVALID and XMAX_INVALID are set, and t_hoff matches the
fixed sentinel header size.  For bridges, require t_ctid.blkno equal to
the current block, t_ctid.offnum within the page's live offset range,
and the LP length equal to HOT_INDEXED_BRIDGE_SIZE.  For adjacent
tombstones, require t_ctid.blkno == InvalidBlockNumber, the back-pointer
offnum within range, the LP length equal to HotIndexedTombstoneSize for
the relation's natts, the payload's t_target equal to t_ctid.offnum, and
the payload's t_nbytes equal to ceil(natts/8).  Skip the regular per-
tuple checks for tombstones: those checks are written for real tuples
and the early visibility short-circuit makes them no-ops anyway.
Continue to record the bridge's same-page forward link as a chain
successor so chain validation observes the connection.

Add a regression scenario in check_heap.sql that drives single-step and
multi-step HOT-indexed UPDATEs followed by VACUUM, then runs
verify_heapam and asserts an empty corruption set.
…input

HeapUpdateDetermineLockmode is on the per-UPDATE hot path: it runs
once per heapam_tuple_update() call before heap_update().  When no
indexed column changed -- which is the common case for the wide_0
"UPDATE t SET id = id" workload after the executor-side fast path
in ExecUpdateModifiedIdxAttrs, and also for any UPDATE that touches
only non-indexed columns -- modified_idx_attrs is empty and a key
column cannot have changed.  Short-circuit to LockTupleNoKeyExclusive
without consulting the relcache.

Also switch the non-empty path to RelationGetIndexAttrBitmapNoCopy.
The function tests overlap and discards the result; it never mutates
or frees the bitmap, and nothing between the fetch and the
bms_overlap can trigger a relcache invalidation on this relation.

Together these eliminate one bms_copy, one bms_overlap-against-empty,
and one bms_free per UPDATE on the hot path.
prune_handle_tombstones() decides whether each HOT-indexed tombstone
collected in the main per-offnum pass survives pruning.  It marks a
tombstone unchanged when its target offset is still a live hot-indexed
tuple, and reclaims it as LP_UNUSED otherwise.  The check consults
prstate->nowunused[] and prstate->nowdead[] for the chain-processing
decisions; previously, targets being rewritten in place as bridge
tombstones (prstate->bridges[]) were missed and the corresponding
adjacent tombstones lingered on the page after chain collapse.

A bridge has no use for the adjacent tombstone's modified-attrs
bitmap.  Stale-leaf readers landing on the bridge follow t_ctid to
the live tuple and recheck the leaf key against the live tuple's
current index form via amrecheck_leaf_key; the bitmap is not
consulted along that path.  Reclaiming the now-orphan tombstone at
chain-collapse time frees the LP and slightly speeds future chain
walks past the collapsed segment.

Add the bridges[] check alongside the existing nowunused/nowdead
checks.  No new WAL infrastructure is required: the existing
XLHP_HAS_NOW_UNUSED_ITEMS path carries the additional reclaimed
offsets, and replay applies them through the same
heap_page_prune_execute() loop that already handles tombstone
reclaim.
Upstream commit 3bf6373 'Fix style in a few REPACK ereports'
restructured the HINTs to substitute the literal "REPACK
(CONCURRENTLY)" via a %s parameter; the resulting message includes
the parens.  An earlier commit on tepid (4483529) wrote the
expected output with the parens dropped, which was wrong.  Refresh.
ExecUpdateModifiedIdxAttrs's fast path bails out when the SQL
UPDATE's targeted columns don't intersect the relation's indexed
attribute bitmap.  That's correct for ordinary UPDATEs but wrong
in two narrower cases:

 * FOR PORTION OF UPDATE: the temporal range column changes
   implicitly via the FOR PORTION OF machinery, not via the SET
   clause.  ExecGetAllUpdatedCols() doesn't see it.  A short-circuit
   here tells heap_update no indexed column changed and the
   row-split that FOR PORTION OF needs never happens; the original
   row stays unsplit and an exclusion-constraint violation surfaces
   on the next overlapping UPDATE.  updatable_views/uv_fpo_view
   is the regression gate.

 * Relations carrying any exclusion constraint: temporal PRIMARY
   KEY ... WITHOUT OVERLAPS and similar custom range/overlap
   constraints can drive value mutations through paths the SQL
   target list doesn't capture.  HeapUpdateHotAllowable already
   demotes such relations to non-HOT, but that decision runs after
   the fast path; the fast path must not pre-empty
   modified_idx_attrs.

Bail from the fast path when ri_forPortionOf is set or when
RelationHasExclusionConstraint() returns true.  Both checks are
cheap (struct-field test plus a cached relcache flag introduced
in 6e79d82).
Apply the renames recommended in the reviewer pre-mortem:

  pg_subscription.subhotindexedmode -> subhotindexedonapply
    (matches the option name 'hot_indexed_on_apply')
  pg_stat_get_tuples_hot_idx_updated -> pg_stat_get_tuples_hot_indexed_updated
  n_tup_hot_idx_upd -> n_tup_hot_indexed_upd
    (Lane has historically rejected 'idx' abbreviations in user-facing
    identifiers)
  pg_stat_get_tuples_hot_idx_updated_skipped -> pg_stat_get_tuples_hot_indexed_updated_skipped
  pg_stat_get_tuples_hot_idx_updated_matched -> pg_stat_get_tuples_hot_indexed_updated_matched
  PgStat_Counter tuples_hot_idx_updated -> tuples_hot_indexed_updated
  PgStat_Counter tuples_hot_idx_upd_skipped -> tuples_hot_indexed_upd_skipped
  PgStat_Counter tuples_hot_idx_upd_matched -> tuples_hot_indexed_upd_matched
  pgstat_count_hot_idx_upd_skipped -> pgstat_count_hot_indexed_upd_skipped
  pgstat_count_hot_idx_upd_matched -> pgstat_count_hot_indexed_upd_matched

Catalog version was already bumped during the post-rebase
catversion conflict resolution; no second bump needed.

Also expands README.HOT-INDEXED's exclusion-constraint exemption
section with the precise rationale (temporal PRIMARY KEY ...
WITHOUT OVERLAPS, GiST overlap semantics, two prerequisites for
lifting), and refreshes hot_indexed_updates expected output to
match the post-tombstone-reclaim improvement from commit
63df3b8 (one fewer tombstone after vacuum).
Two-pass A/B sweep on nuc with all 8 prioritized optimizations
applied (commits 24f7177, 24ba068, c70bbd3, 9d8f92d,
63df3b8, 2465226, aad3b07 plus the bench split b9b1f53).

Pass A (threshold=100, sweet-spot):
  wide_1: TPS +10.0%, WAL -87.2% (was -3.9% / -79.1% pre-optimization)
  wide_2..8: TPS +11.9% to +14.7%, WAL -74% to -85%
  wide_16: TPS +9.1%, WAL -61.6%
  wide_64: TPS parity, WAL parity

Pass B (threshold=80, default):
  wide_0: TPS +19.5% (was -55% pre-optimization, fast path now firing)
  wide_1..16: TPS +10% to +17%, WAL -61% to -87%
  wide_64: HOT-indexed gate fires, degenerates to non-HOT at parity

The headline wide_1 result moved from -3.9% TPS / -79.1% WAL to
+10.0% TPS / -87.2% WAL.  TPS improved by 13.9 percentage points
through the wide_0 fast path + RelationGetIndexAttrBitmap dedup
+ KEY-bitmap skip; WAL improved by 8 points through the bench
running on the rebased upstream/master.
When heap_prune_chain collapses a partial-dead HOT-indexed chain
(R live -> H1 dead -> H2 dead -> ... -> first_live), each dead
intermediate chain member that was a HOT-indexed update has its own
adjacent tombstone with a per-update modified-attrs bitmap.  At
collapse time we redirect R to first_live and bridge or reclaim each
H[i]; the bridge case has no need for the H[i] tombstone (the bridge
itself signals readers via its t_ctid forward link, not via a
modified-attrs bitmap).

Until now those H[i] tombstones were reclaimed via the existing
nowunused flow (commit 63df3b8), discarding their bitmaps.  But
the leaf entries described by those bitmaps still chain-walk to the
surviving live tuple after collapse: any future reader that consults
the bitmap (none consult it today; the leaf-key recheck via
amrecheck_leaf_key is the canonical stale-leaf filter) deserves to
see the cumulative "attributes that ever changed across the
collapsed chain", not just the bitmap of whichever surviving
tombstone happened to be adjacent to first_live.

OR-merge each discarded H[i] tombstone's bitmap into first_live's
adjacent tombstone before the source LP is reclaimed.  The OR is
byte-by-byte; adjacent tombstones for the same relation always
carry an identical t_nbytes (every per-update bitmap covers the
relation's full attribute count), so the operation is well-defined.
The union over-approximates -- any leaf that triggered recheck via a
per-hop bitmap still triggers via the union -- so correctness is
preserved.

Plumbing:

- new XLHP_HAS_TOMBSTONE_UNIONS WAL flag (bit 12) + xlhp_prune_items
  sub-record carrying (target, source) OffsetNumber pairs
- PruneState gains tombstone_unions[] and ntombstone_unions
- heap_prune_record_tombstone_union() helper queues the pair
- heap_prune_find_tombstone_for() helper locates a chain member's
  adjacent tombstone in prstate->tombstones[]
- heap_page_prune_execute() applies the byte-OR before LP_UNUSED
  conversion (so the source body is still readable)
- log_heap_prune_and_freeze() emits the sub-record
- heap_xlog_deserialize_prune_and_freeze() parses it
- pg_waldump heap2_desc prints nunions and the unions: array

The change is behaviorally a no-op today (no reader consults the
bitmap), but lays the WAL-format groundwork for the documented
follow-up consumers: apply-path index_recheck_constraint for
temporal exclusion-constraint replication (README.HOT-INDEXED
"exclusion-constraint exemption" lift), and any future per-index
recheck that needs to know the cumulative change set across a
collapsed chain.

Bit 11 stays reserved for chain promotion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant