Skip to content

Conversation

@ibrarahmad
Copy link
Contributor

The relmetacache_flush function was calling hash_search with HASH_REMOVE while iterating through RelMetaCache using hash_seq_search. This corrupts the iterator state because removing an entry deallocates memory that the iterator references, reorganizes bucket chains that invalidate the iterator's position, and may trigger dynamic resizing. The result is segmentation faults from accessing freed memory, skipped entries when the iterator loses its position, or infinite loops from corrupted chain pointers. Fixed by collecting all relids in a first pass, then removing them in a second pass after iteration completes.

The relmetacache_prune function had the same issue when removing invalid cache entries during iteration. It was calling hash_search with HASH_REMOVE on entries marked as invalid while hash_seq_search was actively scanning the hash table. This causes identical corruption problems as the iterator's cached bucket index and chain pointers become stale when entries are physically removed from the table. Fixed using the same two-pass pattern: collect relids of invalid entries during iteration, then remove them after the scan finishes.

The reset_channel_stats function was removing all statistics entries from SpockHash while iterating with hash_seq_search. This violated the fundamental rule that hash tables must not be modified during sequential scans. The iterator maintains pointers to the current entry and the next entry to visit, both of which become dangling pointers when hash_search removes the current entry. Under concurrent load this causes crashes with hash table corrupted errors and potential loss of statistics data. Fixed by copying all keys during iteration, then removing entries in a separate loop after the iterator completes its scan.

The relmetacache_flush function was calling hash_search with HASH_REMOVE
while iterating through RelMetaCache using hash_seq_search. This corrupts
the iterator state because removing an entry deallocates memory that the
iterator references, reorganizes bucket chains that invalidate the
iterator's position, and may trigger dynamic resizing. The result is
segmentation faults from accessing freed memory, skipped entries when the
iterator loses its position, or infinite loops from corrupted chain
pointers. Fixed by collecting all relids in a first pass, then removing
them in a second pass after iteration completes.

The relmetacache_prune function had the same issue when removing invalid
cache entries during iteration. It was calling hash_search with
HASH_REMOVE on entries marked as invalid while hash_seq_search was
actively scanning the hash table. This causes identical corruption
problems as the iterator's cached bucket index and chain pointers become
stale when entries are physically removed from the table. Fixed using the
same two-pass pattern: collect relids of invalid entries during
iteration, then remove them after the scan finishes.

The reset_channel_stats function was removing all statistics entries from
SpockHash while iterating with hash_seq_search. This violated the
fundamental rule that hash tables must not be modified during sequential
scans. The iterator maintains pointers to the current entry and the next
entry to visit, both of which become dangling pointers when hash_search
removes the current entry. Under concurrent load this causes crashes with
hash table corrupted errors and potential loss of statistics data. Fixed
by copying all keys during iteration, then removing entries in a separate
loop after the iterator completes its scan.
@mason-sharp mason-sharp requested review from danolivo and rasifr January 7, 2026 17:41
@danolivo
Copy link
Contributor

danolivo commented Jan 8, 2026

In the code:

old_ctxt = MemoryContextSwitchTo(RelMetaCacheContext);
		RelMetaCache = hash_create("spock relation metadata cache",
								   RELMETACACHE_INITIAL_SIZE,
								   &ctl, hash_flags);
		(void) MemoryContextSwitchTo(old_ctxt);

It is not necessary to use MemoryContextSwitchTo at all - we already instructed hash_create to use a specific memory context. Hence, we may remove the old_ctxt variable altogether.

idx = 0;

/* First pass: collect all keys while iterating over stable hash table */
hash_seq_init(&hash_seq, SpockHash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I can't agree with the approach: upstream and other extensions use this pattern: iterate over a HTAB and call hash_search(HASH_REMOVE) inside to delete entries one by one. See the postgres_fdw's InvalidateShippableCacheCallback routine as proof.

The relmetacache_invalidation_cb routine never processes the InvalidOid case. Doing nothing keeps all cache entries valid, even though the invalidation message forces a reset of this cache.

Pass through the HTAB with removal of each (or conditional) entry may happen in relmetacache_prune, relmetacache_flush.

We add an entry in the relmetacache_get_relation. I’m not 100% sure, but is the relmetacache_get_relation routine safe against non-plain tables, such as matviews or indexes? Also, there is one more unnecessary switch: old_mctx = MemoryContextSwitchTo(RelMetaCacheContext); exists.

The RelMetaCacheContext variable is unnecessary: we never remove or reset this context. But hash_create() creates a sub-memory context as a child of this. So, we could pass no memory context with the ctl structure and have the same behaviour.

Performance improvement: we may introduce a flag 'RelcacheCallbackIsSet'. Given that, we don't need to iterate through each entry and remove them one by one. We could just drop the whole table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On your first comment, I slightly agree, but the pacth adopt the very safe option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'very safe option' exactly mean?
If a bug exists, you should report it to the pgsql-bugs mailing list immediately, and we must fix it in our enterprise Postgres with the highest priority. If not, what does this approach add to the code's safety?

Copy link
Contributor Author

@ibrarahmad ibrarahmad Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • NOTE: caller may delete the returned element before continuing the scan.
  • However, deleting any other element while the scan is in progress is
  • UNDEFINED (it might be the one that curIndex is pointing at!). Also,
  • if elements are added to the table while the scan is in progress, it is
  • unspecified whether they will be visited by the scan or not.
    https://doxygen.postgresql.org/dynahash_8c_source.html#l1364-l1368

You CAN delete the CURRENT element just returned by hash_seq_search()
You CANNOT delete ANY OTHER element during iteration (undefined behavior)
You CANNOT add elements during iteration (unspecified whether they'll be visited)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What of these three dangers may be realised in the genuinuely local spock output plugin hash table?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is more important: if you have seen the corruption message what exactly was the case? How someone else in the same process might interfere with this iterator?

Ibrar Ahmed and others added 4 commits January 8, 2026 19:41
The previous commit added extensive documentation explaining the hash
table iterator corruption issue. While thorough, these comments were
overly verbose for production code. This commit condenses them to be
more concise while preserving the essential information: the problem
(calling hash_search HASH_REMOVE during hash_seq_search corrupts the
iterator state), the consequences (crashes, skipped entries), and the
solution (two-pass approach: collect keys, then remove).

Simplified comments in relmetacache_flush, relmetacache_prune, and
reset_channel_stats functions without changing any code logic.
The hash_create call was configured with HASH_CONTEXT flag and
ctl.hcxt = RelMetaCacheContext, which instructs the hash table API to
perform all allocations in the specified context automatically. The
explicit MemoryContextSwitchTo calls around hash_create and hash_search
were therefore redundant and have been removed.

This simplifies the code by removing 2 variable declarations and 4
lines of context switch boilerplate. The hash table operations will
continue to use RelMetaCacheContext for all allocations as configured.
1. Missing NULL checks after hash_search(HASH_ENTER) on HASH_FIXED_SIZE
   tables. When SpockHash or SpockGroupHash reach capacity, hash_search
   returns NULL instead of expanding the table. Dereferencing NULL causes
   segmentation faults. Added NULL checks in handle_stats_counter(),
   spock_group_attach(), and spock_group_progress_update().

2. Race condition in handle_stats_counter() where the function releases
   LW_SHARED, checks hash fullness without lock, then reacquires
   LW_EXCLUSIVE. Between these locks, concurrent processes can fill the
   table, causing HASH_ENTER to return NULL unexpectedly. The added NULL
   check handles this TOCTOU (time-of-check-to-time-of-use) race.

3. Lock mismatch in get_channel_stats() causing torn reads of 64-bit
   counter values. The reader held only LW_SHARED while writers used
   per-entry spinlocks (entry->mutex). Added spinlock acquisition in
   reader to ensure atomic access and prevent data races that violate
   the C memory model.

All three fixes prevent crashes from NULL pointer dereference and ensure
data consistency for statistics counters in shared memory hash tables.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants