-
Notifications
You must be signed in to change notification settings - Fork 43
Fix hash table corruption from removing entries during iteration. #307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The relmetacache_flush function was calling hash_search with HASH_REMOVE while iterating through RelMetaCache using hash_seq_search. This corrupts the iterator state because removing an entry deallocates memory that the iterator references, reorganizes bucket chains that invalidate the iterator's position, and may trigger dynamic resizing. The result is segmentation faults from accessing freed memory, skipped entries when the iterator loses its position, or infinite loops from corrupted chain pointers. Fixed by collecting all relids in a first pass, then removing them in a second pass after iteration completes. The relmetacache_prune function had the same issue when removing invalid cache entries during iteration. It was calling hash_search with HASH_REMOVE on entries marked as invalid while hash_seq_search was actively scanning the hash table. This causes identical corruption problems as the iterator's cached bucket index and chain pointers become stale when entries are physically removed from the table. Fixed using the same two-pass pattern: collect relids of invalid entries during iteration, then remove them after the scan finishes. The reset_channel_stats function was removing all statistics entries from SpockHash while iterating with hash_seq_search. This violated the fundamental rule that hash tables must not be modified during sequential scans. The iterator maintains pointers to the current entry and the next entry to visit, both of which become dangling pointers when hash_search removes the current entry. Under concurrent load this causes crashes with hash table corrupted errors and potential loss of statistics data. Fixed by copying all keys during iteration, then removing entries in a separate loop after the iterator completes its scan.
|
In the code: old_ctxt = MemoryContextSwitchTo(RelMetaCacheContext);
RelMetaCache = hash_create("spock relation metadata cache",
RELMETACACHE_INITIAL_SIZE,
&ctl, hash_flags);
(void) MemoryContextSwitchTo(old_ctxt);It is not necessary to use MemoryContextSwitchTo at all - we already instructed hash_create to use a specific memory context. Hence, we may remove the old_ctxt variable altogether. |
| idx = 0; | ||
|
|
||
| /* First pass: collect all keys while iterating over stable hash table */ | ||
| hash_seq_init(&hash_seq, SpockHash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I can't agree with the approach: upstream and other extensions use this pattern: iterate over a HTAB and call hash_search(HASH_REMOVE) inside to delete entries one by one. See the postgres_fdw's InvalidateShippableCacheCallback routine as proof.
The relmetacache_invalidation_cb routine never processes the InvalidOid case. Doing nothing keeps all cache entries valid, even though the invalidation message forces a reset of this cache.
Pass through the HTAB with removal of each (or conditional) entry may happen in relmetacache_prune, relmetacache_flush.
We add an entry in the relmetacache_get_relation. I’m not 100% sure, but is the relmetacache_get_relation routine safe against non-plain tables, such as matviews or indexes? Also, there is one more unnecessary switch: old_mctx = MemoryContextSwitchTo(RelMetaCacheContext); exists.
The RelMetaCacheContext variable is unnecessary: we never remove or reset this context. But hash_create() creates a sub-memory context as a child of this. So, we could pass no memory context with the ctl structure and have the same behaviour.
Performance improvement: we may introduce a flag 'RelcacheCallbackIsSet'. Given that, we don't need to iterate through each entry and remove them one by one. We could just drop the whole table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On your first comment, I slightly agree, but the pacth adopt the very safe option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 'very safe option' exactly mean?
If a bug exists, you should report it to the pgsql-bugs mailing list immediately, and we must fix it in our enterprise Postgres with the highest priority. If not, what does this approach add to the code's safety?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- NOTE: caller may delete the returned element before continuing the scan.
- However, deleting any other element while the scan is in progress is
- UNDEFINED (it might be the one that curIndex is pointing at!). Also,
- if elements are added to the table while the scan is in progress, it is
- unspecified whether they will be visited by the scan or not.
https://doxygen.postgresql.org/dynahash_8c_source.html#l1364-l1368
You CAN delete the CURRENT element just returned by hash_seq_search()
You CANNOT delete ANY OTHER element during iteration (undefined behavior)
You CANNOT add elements during iteration (unspecified whether they'll be visited)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What of these three dangers may be realised in the genuinuely local spock output plugin hash table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is more important: if you have seen the corruption message what exactly was the case? How someone else in the same process might interfere with this iterator?
The previous commit added extensive documentation explaining the hash table iterator corruption issue. While thorough, these comments were overly verbose for production code. This commit condenses them to be more concise while preserving the essential information: the problem (calling hash_search HASH_REMOVE during hash_seq_search corrupts the iterator state), the consequences (crashes, skipped entries), and the solution (two-pass approach: collect keys, then remove). Simplified comments in relmetacache_flush, relmetacache_prune, and reset_channel_stats functions without changing any code logic.
The hash_create call was configured with HASH_CONTEXT flag and ctl.hcxt = RelMetaCacheContext, which instructs the hash table API to perform all allocations in the specified context automatically. The explicit MemoryContextSwitchTo calls around hash_create and hash_search were therefore redundant and have been removed. This simplifies the code by removing 2 variable declarations and 4 lines of context switch boilerplate. The hash table operations will continue to use RelMetaCacheContext for all allocations as configured.
1. Missing NULL checks after hash_search(HASH_ENTER) on HASH_FIXED_SIZE tables. When SpockHash or SpockGroupHash reach capacity, hash_search returns NULL instead of expanding the table. Dereferencing NULL causes segmentation faults. Added NULL checks in handle_stats_counter(), spock_group_attach(), and spock_group_progress_update(). 2. Race condition in handle_stats_counter() where the function releases LW_SHARED, checks hash fullness without lock, then reacquires LW_EXCLUSIVE. Between these locks, concurrent processes can fill the table, causing HASH_ENTER to return NULL unexpectedly. The added NULL check handles this TOCTOU (time-of-check-to-time-of-use) race. 3. Lock mismatch in get_channel_stats() causing torn reads of 64-bit counter values. The reader held only LW_SHARED while writers used per-entry spinlocks (entry->mutex). Added spinlock acquisition in reader to ensure atomic access and prevent data races that violate the C memory model. All three fixes prevent crashes from NULL pointer dereference and ensure data consistency for statistics counters in shared memory hash tables.
The relmetacache_flush function was calling hash_search with HASH_REMOVE while iterating through RelMetaCache using hash_seq_search. This corrupts the iterator state because removing an entry deallocates memory that the iterator references, reorganizes bucket chains that invalidate the iterator's position, and may trigger dynamic resizing. The result is segmentation faults from accessing freed memory, skipped entries when the iterator loses its position, or infinite loops from corrupted chain pointers. Fixed by collecting all relids in a first pass, then removing them in a second pass after iteration completes.
The relmetacache_prune function had the same issue when removing invalid cache entries during iteration. It was calling hash_search with HASH_REMOVE on entries marked as invalid while hash_seq_search was actively scanning the hash table. This causes identical corruption problems as the iterator's cached bucket index and chain pointers become stale when entries are physically removed from the table. Fixed using the same two-pass pattern: collect relids of invalid entries during iteration, then remove them after the scan finishes.
The reset_channel_stats function was removing all statistics entries from SpockHash while iterating with hash_seq_search. This violated the fundamental rule that hash tables must not be modified during sequential scans. The iterator maintains pointers to the current entry and the next entry to visit, both of which become dangling pointers when hash_search removes the current entry. Under concurrent load this causes crashes with hash table corrupted errors and potential loss of statistics data. Fixed by copying all keys during iteration, then removing entries in a separate loop after the iterator completes its scan.