fix(pvtdatastorage): close collItr before releasing purgerLock in processCollElgEvents#5424
Open
sridhar-panigrahi wants to merge 1 commit intohyperledger:mainfrom
Conversation
d568539 to
6c7dfac
Compare
…cessCollElgEvents When processCollElgEvents flushes an oversized batch, it releases purgerLock to sleep between writes but leaves the collItr LevelDB snapshot iterator open. LevelDB snapshots freeze DB state at creation time, so while the lock is dropped the purger goroutine can wake up and delete InelgMissingData entries that collItr hasn't processed yet. When the lock is re-acquired, the stale iterator still yields those deleted keys, and the loop re-inserts them as ElgPrioMissingData. The result is that expired private data gets written back into the store and the reconciler spins trying to fetch it from peers that have already purged it. Fix: release collItr before dropping the lock, then reopen a fresh iterator from the last-processed key after re-acquiring it. Since that key was already deleted by the flushed batch, the new iterator skips it and picks up from the next real entry, now reflecting whatever the purger cleaned up during the sleep. Signed-off-by: Shridhar Panigrahi <sridharpanigrahi2006@gmail.com>
6c7dfac to
105a7f0
Compare
Contributor
Author
|
@pfi79 , please let me know your thoughts on this ! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I found a bug in
processCollElgEventswhere a stale LevelDB iterator can cause expired private data to be silently re-added to the store, making the reconciler loop forever trying to fetch data that no longer exists.What's happening
When converting ineligible missing data entries to eligible ones, the function batches up writes and sleeps between batches to avoid hammering the DB. Before sleeping it drops
purgerLock:The problem is that
collItr— a LevelDB snapshot iterator over the ineligible missing data range — is still open when the lock is released. LevelDB snapshot iterators capture the DB state at the moment they're created and keep returning those keys regardless of what happens to the DB afterwards.So while we're sleeping, the purger goroutine wakes up (it was blocked waiting for
purgerLock), runspurgeExpiredData, and deletes both the eligible and ineligible missing data entries for any BTL-expired collections. When we re-acquire the lock and keep iterating, the stale snapshot still hands us those deleted keys, and we write them back to the DB as eligible missing data:The reconciler then picks these up, asks every peer for the data, and every peer says no — because it's been purged everywhere. This repeats until the purger happens to run again and cleans up the re-inserted entries.
The fix
Close
collItrbefore releasing the lock. After re-acquiring the lock, reopen a fresh iterator starting from the last-processed key. Since that key was already deleted by the batch we just flushed, the new iterator skips it and starts from the next real entry — and this time it reflects the actual DB state, including whatever the purger deleted during the sleep.Why it's hard to notice
There's no crash and the peer keeps running normally. The only signal is the reconciler logging failures to fetch private data, which happens in normal operation too (e.g. when peers are temporarily unreachable). You'd have to specifically correlate those warnings with collection eligibility events and purge intervals to suspect this. In practice it just looks like slow or noisy reconciliation.