Skip to content

fix(pvtdatastorage): close collItr before releasing purgerLock in processCollElgEvents#5424

Open
sridhar-panigrahi wants to merge 1 commit intohyperledger:mainfrom
sridhar-panigrahi:fix/pvtdata-collElg-purger-lock-iterator-race
Open

fix(pvtdatastorage): close collItr before releasing purgerLock in processCollElgEvents#5424
sridhar-panigrahi wants to merge 1 commit intohyperledger:mainfrom
sridhar-panigrahi:fix/pvtdata-collElg-purger-lock-iterator-race

Conversation

@sridhar-panigrahi
Copy link
Contributor

@sridhar-panigrahi sridhar-panigrahi commented Mar 18, 2026

I found a bug in processCollElgEvents where a stale LevelDB iterator can cause expired private data to be silently re-added to the store, making the reconciler loop forever trying to fetch data that no longer exists.

What's happening

When converting ineligible missing data entries to eligible ones, the function batches up writes and sleeps between batches to avoid hammering the DB. Before sleeping it drops purgerLock:

s.db.WriteBatch(batch, true)
s.purgerLock.Unlock()
time.Sleep(sleepTime * time.Millisecond)
s.purgerLock.Lock()

The problem is that collItr — a LevelDB snapshot iterator over the ineligible missing data range — is still open when the lock is released. LevelDB snapshot iterators capture the DB state at the moment they're created and keep returning those keys regardless of what happens to the DB afterwards.

So while we're sleeping, the purger goroutine wakes up (it was blocked waiting for purgerLock), runs purgeExpiredData, and deletes both the eligible and ineligible missing data entries for any BTL-expired collections. When we re-acquire the lock and keep iterating, the stale snapshot still hands us those deleted keys, and we write them back to the DB as eligible missing data:

batch.Delete(originalKey)                                    // already gone, no-op
batch.Put(encodeElgPrioMissingDataKey(modifiedKey), copyVal) // re-inserts expired entry

The reconciler then picks these up, asks every peer for the data, and every peer says no — because it's been purged everywhere. This repeats until the purger happens to run again and cleans up the re-inserted entries.

The fix

Close collItr before releasing the lock. After re-acquiring the lock, reopen a fresh iterator starting from the last-processed key. Since that key was already deleted by the batch we just flushed, the new iterator skips it and starts from the next real entry — and this time it reflects the actual DB state, including whatever the purger deleted during the sleep.

nextKey := make([]byte, len(originalKey))
copy(nextKey, originalKey)
collItr.Release()
s.purgerLock.Unlock()
time.Sleep(sleepTime * time.Millisecond)
s.purgerLock.Lock()
collItr, err = s.db.GetIterator(nextKey, endKey)
if err != nil {
    return err
}

Why it's hard to notice

There's no crash and the peer keeps running normally. The only signal is the reconciler logging failures to fetch private data, which happens in normal operation too (e.g. when peers are temporarily unreachable). You'd have to specifically correlate those warnings with collection eligibility events and purge intervals to suspect this. In practice it just looks like slow or noisy reconciliation.

@sridhar-panigrahi sridhar-panigrahi requested a review from a team as a code owner March 18, 2026 15:41
@sridhar-panigrahi sridhar-panigrahi force-pushed the fix/pvtdata-collElg-purger-lock-iterator-race branch from d568539 to 6c7dfac Compare March 18, 2026 15:45
…cessCollElgEvents

When processCollElgEvents flushes an oversized batch, it releases
purgerLock to sleep between writes but leaves the collItr LevelDB
snapshot iterator open. LevelDB snapshots freeze DB state at creation
time, so while the lock is dropped the purger goroutine can wake up and
delete InelgMissingData entries that collItr hasn't processed yet.
When the lock is re-acquired, the stale iterator still yields those
deleted keys, and the loop re-inserts them as ElgPrioMissingData. The
result is that expired private data gets written back into the store
and the reconciler spins trying to fetch it from peers that have
already purged it.

Fix: release collItr before dropping the lock, then reopen a fresh
iterator from the last-processed key after re-acquiring it. Since that
key was already deleted by the flushed batch, the new iterator skips it
and picks up from the next real entry, now reflecting whatever the
purger cleaned up during the sleep.

Signed-off-by: Shridhar Panigrahi <sridharpanigrahi2006@gmail.com>
@sridhar-panigrahi sridhar-panigrahi force-pushed the fix/pvtdata-collElg-purger-lock-iterator-race branch from 6c7dfac to 105a7f0 Compare March 18, 2026 15:46
@sridhar-panigrahi
Copy link
Contributor Author

@pfi79 , please let me know your thoughts on this !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant