fix(pvtdatastorage): close collItr before releasing purgerLock in processCollElgEvents by sridhar-panigrahi · Pull Request #5424 · hyperledger/fabric

sridhar-panigrahi · 2026-03-18T15:41:32Z

I found a bug in processCollElgEvents where a stale LevelDB iterator can cause expired private data to be silently re-added to the store, making the reconciler loop forever trying to fetch data that no longer exists.

What's happening

When converting ineligible missing data entries to eligible ones, the function batches up writes and sleeps between batches to avoid hammering the DB. Before sleeping it drops purgerLock:

s.db.WriteBatch(batch, true)
s.purgerLock.Unlock()
time.Sleep(sleepTime * time.Millisecond)
s.purgerLock.Lock()

The problem is that collItr — a LevelDB snapshot iterator over the ineligible missing data range — is still open when the lock is released. LevelDB snapshot iterators capture the DB state at the moment they're created and keep returning those keys regardless of what happens to the DB afterwards.

So while we're sleeping, the purger goroutine wakes up (it was blocked waiting for purgerLock), runs purgeExpiredData, and deletes both the eligible and ineligible missing data entries for any BTL-expired collections. When we re-acquire the lock and keep iterating, the stale snapshot still hands us those deleted keys, and we write them back to the DB as eligible missing data:

batch.Delete(originalKey)                                    // already gone, no-op
batch.Put(encodeElgPrioMissingDataKey(modifiedKey), copyVal) // re-inserts expired entry

The reconciler then picks these up, asks every peer for the data, and every peer says no — because it's been purged everywhere. This repeats until the purger happens to run again and cleans up the re-inserted entries.

The fix

Close collItr before releasing the lock. After re-acquiring the lock, reopen a fresh iterator starting from the last-processed key. Since that key was already deleted by the batch we just flushed, the new iterator skips it and starts from the next real entry — and this time it reflects the actual DB state, including whatever the purger deleted during the sleep.

nextKey := make([]byte, len(originalKey))
copy(nextKey, originalKey)
collItr.Release()
s.purgerLock.Unlock()
time.Sleep(sleepTime * time.Millisecond)
s.purgerLock.Lock()
collItr, err = s.db.GetIterator(nextKey, endKey)
if err != nil {
    return err
}

Why it's hard to notice

There's no crash and the peer keeps running normally. The only signal is the reconciler logging failures to fetch private data, which happens in normal operation too (e.g. when peers are temporarily unreachable). You'd have to specifically correlate those warnings with collection eligibility events and purge intervals to suspect this. In practice it just looks like slow or noisy reconciliation.

…cessCollElgEvents When processCollElgEvents flushes an oversized batch, it releases purgerLock to sleep between writes but leaves the collItr LevelDB snapshot iterator open. LevelDB snapshots freeze DB state at creation time, so while the lock is dropped the purger goroutine can wake up and delete InelgMissingData entries that collItr hasn't processed yet. When the lock is re-acquired, the stale iterator still yields those deleted keys, and the loop re-inserts them as ElgPrioMissingData. The result is that expired private data gets written back into the store and the reconciler spins trying to fetch it from peers that have already purged it. Fix: release collItr before dropping the lock, then reopen a fresh iterator from the last-processed key after re-acquiring it. Since that key was already deleted by the flushed batch, the new iterator skips it and picks up from the next real entry, now reflecting whatever the purger cleaned up during the sleep. Signed-off-by: Shridhar Panigrahi <sridharpanigrahi2006@gmail.com>

sridhar-panigrahi · 2026-03-18T15:51:22Z

@pfi79 , please let me know your thoughts on this !

sridhar-panigrahi requested a review from a team as a code owner March 18, 2026 15:41

sridhar-panigrahi force-pushed the fix/pvtdata-collElg-purger-lock-iterator-race branch from d568539 to 6c7dfac Compare March 18, 2026 15:45

sridhar-panigrahi force-pushed the fix/pvtdata-collElg-purger-lock-iterator-race branch from 6c7dfac to 105a7f0 Compare March 18, 2026 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pvtdatastorage): close collItr before releasing purgerLock in processCollElgEvents#5424

fix(pvtdatastorage): close collItr before releasing purgerLock in processCollElgEvents#5424
sridhar-panigrahi wants to merge 1 commit intohyperledger:mainfrom
sridhar-panigrahi:fix/pvtdata-collElg-purger-lock-iterator-race

sridhar-panigrahi commented Mar 18, 2026 •

edited

Loading

Uh oh!

sridhar-panigrahi commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sridhar-panigrahi commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's happening

The fix

Why it's hard to notice

Uh oh!

sridhar-panigrahi commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sridhar-panigrahi commented Mar 18, 2026 •

edited

Loading