Skip to content

HDDS-14768. Fix lock leak during snapshot cache cleanup and handle eviction race appropriately.#9869

Open
SaketaChalamchala wants to merge 2 commits intoapache:masterfrom
SaketaChalamchala:HDDS-14768
Open

HDDS-14768. Fix lock leak during snapshot cache cleanup and handle eviction race appropriately.#9869
SaketaChalamchala wants to merge 2 commits intoapache:masterfrom
SaketaChalamchala:HDDS-14768

Conversation

@SaketaChalamchala
Copy link
Contributor

@SaketaChalamchala SaketaChalamchala commented Mar 5, 2026

What changes were proposed in this pull request?

Currently,

  1. Eviction race: SnapshotCache cleanup throws an IllegalStateException when it finds stale entries in pendingEvictionQueue for snapshots that have already been removed from dbMap
    Ex., say SnapshotPurge invalidates the entry right before the last thread with a reference to the snapshot just closes adding the snapshotID back to the evictionQueue
  2. Inconsistent Bookkeeping: invalidate removes snapshot entry from dbMap but does not remove it from pendingEvictionQueue if it exists.
  3. Potential snapshot leak: Snapshot close failure during cleanup removes the snapshotID from eviction queue and throws an exception. This causes the snapshot to remain in cache even is refCount = 0 and the snapshot entry remains in dbMap unless
    some other thread explicitly invalidates it or references it again. This means SnapshotCache.lock() during this time cannot hold the write lock because lock() expects the cache to be drained.
  4. Write lock leak: Fix write lock leak in SnapshotCache. If the cache drain cleanup(true) throws an exception write lock is not released in SnapshotCache.lock()

Proposed solution:

  1. Handle eviction race appropriately. Log the stale snapshot entry in eviction queue and remove it from the queue.
  2. Remove snapshot entry from eviction queu upon successful invalidation.
  3. Log snapshot close failure during cleanup but do not remove it from eviction queue so that it's cleanup can be retried later.
  4. Catch any unchecked exception during cleanup and release the write lock in SnapshotCache.lock()

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14768

(Please replace this section with the link to the Apache JIRA)

How was this patch tested?

Unit tests.

@jojochuang jojochuang marked this pull request as ready for review March 7, 2026 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant