Skip to content

fix(utils): update existing keys in-place in FifoCache::push#2065

Open
amathxbt wants to merge 2 commits into
0xMiden:nextfrom
amathxbt:fix/fifo-cache-ghost-eviction-entries
Open

fix(utils): update existing keys in-place in FifoCache::push#2065
amathxbt wants to merge 2 commits into
0xMiden:nextfrom
amathxbt:fix/fifo-cache-ghost-eviction-entries

Conversation

@amathxbt
Copy link
Copy Markdown

@amathxbt amathxbt commented May 9, 2026

Summary

FifoCache::push unconditionally appended key to the eviction queue before calling map.insert, even when the key was already present in the cache. This created a ghost eviction entry: the queue length grew beyond the number of live map entries, consuming an eviction slot that had no corresponding map value. When that ghost eventually surfaced as the oldest entry, map.remove found nothing—silently discarding the slot—while the queue shrank by one. The net effect was that the cache's effective unique-entry capacity was reduced by one for every overwrite, and a still-live entry could be prematurely evicted.

Root Cause

// Before (buggy)
pub fn push(&self, key: K, value: V) {
    let mut inner = self.0.lock().expect("fifo cache lock poisoned");
    if inner.eviction.len() >= inner.capacity.get() {
        if let Some(oldest) = inner.eviction.pop_front() {
            inner.map.remove(&oldest);  // removes ghost if key reused
        }
    }
    inner.eviction.push_back(key.clone()); // appended EVEN for existing keys
    inner.map.insert(key, value);
}

With capacity = 2:

  1. push(A, 1) → queue: [A], map: {A:1}
  2. push(A, 2) → queue: [A, A], map: {A:2} ← ghost created, capacity consumed
  3. push(B, 3) → evicts oldest A (ghost); queue: [A, B], map: {A:2, B:3} ← full
  4. push(C, 4) → evicts real A; queue: [B, C], map: {B:3, C:4} ← A lost prematurely

Fix

Check map.contains_key(&key) first. When the key exists, update the value in-place and return immediately, leaving the eviction queue unchanged:

// After (fixed)
if inner.map.contains_key(&key) {
    inner.map.insert(key, value);
    return;
}

Testing

Added two new tests:

  • overwrite_key_updates_value_in_place — verifies that overwriting a key does not consume an extra eviction slot.
  • overwrite_does_not_change_eviction_position — verifies that the overwritten key is still evicted at its original FIFO position when the cache later fills up.

CHANGELOG

Added entry to ## v0.15.0 (TBD) section.

amathxbt added 2 commits May 9, 2026 02:37
When push() was called with an already-present key the previous
implementation unconditionally appended the key to the eviction
queue before calling map.insert(). This created a ghost entry: the
eviction queue length exceeded the number of live map entries,
effectively reducing unique-entry capacity and causing a valid value
to be prematurely dropped when the ghost surfaced as the oldest key.

Fix: check map.contains_key() first and, when the key exists, update
the value in-place without touching the eviction queue.
if inner.map.contains_key(&key) {
inner.map.insert(key, value);
return;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying the bug. However the FIFO behaviour is still not quite right with this fix.

Every time an entry is pushed into the FIFO cache, it should be put (or moved) into the back of the eviction queue. This still doesn't happen after this fix.

I think we have two options as to how to implement this (without an O(n) scan of the eviction queue):

  1. Use a linked hash map (requires external crate) instead of the map and vecdeque; or
  2. A tombstone mechanism which prevents prior entries in the eviction queue from removing entries that were pushed multiple times.

@Mirko-von-Leipzig any thoughts / preference?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc this is used for the block/proofs caching for the subscriptions?

Can we not just use a VecDequeue since they should always be sequential by block number?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caches are used for reads from arbitrarily connected streams. Only the starting block is "arbitrary" though. If we use just a vecdeque, we would have to peek (front()/back()) to determine whether the cache helps with the range of blocks/proofs we need.

Instead of fetch_block() it would maybe be fetch_block_range(), at least until we catch up to the tip (which should be after getting the initial range).

Do you think we should refactor it this way instead? No need for FifoCache then.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was thinking is something like:

struct FifoCache<T> {
    inner: Arc<RwLock<VecDeque<(BlockNumber, Arc<T>)>>>
    capacity: usize,
}

impl<T> FifoCache<T> {
    async fn push(&self, number: BlockNumber, value: Arc<T>) {
        let mut fifo = self.inner.wr_lock().await;

        if let Some((youngest, _)) = fifo.back() {
            assert_eq!(youngest.child(), number);
        }

        if fifo.len() == self.capacity {
            fifo.pop_front();
        }

        fifo.push_back((number, value));
    }

    async fn get(&self, number: BlockNumber) -> Option<Arc<T>> {
        let fifo = self.inner.rd_lock().await;

        let (oldest, _) = fifo.front()?;

        let offset = number.checked_sub(oldest)?;
        fifo.get(offset)
    }
}

for additional safety we could even separate them into a cloneable Reader, and a single Writer on construction.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both for the detailed feedback, this is really helpful tbh

To make sure I understand the direction before I push anything:

The plan is to drop the HashMap + VecDeque design entirely and replace FifoCache with a VecDeque<(BlockNumber, Arc<T>)> that relies on block numbers being strictly sequential. push asserts that the incoming block is the child of the current back (or the queue is empty), pops the front when at capacity, and pushes to the back. get(number) computes the offset from the front (number - oldest) and indexes into the deque in O(1), returning None if the number is outside the cached range.

That keeps everything O(1), avoids the ghost-entry class of bugs entirely (each block can only be inserted once, in order), and removes the need for either a linked hash map crate or a tombstone scheme.

A few clarifying points before I start:

  1. Should this new type live in crates/utils/src/fifo_cache.rs and keep the FifoCache name (now strictly block-keyed), or should it be renamed to something more specific like BlockCache / SequentialCache since it's no longer a general-purpose FIFO?
  2. The current FifoCache is generic over K, V. The redesign hardcodes BlockNumber as the key. Is it fine to make it non-generic, or do you want it generic over a key type that exposes a child() and checked_sub() (so proofs and any future sequential cache can share it)?
  3. For the assertion assert_eq!(youngest.child(), number) — do you want a hard assert! (panic on misuse) or a soft debug_assert! plus a returned Result / silent no-op in release? Given the writer-side discipline implied by the design, I'd lean toward assert! so violations surface immediately in tests.
  4. Re: the Reader/Writer split — happy to add it. Should that go in this PR, or land the core type first and split in a follow-up?

@sergerad regarding your earlier comment about refactoring to fetch_block_range() — should I also touch the call sites in this PR, or keep this PR focused on the cache type and do the call-site changes separately? Doing both in one PR is fine with me, just want to keep the review surface manageable.

Happy to push the change as soon as you confirm the above. I'll also drop the original ghost-entry fix from this branch since the redesign makes it moot, and rewrite the CHANGELOG entry accordingly.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think we only need this in the store, so lets add it there and maybe FifoBlockCache as a name.
  2. I think making it generic will be a pain so lets keep it BlockNumber focused for now.
  3. assert! please, just ensure its documented
  4. Yeah lets add the split. So on construction it returns a tuple similar to a channel.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding your earlier comment about refactoring to fetch_block_range() — should I also touch the call sites in this PR, or keep this PR focused on the cache type and do the call-site changes separately? Doing both in one PR is fine with me, just want to keep the review surface manageable.

I think we should make this change in this PR. I wouldn't want us to be performing the pop + offset logic for every get when the caller could just request a range of [arbitrary starting block, highest cached block] via something like fetch_block_range(from: BlockNumber) -> Vec<T>.

@amathxbt
Copy link
Copy Markdown
Author

Thanks for digging into this. @sergerad @Mirko-von-Leipzig

The bug here is not that the key exists twice in the map map.insert() correctly overwrites the old value. The bug is that push() also unconditionally appends the key to the eviction queue, so the queue can contain duplicate entries for a single live map entry.

Minimal repro with capacity = 2:

  1. push(1, "a")
    queue = [1]
    map = {1}

  2. push(1, "b")
    queue = [1, 1]
    map = {1}

  3. push(2, "c")
    because queue.len() == capacity, the cache evicts the front entry (1) before inserting 2
    result:
    queue = [1, 2]
    map = {2}

At this point, key 1 was dropped even though the cache only ever held 2 unique keys (1 and 2). So the duplicate queue entry consumed capacity and caused a premature eviction.

That is why the fix updates existing keys in-place and does not push the key into the eviction queue again. Capacity should track unique live entries, not the number of times push() was called for the same key.

I think part of the confusion may be the current test overwrite_key_evicts_on_next_push, because it encodes the buggy behavior as the expected behavior. The regression test should instead verify that:

  • after push(1, "a"); push(1, "b"); push(2, "c");
    both 1 and 2 are still present, and
  • only a later push(3, "d") should evict 1.

So the intended invariant is:
eviction.len() must never exceed the number of live keys in map.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants