Skip to content

Fix OOM, sync stalls, attestations, and checkpoint safety#84

Merged
ArtiomTr merged 2 commits intograndinetech:devnet-3from
bomanaps:bug/listener
Mar 20, 2026
Merged

Fix OOM, sync stalls, attestations, and checkpoint safety#84
ArtiomTr merged 2 commits intograndinetech:devnet-3from
bomanaps:bug/listener

Conversation

@bomanaps
Copy link

No description provided.

Comment on lines +775 to +788
let finalized_slot = self.status_provider.read().finalized.slot.0;
provider.retain(|_, b| b.message.block.slot.0 > finalized_slot);
// Hard cap: evict lowest-slot blocks if still over limit.
if provider.len() > MAX_BLOCK_CACHE_SIZE {
let to_remove = provider.len() - MAX_BLOCK_CACHE_SIZE;
let mut slots: Vec<(H256, u64)> = provider
.iter()
.map(|(root, b)| (*root, b.message.block.slot.0))
.collect();
slots.sort_by_key(|(_, slot)| *slot);
for (root, _) in slots.into_iter().take(to_remove) {
provider.remove(&root);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you shouldn't really prune blocks. The blocks are mostly lightweight, and they allow to reconstruct historical states easily. What may be pruned, is the old states - those aren't needed, and can be trivially reconstructed by taking state and applying blocks on top of them

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we added pruning here is that signed_block_provider is lean_client's only block store there is no RocksDB backing it. SignedBlockWithAttestation carries XMSS signatures (3,112 bytes per block minimum), so the cache is not lightweight in this protocol. Across 5 runs in our test log the node accumulated 142,735 blocks in memory and was OOM-killed by the kernel each time with no log entry.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persist blocks to storage so the in-memory cache is just a working set and can be evicted freely. We can track that as a follow-up. In the meantime, should we drop the finalization pruning and keep only the hard cap, or would you prefer we revert entirely until persistence is in place?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or add RocksDB-backed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that many blocks may easily inflate memory usage, but I really doubt that it is primary OOM issue source. If you'll try to estimate, the 142735 blocks, 3112 bytes each, weight approximately 423MiB, or ~0.5GiB roughly. This is a lot, but still manageable.

On the other hand, one ssz-encoded state weights ~900 KiB (you can try downloading http://leanpoint.leanroadmap.org/lean/v0/states/finalized) - so 142735 of them would weight ~122GiB. And if you'll look closely into forkchoice store processing code, you'll see that states are cloned on each new block processed:

store.states.insert(block_root, new_state.clone());

But they are never pruned. Of course, the real memory usage will be lower than 122GiB, because we're using structural sharing, but nevertheless, I think this is the main weak point causing memory issues.

Also noticed one more thing - we're currently saving both blocks & signed blocks:

store.blocks.insert(block_root, block.clone());
store.states.insert(block_root, new_state.clone());
// Also store signed block for serving BlocksByRoot requests (checkpoint sync backfill)
store.signed_blocks.insert(block_root, signed_block.clone());

No need to do that, as signed block contains strictly more information, than block. So block memory usage has twice of size.

Regarding RocksDB - in beacon client we're actually using libmdbx, but probably there is no need currently add persistence layer to the lean client, I think better to keep it simpler.

For now, you can keep block pruning process, but I wouldn't recommend making it too aggressive. Better to focus on two primary issues:

  • Blocks are duplicated in memory cache.
  • States are never pruned.

@bomanaps bomanaps requested a review from ArtiomTr March 20, 2026 09:07
@ArtiomTr ArtiomTr merged commit 11ebcf8 into grandinetech:devnet-3 Mar 20, 2026
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants