Fix OOM, sync stalls, attestations, and checkpoint safety by bomanaps · Pull Request #84 · grandinetech/lean

bomanaps · 2026-03-19T14:35:14Z

No description provided.

ArtiomTr · 2026-03-19T15:00:05Z

lean_client/networking/src/network/service.rs

+                                let finalized_slot = self.status_provider.read().finalized.slot.0;
+                                provider.retain(|_, b| b.message.block.slot.0 > finalized_slot);
+                                // Hard cap: evict lowest-slot blocks if still over limit.
+                                if provider.len() > MAX_BLOCK_CACHE_SIZE {
+                                    let to_remove = provider.len() - MAX_BLOCK_CACHE_SIZE;
+                                    let mut slots: Vec<(H256, u64)> = provider
+                                        .iter()
+                                        .map(|(root, b)| (*root, b.message.block.slot.0))
+                                        .collect();
+                                    slots.sort_by_key(|(_, slot)| *slot);
+                                    for (root, _) in slots.into_iter().take(to_remove) {
+                                        provider.remove(&root);
+                                    }
+                                }


I think you shouldn't really prune blocks. The blocks are mostly lightweight, and they allow to reconstruct historical states easily. What may be pruned, is the old states - those aren't needed, and can be trivially reconstructed by taking state and applying blocks on top of them

The reason we added pruning here is that signed_block_provider is lean_client's only block store there is no RocksDB backing it. SignedBlockWithAttestation carries XMSS signatures (3,112 bytes per block minimum), so the cache is not lightweight in this protocol. Across 5 runs in our test log the node accumulated 142,735 blocks in memory and was OOM-killed by the kernel each time with no log entry.

persist blocks to storage so the in-memory cache is just a working set and can be evicted freely. We can track that as a follow-up. In the meantime, should we drop the finalization pruning and keep only the hard cap, or would you prefer we revert entirely until persistence is in place?

or add RocksDB-backed

I agree that many blocks may easily inflate memory usage, but I really doubt that it is primary OOM issue source. If you'll try to estimate, the 142735 blocks, 3112 bytes each, weight approximately 423MiB, or ~0.5GiB roughly. This is a lot, but still manageable.

On the other hand, one ssz-encoded state weights ~900 KiB (you can try downloading http://leanpoint.leanroadmap.org/lean/v0/states/finalized) - so 142735 of them would weight ~122GiB. And if you'll look closely into forkchoice store processing code, you'll see that states are cloned on each new block processed:

lean/lean_client/fork_choice/src/handlers.rs

Line 468 in 4e08fe2

store.states.insert(block_root, new_state.clone());

But they are never pruned. Of course, the real memory usage will be lower than 122GiB, because we're using structural sharing, but nevertheless, I think this is the main weak point causing memory issues.

Also noticed one more thing - we're currently saving both blocks & signed blocks:

lean/lean_client/fork_choice/src/handlers.rs

Lines 467 to 470 in 4e08fe2

store.blocks.insert(block_root, block.clone());

store.states.insert(block_root, new_state.clone());

// Also store signed block for serving BlocksByRoot requests (checkpoint sync backfill)

store.signed_blocks.insert(block_root, signed_block.clone());

No need to do that, as signed block contains strictly more information, than block. So block memory usage has twice of size.

Regarding RocksDB - in beacon client we're actually using libmdbx, but probably there is no need currently add persistence layer to the lean client, I think better to keep it simpler.

For now, you can keep block pruning process, but I wouldn't recommend making it too aggressive. Better to focus on two primary issues:

Blocks are duplicated in memory cache.

States are never pruned.

Fix OOM, sync stalls, attestations, and checkpoint safety

d6ef44f

ArtiomTr reviewed Mar 19, 2026

View reviewed changes

address review comment on pruned blocks

6f6686a

bomanaps requested a review from ArtiomTr March 20, 2026 09:07

ArtiomTr merged commit 11ebcf8 into grandinetech:devnet-3 Mar 20, 2026
0 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OOM, sync stalls, attestations, and checkpoint safety#84

Fix OOM, sync stalls, attestations, and checkpoint safety#84
ArtiomTr merged 2 commits intograndinetech:devnet-3from
bomanaps:bug/listener

bomanaps commented Mar 19, 2026

Uh oh!

ArtiomTr Mar 19, 2026

Uh oh!

bomanaps Mar 19, 2026

Uh oh!

bomanaps Mar 19, 2026

Uh oh!

bomanaps Mar 19, 2026

Uh oh!

ArtiomTr Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	store.blocks.insert(block_root, block.clone());
	store.states.insert(block_root, new_state.clone());
	// Also store signed block for serving BlocksByRoot requests (checkpoint sync backfill)
	store.signed_blocks.insert(block_root, signed_block.clone());

Conversation

bomanaps commented Mar 19, 2026

Uh oh!

ArtiomTr Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

bomanaps Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

bomanaps Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

bomanaps Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

ArtiomTr Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants