Skip to content

Bloatnet - lessons learned #18576

@wmitsuda

Description

@wmitsuda

This issue is to capture lessons learned during the creation of the bloatnet snapshotter.

I'm going to describe everything I experienced as part of it and some ideas that occurred. We can use this issue to discuss and propose concrete actionable items as subissues.

Disclaimer: focus so far was to unblock bloatnet snapshotter construction, so I investigated most things enough to find fast workarounds and save references/pprofs for future analysis, so some of my conclusions below may be missing entire context.

Non-cacheable p2p data during initial sync

Scenario: we are syncing an existing chain, no snapshotter exists, we are doing it for the first time.

What happens: during the first cycle, it downloads headers/bodies backwards from the tip, then it starts executing it.

What happened on bloatnet: we were 4 months behind the tip, there were over 4 bootnodes only, all slow. The backward download process by itself was very slow and not persisted, which means if after a few hours downloading it, even after it finished downloading and started executing it, if need to restart Erigon, you lose all progress, including everything you downloaded via devp2p and took hours.

Why that is important: I quite often had to restart Erigon in order to adjust some setting bc bloatnet was an unexplored territory, every time I used to lose progress. That should be a very common case if you are bootstrapping a shadowfork and you are very behind.

Workaround: NOT use an external CL, I manually applied FCU using the internal_triggerFCU RPC I did specially to make my life easier.

Possible fixes: (1) cache headers/blocks, do not throw it away on stop. (2) Align downloader batches with --sync.block.loop.limit batches.

More "prune-friendly" data structure for domains

I saw prunes taking hours like:

[INFO] [12-18|04:11:02.285] [snapshots] PruneSmallBatches finished   took=10h48m46.601212982s stat="commitment| kv: 80.55M from steps 1854-1854; storage| kv: 165.09M from steps 1854-1854"

logs are not totally clear, I think pruning is missing a ticker (didn't investigate the code deeper), but during execution you can notice very long pauses between execution batches, which I understand to be either pruning or commit (or both).

Current domain pruning requires full table scan bc the step is inside values, maybe some specialized data model separating latest state from "sorted by step" "keys/value for archival" would reduce required I/O and disk fragmentation at the expense of more I/O for writing latest state.

The step size reduction idea sounded like a general workaround for it, but apparently didn't help in my separated experiment before - on my local machine DB kept increasing.

Reclaimable space not shrinking below a certain plateau

The DB inflation may make sense, but it stopping shrinking after a certain point sounds strange, or maybe there is a reason I hadn't discovered yet.

Issue: #18421

Memory hungry commitment saving (not computation!)

Quite often during sync I've noticed OOMs. I've tracked them down, see pprof below:

Image

I'm going to only attach one partial screenshot, but I've archived several mem pprofs in different moments.

The culprit seems to be the bytes.Clone inside BranchEncoder, but that's where the allocation is done, actually the references are passed down and held in TemporalMemBatch.

Basically, bloatnet generates a huge amount of changes in commitment, those changes are batched in-mem uncompressed and that OOMs.

Workaround: by default --sync.block.loop.limit is 5000, for bloatnet I had to decrease it to 1024 (by experimenting with different values), effectively telling Erigon to "batch less work before commiting to DB" at the expense of computing commitment more often (which is not good bc that's is pretty slow, so bigger batch == better). But for a 64GB machine, that was necessary.

What does it mean: it means Erigon, on non chaintip, is vulnerable to OOMs if the amount of work defined by --sync.block.loop.limit surpasses machine mem. But that's not intuitive, bc the inmem representation depends on lots of internals.

And by non chaintip I mean not only sync from scratch, but any case which you node loses chaintip, be it bc you stopped it for a while, machine become slow bc it swapped for a long time, etc.

Side note: I remember myself experiencing OOMs on regular mainnet in the past, on non chaintip, after resuming stopped nodes, on machines with 32GB. I suspect we've found a major bottleneck here, bc on chaintip Erigon should have no problems with 32GB or even less I guess. Solving this one should allow us to lower our minimum mem specs.

I think that case can be unfolded into more specific topics:

--sync.block.loop.limit as the unit of work

That's not a much stable unit to denominate work, bc block size is different among chains; even on mainnet, a hardfork can bring more data into a block and that translates into more work -> more mem required

I think one future direction could be for us to denominate those batch sizes in some unit that better represents the amount of work generated by those batches, for example, gas used. That way Erigon could "auto-tune" itself to machine specs. Asking users to guess --sync.block.loop.limit is not ideal.

Find a solution for holding uncompressed references in TemporalMemBatch

Specialized in-mem compressed format? Don't know, but it feels awkward that we need to hold everything in-mem when 99% of time the tx will commit successfully. It feels we can do more to improve that situation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions