feat: Implement persistent storage#108
Conversation
ArtiomTr
left a comment
There was a problem hiding this comment.
looks like this is not finished? there is nothing implemented yet?
This will take some thing as it won't be a one of pr |
Ok but there is absolutely zero functionality yet? We can do this in iterations, i.e. first we can persist only blocks, but there is no point of merging dead code |
| bls = { git = "https://github.com/grandinetech/grandine", package = "bls", features = ["blst"], rev = "64afdee3c6be79fceffb66933dcb69a943f3f1ae" } | ||
| bytesize = { version = '2', features = ['serde'] } | ||
| clap = { version = "4", features = ["derive"] } | ||
| database = { git = "https://github.com/grandinetech/grandine", package = "database", rev = "64afdee3c6be79fceffb66933dcb69a943f3f1ae" } |
There was a problem hiding this comment.
the database crate from grandine is a bit flawed, I would suggest implementing your own. Specifically, current database impl forces snappy compression for all values - this is inefficient in some cases (e.g., if we save slot -> state_root indexes, there is no point of compressing state_root, as it is pure entropy), and not optimal in others (e.g., for blobs you probably want to use something more compressing, like zstd). Thus, forcing single compression algorithm for all values wasn't a good idea. Also, for cases where performance matters, looks like there are currently compression algorithms that are both more performant & give better compression ratios than snappy - like lz4.
There was a problem hiding this comment.
okay, I will be implementing my own database crate on top of this next
yeah this is still a WIP, I was asking for review on the initial architecture (KV pairs and grandine database), if this is the correct direction |
|
One more thing can we remove the OOMing framing, as that issue has been resolved? |
|
@bomanaps but since all state data is being managed through memory right now, don't you think it is bound to OOM sometime when the node is running for a long time? |
|
On a separate note, after implementation of persistent db I'm planning to research into LRU cache implementation in the crate itself which it will manage automatically such that the db functions can be used without worrying about caching. What do you think about this? |
The best person to answer this is @ArtiomTr and also on the side have you tried running a node maybe 3 node setup or more depending on your laptop capacity as this should give you a better feel of how lean Ethereum runs? |
It is hard to tell what is going on, without seeing actual implementation :). Better to implement something first, see if it works & is performant enough, then proceed with review. To avoid wasting much time, start with smaller scope - like just saving the blocks first. The database must keep blocks, because if you have blocks, you can reconstruct any historical state, at the cost of cpu time. This way you can scaffold database structure, get early feedback on that, and then proceed on implementing everything else.
This is true only for some cases, e.g. during long non-finality periods - roughly speaking, validator has to track every "branch", to be able to properly converge into whatever branch eventually wins. However, even in those cases, I think there are clever algorithms to avoid keeping all unfinalized history in memory. During normal operation, usually memory consumption won't grow indefinitely - node has to keep only last finalized state, maybe some older ones, but no more. Node also has to keep some historical blocks (I believe all blocks up to weak subjectivity period, although I don't remember exactly and may be wrong on this), but blocks usually take only a fraction of space comparing to states, so should be a non-issue. Grandine the beacon chain existed for quite some time without database at all, and worked really good.
It is kinda complicated topic. Ideally, the node should operate without depending on database at all. So in this sense, caching just wastes cpu/memory. However, sometimes you actually do want to have caches, for instance there may be cases when you need to load some state that is a bit older than finalized point on a hot path, so loading it quickly may be desirable. However, caches wont magically make loading quicker -- instead, you will pay some small performance cost once, for being able to query the same thing instantly next time. If you take straightforward approach, and cache every database query, then such caches are pointless -- it is very rare that same object is queried from database twice. But if you make them smart, by somehow, caching intermediate values that may be needed for both querying objects A and B, then such caches will be very useful. This is the approach I take when implementing new database layout for grandine beacon chain (https://github.com/ArtiomTr/grandine/blob/4ec3964cf42b04b8d1ac93791a6a14ff788b2d18/fork_choice_control/src/storage.rs#L907). Although this requires careful benchmarking & profiling first, so probably better to think about caches after you have working database. Also, let me give you some advice on using libmdbx:
|
|
also, don't forget to change target branch to |
libmdbx in lean client for persistent DB storagea2bca8a to
2a25e9e
Compare
|
rebased and changed target branch to devnet-5. continuing work now, setting up lean's own database. |
This PR adds a persistent
libmdbx-backed database for client data.