Complete const ledger state refactor#5141
Conversation
There was a problem hiding this comment.
Pull request overview
This PR completes the ledger state refactoring by removing BucketSnapshotManager and centralizing state management in LedgerManager. The refactor introduces LedgerStateSnapshot as a unified, thread-safe snapshot type with copy semantics, preventing accidental sharing across threads.
Changes:
- Removed
BucketSnapshotManagerand moved snapshot management toLedgerManager - Introduced
LedgerStateSnapshot(copyable value type) andApplyLedgerStateSnapshot(strong typedef for apply-time snapshots) - Implemented copy semantics with fresh file stream caches per copy for thread safety
- Fixed snapshot invariant race condition by launching it from apply thread instead of main thread
Reviewed changes
Copilot reviewed 62 out of 62 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/ledger/LedgerStateSnapshot.h/cpp |
New unified snapshot types with copy semantics |
src/ledger/LedgerManagerImpl.h/cpp |
Centralized snapshot management with thread-safe access |
src/bucket/BucketSnapshotManager.h/cpp |
Removed entirely |
src/bucket/BucketListSnapshot.h/cpp |
Added copy constructors for thread-safe copying |
src/transactions/ParallelApplyUtils.h/cpp |
Updated to use ApplyLedgerStateSnapshot |
src/invariant/* |
Updated invariant interfaces to use new snapshot types |
src/main/QueryServer.h/cpp |
Updated to use LedgerManager instead of BucketSnapshotManager |
src/overlay/OverlayManagerImpl.h/cpp |
Updated to use LedgerStateSnapshot value type |
| Test files | Updated to use new snapshot APIs |
cf31391 to
e38ed3f
Compare
e38ed3f to
99bf7d0
Compare
| LedgerManagerImpl::hasLastClosedSorobanNetworkConfig() const | ||
| { | ||
| releaseAssert(threadIsMain()); | ||
| releaseAssert(!mApp.threadIsType(Application::ThreadType::APPLY)); |
There was a problem hiding this comment.
The inconsistency between this and getLastClosedSorobanNetworkConfig is really suspicious; I would expect has and get have exactly the same access invariants.
There was a problem hiding this comment.
It doesn't seem like this has been addressed, could this be just the main thread assertion?
There was a problem hiding this comment.
Yeah this has to do with the convo below wrt the getters. The reason why hasLastClosedSorobanNetworkConfig doesn't have the main thread assert is because we return a bool. getLastClosedSorobanNetworkConfig returns a reference, so we still have the main thread assert to prevent the case where we have a dangling reference. The "correct" solution is just to use the snapshot itself. I could switch this to main thread only, but there's no safety reason to do so, unlike with getLastClosedSorobanNetworkConfig.
| LedgerManagerImpl::maybeUpdateLedgerStateSnapshot( | ||
| LedgerStateSnapshot& snapshot) const | ||
| { | ||
| SharedLockShared guard(mLedgerStateSnapshotMutex); |
There was a problem hiding this comment.
I wonder if we can avoid locking in scenarios where it's ok to be slightly stale (like query server and maybe more). It doesn't seem dangerous to me if implemented right. Not for this PR, but maybe something to think about in the future.
| // close, providing access to both the live bucket list and the hot archive | ||
| // bucket list. Note that this does not reflect changes from the classic | ||
| // apply phase, but is a snapshot of the start of the ledger. | ||
| ApplyLedgerStateSnapshot mSnapshot; |
There was a problem hiding this comment.
I think we should call this mLCLSnapshot to signify the fact that it's not up to date (the fact that it also has type ApplyLedgerStateSnapshot brings even more confusion - does it have to be that and not a regular snapshot?). This issue has existed in the previous name as well, but since we're doing the rename anyways, I think it's worth fixing.
There was a problem hiding this comment.
Yes, this does have to be ApplyLedgerStateSnapshot, and may not actually be the same as the "LCL" snapshot managed by the main thread. I think the issue is the definition of "up to date". GlobalParallelApplyLedgerState should never reference the main thread maintained LCL, since the apply thread LCL may be arbitrarily ahead of the main thread LCL (such as when replaying buffered ledgers). By "up to date" snapshot, I mean that if we are applying ledger N + 1, the snapshot is based on the ledger N state. The "out of date" snapshot in this case would be N - 1 (which the main thread may possibly still have).
Prior to this refactor, we had a bit of a confusing interface, where the canonical BucketSnapshot was updated via the apply thread, so SnapshotManager always had the apply thread LCL. Now that snapshots are managed solely by LedgerManager, we need to more clearly define the main thread lcl state from the apply thread lcl state, which is what the static types help do.
I don't have a problem renaming this to mLCLSnapshot. It's a little challenging defining "up to date" since we have snapshots at multiple ledgers, plus intra-ledger hot state as we apply TXs. In my mind, mSnapshot is "up to date" since it's the most recent snapshot and since snapshots inherently don't contain intra-ledger state, but I see how this could be confusing.
There was a problem hiding this comment.
Oh, interesting. I must admit that I have been confused as well, as I thought that ApplyLedgerStateSnapshot may change with intra-ledger state. But it's basically just 'LCL from the perspective of the apply thread' (i.e. ledger N-1). Maybe LCL is too ambiguous here and the original mSnapshot name is better. It's still somewhat confusing though; maybe we need a new term for this, like preApplySnapshot or lastSnapshotBeforeApply or something along these lines. Not sure what's the best option TBH.
|
I think the rest of @dmkozh comments are all regarding the LCL state getters, so I'll answer them all here. Previously, the LedgerManager maintained a main-thread only ledger snapshot, and had a collection of main-thread only getters into that snapshot. This was part of the fragmentation issue, where we had multiple different snapshots with multiple different interfaces depending on which subsystem was accessing the snapshot. I've done a first path of unifying these by having the main thread getters all call into the Ideally, I'd like to get rid of these functions entirely. The unified interface involves callers getting/maintaining a snapshot and calling the snapshot geters. However, this is is a lot of LOC in a PR that already is pretty big, so I'm saving it for a future refactor. This is why the interface looks kinda weird and inconsistent. It was inconsistent before too, but now at least everything calls into the same snapshot interface internally. Wrt to the transition period, we have a few options on what to do with the old main-thread only LedgerManager getters:
I went with option 2 initially because there was nothing preventing these getters from not being thread safe in the model, so I just added it for completeness sake. This may not be worthwhile if we end up getting rid of them soon anyway though |
Alright, I think it's ok to address this in the followup, as long as we don't keep this around for too long.
Yeah, this seems fine to me for the transition period. I was mostly concerned about the references, but since we're hopefully not going to keep these around for a while there isn't much risk that these get misused due to incorrect thread safety assumptions. |
Description
Resolves #5073
This is the 2nd PR in the ledger state factor. Specifically, this PR covers steps 2-4.
The change is a bit larger than I would have liked, but it was difficult to break this down into smaller chunks that were still safe for master. While the diff is large, I've tried to make the commits easy to follow along. The commits are as follows:
3476e384cRemove LedgerHeader from BucketListSnapshotDataThis reduces the duplication of
LedgerHeaderby removing it from the immutableBucketListSnapshotDatastruct. The header is moved to a centralized location inBucketSnapshotManagerto prepare for a future commit, which will centralize the header to justCompleteConstLedgerState9bec0791fConsolidated state access to LedgerStateSnapshotThis introduces
LedgerStateSnapshotas the unified, searchable ledger state snapshot struct. This is a lightweight wrapper over a centralizedCompleteConstLedgerState, which provides BucketList search functionality via local file streams.BucketSnapshotManagerhas not yet been removed yet, but basically is just a wrapper around theLedgerManagerto serveLedgerStateSnapshot.2b1957028Make LedgerStateSnapshot threadsafeImplements copy semantics for
LedgerStateSnapshotand makes the default interface a value type. This removes footguns where two threads would have a pointer to the same snapshot, resulting in race conditions.b77fb68feRemove BucketSnapshotManager and manage snapshot from LedgerManagerBucketSnapshotManageris removed entirely, and theLedgerManagermaintains the canonical lcl ledger state snapshot, as well as provides getters for that snapshot. This hardens the distinction between the "apply state" snapshot (used only during ledger apply) and the ledger snapshot published and maintained by the main thread. The apply state, and all apply time functions, must use the 'apply state" snapshot, since this may be ahead of the main thread's snapshot. The main thread snapshot is what is copied by other non-apply threads, such as the query server and overlay thread.d143cc7f3Fix minor race condition in invariant checkThis has various small cleanups. It fixes a few dangling reference issues (that I introduced by removing a couple asserts) and adds a check back to a unit test I forgot I removed during development. Finally, this launches the snapshot invariant thread from the apply thread, not the main thread. This ensures that we always have consistent state between the in-memory state and the ledger state snapshot used for the invariant.
There are still some outstanding cleanups. For example, I still want to get rid of the weird mutable header copy in
BucketSnapshotState(and maybe get rid of that class entirely). I also want to move historical snapshots outside ofCompleteConstLedgerStateand have the query server manage those directly, since it's the only subsystem that cares about it.Checklist
clang-formatv8.0.0 (viamake formator the Visual Studio extension)