Skip to content

Comments

Complete const ledger state refactor#5141

Open
SirTyson wants to merge 6 commits intostellar:masterfrom
SirTyson:complete-const-ledger-state-refactor
Open

Complete const ledger state refactor#5141
SirTyson wants to merge 6 commits intostellar:masterfrom
SirTyson:complete-const-ledger-state-refactor

Conversation

@SirTyson
Copy link
Contributor

Description

Resolves #5073

This is the 2nd PR in the ledger state factor. Specifically, this PR covers steps 2-4.

The change is a bit larger than I would have liked, but it was difficult to break this down into smaller chunks that were still safe for master. While the diff is large, I've tried to make the commits easy to follow along. The commits are as follows:

  • 3476e384c Remove LedgerHeader from BucketListSnapshotData

This reduces the duplication of LedgerHeader by removing it from the immutable BucketListSnapshotData struct. The header is moved to a centralized location in BucketSnapshotManager to prepare for a future commit, which will centralize the header to just CompleteConstLedgerState

  • 9bec0791f Consolidated state access to LedgerStateSnapshot

This introduces LedgerStateSnapshot as the unified, searchable ledger state snapshot struct. This is a lightweight wrapper over a centralized CompleteConstLedgerState, which provides BucketList search functionality via local file streams. BucketSnapshotManager has not yet been removed yet, but basically is just a wrapper around the LedgerManager to serve LedgerStateSnapshot.

  • 2b1957028 Make LedgerStateSnapshot threadsafe

Implements copy semantics for LedgerStateSnapshot and makes the default interface a value type. This removes footguns where two threads would have a pointer to the same snapshot, resulting in race conditions.

  • b77fb68fe Remove BucketSnapshotManager and manage snapshot from LedgerManager

BucketSnapshotManager is removed entirely, and the LedgerManager maintains the canonical lcl ledger state snapshot, as well as provides getters for that snapshot. This hardens the distinction between the "apply state" snapshot (used only during ledger apply) and the ledger snapshot published and maintained by the main thread. The apply state, and all apply time functions, must use the 'apply state" snapshot, since this may be ahead of the main thread's snapshot. The main thread snapshot is what is copied by other non-apply threads, such as the query server and overlay thread.

  • d143cc7f3 Fix minor race condition in invariant check

This has various small cleanups. It fixes a few dangling reference issues (that I introduced by removing a couple asserts) and adds a check back to a unit test I forgot I removed during development. Finally, this launches the snapshot invariant thread from the apply thread, not the main thread. This ensures that we always have consistent state between the in-memory state and the ledger state snapshot used for the invariant.

There are still some outstanding cleanups. For example, I still want to get rid of the weird mutable header copy in BucketSnapshotState (and maybe get rid of that class entirely). I also want to move historical snapshots outside of CompleteConstLedgerState and have the query server manage those directly, since it's the only subsystem that cares about it.

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

Copilot AI review requested due to automatic review settings February 17, 2026 02:56
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the ledger state refactoring by removing BucketSnapshotManager and centralizing state management in LedgerManager. The refactor introduces LedgerStateSnapshot as a unified, thread-safe snapshot type with copy semantics, preventing accidental sharing across threads.

Changes:

  • Removed BucketSnapshotManager and moved snapshot management to LedgerManager
  • Introduced LedgerStateSnapshot (copyable value type) and ApplyLedgerStateSnapshot (strong typedef for apply-time snapshots)
  • Implemented copy semantics with fresh file stream caches per copy for thread safety
  • Fixed snapshot invariant race condition by launching it from apply thread instead of main thread

Reviewed changes

Copilot reviewed 62 out of 62 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/ledger/LedgerStateSnapshot.h/cpp New unified snapshot types with copy semantics
src/ledger/LedgerManagerImpl.h/cpp Centralized snapshot management with thread-safe access
src/bucket/BucketSnapshotManager.h/cpp Removed entirely
src/bucket/BucketListSnapshot.h/cpp Added copy constructors for thread-safe copying
src/transactions/ParallelApplyUtils.h/cpp Updated to use ApplyLedgerStateSnapshot
src/invariant/* Updated invariant interfaces to use new snapshot types
src/main/QueryServer.h/cpp Updated to use LedgerManager instead of BucketSnapshotManager
src/overlay/OverlayManagerImpl.h/cpp Updated to use LedgerStateSnapshot value type
Test files Updated to use new snapshot APIs

@SirTyson SirTyson force-pushed the complete-const-ledger-state-refactor branch 3 times, most recently from cf31391 to e38ed3f Compare February 17, 2026 20:19
@SirTyson SirTyson force-pushed the complete-const-ledger-state-refactor branch from e38ed3f to 99bf7d0 Compare February 17, 2026 20:59
LedgerManagerImpl::hasLastClosedSorobanNetworkConfig() const
{
releaseAssert(threadIsMain());
releaseAssert(!mApp.threadIsType(Application::ThreadType::APPLY));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inconsistency between this and getLastClosedSorobanNetworkConfig is really suspicious; I would expect has and get have exactly the same access invariants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like this has been addressed, could this be just the main thread assertion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this has to do with the convo below wrt the getters. The reason why hasLastClosedSorobanNetworkConfig doesn't have the main thread assert is because we return a bool. getLastClosedSorobanNetworkConfig returns a reference, so we still have the main thread assert to prevent the case where we have a dangling reference. The "correct" solution is just to use the snapshot itself. I could switch this to main thread only, but there's no safety reason to do so, unlike with getLastClosedSorobanNetworkConfig.

LedgerManagerImpl::maybeUpdateLedgerStateSnapshot(
LedgerStateSnapshot& snapshot) const
{
SharedLockShared guard(mLedgerStateSnapshotMutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can avoid locking in scenarios where it's ok to be slightly stale (like query server and maybe more). It doesn't seem dangerous to me if implemented right. Not for this PR, but maybe something to think about in the future.

// close, providing access to both the live bucket list and the hot archive
// bucket list. Note that this does not reflect changes from the classic
// apply phase, but is a snapshot of the start of the ledger.
ApplyLedgerStateSnapshot mSnapshot;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call this mLCLSnapshot to signify the fact that it's not up to date (the fact that it also has type ApplyLedgerStateSnapshot brings even more confusion - does it have to be that and not a regular snapshot?). This issue has existed in the previous name as well, but since we're doing the rename anyways, I think it's worth fixing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this does have to be ApplyLedgerStateSnapshot, and may not actually be the same as the "LCL" snapshot managed by the main thread. I think the issue is the definition of "up to date". GlobalParallelApplyLedgerState should never reference the main thread maintained LCL, since the apply thread LCL may be arbitrarily ahead of the main thread LCL (such as when replaying buffered ledgers). By "up to date" snapshot, I mean that if we are applying ledger N + 1, the snapshot is based on the ledger N state. The "out of date" snapshot in this case would be N - 1 (which the main thread may possibly still have).

Prior to this refactor, we had a bit of a confusing interface, where the canonical BucketSnapshot was updated via the apply thread, so SnapshotManager always had the apply thread LCL. Now that snapshots are managed solely by LedgerManager, we need to more clearly define the main thread lcl state from the apply thread lcl state, which is what the static types help do.

I don't have a problem renaming this to mLCLSnapshot. It's a little challenging defining "up to date" since we have snapshots at multiple ledgers, plus intra-ledger hot state as we apply TXs. In my mind, mSnapshot is "up to date" since it's the most recent snapshot and since snapshots inherently don't contain intra-ledger state, but I see how this could be confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting. I must admit that I have been confused as well, as I thought that ApplyLedgerStateSnapshot may change with intra-ledger state. But it's basically just 'LCL from the perspective of the apply thread' (i.e. ledger N-1). Maybe LCL is too ambiguous here and the original mSnapshot name is better. It's still somewhat confusing though; maybe we need a new term for this, like preApplySnapshot or lastSnapshotBeforeApply or something along these lines. Not sure what's the best option TBH.

@SirTyson
Copy link
Contributor Author

I think the rest of @dmkozh comments are all regarding the LCL state getters, so I'll answer them all here. Previously, the LedgerManager maintained a main-thread only ledger snapshot, and had a collection of main-thread only getters into that snapshot. This was part of the fragmentation issue, where we had multiple different snapshots with multiple different interfaces depending on which subsystem was accessing the snapshot.

I've done a first path of unifying these by having the main thread getters all call into the LedgerStateSnapshot object. Given that we have a thread-safety model surrounding LedgerStateSnapshot, I've dropped the main thread assert from the getters and use locks instead (with the exception of the interfaces that return references. I've kept the assert and added a lock for completeness sake, which is redundant because only the main thread can update the snapshot).

Ideally, I'd like to get rid of these functions entirely. The unified interface involves callers getting/maintaining a snapshot and calling the snapshot geters. However, this is is a lot of LOC in a PR that already is pretty big, so I'm saving it for a future refactor. This is why the interface looks kinda weird and inconsistent. It was inconsistent before too, but now at least everything calls into the same snapshot interface internally.

Wrt to the transition period, we have a few options on what to do with the old main-thread only LedgerManager getters:

  1. Just keep the main thread assert with no lock. We'll have to do something about the thread safety static-analysis, but this is safe since these values can only be changed by the main thread.
  2. Remove the main thread assert and add locks to make these functions thread safe. (this is what I've already done). In my mind this keeps the outdated interface the "closest" semantically to our new snapshot interface. This may have a performance hit though. Additionally, if we want to get rid of these functions eventually, it may not be a good idea to make them more available to other threads, instead of less available.

I went with option 2 initially because there was nothing preventing these getters from not being thread safe in the model, so I just added it for completeness sake. This may not be worthwhile if we end up getting rid of them soon anyway though

@dmkozh
Copy link
Contributor

dmkozh commented Feb 24, 2026

Ideally, I'd like to get rid of these functions entirely. The unified interface involves callers getting/maintaining a snapshot and calling the snapshot geters. However, this is is a lot of LOC in a PR that already is pretty big, so I'm saving it for a future refactor.

Alright, I think it's ok to address this in the followup, as long as we don't keep this around for too long.

I went with option 2 initially because there was nothing preventing these getters from not being thread safe in the model, so I just added it for completeness sake.

Yeah, this seems fine to me for the transition period. I was mostly concerned about the references, but since we're hopefully not going to keep these around for a while there isn't much risk that these get misused due to incorrect thread safety assumptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enforce copy semantics for BucketList snapshots

2 participants