Skip to content

[Fix] Pause BFT block advancement during sync#4039

Merged
vicsn merged 42 commits intostagingfrom
fix/sync-race-conditions
Mar 25, 2026
Merged

[Fix] Pause BFT block advancement during sync#4039
vicsn merged 42 commits intostagingfrom
fix/sync-race-conditions

Conversation

@kaimast
Copy link
Copy Markdown
Contributor

@kaimast kaimast commented Dec 5, 2025

This PR aims to fix #3911 and #3636. The PR also contains changes from three smaller PRs (#4065, #4138, and #4139) that can be reviewed and merged separately.

Motivation

Recently, we added CheckBlockError as a machine-readable response to failures in block checks. The proposed changes use this new error type to differentiate between invalid blocks, and cases where the ledger has already advanced, and the proposed block is contained in the ledger already.
The snarkVM changes also ensure that calls to the ledger are atomic, but provides no transactional guarantees across multiple invocations of Ledger functions.

BFT has a lock field that aims to protect, block advancement. This lock already existed before this change, but was not accessible from outside the struct. A problem with this setup is that check_block_content and advance_to_next_block are dependent operations. Generally, one assumes that if the former succeeds, the latter will also. However, without additional locking, BFT can call advance_to_block between Sync's calls of check_block_content and advance_to_next_block. In this case, Sync second call will fail.

Consistent usage of PendingBlocks

Before this PR, pending blocks were only used during DAG sync. This PR changes it to always use pending blocks during sync, preventing a validator from syncing a block that contains a leader certificate that was skipped by the majority of the network. It also means that the two sync modes BFT are now almost identical, except that "fast" sync does not update the DAG until it has caught up with the network. This makes the other changes (next section) a lot easier to implement.

This README contains a longer write-up on how sync and block creation works in BFT. Please verify that it makes sense, and let me know if you notice inconsistencies between the write-up and the code.

Improved Coordination between Sync and Primary

The implementation implements an approached based on optimistic concurrency control to address this. Sync builds a chain of PendingBlocks without grabbing the BFT lock (otherwise, Sync would hold the lock for long periods of time).
Once a chain of pending blocks has reached the availability threshold, it will grab the lock, check their contents for correctness, and advance the ledger. In this step, check_block_content will verify the height of the block again and, if the height is below the current ledger height, Sync will abort and retry.

The main change for BFT is that update_dag does not grab the lock anymore, but callers are required to grab the lock before calling update_dag. There is a new method BFT::pause_for_block_sync that Sync calls before trying to commit blocks.
I removed the ALLOW_LEDGER_ACCESS parameter from update_dag to remove complexity from this function. This flag was only set to false for some tests, and does not seem to be needed any more even in this case, as the LedgerService abstraction allows to simply ignore writes. I would like to eventually have separate update_dag functions for sync and BFT, but this PR already reduces the complexity of the function noticeably.

Testing

[x] Pass CI
[x] Successfully run prerelease tests
[x] Successfully run prerelease tests again
[x] Run benchmarks and ensured there are no performance regressions

Notes

  • Sync on this branch is about three times as fast as on staging. I assume this is mostly due to the improvements in snarkVM and unrelated to the changes in this branch.

@kaimast kaimast changed the base branch from staging to fix/revert-revert-pending-blocks December 5, 2025 18:25
@kaimast kaimast force-pushed the fix/sync-race-conditions branch from e8a483a to c32278a Compare December 5, 2025 18:49
Base automatically changed from fix/revert-revert-pending-blocks to staging December 5, 2025 22:03
@kaimast kaimast force-pushed the fix/sync-race-conditions branch 6 times, most recently from 648a2f7 to fe03df4 Compare December 8, 2025 16:58
@kaimast kaimast force-pushed the fix/sync-race-conditions branch from fe03df4 to bd73f5f Compare December 8, 2025 18:40
@kaimast kaimast marked this pull request as ready for review December 8, 2025 18:41
@kaimast kaimast requested review from ljedrz, raychu86 and vicsn December 8, 2025 19:37
@kaimast kaimast force-pushed the fix/sync-race-conditions branch from b6b59ba to 31fbc06 Compare December 8, 2025 19:44
@vicsn vicsn removed request for ljedrz, raychu86 and vicsn December 9, 2025 13:18
@vicsn vicsn marked this pull request as draft December 9, 2025 13:18
@kaimast kaimast force-pushed the fix/sync-race-conditions branch 3 times, most recently from c4f62ba to 1e37a31 Compare December 10, 2025 16:52
@vicsn
Copy link
Copy Markdown
Collaborator

vicsn commented Dec 19, 2025

Pausing this for now, the latest stress-test run prerelease-v4.4.20 had a halt. First tackling #4047

@kaimast kaimast force-pushed the fix/sync-race-conditions branch 2 times, most recently from 435cce4 to da65f53 Compare January 5, 2026 22:45
@kaimast
Copy link
Copy Markdown
Contributor Author

kaimast commented Jan 5, 2026

ProvableHQ/snarkVM#3070 needs to be merged first.

@kaimast kaimast force-pushed the fix/sync-race-conditions branch from da65f53 to 1c6e520 Compare January 11, 2026 19:50
@kaimast
Copy link
Copy Markdown
Contributor Author

kaimast commented Mar 12, 2026

I ran another set of stress tests on top of the two commits from today. Still passes.

Grafana Link

Comment thread node/bft/src/helpers/storage.rs Outdated
Comment thread node/bft/src/bft.rs Outdated
Comment thread node/bft/src/bft.rs Outdated
Comment thread node/bft/src/sync/mod.rs
Comment thread node/bft/README.md
@kaimast kaimast changed the base branch from fix/load-aborted-transmissions to staging March 13, 2026 00:41
@kaimast
Copy link
Copy Markdown
Contributor Author

kaimast commented Mar 13, 2026

I addressed all comments and resolved conflicts with mainnet.

@kaimast kaimast force-pushed the fix/sync-race-conditions branch from 0e9c9fa to ed71c56 Compare March 17, 2026 17:30
@vicsn vicsn requested a review from raychu86 March 23, 2026 18:35
@kaimast
Copy link
Copy Markdown
Contributor Author

kaimast commented Mar 23, 2026

I added the changes from #4179 to this PR, which should make merging easier.

Copy link
Copy Markdown
Collaborator

@raychu86 raychu86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks sound to me. Please run proper stress tests before merging

@vicsn
Copy link
Copy Markdown
Collaborator

vicsn commented Mar 25, 2026

Stresstest prerelease-v4.5.29 succeeded

@vicsn vicsn merged commit 2965cd6 into staging Mar 25, 2026
5 checks passed
@vicsn vicsn deleted the fix/sync-race-conditions branch March 25, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Errors during sync race conditions

4 participants