Batch transaction writes in handleTransactionList to reduce fsync count#4168
Draft
stevenvegt wants to merge 1 commit intomasterfrom
Draft
Batch transaction writes in handleTransactionList to reduce fsync count#4168stevenvegt wants to merge 1 commit intomasterfrom
stevenvegt wants to merge 1 commit intomasterfrom
Conversation
1 new issue
|
Add State.AddMany() that writes multiple transactions in a single BBolt write transaction. Previously, each transaction in a TransactionList message was added individually via State.Add(), each acquiring a write lock and triggering its own fsync. For a message with N transactions, this meant N separate fsyncs. On network-attached storage (e.g. Azure premium SMB), fsync latency is 10-100ms compared to <1ms on local SSD. Combined with the go-stoabs read lock issue (see go-stoabs#146), this creates a compounding effect: slow fsyncs hold the write lock longer, blocking concurrent reads via the RWMutex writer-preference, which in turn blocks subsequent writes. A bootup that takes 3 minutes on local storage can take 30+ minutes on SMB because the lock contention multiplies the raw I/O penalty. With batching, N transactions require only 1 fsync. Verification happens inside the write transaction so that later transactions in the batch can reference earlier ones. On the first failure, processing stops and all successfully added transactions are committed. The caller receives the count of added transactions and the first error, preserving the existing error handling (ErrPreviousTransactionMissing triggers state reconciliation, other errors are logged and recovered via gossip). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d88ccdd to
dc50f1c
Compare
|
Coverage Impact ⬇️ Merging this pull request will decrease total coverage on Modified Files with Diff Coverage (2)
🤖 Increase coverage with AI coding...🚦 See full report on Qlty Cloud » 🛟 Help
|
Member
Author
|
Not ready yet, see failing e2e tests. Keys from previous txs in the same batch don't become available for later txs because the notifier hasn't ran yet. Needs some more thinking. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Add
State.AddMany()that writes multiple transactions from aTransactionListmessage in a single BBolt write transaction, requiring only one fsync instead of one per transaction.The problem
Previously,
handleTransactionListcalledstate.Add()for each transaction individually. EachAdd()acquires a write lock, writes to BBolt, and triggers an fsync on commit. For a message with N transactions, that's N fsyncs.On local SSD, fsync takes <1ms — barely noticeable. On network-attached storage (e.g. Azure premium SMB), fsync latency is 10-100ms per call. For 100 transactions, that's 1-10 seconds just in fsyncs per message.
This is compounded by the go-stoabs read lock issue (go-stoabs#146): slow fsyncs hold the write lock longer, and Go's
sync.RWMutexwriter-preference blocks all concurrent reads while a writer is pending. The result is a cascading slowdown where a bootup that takes 3 minutes on local storage takes 30+ minutes on SMB — not a proportional slowdown, but a multiplicative one due to lock contention amplifying the raw I/O penalty.The fix
AddMany()processes all transactions from a message in a single write transaction:(added int, err error)— the caller usesaddedto identify which transaction failed anderrto determine the action (ErrPreviousTransactionMissingtriggers state reconciliation, other errors are logged and recovered via gossip)Relationship to go-stoabs#146
This PR and go-stoabs#146 are complementary:
Together they address both sides of the lock contention that causes slow bootup on network storage.
Test plan
TestProtocol_handleTransactionListtests updated and passingCo-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com