feat: simplify transaction scheduler, make it best effort#916
feat: simplify transaction scheduler, make it best effort#916
Conversation
Manual Deploy AvailableYou can trigger a manual deploy of this PR branch to testnet: Alternative: Comment
Comment updated automatically when the PR is synchronized. |
📝 WalkthroughWalkthroughThis PR restructures the scheduler: README now documents an ExecutionCoordinator-based architecture. Internals replace per-executor VecDeque queues with per-executor BinaryHeap prioritized by transaction ID, and remove many contention-tracking types. Account locks move from Rc/RefCell indirection to a direct LocksCache with updated bitmask semantics and an atomic transaction ID generator. Scheduling and lock-acquisition flow are simplified (try_acquire_locks returns a blocking executor on conflict; blocked transactions are queued in heaps). Tests updated to reflect that strict FIFO ordering is no longer guaranteed. Suggested reviewers
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@magicblock-processor/README.md`:
- Around line 13-32: The fenced ASCII diagram block is missing a language
specifier; update the opening fence from ``` to ```text (i.e., add the "text"
specifier for the ASCII diagram block) so the diagram renders and lints
correctly—locate the ASCII diagram block in the README (the block containing
"TransactionScheduler" / "ExecutionCoordinator" / "Executor 0") and change its
opening fence accordingly.
In `@magicblock-processor/src/scheduler/mod.rs`:
- Around line 190-193: The SAFETY comment above the unsafe impl Send for
TransactionScheduler incorrectly mentions Rc<RefCell>; replace it with a brief,
accurate invariant: state that TransactionScheduler is safe to mark Send because
it is designed to run on a single dedicated thread and none of its non-Send
internal state escapes that thread or is accessed from other threads; reference
the unsafe impl Send for TransactionScheduler and ensure the comment documents
the single-threaded architectural guarantee and the requirement that no non-Send
handles are moved across threads.
- Around line 160-165: The call program_cache.write().unwrap() in
transition_to_new_slot risks panics on poisoned locks; replace the unwrap with a
poison-aware recovery (e.g., use .write().unwrap_or_else(|poison|
poison.into_inner()) or map_err to recover the guard) and optionally log the
poisoning event before continuing, then proceed to call cache.prune(...) and set
cache.latest_root_slot as before; ensure the change touches the
transition_to_new_slot function and the program_cache write acquisition to
remove the direct unwrap and handle PoisonError appropriately.
30714cd to
d91128b
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@magicblock-processor/src/scheduler/coordinator.rs`:
- Around line 99-122: Reduce visibility of scheduler lock-management methods
from pub(crate) to pub(super): change the signature of try_acquire_locks and the
unlock_accounts function to use pub(super) instead of pub(crate) so they remain
accessible to the parent module but not the entire crate; update both function
declarations (try_acquire_locks and unlock_accounts) accordingly and run tests
to ensure no external references rely on the wider visibility.
In `@magicblock-processor/src/scheduler/mod.rs`:
- Around line 83-91: The spawn function calls
Builder::new_current_thread().build().expect(...), which must not use .expect()
in production; change it to handle the error explicitly: call build() and match
or use .unwrap_or_else to log the build error via the project's logger (e.g.,
tracing::error! or processLogger equivalent) including the error details, then
panic with a clear message if you still consider it unrecoverable. Update the
code around the spawn function / runtime variable so the runtime build error is
logged before aborting (or propagate/return an error from spawn if you prefer),
and ensure the callsite behavior remains consistent with run()'s expectations.
In `@magicblock-processor/src/scheduler/tests.rs`:
- Around line 250-261: The existing comment in the test
transaction_with_duplicate_accounts is misleading because
Transaction::new_signed_with_payer compiles the message and deduplicates
account_keys, so there is only a single lock attempt; update the comment above
the test (referencing transaction_with_duplicate_accounts, mock_txn, and
try_schedule on ExecutionCoordinator) to state that Solana message compilation
deduplicates account keys and that the test simply verifies the scheduler does
not panic when given duplicate inputs rather than asserting a second lock is a
no-op.
| pub(crate) fn try_acquire_locks( | ||
| &mut self, | ||
| executor: ExecutorId, | ||
| txn: &TransactionWithId, | ||
| ) -> Result<(), BlockerId> { | ||
| let message = txn.txn.transaction.message(); | ||
| txn: &ProcessableTransaction, | ||
| ) -> Result<(), ExecutorId> { | ||
| let message = txn.transaction.message(); | ||
| let accounts = message.account_keys(); | ||
|
|
||
| for (i, &acc) in accounts.iter().enumerate() { | ||
| // 1. Fairness Check: | ||
| // Before touching the lock, check if an older transaction is already waiting | ||
| // for this account. If so, we must wait our turn. | ||
| self.check_contention(acc, txn.id)?; | ||
|
|
||
| // 2. Lock Acquisition: | ||
| let lock = self.locks.entry(acc).or_default().clone(); | ||
| let mut guard = lock.borrow_mut(); | ||
|
|
||
| let lock = self.locks.entry(acc).or_default(); | ||
| let result = if message.is_writable(i) { | ||
| guard.write(executor) | ||
| lock.write(executor) | ||
| } else { | ||
| guard.read(executor) | ||
| lock.read(executor) | ||
| }; | ||
|
|
||
| match result { | ||
| Ok(()) => { | ||
| self.acquired_locks[executor as usize].push(lock.clone()) | ||
| } | ||
| Err(holder) => return Err(BlockerId::Executor(holder)), | ||
| if let Err(blocker) = result { | ||
| self.unlock_accounts(executor); | ||
| return Err(blocker); | ||
| } | ||
| self.held_accounts[executor as usize].push(acc); | ||
| } | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify if try_acquire_locks or unlock_accounts are accessed outside the scheduler module
rg -n 'try_acquire_locks|unlock_accounts' --type=rust -g '!magicblock-processor/src/scheduler/*'Repository: magicblock-labs/magicblock-validator
Length of output: 62
🏁 Script executed:
#!/bin/bash
# Check the actual visibility of try_acquire_locks and unlock_accounts
rg -n 'pub\(crate\)\s+(fn try_acquire_locks|fn unlock_accounts)' magicblock-processor/src/scheduler/coordinator.rs -A 3Repository: magicblock-labs/magicblock-validator
Length of output: 473
Narrow method visibility to pub(super) for consistency.
try_acquire_locks and unlock_accounts are not accessed outside the scheduler module and can be narrowed from pub(crate) to pub(super) to match the visibility pattern of other methods in this file.
Suggested changes
- pub(crate) fn try_acquire_locks(
+ pub(super) fn try_acquire_locks(- pub(crate) fn unlock_accounts(&mut self, executor: ExecutorId) {
+ pub(super) fn unlock_accounts(&mut self, executor: ExecutorId) {🤖 Prompt for AI Agents
In `@magicblock-processor/src/scheduler/coordinator.rs` around lines 99 - 122,
Reduce visibility of scheduler lock-management methods from pub(crate) to
pub(super): change the signature of try_acquire_locks and the unlock_accounts
function to use pub(super) instead of pub(crate) so they remain accessible to
the parent module but not the entire crate; update both function declarations
(try_acquire_locks and unlock_accounts) accordingly and run tests to ensure no
external references rely on the wider visibility.
| pub fn spawn(self) -> JoinHandle<()> { | ||
| let task = move || { | ||
| std::thread::spawn(move || { | ||
| let runtime = Builder::new_current_thread() | ||
| .thread_name("transaction-scheduler") | ||
| .build() | ||
| .expect("Failed to build single-threaded Tokio runtime"); | ||
| runtime.block_on(tokio::task::unconstrained(self.run())); | ||
| }; | ||
| std::thread::spawn(task) | ||
| }) | ||
| } |
There was a problem hiding this comment.
.expect() on Tokio runtime creation in production code.
While a failure to build the runtime is indeed unrecoverable at this point, the coding guidelines mandate flagging all .expect() calls. Consider logging before panicking or documenting the invariant.
As per coding guidelines: {magicblock-*,programs,storage-proto}/**: Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.
🤖 Prompt for AI Agents
In `@magicblock-processor/src/scheduler/mod.rs` around lines 83 - 91, The spawn
function calls Builder::new_current_thread().build().expect(...), which must not
use .expect() in production; change it to handle the error explicitly: call
build() and match or use .unwrap_or_else to log the build error via the
project's logger (e.g., tracing::error! or processLogger equivalent) including
the error details, then panic with a clear message if you still consider it
unrecoverable. Update the code around the spawn function / runtime variable so
the runtime build error is logged before aborting (or propagate/return an error
from spawn if you prefer), and ensure the callsite behavior remains consistent
with run()'s expectations.
| #[test] | ||
| fn test_release_and_reacquire_lock() { | ||
| let mut coordinator = ExecutionCoordinator::new(1); | ||
| let (a, b) = (Pubkey::new_unique(), Pubkey::new_unique()); | ||
|
|
||
| let txn1 = mock_txn(&[(a, true)]); | ||
| let txn2 = mock_txn(&[(b, true)]); | ||
|
|
||
| let exec = coordinator.get_ready_executor().unwrap(); | ||
| assert!(coordinator.try_acquire_locks(exec, &txn1).is_ok()); | ||
|
|
||
| coordinator.unlock_accounts(exec); | ||
| assert!(coordinator.try_acquire_locks(exec, &txn2).is_ok()); | ||
| fn transaction_with_duplicate_accounts() { | ||
| // Real transactions shouldn't have duplicates, but verify we don't panic | ||
| let mut c = ExecutionCoordinator::new(1); | ||
| let acc = Pubkey::new_unique(); | ||
| let e = c.get_ready_executor().unwrap(); | ||
|
|
||
| // Same account twice as writable - second lock attempt is no-op (already held) | ||
| assert!(c | ||
| .try_schedule(e, mock_txn(&[(acc, true), (acc, true)])) | ||
| .is_ok()); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Comment is slightly misleading — Solana message compilation deduplicates account keys.
The comment says "second lock attempt is no-op (already held)", but Transaction::new_signed_with_payer compiles the message and deduplicates account_keys, so there's only a single lock attempt. The test still correctly verifies no panic on duplicate inputs, but the comment could be more accurate.
- // Same account twice as writable - second lock attempt is no-op (already held)
+ // Same account in two instructions - message compilation deduplicates keys, so only one lock attempt📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[test] | |
| fn test_release_and_reacquire_lock() { | |
| let mut coordinator = ExecutionCoordinator::new(1); | |
| let (a, b) = (Pubkey::new_unique(), Pubkey::new_unique()); | |
| let txn1 = mock_txn(&[(a, true)]); | |
| let txn2 = mock_txn(&[(b, true)]); | |
| let exec = coordinator.get_ready_executor().unwrap(); | |
| assert!(coordinator.try_acquire_locks(exec, &txn1).is_ok()); | |
| coordinator.unlock_accounts(exec); | |
| assert!(coordinator.try_acquire_locks(exec, &txn2).is_ok()); | |
| fn transaction_with_duplicate_accounts() { | |
| // Real transactions shouldn't have duplicates, but verify we don't panic | |
| let mut c = ExecutionCoordinator::new(1); | |
| let acc = Pubkey::new_unique(); | |
| let e = c.get_ready_executor().unwrap(); | |
| // Same account twice as writable - second lock attempt is no-op (already held) | |
| assert!(c | |
| .try_schedule(e, mock_txn(&[(acc, true), (acc, true)])) | |
| .is_ok()); | |
| } | |
| #[test] | |
| fn transaction_with_duplicate_accounts() { | |
| // Real transactions shouldn't have duplicates, but verify we don't panic | |
| let mut c = ExecutionCoordinator::new(1); | |
| let acc = Pubkey::new_unique(); | |
| let e = c.get_ready_executor().unwrap(); | |
| // Same account in two instructions - message compilation deduplicates keys, so only one lock attempt | |
| assert!(c | |
| .try_schedule(e, mock_txn(&[(acc, true), (acc, true)])) | |
| .is_ok()); | |
| } |
🤖 Prompt for AI Agents
In `@magicblock-processor/src/scheduler/tests.rs` around lines 250 - 261, The
existing comment in the test transaction_with_duplicate_accounts is misleading
because Transaction::new_signed_with_payer compiles the message and deduplicates
account_keys, so there is only a single lock attempt; update the comment above
the test (referencing transaction_with_duplicate_accounts, mock_txn, and
try_schedule on ExecutionCoordinator) to state that Solana message compilation
deduplicates account keys and that the test simply verifies the scheduler does
not panic when given duplicate inputs rather than asserting a second lock is a
no-op.
taco-paco
left a comment
There was a problem hiding this comment.
LGTM! The only thing - I would keep some of the comments as they're useful,
| while let Some(lock) = locks.pop() { | ||
| lock.borrow_mut().unlock(executor); | ||
| for acc in self.held_accounts[executor as usize].drain(..) { | ||
| if let Some(lock) = self.locks.get_mut(&acc) { |
There was a problem hiding this comment.
I think else case here shall not ever happen, if it does it is a sign of bug, lets add error!() and alert to be able to catch that.
| executor: ExecutorId, | ||
| ) -> Result<(), ExecutorId> { | ||
| if self.state != 0 { | ||
| // trailing_zeros() efficiently finds the index of the first set bit, |
There was a problem hiding this comment.
I think some comment here is needed. Maybe not complex as original one, but something to let reader know:
Executor gets its bit in state, so trailing zeros is used to map bit to executor.
| /// worker threads for execution or simulation. | ||
| /// The central transaction scheduler responsible for distributing work to executors. | ||
| pub struct TransactionScheduler { | ||
| /// Manages the state of all executors, including locks and blocked transactions. |
There was a problem hiding this comment.
Let's keep the docs, i think they're quite useful. I would remove only for accountsdb, shutdown and maybe latest_block. Doc for transaction_rx are quite imo
| Ok(()) = block_produced.recv() => { | ||
| self.transition_to_new_slot(); | ||
| } | ||
| // A worker has finished its task and is ready for more. |
| } | ||
| } | ||
| drop(self.executors); | ||
| self.ready_rx.recv().await; |
|
|
||
| /// Handles a new transaction from the global queue. | ||
| fn handle_new_transaction(&mut self, txn: ProcessableTransaction) { | ||
| // SAFETY: |
| self.coordinator.release_executor(exec); | ||
| break; | ||
| }; | ||
| // Here we check whether the transaction was blocked and re-queued: |
There was a problem hiding this comment.
I think we need to keep this one, since edge case isn't trivial
Summary
This PR removes the complexity associated with fairness enforcement, which led to subtle bugs (deadlocks) under synthetic high loads. This is achieved by making the locks best effort attempts, not dealing with any future contention. I.e. if the lock on account fails, the transaction is simply requeued and retried when the executor becomes available. This simplification makes deadlocks impossible by design and in practice doesn't lead to writer/reader starvation even under unrealistic loads.
Compatibility
Testing
Checklist
Summary by CodeRabbit
Documentation
Changes
Tests