feat: simplify transaction scheduler, make it best effort by bmuddha · Pull Request #916 · magicblock-labs/magicblock-validator

bmuddha · 2026-01-29T07:16:26Z

Summary

This PR removes the complexity associated with fairness enforcement, which led to subtle bugs (deadlocks) under synthetic high loads. This is achieved by making the locks best effort attempts, not dealing with any future contention. I.e. if the lock on account fails, the transaction is simply requeued and retried when the executor becomes available. This simplification makes deadlocks impossible by design and in practice doesn't lead to writer/reader starvation even under unrealistic loads.

Compatibility

No breaking changes, but the scheduling logic is now different

Testing

tests have been rewritten to accomodate the new scheduling logic

Checklist

docs updated README
related to Feat: multi threaded scheduler #589

Summary by CodeRabbit

Documentation
- README reorganized with new Architecture/Core Components sections, diagram, and a clearer ExecutionCoordinator and scheduling workflow.
Changes
- Scheduling logic updated: executor assignment and lock/queue coordination changed; strict FIFO ordering is no longer guaranteed, but transaction correctness is preserved.
Tests
- Tests adjusted to verify unordered completion where ordering is not guaranteed.

github-actions · 2026-01-29T07:16:35Z

Manual Deploy Available

You can trigger a manual deploy of this PR branch to testnet:

Deploy to Testnet 🚀

Alternative: Comment /deploy on this PR to trigger deployment directly.

⚠️ Note: Manual deploy requires authorization. Only authorized users can trigger deployments.

Comment updated automatically when the PR is synchronized.

coderabbitai · 2026-01-29T07:16:46Z

📝 Walkthrough

Walkthrough

This PR restructures the scheduler: README now documents an ExecutionCoordinator-based architecture. Internals replace per-executor VecDeque queues with per-executor BinaryHeap prioritized by transaction ID, and remove many contention-tracking types. Account locks move from Rc/RefCell indirection to a direct LocksCache with updated bitmask semantics and an atomic transaction ID generator. Scheduling and lock-acquisition flow are simplified (try_acquire_locks returns a blocking executor on conflict; blocked transactions are queued in heaps). Tests updated to reflect that strict FIFO ordering is no longer guaranteed.

Suggested reviewers

GabrielePicco
taco-paco

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bmuddha/fix/simple-scheduler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@magicblock-processor/README.md`:
- Around line 13-32: The fenced ASCII diagram block is missing a language
specifier; update the opening fence from ``` to ```text (i.e., add the "text"
specifier for the ASCII diagram block) so the diagram renders and lints
correctly—locate the ASCII diagram block in the README (the block containing
"TransactionScheduler" / "ExecutionCoordinator" / "Executor 0") and change its
opening fence accordingly.

In `@magicblock-processor/src/scheduler/mod.rs`:
- Around line 190-193: The SAFETY comment above the unsafe impl Send for
TransactionScheduler incorrectly mentions Rc<RefCell>; replace it with a brief,
accurate invariant: state that TransactionScheduler is safe to mark Send because
it is designed to run on a single dedicated thread and none of its non-Send
internal state escapes that thread or is accessed from other threads; reference
the unsafe impl Send for TransactionScheduler and ensure the comment documents
the single-threaded architectural guarantee and the requirement that no non-Send
handles are moved across threads.
- Around line 160-165: The call program_cache.write().unwrap() in
transition_to_new_slot risks panics on poisoned locks; replace the unwrap with a
poison-aware recovery (e.g., use .write().unwrap_or_else(|poison|
poison.into_inner()) or map_err to recover the guard) and optionally log the
poisoning event before continuing, then proceed to call cache.prune(...) and set
cache.latest_root_slot as before; ensure the change touches the
transition_to_new_slot function and the program_cache write acquisition to
remove the direct unwrap and handle PoisonError appropriately.

magicblock-processor/README.md

magicblock-processor/src/scheduler/mod.rs

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@magicblock-processor/src/scheduler/coordinator.rs`:
- Around line 99-122: Reduce visibility of scheduler lock-management methods
from pub(crate) to pub(super): change the signature of try_acquire_locks and the
unlock_accounts function to use pub(super) instead of pub(crate) so they remain
accessible to the parent module but not the entire crate; update both function
declarations (try_acquire_locks and unlock_accounts) accordingly and run tests
to ensure no external references rely on the wider visibility.

In `@magicblock-processor/src/scheduler/mod.rs`:
- Around line 83-91: The spawn function calls
Builder::new_current_thread().build().expect(...), which must not use .expect()
in production; change it to handle the error explicitly: call build() and match
or use .unwrap_or_else to log the build error via the project's logger (e.g.,
tracing::error! or processLogger equivalent) including the error details, then
panic with a clear message if you still consider it unrecoverable. Update the
code around the spawn function / runtime variable so the runtime build error is
logged before aborting (or propagate/return an error from spawn if you prefer),
and ensure the callsite behavior remains consistent with run()'s expectations.

In `@magicblock-processor/src/scheduler/tests.rs`:
- Around line 250-261: The existing comment in the test
transaction_with_duplicate_accounts is misleading because
Transaction::new_signed_with_payer compiles the message and deduplicates
account_keys, so there is only a single lock attempt; update the comment above
the test (referencing transaction_with_duplicate_accounts, mock_txn, and
try_schedule on ExecutionCoordinator) to state that Solana message compilation
deduplicates account keys and that the test simply verifies the scheduler does
not panic when given duplicate inputs rather than asserting a second lock is a
no-op.

coderabbitai · 2026-02-06T05:08:35Z

magicblock-processor/src/scheduler/coordinator.rs

+    pub(crate) fn try_acquire_locks(
        &mut self,
        executor: ExecutorId,
-        txn: &TransactionWithId,
-    ) -> Result<(), BlockerId> {
-        let message = txn.txn.transaction.message();
+        txn: &ProcessableTransaction,
+    ) -> Result<(), ExecutorId> {
+        let message = txn.transaction.message();
        let accounts = message.account_keys();

        for (i, &acc) in accounts.iter().enumerate() {
-            // 1. Fairness Check:
-            // Before touching the lock, check if an older transaction is already waiting
-            // for this account. If so, we must wait our turn.
-            self.check_contention(acc, txn.id)?;
-
-            // 2. Lock Acquisition:
-            let lock = self.locks.entry(acc).or_default().clone();
-            let mut guard = lock.borrow_mut();
-
+            let lock = self.locks.entry(acc).or_default();
            let result = if message.is_writable(i) {
-                guard.write(executor)
+                lock.write(executor)
            } else {
-                guard.read(executor)
+                lock.read(executor)
            };

-            match result {
-                Ok(()) => {
-                    self.acquired_locks[executor as usize].push(lock.clone())
-                }
-                Err(holder) => return Err(BlockerId::Executor(holder)),
+            if let Err(blocker) = result {
+                self.unlock_accounts(executor);
+                return Err(blocker);
            }
+            self.held_accounts[executor as usize].push(acc);
        }
        Ok(())
    }


🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify if try_acquire_locks or unlock_accounts are accessed outside the scheduler module rg -n 'try_acquire_locks|unlock_accounts' --type=rust -g '!magicblock-processor/src/scheduler/*'

Repository: magicblock-labs/magicblock-validator

Length of output: 62

🏁 Script executed:

#!/bin/bash # Check the actual visibility of try_acquire_locks and unlock_accounts rg -n 'pub\(crate\)\s+(fn try_acquire_locks|fn unlock_accounts)' magicblock-processor/src/scheduler/coordinator.rs -A 3

Repository: magicblock-labs/magicblock-validator

Length of output: 473

Narrow method visibility to pub(super) for consistency.

try_acquire_locks and unlock_accounts are not accessed outside the scheduler module and can be narrowed from pub(crate) to pub(super) to match the visibility pattern of other methods in this file.

Suggested changes

- pub(crate) fn try_acquire_locks( + pub(super) fn try_acquire_locks(

- pub(crate) fn unlock_accounts(&mut self, executor: ExecutorId) { + pub(super) fn unlock_accounts(&mut self, executor: ExecutorId) {

🤖 Prompt for AI Agents

In `@magicblock-processor/src/scheduler/coordinator.rs` around lines 99 - 122, Reduce visibility of scheduler lock-management methods from pub(crate) to pub(super): change the signature of try_acquire_locks and the unlock_accounts function to use pub(super) instead of pub(crate) so they remain accessible to the parent module but not the entire crate; update both function declarations (try_acquire_locks and unlock_accounts) accordingly and run tests to ensure no external references rely on the wider visibility.

coderabbitai · 2026-02-06T05:08:35Z

magicblock-processor/src/scheduler/mod.rs

    pub fn spawn(self) -> JoinHandle<()> {
-        let task = move || {
+        std::thread::spawn(move || {
            let runtime = Builder::new_current_thread()
                .thread_name("transaction-scheduler")
                .build()
                .expect("Failed to build single-threaded Tokio runtime");
            runtime.block_on(tokio::task::unconstrained(self.run()));
-        };
-        std::thread::spawn(task)
+        })
    }


⚠️ Potential issue | 🟠 Major

.expect() on Tokio runtime creation in production code.

While a failure to build the runtime is indeed unrecoverable at this point, the coding guidelines mandate flagging all .expect() calls. Consider logging before panicking or documenting the invariant.

As per coding guidelines: {magicblock-*,programs,storage-proto}/**: Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents

In `@magicblock-processor/src/scheduler/mod.rs` around lines 83 - 91, The spawn function calls Builder::new_current_thread().build().expect(...), which must not use .expect() in production; change it to handle the error explicitly: call build() and match or use .unwrap_or_else to log the build error via the project's logger (e.g., tracing::error! or processLogger equivalent) including the error details, then panic with a clear message if you still consider it unrecoverable. Update the code around the spawn function / runtime variable so the runtime build error is logged before aborting (or propagate/return an error from spawn if you prefer), and ensure the callsite behavior remains consistent with run()'s expectations.

coderabbitai · 2026-02-06T05:08:35Z

magicblock-processor/src/scheduler/tests.rs

 #[test]
-fn test_release_and_reacquire_lock() {
-    let mut coordinator = ExecutionCoordinator::new(1);
-    let (a, b) = (Pubkey::new_unique(), Pubkey::new_unique());
-
-    let txn1 = mock_txn(&[(a, true)]);
-    let txn2 = mock_txn(&[(b, true)]);
-
-    let exec = coordinator.get_ready_executor().unwrap();
-    assert!(coordinator.try_acquire_locks(exec, &txn1).is_ok());
-
-    coordinator.unlock_accounts(exec);
-    assert!(coordinator.try_acquire_locks(exec, &txn2).is_ok());
+fn transaction_with_duplicate_accounts() {
+    // Real transactions shouldn't have duplicates, but verify we don't panic
+    let mut c = ExecutionCoordinator::new(1);
+    let acc = Pubkey::new_unique();
+    let e = c.get_ready_executor().unwrap();
+
+    // Same account twice as writable - second lock attempt is no-op (already held)
+    assert!(c
+        .try_schedule(e, mock_txn(&[(acc, true), (acc, true)]))
+        .is_ok());
 }


🧹 Nitpick | 🔵 Trivial

Comment is slightly misleading — Solana message compilation deduplicates account keys.

The comment says "second lock attempt is no-op (already held)", but Transaction::new_signed_with_payer compiles the message and deduplicates account_keys, so there's only a single lock attempt. The test still correctly verifies no panic on duplicate inputs, but the comment could be more accurate.

- // Same account twice as writable - second lock attempt is no-op (already held) + // Same account in two instructions - message compilation deduplicates keys, so only one lock attempt

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#[test]

fn test_release_and_reacquire_lock() {

let mut coordinator = ExecutionCoordinator::new(1);

let (a, b) = (Pubkey::new_unique(), Pubkey::new_unique());

let txn1 = mock_txn(&[(a, true)]);

let txn2 = mock_txn(&[(b, true)]);

let exec = coordinator.get_ready_executor().unwrap();

assert!(coordinator.try_acquire_locks(exec, &txn1).is_ok());

coordinator.unlock_accounts(exec);

assert!(coordinator.try_acquire_locks(exec, &txn2).is_ok());

fn transaction_with_duplicate_accounts() {

// Real transactions shouldn't have duplicates, but verify we don't panic

let mut c = ExecutionCoordinator::new(1);

let acc = Pubkey::new_unique();

let e = c.get_ready_executor().unwrap();

// Same account twice as writable - second lock attempt is no-op (already held)

assert!(c

.try_schedule(e, mock_txn(&[(acc, true), (acc, true)]))

.is_ok());

}

#[test]

fn transaction_with_duplicate_accounts() {

// Real transactions shouldn't have duplicates, but verify we don't panic

let mut c = ExecutionCoordinator::new(1);

let acc = Pubkey::new_unique();

let e = c.get_ready_executor().unwrap();

// Same account in two instructions - message compilation deduplicates keys, so only one lock attempt

assert!(c

.try_schedule(e, mock_txn(&[(acc, true), (acc, true)]))

.is_ok());

}

🤖 Prompt for AI Agents

In `@magicblock-processor/src/scheduler/tests.rs` around lines 250 - 261, The existing comment in the test transaction_with_duplicate_accounts is misleading because Transaction::new_signed_with_payer compiles the message and deduplicates account_keys, so there is only a single lock attempt; update the comment above the test (referencing transaction_with_duplicate_accounts, mock_txn, and try_schedule on ExecutionCoordinator) to state that Solana message compilation deduplicates account keys and that the test simply verifies the scheduler does not panic when given duplicate inputs rather than asserting a second lock is a no-op.

taco-paco

LGTM! The only thing - I would keep some of the comments as they're useful,

taco-paco · 2026-02-07T05:28:01Z

magicblock-processor/src/scheduler/coordinator.rs

-        while let Some(lock) = locks.pop() {
-            lock.borrow_mut().unlock(executor);
+        for acc in self.held_accounts[executor as usize].drain(..) {
+            if let Some(lock) = self.locks.get_mut(&acc) {


I think else case here shall not ever happen, if it does it is a sign of bug, lets add error!() and alert to be able to catch that.

taco-paco · 2026-02-07T05:36:57Z

magicblock-processor/src/scheduler/locks.rs

        executor: ExecutorId,
    ) -> Result<(), ExecutorId> {
        if self.state != 0 {
-            // trailing_zeros() efficiently finds the index of the first set bit,


I think some comment here is needed. Maybe not complex as original one, but something to let reader know:
Executor gets its bit in state, so trailing zeros is used to map bit to executor.

taco-paco · 2026-02-07T05:42:22Z

magicblock-processor/src/scheduler/mod.rs

-/// worker threads for execution or simulation.
+/// The central transaction scheduler responsible for distributing work to executors.
 pub struct TransactionScheduler {
-    /// Manages the state of all executors, including locks and blocked transactions.


Let's keep the docs, i think they're quite useful. I would remove only for accountsdb, shutdown and maybe latest_block. Doc for transaction_rx are quite imo

taco-paco · 2026-02-07T05:43:15Z

magicblock-processor/src/scheduler/mod.rs

-                Ok(()) = block_produced.recv() => {
-                    self.transition_to_new_slot();
-                }
-                // A worker has finished its task and is ready for more.


I'd keep those comments

taco-paco · 2026-02-07T05:50:18Z

magicblock-processor/src/scheduler/mod.rs

            }
        }
        drop(self.executors);
        self.ready_rx.recv().await;


Need to loop here

taco-paco · 2026-02-07T05:50:57Z

magicblock-processor/src/scheduler/mod.rs


-    /// Handles a new transaction from the global queue.
    fn handle_new_transaction(&mut self, txn: ProcessableTransaction) {
-        // SAFETY:


Let't keep safety thingy

taco-paco · 2026-02-07T05:52:05Z

magicblock-processor/src/scheduler/mod.rs

                self.coordinator.release_executor(exec);
                break;
            };
-            // Here we check whether the transaction was blocked and re-queued:


I think we need to keep this one, since edge case isn't trivial

bmuddha requested review from GabrielePicco and taco-paco January 29, 2026 07:16

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

magicblock-processor/README.md Show resolved Hide resolved

magicblock-processor/src/scheduler/mod.rs Show resolved Hide resolved

magicblock-processor/src/scheduler/mod.rs Show resolved Hide resolved

bmuddha changed the title ~~Simplify transaction scheduler, make it best effort~~ simplify transaction scheduler, make it best effort Jan 29, 2026

bmuddha changed the title ~~simplify transaction scheduler, make it best effort~~ feat: simplify transaction scheduler, make it best effort Jan 29, 2026

bmuddha added 3 commits February 6, 2026 13:00

fix: simplify the scheduler to remove fairness

ef11f96

feat: preserve best effort transaction ordering

5f6a50e

docs: update processor README

d91128b

bmuddha force-pushed the bmuddha/fix/simple-scheduler branch from 30714cd to d91128b Compare February 6, 2026 05:00

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

taco-paco approved these changes Feb 7, 2026

View reviewed changes

Conversation

bmuddha commented Jan 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Compatibility

Testing

Checklist

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Manual Deploy Available

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

taco-paco left a comment

Choose a reason for hiding this comment

Uh oh!

taco-paco Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

taco-paco Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

taco-paco Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

taco-paco Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

taco-paco Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

taco-paco Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

taco-paco Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bmuddha commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Jan 29, 2026 •

edited

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading