Skip to content

feat: simplify transaction scheduler, make it best effort#916

Open
bmuddha wants to merge 3 commits intomasterfrom
bmuddha/fix/simple-scheduler
Open

feat: simplify transaction scheduler, make it best effort#916
bmuddha wants to merge 3 commits intomasterfrom
bmuddha/fix/simple-scheduler

Conversation

@bmuddha
Copy link
Collaborator

@bmuddha bmuddha commented Jan 29, 2026

Summary

This PR removes the complexity associated with fairness enforcement, which led to subtle bugs (deadlocks) under synthetic high loads. This is achieved by making the locks best effort attempts, not dealing with any future contention. I.e. if the lock on account fails, the transaction is simply requeued and retried when the executor becomes available. This simplification makes deadlocks impossible by design and in practice doesn't lead to writer/reader starvation even under unrealistic loads.

Compatibility

  • No breaking changes, but the scheduling logic is now different

Testing

  • tests have been rewritten to accomodate the new scheduling logic

Checklist

Summary by CodeRabbit

  • Documentation

    • README reorganized with new Architecture/Core Components sections, diagram, and a clearer ExecutionCoordinator and scheduling workflow.
  • Changes

    • Scheduling logic updated: executor assignment and lock/queue coordination changed; strict FIFO ordering is no longer guaranteed, but transaction correctness is preserved.
  • Tests

    • Tests adjusted to verify unordered completion where ordering is not guaranteed.

@github-actions
Copy link

github-actions bot commented Jan 29, 2026

Manual Deploy Available

You can trigger a manual deploy of this PR branch to testnet:

Deploy to Testnet 🚀

Alternative: Comment /deploy on this PR to trigger deployment directly.

⚠️ Note: Manual deploy requires authorization. Only authorized users can trigger deployments.

Comment updated automatically when the PR is synchronized.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

This PR restructures the scheduler: README now documents an ExecutionCoordinator-based architecture. Internals replace per-executor VecDeque queues with per-executor BinaryHeap prioritized by transaction ID, and remove many contention-tracking types. Account locks move from Rc/RefCell indirection to a direct LocksCache with updated bitmask semantics and an atomic transaction ID generator. Scheduling and lock-acquisition flow are simplified (try_acquire_locks returns a blocking executor on conflict; blocked transactions are queued in heaps). Tests updated to reflect that strict FIFO ordering is no longer guaranteed.

Suggested reviewers

  • GabrielePicco
  • taco-paco
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bmuddha/fix/simple-scheduler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@magicblock-processor/README.md`:
- Around line 13-32: The fenced ASCII diagram block is missing a language
specifier; update the opening fence from ``` to ```text (i.e., add the "text"
specifier for the ASCII diagram block) so the diagram renders and lints
correctly—locate the ASCII diagram block in the README (the block containing
"TransactionScheduler" / "ExecutionCoordinator" / "Executor 0") and change its
opening fence accordingly.

In `@magicblock-processor/src/scheduler/mod.rs`:
- Around line 190-193: The SAFETY comment above the unsafe impl Send for
TransactionScheduler incorrectly mentions Rc<RefCell>; replace it with a brief,
accurate invariant: state that TransactionScheduler is safe to mark Send because
it is designed to run on a single dedicated thread and none of its non-Send
internal state escapes that thread or is accessed from other threads; reference
the unsafe impl Send for TransactionScheduler and ensure the comment documents
the single-threaded architectural guarantee and the requirement that no non-Send
handles are moved across threads.
- Around line 160-165: The call program_cache.write().unwrap() in
transition_to_new_slot risks panics on poisoned locks; replace the unwrap with a
poison-aware recovery (e.g., use .write().unwrap_or_else(|poison|
poison.into_inner()) or map_err to recover the guard) and optionally log the
poisoning event before continuing, then proceed to call cache.prune(...) and set
cache.latest_root_slot as before; ensure the change touches the
transition_to_new_slot function and the program_cache write acquisition to
remove the direct unwrap and handle PoisonError appropriately.

@bmuddha bmuddha changed the title Simplify transaction scheduler, make it best effort simplify transaction scheduler, make it best effort Jan 29, 2026
@bmuddha bmuddha changed the title simplify transaction scheduler, make it best effort feat: simplify transaction scheduler, make it best effort Jan 29, 2026
@bmuddha bmuddha force-pushed the bmuddha/fix/simple-scheduler branch from 30714cd to d91128b Compare February 6, 2026 05:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@magicblock-processor/src/scheduler/coordinator.rs`:
- Around line 99-122: Reduce visibility of scheduler lock-management methods
from pub(crate) to pub(super): change the signature of try_acquire_locks and the
unlock_accounts function to use pub(super) instead of pub(crate) so they remain
accessible to the parent module but not the entire crate; update both function
declarations (try_acquire_locks and unlock_accounts) accordingly and run tests
to ensure no external references rely on the wider visibility.

In `@magicblock-processor/src/scheduler/mod.rs`:
- Around line 83-91: The spawn function calls
Builder::new_current_thread().build().expect(...), which must not use .expect()
in production; change it to handle the error explicitly: call build() and match
or use .unwrap_or_else to log the build error via the project's logger (e.g.,
tracing::error! or processLogger equivalent) including the error details, then
panic with a clear message if you still consider it unrecoverable. Update the
code around the spawn function / runtime variable so the runtime build error is
logged before aborting (or propagate/return an error from spawn if you prefer),
and ensure the callsite behavior remains consistent with run()'s expectations.

In `@magicblock-processor/src/scheduler/tests.rs`:
- Around line 250-261: The existing comment in the test
transaction_with_duplicate_accounts is misleading because
Transaction::new_signed_with_payer compiles the message and deduplicates
account_keys, so there is only a single lock attempt; update the comment above
the test (referencing transaction_with_duplicate_accounts, mock_txn, and
try_schedule on ExecutionCoordinator) to state that Solana message compilation
deduplicates account keys and that the test simply verifies the scheduler does
not panic when given duplicate inputs rather than asserting a second lock is a
no-op.

Comment on lines +99 to 122
pub(crate) fn try_acquire_locks(
&mut self,
executor: ExecutorId,
txn: &TransactionWithId,
) -> Result<(), BlockerId> {
let message = txn.txn.transaction.message();
txn: &ProcessableTransaction,
) -> Result<(), ExecutorId> {
let message = txn.transaction.message();
let accounts = message.account_keys();

for (i, &acc) in accounts.iter().enumerate() {
// 1. Fairness Check:
// Before touching the lock, check if an older transaction is already waiting
// for this account. If so, we must wait our turn.
self.check_contention(acc, txn.id)?;

// 2. Lock Acquisition:
let lock = self.locks.entry(acc).or_default().clone();
let mut guard = lock.borrow_mut();

let lock = self.locks.entry(acc).or_default();
let result = if message.is_writable(i) {
guard.write(executor)
lock.write(executor)
} else {
guard.read(executor)
lock.read(executor)
};

match result {
Ok(()) => {
self.acquired_locks[executor as usize].push(lock.clone())
}
Err(holder) => return Err(BlockerId::Executor(holder)),
if let Err(blocker) = result {
self.unlock_accounts(executor);
return Err(blocker);
}
self.held_accounts[executor as usize].push(acc);
}
Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify if try_acquire_locks or unlock_accounts are accessed outside the scheduler module
rg -n 'try_acquire_locks|unlock_accounts' --type=rust -g '!magicblock-processor/src/scheduler/*'

Repository: magicblock-labs/magicblock-validator

Length of output: 62


🏁 Script executed:

#!/bin/bash
# Check the actual visibility of try_acquire_locks and unlock_accounts
rg -n 'pub\(crate\)\s+(fn try_acquire_locks|fn unlock_accounts)' magicblock-processor/src/scheduler/coordinator.rs -A 3

Repository: magicblock-labs/magicblock-validator

Length of output: 473


Narrow method visibility to pub(super) for consistency.

try_acquire_locks and unlock_accounts are not accessed outside the scheduler module and can be narrowed from pub(crate) to pub(super) to match the visibility pattern of other methods in this file.

Suggested changes
-    pub(crate) fn try_acquire_locks(
+    pub(super) fn try_acquire_locks(
-    pub(crate) fn unlock_accounts(&mut self, executor: ExecutorId) {
+    pub(super) fn unlock_accounts(&mut self, executor: ExecutorId) {
🤖 Prompt for AI Agents
In `@magicblock-processor/src/scheduler/coordinator.rs` around lines 99 - 122,
Reduce visibility of scheduler lock-management methods from pub(crate) to
pub(super): change the signature of try_acquire_locks and the unlock_accounts
function to use pub(super) instead of pub(crate) so they remain accessible to
the parent module but not the entire crate; update both function declarations
(try_acquire_locks and unlock_accounts) accordingly and run tests to ensure no
external references rely on the wider visibility.

Comment on lines 83 to 91
pub fn spawn(self) -> JoinHandle<()> {
let task = move || {
std::thread::spawn(move || {
let runtime = Builder::new_current_thread()
.thread_name("transaction-scheduler")
.build()
.expect("Failed to build single-threaded Tokio runtime");
runtime.block_on(tokio::task::unconstrained(self.run()));
};
std::thread::spawn(task)
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

.expect() on Tokio runtime creation in production code.

While a failure to build the runtime is indeed unrecoverable at this point, the coding guidelines mandate flagging all .expect() calls. Consider logging before panicking or documenting the invariant.

As per coding guidelines: {magicblock-*,programs,storage-proto}/**: Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents
In `@magicblock-processor/src/scheduler/mod.rs` around lines 83 - 91, The spawn
function calls Builder::new_current_thread().build().expect(...), which must not
use .expect() in production; change it to handle the error explicitly: call
build() and match or use .unwrap_or_else to log the build error via the
project's logger (e.g., tracing::error! or processLogger equivalent) including
the error details, then panic with a clear message if you still consider it
unrecoverable. Update the code around the spawn function / runtime variable so
the runtime build error is logged before aborting (or propagate/return an error
from spawn if you prefer), and ensure the callsite behavior remains consistent
with run()'s expectations.

Comment on lines 250 to 261
#[test]
fn test_release_and_reacquire_lock() {
let mut coordinator = ExecutionCoordinator::new(1);
let (a, b) = (Pubkey::new_unique(), Pubkey::new_unique());

let txn1 = mock_txn(&[(a, true)]);
let txn2 = mock_txn(&[(b, true)]);

let exec = coordinator.get_ready_executor().unwrap();
assert!(coordinator.try_acquire_locks(exec, &txn1).is_ok());

coordinator.unlock_accounts(exec);
assert!(coordinator.try_acquire_locks(exec, &txn2).is_ok());
fn transaction_with_duplicate_accounts() {
// Real transactions shouldn't have duplicates, but verify we don't panic
let mut c = ExecutionCoordinator::new(1);
let acc = Pubkey::new_unique();
let e = c.get_ready_executor().unwrap();

// Same account twice as writable - second lock attempt is no-op (already held)
assert!(c
.try_schedule(e, mock_txn(&[(acc, true), (acc, true)]))
.is_ok());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Comment is slightly misleading — Solana message compilation deduplicates account keys.

The comment says "second lock attempt is no-op (already held)", but Transaction::new_signed_with_payer compiles the message and deduplicates account_keys, so there's only a single lock attempt. The test still correctly verifies no panic on duplicate inputs, but the comment could be more accurate.

-    // Same account twice as writable - second lock attempt is no-op (already held)
+    // Same account in two instructions - message compilation deduplicates keys, so only one lock attempt
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[test]
fn test_release_and_reacquire_lock() {
let mut coordinator = ExecutionCoordinator::new(1);
let (a, b) = (Pubkey::new_unique(), Pubkey::new_unique());
let txn1 = mock_txn(&[(a, true)]);
let txn2 = mock_txn(&[(b, true)]);
let exec = coordinator.get_ready_executor().unwrap();
assert!(coordinator.try_acquire_locks(exec, &txn1).is_ok());
coordinator.unlock_accounts(exec);
assert!(coordinator.try_acquire_locks(exec, &txn2).is_ok());
fn transaction_with_duplicate_accounts() {
// Real transactions shouldn't have duplicates, but verify we don't panic
let mut c = ExecutionCoordinator::new(1);
let acc = Pubkey::new_unique();
let e = c.get_ready_executor().unwrap();
// Same account twice as writable - second lock attempt is no-op (already held)
assert!(c
.try_schedule(e, mock_txn(&[(acc, true), (acc, true)]))
.is_ok());
}
#[test]
fn transaction_with_duplicate_accounts() {
// Real transactions shouldn't have duplicates, but verify we don't panic
let mut c = ExecutionCoordinator::new(1);
let acc = Pubkey::new_unique();
let e = c.get_ready_executor().unwrap();
// Same account in two instructions - message compilation deduplicates keys, so only one lock attempt
assert!(c
.try_schedule(e, mock_txn(&[(acc, true), (acc, true)]))
.is_ok());
}
🤖 Prompt for AI Agents
In `@magicblock-processor/src/scheduler/tests.rs` around lines 250 - 261, The
existing comment in the test transaction_with_duplicate_accounts is misleading
because Transaction::new_signed_with_payer compiles the message and deduplicates
account_keys, so there is only a single lock attempt; update the comment above
the test (referencing transaction_with_duplicate_accounts, mock_txn, and
try_schedule on ExecutionCoordinator) to state that Solana message compilation
deduplicates account keys and that the test simply verifies the scheduler does
not panic when given duplicate inputs rather than asserting a second lock is a
no-op.

Copy link
Contributor

@taco-paco taco-paco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The only thing - I would keep some of the comments as they're useful,

while let Some(lock) = locks.pop() {
lock.borrow_mut().unlock(executor);
for acc in self.held_accounts[executor as usize].drain(..) {
if let Some(lock) = self.locks.get_mut(&acc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think else case here shall not ever happen, if it does it is a sign of bug, lets add error!() and alert to be able to catch that.

executor: ExecutorId,
) -> Result<(), ExecutorId> {
if self.state != 0 {
// trailing_zeros() efficiently finds the index of the first set bit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some comment here is needed. Maybe not complex as original one, but something to let reader know:
Executor gets its bit in state, so trailing zeros is used to map bit to executor.

/// worker threads for execution or simulation.
/// The central transaction scheduler responsible for distributing work to executors.
pub struct TransactionScheduler {
/// Manages the state of all executors, including locks and blocked transactions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the docs, i think they're quite useful. I would remove only for accountsdb, shutdown and maybe latest_block. Doc for transaction_rx are quite imo

Ok(()) = block_produced.recv() => {
self.transition_to_new_slot();
}
// A worker has finished its task and is ready for more.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep those comments

}
}
drop(self.executors);
self.ready_rx.recv().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to loop here


/// Handles a new transaction from the global queue.
fn handle_new_transaction(&mut self, txn: ProcessableTransaction) {
// SAFETY:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let't keep safety thingy

self.coordinator.release_executor(exec);
break;
};
// Here we check whether the transaction was blocked and re-queued:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep this one, since edge case isn't trivial

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants