Skip to content

Conversation

@arihant2math
Copy link
Collaborator

@arihant2math arihant2math commented Dec 25, 2025

Exposes better APIs to collect all tasks/querying tasks across CPUs. Schedules in a round-robin ordering for now. Eventually we'd want work stealing from other cores and a quick way to get the current queue lengths for each core. Also this forces timer interrupts and extends them to non-main cores.

Currently fixes the bug where background tasks force a hang.

Includes improvements to the EEVDF implementation which fix many bugs.

@arihant2math arihant2math force-pushed the multicore-sched branch 2 times, most recently from b620a40 to a63ce3b Compare December 25, 2025 08:39
@arihant2math arihant2math changed the title schedule in round-robin fashion Schedule onto multiple cores in round-robin fashion Dec 25, 2025
@arihant2math arihant2math marked this pull request as draft December 25, 2025 09:00
src/sched/mod.rs Outdated
}

// TODO: arbitrary cap.
pub static SCHED_STATES: [SyncUnsafeCell<Option<SchedState>>; 128] =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you’ve ditched the per_cpu! macro here since it precludes access to state held by other cores?

I’ve been thinking about this for a while and wondering what the best approach is. I can see two options:

  • What you’ve done here: i.e., a static array which allows arbitrary access from any other core. The main downside I see is that everything must be locked, since it can be accessed by another core. I see that you’ve used SyncUnsafeCell; in that case, I think functions such as get_sched_state_by_id and with_cpu_sched_state are inherently unsafe, as they permit lock-free access to SchedState across CPUs.

  • Accessing/modifying CPU state via the CPUMessenger. In this approach, we fire off an IPI to the target CPU, which performs the read/modify locally (lock-free) and returns the result.

I think the second option is a good compromise. Updating or accessing the state of another CPU isn’t done nearly as often as accessing the current CPU’s state. With the first option, we’d need locking for every scheduler state access or update (even something as frequent as retrieving the current task). Given that, my vote would be to try the CPUMessenger approach. I’m happy to give that a go if you’d like.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an example of this, I've just pushed this branch which pushes tasks onto other CPUs via IPIs with the CPUMessenger. When I run usertest, I'm getting deadlock at the moment:

bash-5.3# ./bin/usertest 
Running userspace tests ...
[   35.35031303] interrupts::cpu_messenger: Putting task on CPU 1
Testing sync syscall ... OK

I suspect this is the child process that's finished on CPU1, the waker seting the task in CPU0 to be runnable again, but it's stuck in the idle task. I need to implement something such that a Waker, when signalling to another CPU causes the rechedule IPI message.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've intentionally left out cross-cpu access beyond task insertion. To my knowledge get_sched_state_by_id et. al. should be safe since I'm never overwriting any item in the array, and the interior mutability of other threads is limited to run_queue, which is protected by a SpinLock.

A better approach could be to Arc the runqueue and just share that.

On the topic of using messengers, how fast are they?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the optimal cross-CPU task balancing method is work-stealing rather than work-giving, how would that work with messengers? It seems like you'd need to query for the number of tasks first, then request a steal.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've intentionally left out cross-cpu access beyond task insertion. To my knowledge get_sched_state_by_id et. al. should be safe since I'm never overwriting any item in the array, and the interior mutability of other threads is limited to run_queue, which is protected by a SpinLock.

I disagree. I see a code path where the procfs module can call find_task_by_descriptor. That call can originate from any task on any CPU, which means it will very likely result in cross-CPU data access (i.e. accessing the run_queue of another CPU). That code iterates through all active slots and walks the run_queues until it finds the task.

As far as I can see, this is highly unsafe and will lead to data races. SchedState itself is not protected by a mutex, nor are any of its fields.

If we can guarantee that the only core accessing a given piece of data is the CPU that owns it, then we can safely access it lock-free. That is precisely the guarantee provided by per_cpu!: there is no way to access data that does not belong to the current CPU. In this case, however, that guarantee does not hold. You need a different mechanism here—this is exactly the problem the CPUMessenger is trying to solve.

A better approach could be to Arc the runqueue and just share that.

Possibly. I did consider wrapping all of this state in a large Arc, but that effectively means locking everything. Any on-CPU operation would then require taking a lock, which I think introduces too much overhead for the scheduler fast path.

On the topic of using messengers, how fast are they?

IPIs are fast. You can hit additional latency if interrupts are disabled on the target CPU, but Linux uses them extensively in practice. For example, on my machine:

$ cat /proc/interrupts

[...]
 IWI:     103089     111648      51802      50290    3480843      50846      37640     101369   IRQ work interrupts
 RTR:          0          0          0          0          0          0          0          0   APIC ICR read retries
 RES:      53258      64925      39159      37986     143873      61857      38363      49915   Rescheduling interrupts
 CAL:    1434334    1127025     974039     838055     844695     993721     932218     876858   Function call interrupts
 TLB:     684503     742799     728184     629270     646185     829325     783279     722283   TLB shootdowns
 TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0          0          0          0          0   Threshold APIC interrupts
 DFR:          0          0          0          0          0          0          0          0   Deferred Error APIC interrupts
 MCE:          0          0          0          0          0          0          0          0   Machine check exceptions
 MCP:         66         65         65         65         65         65         65         65   Machine check polls
 ERR:          0

These are the various IPIs Linux uses, and as you can see, they are heavily exercised.

I've just pushed some additional code to the branch I mentioned earlier, this resolves the deadlock that I had and allows me to get through nearly all of usertest without any kind of preemption. I'm just using IPIs to wake-up other CPUs when needed.

I think the optimal cross-CPU task balancing method is work-stealing rather than work-giving, how would that work with messengers? It seems like you'd need to query for the number of tasks first, then request a steal.

That’s a fair point, and I agree that work-stealing is generally the better default for schedulers.

With messengers, the model shifts slightly compared to shared-memory designs, but it still maps cleanly. The key idea is that ownership never changes implicitly: the stealing CPU never directly inspects or mutates another CPU’s run queue. Instead, it asks the owning CPU to do the inspection and transfer on its behalf. Concretely, the stealing CPU raises a message to a target CPU with the necessary parameters, and the target CPU decides what (if any) tasks it can push back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think we can worry about the work stealing implementation later, I'll switch what I have currently to use messengers.

@arihant2math arihant2math force-pushed the multicore-sched branch 2 times, most recently from 5c890be to a25bb55 Compare December 25, 2025 21:04
@arihant2math arihant2math marked this pull request as ready for review December 25, 2025 21:04
@hexagonal-sun
Copy link
Owner

hexagonal-sun commented Dec 25, 2025

I've just tried to run the usertest and it looks like you're getting the same problem as I am on my cpu-messenger-task-migration branch:

Testing rust threads ... OK
[    5.5169154] moss: Kernel panicked at src/arch/arm64/memory/fault.rs:121:17: Page fault defered error, SIGBUS on process

I Need to investigate what's going on there.

@arihant2math arihant2math force-pushed the multicore-sched branch 3 times, most recently from bc729dc to 427aae4 Compare December 26, 2025 05:52
Copy link
Owner

@hexagonal-sun hexagonal-sun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor comments - nothing major. Other than that, happy to merge.

It might be work creating an issue for per-cpu driver initialisation logic for per-CPU devices (the armv8 timer being the primary one). That decouples that driver from the gic code.

}
}

// TODO: arbitrary cap.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment now.

sched_state.advance_vclock(now_inst);

let next_task = sched_state.find_next_runnable_task();
// if previous_task.tid != next_task.tid {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this is useful debug code?

It'd be a shame to remove such code when it's useful for debugging. Perhaps we should make use of log's release_max_level_info feature flags or similar, so that useful code isn't removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope with a proper procfs we can remove this and rely on top instead.

.activate();

*init_task.state.lock_save_irq() = TaskState::Running;
SCHED_STATE.borrow_mut().running_task = Some(idle_task.clone());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want init to be the first task executed?

@arihant2math
Copy link
Collaborator Author

FYI still debugging some weird crashes with usertest. Also there's a lot of intermittent hanging.

@hexagonal-sun
Copy link
Owner

I found 50e5b0b helped a lot with intermittent hanging. Note that you need the prior commit to the interrupt manager to help prevent a deadlock in the interrupt manager.

@hexagonal-sun
Copy link
Owner

hexagonal-sun commented Dec 26, 2025

I’m seeing some lock-ups due to the interrupt manager not being able to handle interrupts reentrantly. That should be fixed if you cherry-pick 6ae2375.

I'm also seeing a lot of lock-ups in the allocator:

#0  spinning_top::spinlock::{impl#1}::lock (self=0xffff800001213170 <moss::arch::arm64::memory::HEAP_ALLOCATOR>) at src/spinlock.rs:57
#1  0xffff800000164f94 in lock_api::mutex::Mutex<spinning_top::spinlock::RawSpinlock, linked_list_allocator::Heap>::lock<spinning_top::spinlock::RawSpinlock, linked_list_allocator::Heap> (self=0xffff800001213170 <moss::arch::arm64::memory::HEAP_ALLOCATOR>)
    at /home/matthew/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/lock_api-0.4.14/src/mutex.rs:213
#2  0xffff800000166144 in linked_list_allocator::{impl#4}::alloc (self=0xffff800001213170 <moss::arch::arm64::memory::HEAP_ALLOCATOR>, layout=...)
    at src/lib.rs:336
#3  0xffff800000151088 in moss::arch::arm64::memory::_::__rust_alloc (size=24, align=8) at src/arch/arm64/memory/mod.rs:21
#4  0xffff800000119dc8 in alloc::alloc::alloc (layout=...)
    at /home/matthew/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:95
#5  alloc::alloc::Global::alloc_impl (self=0x1, layout=..., zeroed=false)
    at /home/matthew/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:190
#6  0xffff800000119f04 in alloc::alloc::{impl#1}::allocate (layout=...)
    at /home/matthew/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:251
#7  alloc::alloc::exchange_malloc (size=24, align=8)
    at /home/matthew/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:352
#8  0xffff8000000e4790 in alloc::boxed::{impl#0}::new<moss::drivers::interrupts::arm_gic_v3::ArmGicV3InterruptContext> (x=...)
    at /home/matthew/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:265
#9  moss::drivers::interrupts::arm_gic_v3::{impl#5}::read_active_interrupt (self=0xffffb00000000ce0) at src/drivers/interrupts/arm_gic_v3.rs:395
#10 0xffff800000066360 in moss::interrupts::InterruptManager::get_active_handler (self=0xffffb00000000d08) at src/interrupts/mod.rs:152
#11 0xffff800000066140 in moss::interrupts::InterruptManager::handle_interrupt (self=0xffffb00000000d08) at src/interrupts/mod.rs:161
#12 0xffff8000000e0a2c in moss::arch::arm64::exceptions::el1_irq_spx (state=0xffffb8000001a8d0) at src/arch/arm64/exceptions/mod.rs:125
#13 0xffffe000000002f4 in ?? ()

This makes total sense: the spinlock used in the allocator does not disable interrupts. Therefore, when an ISR is running, the heap spinlock may already be held. If an allocation occurs during the ISR, this will deadlock.

@hexagonal-sun
Copy link
Owner

hexagonal-sun commented Dec 26, 2025

Okay, I've just pushed fixes for the heap allocator and the interrupt manager to this PR which should help. I now get a successful boot to bash pretty much every time which means I think we've fixed interrupt locking issues. I'm now hitting:

[    0.717370] drivers::interrupts::arm_gic_v3: GICv3: CPU interface for core 2 enabled.
[    0.718020] arch::arm64::boot::secondary: CPU 2 online.
[    0.719852] drivers::interrupts::arm_gic_v3: GICv3: Redistributor for core 3 (MPIDR=0x80000003) is awake.
[    0.720558] drivers::interrupts::arm_gic_v3: GICv3: CPU interface for core 3 enabled.
[    0.721818] arch::arm64::boot::secondary: CPU 3 online.
[    0.725946] interrupts::cpu_messenger: CPU 1 recieved ping from CPU 0
bash-5.3# [    1.1196409] moss: Kernel panicked at src/sched/mod.rs:335:13: assertion `left == right` failed
  left: Sleeping
 right: Runnable

Which looks like it's a scheduler related issue. This was an incorrect assertion check which I added to the sched code. I've pushed a commit to remove it.

@arihant2math
Copy link
Collaborator Author

Seems like I can spawn 3 cat /dev/zero &s, then it starts to slow down. This is fine for now.

However usertest still doesn't run to completion. I think there might be a bug with respect to cross-cpu task completion.

@arihant2math
Copy link
Collaborator Author

Also interrupts being scheduled on top of times when the borrow is acquired is turning out to be a problem. (i.e. insert_task is running and then the timer interrupt strikes. This seems to cause hangups a lot.

@hexagonal-sun hexagonal-sun force-pushed the multicore-sched branch 2 times, most recently from c6b44cb to c52b786 Compare December 27, 2025 21:13
@hexagonal-sun
Copy link
Owner

Fixed a bunch of issues. The thread exit logic was utterly wrong - we were sending a SIGCHLD to the parent for every thread exit; I'm surprised threading code worked prior to this PR!

Now we've introduced proper SMP, the fault handling code didn't handle page-fault collisions on two different CPUs. It treated an AlreadyMapped condition as a hard error, see 709845c for details.

I think we're getting closer to this being workable now.

@arihant2math
Copy link
Collaborator Author

Still being very non-deterministic from my end. Usertest does pass sometimes. I suspect there might be some sort of race condition somewhere. From a black-box perspective it seems the algorithm is fine, so I suspect it's something else.

@arihant2math
Copy link
Collaborator Author

arihant2math commented Dec 28, 2025

Ah ok, I see that preempts are being scheduled but not executed (sometimes).

@arihant2math
Copy link
Collaborator Author

arihant2math commented Dec 28, 2025

Ok I see the problem, we should disable interrupts when modifying the wait queue because it might be locked during a timer interrupt.

Or something is just holding the wakeup_q and never dropping it.

Nevermind, I see the problem:
Essentially SYS_TIMER seems to be running on a random CPU. However sched_yield yields the current CPU's scheduler. So, if the yield was meant for a different CPU it'll never go through.

Indeed, modifying WakeupEvent::preempt to include the callers CPU and adding the following patch:

                    WakeupKind::Preempt(cpu_id) => {
                        if cpu_id == ArchImpl::id() {
                            crate::sched::sched_yield();
                        } else {
                            log::warn!(
                                "SysTimer: received preemption event for CPU {}, but running on CPU {}",
                                cpu_id,
                                ArchImpl::id()
                            );
                        }
                    }
                }

Shows:

[    2.2638996] drivers::timer: SysTimer: received preemption event for CPU 1, but running on CPU 0

right before a hang.

Ideally we'd have some sort of per-cpu handling (either the wakeup-queue being per-cpu or the timer interrupts being handled per-cpu). But for now I'll just send an IPI.

Still running into the timer interrupt not firing (I think it's still being deadlocked somehow).

Ah, it's a cyclic lock acquire (sched_yield -> schedule -> schedule_preempt).

@arihant2math
Copy link
Collaborator Author

Ok, fixed the hanging bug with 8a09fe0. Essentially I move to interrupt the idle task every 4 ms.

However I'm seeing a speed regression of ~10x when running usertests. I'd like to probably resolve that before merging.

@hexagonal-sun
Copy link
Owner

hexagonal-sun commented Dec 28, 2025

Do you think we need a commit like 50e5b0b to force other CPUs out of the idle task if a waker changes the state of the task? I realise that the preempt code would eventually do this with a tick interrupt but I perhaps it may reduce latency?

if let Some(task) = SCHED_STATE.borrow().run_queue.get(descriptor) {
return Some(task.clone());
}
// TODO: Ping other CPUs to find the task.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be handled by an IPI in a future PR

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, althogh I've got a patch in the works which makes TASK_LIST a global list of all tasks, not just task states which should help with this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works as well.

// After pushing, we must update the hardware timer in case our
// new event is the earliest one.
if let Some(next_event) = wake_q.peek() {
if let Some(next_event) = wakeup_q.peek() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might stop being accurate now that schedule() is called after this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it doesn't matter because we'll probably remove this in the near-future anyhow.

}
}

pub fn schedule_force_preempt() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should find a way to do without this later.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this feels like a bit of a hack.

@hexagonal-sun
Copy link
Owner

I've split out independent fixes into sepereate PRs which has reduced the size of this PR. Ideally they need merging before this does. They are #108 #107 #106 #105 #104.

@arihant2math arihant2math force-pushed the multicore-sched branch 2 times, most recently from 3b5c835 to 2818775 Compare December 29, 2025 07:50
arihant2math and others added 15 commits December 28, 2025 23:51
Use EEVDF concepts like virtual deadline correctly and actually calculate the necessary deadline and use it to schedule.

Also dynamically preempts based on the deadline.
# Conflicts:
#	libkernel/src/sync/per_cpu.rs
When switching tasks, we may well be swithing away from a task which is
going to `Sleep`. Therefore the check

```rust
 debug_assert_eq!(*prev_task.state.lock_save_irq(), TaskState::Runnable);
```

Is incorrect.
This prevents a bug where `sys_exit` calls `exit_group` for the thread's
process, even when there are still active threads.
Currently, a global wakeup queue is used for all CPUs on the system.
This leads to inefficient behavior regarding preemption. When the
scheduler requests a preemption event, it is inserted into a global list
alongside events from all other CPUs.

When processing IRQs, there is no guarantee which CPU will handle the
timer interrupt. If the current CPU processes a preemption event
intended for a different CPU, it must signal the target CPU via an IPI.
This causes a severe bottleneck, as one CPU may end up distributing
preemption events for the entire system.

Fix this by implementing a per-cpu wakeup queue. Preemption events are
now strictly scheduled for the current CPU, ensuring they are handled
locally by the core that scheduled them. This significantly simplifies
the preemption logic and eliminates the need for IPIs to signal
preemption events.
Fix formatting in the sched module to keep CI happy.
@arihant2math
Copy link
Collaborator Author

arihant2math commented Dec 29, 2025

I think the sub-PRs all look reasonable; I've merged them all. I also cleaned up my commit logs a little bit.
I think that merging this after your review, given that there are no issues, looks reasonable

Copy link
Owner

@hexagonal-sun hexagonal-sun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, but happy to merge. I don't fully understand the EEVDF scheduling aspects but it seems to work fine. I should set some time aside and study the paper really.

Use the new `CpuId` type for the `last_cpu` field in `Task`.
@hexagonal-sun hexagonal-sun merged commit 0258645 into hexagonal-sun:master Dec 29, 2025
1 check passed
@arihant2math arihant2math deleted the multicore-sched branch December 29, 2025 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants