-
Notifications
You must be signed in to change notification settings - Fork 69
Schedule onto multiple cores in round-robin fashion #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schedule onto multiple cores in round-robin fashion #91
Conversation
b620a40 to
a63ce3b
Compare
src/sched/mod.rs
Outdated
| } | ||
|
|
||
| // TODO: arbitrary cap. | ||
| pub static SCHED_STATES: [SyncUnsafeCell<Option<SchedState>>; 128] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume you’ve ditched the per_cpu! macro here since it precludes access to state held by other cores?
I’ve been thinking about this for a while and wondering what the best approach is. I can see two options:
-
What you’ve done here: i.e., a static array which allows arbitrary access from any other core. The main downside I see is that everything must be locked, since it can be accessed by another core. I see that you’ve used
SyncUnsafeCell; in that case, I think functions such asget_sched_state_by_idandwith_cpu_sched_stateare inherently unsafe, as they permit lock-free access toSchedStateacross CPUs. -
Accessing/modifying CPU state via the
CPUMessenger. In this approach, we fire off an IPI to the target CPU, which performs the read/modify locally (lock-free) and returns the result.
I think the second option is a good compromise. Updating or accessing the state of another CPU isn’t done nearly as often as accessing the current CPU’s state. With the first option, we’d need locking for every scheduler state access or update (even something as frequent as retrieving the current task). Given that, my vote would be to try the CPUMessenger approach. I’m happy to give that a go if you’d like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an example of this, I've just pushed this branch which pushes tasks onto other CPUs via IPIs with the CPUMessenger. When I run usertest, I'm getting deadlock at the moment:
bash-5.3# ./bin/usertest
Running userspace tests ...
[ 35.35031303] interrupts::cpu_messenger: Putting task on CPU 1
Testing sync syscall ... OK
I suspect this is the child process that's finished on CPU1, the waker seting the task in CPU0 to be runnable again, but it's stuck in the idle task. I need to implement something such that a Waker, when signalling to another CPU causes the rechedule IPI message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I've intentionally left out cross-cpu access beyond task insertion. To my knowledge get_sched_state_by_id et. al. should be safe since I'm never overwriting any item in the array, and the interior mutability of other threads is limited to run_queue, which is protected by a SpinLock.
A better approach could be to Arc the runqueue and just share that.
On the topic of using messengers, how fast are they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the optimal cross-CPU task balancing method is work-stealing rather than work-giving, how would that work with messengers? It seems like you'd need to query for the number of tasks first, then request a steal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I've intentionally left out cross-cpu access beyond task insertion. To my knowledge get_sched_state_by_id et. al. should be safe since I'm never overwriting any item in the array, and the interior mutability of other threads is limited to
run_queue, which is protected by a SpinLock.
I disagree. I see a code path where the procfs module can call find_task_by_descriptor. That call can originate from any task on any CPU, which means it will very likely result in cross-CPU data access (i.e. accessing the run_queue of another CPU). That code iterates through all active slots and walks the run_queues until it finds the task.
As far as I can see, this is highly unsafe and will lead to data races. SchedState itself is not protected by a mutex, nor are any of its fields.
If we can guarantee that the only core accessing a given piece of data is the CPU that owns it, then we can safely access it lock-free. That is precisely the guarantee provided by per_cpu!: there is no way to access data that does not belong to the current CPU. In this case, however, that guarantee does not hold. You need a different mechanism here—this is exactly the problem the CPUMessenger is trying to solve.
A better approach could be to Arc the runqueue and just share that.
Possibly. I did consider wrapping all of this state in a large Arc, but that effectively means locking everything. Any on-CPU operation would then require taking a lock, which I think introduces too much overhead for the scheduler fast path.
On the topic of using messengers, how fast are they?
IPIs are fast. You can hit additional latency if interrupts are disabled on the target CPU, but Linux uses them extensively in practice. For example, on my machine:
$ cat /proc/interrupts
[...]
IWI: 103089 111648 51802 50290 3480843 50846 37640 101369 IRQ work interrupts
RTR: 0 0 0 0 0 0 0 0 APIC ICR read retries
RES: 53258 64925 39159 37986 143873 61857 38363 49915 Rescheduling interrupts
CAL: 1434334 1127025 974039 838055 844695 993721 932218 876858 Function call interrupts
TLB: 684503 742799 728184 629270 646185 829325 783279 722283 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 0 0 0 0 Machine check exceptions
MCP: 66 65 65 65 65 65 65 65 Machine check polls
ERR: 0
These are the various IPIs Linux uses, and as you can see, they are heavily exercised.
I've just pushed some additional code to the branch I mentioned earlier, this resolves the deadlock that I had and allows me to get through nearly all of usertest without any kind of preemption. I'm just using IPIs to wake-up other CPUs when needed.
I think the optimal cross-CPU task balancing method is work-stealing rather than work-giving, how would that work with messengers? It seems like you'd need to query for the number of tasks first, then request a steal.
That’s a fair point, and I agree that work-stealing is generally the better default for schedulers.
With messengers, the model shifts slightly compared to shared-memory designs, but it still maps cleanly. The key idea is that ownership never changes implicitly: the stealing CPU never directly inspects or mutates another CPU’s run queue. Instead, it asks the owning CPU to do the inspection and transfer on its behalf. Concretely, the stealing CPU raises a message to a target CPU with the necessary parameters, and the target CPU decides what (if any) tasks it can push back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think we can worry about the work stealing implementation later, I'll switch what I have currently to use messengers.
5c890be to
a25bb55
Compare
|
I've just tried to run the usertest and it looks like you're getting the same problem as I am on my I Need to investigate what's going on there. |
bc729dc to
427aae4
Compare
hexagonal-sun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some minor comments - nothing major. Other than that, happy to merge.
It might be work creating an issue for per-cpu driver initialisation logic for per-CPU devices (the armv8 timer being the primary one). That decouples that driver from the gic code.
| } | ||
| } | ||
|
|
||
| // TODO: arbitrary cap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant comment now.
| sched_state.advance_vclock(now_inst); | ||
|
|
||
| let next_task = sched_state.find_next_runnable_task(); | ||
| // if previous_task.tid != next_task.tid { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this is useful debug code?
It'd be a shame to remove such code when it's useful for debugging. Perhaps we should make use of log's release_max_level_info feature flags or similar, so that useful code isn't removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope with a proper procfs we can remove this and rely on top instead.
| .activate(); | ||
|
|
||
| *init_task.state.lock_save_irq() = TaskState::Running; | ||
| SCHED_STATE.borrow_mut().running_task = Some(idle_task.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want init to be the first task executed?
|
FYI still debugging some weird crashes with usertest. Also there's a lot of intermittent hanging. |
|
I found 50e5b0b helped a lot with intermittent hanging. Note that you need the prior commit to the interrupt manager to help prevent a deadlock in the interrupt manager. |
|
I’m seeing some lock-ups due to the interrupt manager not being able to handle interrupts reentrantly. That should be fixed if you cherry-pick 6ae2375. I'm also seeing a lot of lock-ups in the allocator: This makes total sense: the spinlock used in the allocator does not disable interrupts. Therefore, when an ISR is running, the heap spinlock may already be held. If an allocation occurs during the ISR, this will deadlock. |
70ea99c to
c9d0ef4
Compare
|
Okay, I've just pushed fixes for the heap allocator and the interrupt manager to this PR which should help. I now get a successful boot to bash pretty much every time which means I think we've fixed interrupt locking issues. I'm now hitting:
|
|
Seems like I can spawn 3 However usertest still doesn't run to completion. I think there might be a bug with respect to cross-cpu task completion. |
|
Also interrupts being scheduled on top of times when the borrow is acquired is turning out to be a problem. (i.e. |
c6b44cb to
c52b786
Compare
|
Fixed a bunch of issues. The thread exit logic was utterly wrong - we were sending a SIGCHLD to the parent for every thread exit; I'm surprised threading code worked prior to this PR! Now we've introduced proper SMP, the fault handling code didn't handle page-fault collisions on two different CPUs. It treated an I think we're getting closer to this being workable now. |
|
Still being very non-deterministic from my end. Usertest does pass sometimes. I suspect there might be some sort of race condition somewhere. From a black-box perspective it seems the algorithm is fine, so I suspect it's something else. |
|
Ah ok, I see that preempts are being scheduled but not executed (sometimes). |
|
Or something is just holding the wakeup_q and never dropping it. Nevermind, I see the problem: Indeed, modifying WakeupKind::Preempt(cpu_id) => {
if cpu_id == ArchImpl::id() {
crate::sched::sched_yield();
} else {
log::warn!(
"SysTimer: received preemption event for CPU {}, but running on CPU {}",
cpu_id,
ArchImpl::id()
);
}
}
}Shows: right before a hang. Ideally we'd have some sort of per-cpu handling (either the wakeup-queue being per-cpu or the timer interrupts being handled per-cpu). But for now I'll just send an IPI. Still running into the timer interrupt not firing (I think it's still being deadlocked somehow). Ah, it's a cyclic lock acquire (sched_yield -> schedule -> schedule_preempt). |
9697eea to
e1dadca
Compare
|
Ok, fixed the hanging bug with 8a09fe0. Essentially I move to interrupt the idle task every 4 ms. However I'm seeing a speed regression of ~10x when running usertests. I'd like to probably resolve that before merging. |
|
Do you think we need a commit like 50e5b0b to force other CPUs out of the idle task if a waker changes the state of the task? I realise that the preempt code would eventually do this with a tick interrupt but I perhaps it may reduce latency? |
| if let Some(task) = SCHED_STATE.borrow().run_queue.get(descriptor) { | ||
| return Some(task.clone()); | ||
| } | ||
| // TODO: Ping other CPUs to find the task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be handled by an IPI in a future PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, althogh I've got a patch in the works which makes TASK_LIST a global list of all tasks, not just task states which should help with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works as well.
| // After pushing, we must update the hardware timer in case our | ||
| // new event is the earliest one. | ||
| if let Some(next_event) = wake_q.peek() { | ||
| if let Some(next_event) = wakeup_q.peek() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might stop being accurate now that schedule() is called after this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it doesn't matter because we'll probably remove this in the near-future anyhow.
| } | ||
| } | ||
|
|
||
| pub fn schedule_force_preempt() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should find a way to do without this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this feels like a bit of a hack.
e268746 to
f6d4feb
Compare
f6d4feb to
70f0753
Compare
3b5c835 to
2818775
Compare
Use EEVDF concepts like virtual deadline correctly and actually calculate the necessary deadline and use it to schedule. Also dynamically preempts based on the deadline.
# Conflicts: # libkernel/src/sync/per_cpu.rs
When switching tasks, we may well be swithing away from a task which is going to `Sleep`. Therefore the check ```rust debug_assert_eq!(*prev_task.state.lock_save_irq(), TaskState::Runnable); ``` Is incorrect.
This prevents a bug where `sys_exit` calls `exit_group` for the thread's process, even when there are still active threads.
Currently, a global wakeup queue is used for all CPUs on the system. This leads to inefficient behavior regarding preemption. When the scheduler requests a preemption event, it is inserted into a global list alongside events from all other CPUs. When processing IRQs, there is no guarantee which CPU will handle the timer interrupt. If the current CPU processes a preemption event intended for a different CPU, it must signal the target CPU via an IPI. This causes a severe bottleneck, as one CPU may end up distributing preemption events for the entire system. Fix this by implementing a per-cpu wakeup queue. Preemption events are now strictly scheduled for the current CPU, ensuring they are handled locally by the core that scheduled them. This significantly simplifies the preemption logic and eliminates the need for IPIs to signal preemption events.
Fix formatting in the sched module to keep CI happy.
2818775 to
bbde9f0
Compare
|
I think the sub-PRs all look reasonable; I've merged them all. I also cleaned up my commit logs a little bit. |
hexagonal-sun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment, but happy to merge. I don't fully understand the EEVDF scheduling aspects but it seems to work fine. I should set some time aside and study the paper really.
Use the new `CpuId` type for the `last_cpu` field in `Task`.
Exposes better APIs to collect all tasks/querying tasks across CPUs. Schedules in a round-robin ordering for now. Eventually we'd want work stealing from other cores and a quick way to get the current queue lengths for each core. Also this forces timer interrupts and extends them to non-main cores.
Currently fixes the bug where background tasks force a hang.
Includes improvements to the EEVDF implementation which fix many bugs.