Skip to content

scx_mitosis: Initial support for non-cpuset cell assignment#3270

Merged
dschatzberg merged 13 commits intosched-ext:mainfrom
dschatzberg:mitosis-overcommit
Mar 4, 2026
Merged

scx_mitosis: Initial support for non-cpuset cell assignment#3270
dschatzberg merged 13 commits intosched-ext:mainfrom
dschatzberg:mitosis-overcommit

Conversation

@dschatzberg
Copy link
Copy Markdown
Contributor

Currently Mitosis creates cells from bpf purely based on cpusets. We want to make mitosis capable of creating cells without strictly cpu pinning cgroups. This PR implements an initial naive version that:

  1. Adds a new CLI flag: --cell-parent-cgroup : When specified, direct children of this cgroup path automatically become separate cells rather than creating cells only from cpuset cgroups.
  2. Creates a userspace cell manager (cell_manager.rs) which observes creation/destruction of cgroups and triggers reconfiguration from BPF (as opposed to triggered from BPF as is done today).

Future work here will extend this to make cell <=> cpu assignment topology aware, utilization aware, etc. And enable allocation of CPUs through configuration.

tommy-u
tommy-u previously requested changes Feb 6, 2026
Copy link
Copy Markdown
Contributor

@tommy-u tommy-u left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a huge step toward a scheduler that's more understandable and easier to modify, fix, and extend. It's got me excited about how to design across the kernel boundary.

High level things for me are terminology and opportunities to shift code out of apply_cell_config.

  1. Terminology. Right now, terminology is a bit messy. We have "Cell 0" or the "root cell." I saw "all_cell_ids" which was only the workload cell ids (I think this went away in a later commit).

This is a little tricky because the terms we need are doing double duty between 2 views:

  • cgroups - hierarchical
  • cells - flat

So how about we have a "system" cgroup(s)/cell and multiple "workload" cgroups/cells. Not married to these words, but I'd really like to drop "Cell 0" (implementation specific) and "root" which fits for the cgroup view, but is misleading in the cell view.

  1. I really like what you're cooking in apply_cell_config(). Why not take it further though? It seems like almost all the data you're accessing and processing in there is available for userspace to precompute and hand over for apply_cell_config to apply trivially.

I think user code is >10x easier to change and maintain than bpf, so I think it would be way easier to test new ideas. Making stuff up here, but e.g. if we didn't hardcode assigning child cgroup's to same cell as the workload owner, someone could play with extracting child cgroups into their own cells, maybe there's an application with virtualization... idk. Policy <-> Mechanism

let mut cell_cpus: HashMap<u32, Cpumask> = HashMap::new();

// All CPUs are distributed among cell 0 + all cells
let all_cell_ids: Vec<u32> = self.cells.values().map(|c| c.cell_id).collect();
Copy link
Copy Markdown
Contributor

@tommy-u tommy-u Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm getting this right all_cell_ids means "all cell ids except the root cell." That's definitely too confusing for me. How about all_workload_cells?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it might be worth some additional logic just to create cell 0. I get you on the terminology point but I don't love the workload/system distinction (nothing seems to indicate to me that we wouldn't want multiple "system" cells). I could maybe call it the default cell? But for whatever reason root seems most natural to me. I acknowledge that its coupled with the cgroup hierarchy, but that's also how I think about it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean to bikeshed, anything is better than cell 0. I can live with with "root" and something to mean non-root. Let's chat offline :)


// Verify all cells have at least one CPU assigned
// Cell 0 always exists
if !cell_cpus.contains_key(&0) || cell_cpus.get(&0).map_or(true, |m| m.weight() == 0) {
Copy link
Copy Markdown
Contributor

@tommy-u tommy-u Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a constant to keep things self documenting
const SYSTEM_CELL_ID: u32 = 0;

///
/// Returns a Vec of (cell_id, Cpumask), or an error if any cell would
/// receive zero CPUs (which indicates too many cells for available CPUs).
pub fn compute_cpu_assignments(&self) -> Result<Vec<(u32, Cpumask)>> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this approach!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference, but as this fn grows longer, it would be easier for me to understand and modify if it were decomposed into separate functions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference, but as this fn grows longer, it would be easier for me to understand and modify if it were decomposed into separate functions.

Yeah, I think this is just my personal preference that I like the linearity of a single function (with comments to make it easier to follow) rather than breaking it up into multiple functions that requires jumping around to understand the entire logic.

Comment thread scheds/rust/scx_mitosis/src/main.rs Outdated

while !shutdown.load(Ordering::Relaxed) && !uei_exited!(&self.skel, uei) {
let mut events = [EpollEvent::empty(); 1];
let timeout_ms = self.monitor_interval.as_millis() as u16;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be reading this wrong, but would putting a monitor_interval of greater than 65 seconds cause this to roll over?

}
}

// Phase 3: Distribute contested CPUs proportionally among claimants
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split the baby. very king solomon


/* Cell cpumask data for a single cell */
struct cell_cpumask_data {
unsigned char mask[MAX_CPUS_U8];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opportunity to use arenas?

* 5. Bump applied_configuration_seq to signal completion
*
* Note: This is not atomic - tasks may observe intermediate states during
* execution. On error, the scheduler may be left in a partially-configured
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds bad, but probably fine in practice. Can you say another word about why we should feel chill about it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking about the error case or the non-atomic aspect?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the ambiguity in highlighting two ideas here.

My question was about the non-atomic aspect here.

I think there is an implicit statement that might help a reader, something like: "The worst case is tasks transiently running on stale CPUs while applying cell config runs."

struct cgroup *cur_cgrp, *root_cgrp_ref;
u32 i, cell_id;
int ret = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference, but I struggle to hold this many ideas in my head at a time. Maybe can procedurally decompose like on the userspace side.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious - how do multiple functions differ from the phases w/ block comments differ for you? I prefer the latter because I read the code linearly rather than needing to jump around.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skill issue on my part, probably. But here's my case for it not being arbitrary.

The one-big-function approach is like a cookbook that lists every ingredient for every recipe upfront—then you read each recipe with no idea what you actually need until you're done.

One big function:

  • 20+ local variables, any of which could be used at any point
  • You can't mentally set any of them down until you've read everything

Small functions:

  • Natural summary of steps with none of the implementation detail
  • Explicit data flow (prepared by one, consumed by another)
  • Compiler complains if you pass unused variables to minimize scope

Keep the summary open on one window to keep the big picture, jump around to each fn on the other.

* exits, causing the scheduler to be unloaded.
*/
SEC("syscall")
int apply_cell_config(void *ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this overall. It feels like we're moving way closer to a simpler world driven by userspace.

But can't we still move most of this function to userspace? We're almost entirely looking up and processing data that is available in userspace. Why don't we precompute all that and let this function be like 4 trivial loops that write it into an essential bpf datastructures?

Also, possible opportunity to pass data via ctx?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What parts are you suggesting could be moved to userspace?

All the cgroup lookups and walking the cgroup hierarchy would be heinously expensive in userspace. While this isn't a particularly perf sensitive path, it's pretty egregious to have to do so many filesystem syscalls compared to the in-kernel iteration.

The rest is really just atomically publishing the cpumasks - I'm not sure how that can be done from userspace.

I guess I could bump the configuration_seq from userspace but it seems cleaner to keep the parts that actually synchronize in the same code base.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Phase 3 is the best candidate. I tried to sketch an example in my overall note where we might be able to remove a hard-coded policy assumption that child cgroups should be mapped to the same cell as the owning ancestor.

I was thinking of this as something that only happens on the ~hours time scale. And that you'd only have to walk cgroup in the workload path because all the rest map to the root cell, which would be pretty cheap?


cgrp = bpf_cgroup_from_id(cgid);
if (!cgrp)
return -ENOENT;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scx_bpf_error here?

Comment thread scheds/rust/scx_mitosis/Cargo.toml Outdated
lazy_static = "1.5.0"
libbpf-rs = "=0.26.0-beta.1"
libc = "0.2.175"
inotify = "0.10"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

layered uses 0.11, easier using that imo.

Ok(n) => {
for event in &events[..n] {
match event.data() {
INOTIFY_TOKEN => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, to make things simpler/less fragile, we're doing a kinda-sorta figure out the state of the world/reconcile thing here vs something more granular/delta based.

If so, I think there should be some debounce here for when things thrash/to avoid reconcile amplifying that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on what your concern is regarding thrashing?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we use inotify to trigger a full "figure out where things should run/what CPUs they should get".

If we have some container that keeps getting created/destroyed (say, something crashlooping and getting destroyed/created over and over), we could keep re-figuring out where things should run with different outcomes (i.e. split 4 ways, split 3 ways).

I think this issue applies even w/o a crashlooping container, but it'd be tolerable given spacing between create/destroy (i.e. if u care about perf, you aren't doing that already), but I think the crashloop case would be the particularly bad one because we'd have relentless migrations and task spraying, iiuc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@likewhatevs - sorry I missed this response. I think it's a valid concern but I don't see this "thrashing" as being so impactful to actual workloads since we'll continue to have preference for prev CPU and other idle CPU detection.

But I'm also not sure what the alternative would be - if a cell leaves the machine or joins the machine, we need to react to that.

@tommy-u tommy-u dismissed their stale review February 10, 2026 02:14

We covered the points I wanted to hit. All happy

Copy link
Copy Markdown
Contributor

@likewhatevs likewhatevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good but i think we need some confirmation that my understanding of thrash risk is wrong or some mitigation of it (i think something like a 10s debounce might get the job done quick and dirty, maybe).

@dschatzberg dschatzberg force-pushed the mitosis-overcommit branch 2 times, most recently from 25f2067 to 0ab4268 Compare February 19, 2026 17:08
rrnewton added a commit that referenced this pull request Feb 21, 2026
…ests

Generalize cgroup lookup kfuncs so all schedulers benefit, and add 6 new
mitosis integration tests modeling scenarios from PR #3270's
test_cell_isolation.sh.

Infrastructure changes:
- Wire up scx_bpf_task_cgroup, bpf_cgroup_from_id, bpf_cgroup_ancestor
  in kfuncs.rs to call the CgroupRegistry instead of returning NULL.
- Add sim_get_root_cgroup FFI declaration for root cgroup fallback.
- Remove 64 lines of redundant per-scheduler cgroup overrides from
  mitosis wrapper.c (now handled by general kfuncs).
- Add CgroupCpusetChangeEvent for runtime cpuset modifications with
  engine support (update_cpuset + cgroup_init notification).

New tests:
- test_mitosis_basic_cell_isolation: N cgroups, non-overlapping cpusets
- test_mitosis_dynamic_cell_lifecycle: runtime create/destroy/migrate
- test_mitosis_cpu_borrowing_select_cpu: rapid sleep/wake borrowing
- test_mitosis_cpu_borrowing_enqueue: pure CPU spinner borrowing
- test_mitosis_demand_rebalancing: 2x oversubscribed + idle cell
- test_mitosis_cpuset_change_detection: swap cpusets at 100ms

Test Results Summary:
- 102 unit tests: all passed
- 56 mitosis tests: all passed (50 existing + 6 new)
- 195 lavd tests: all passed
- 11 cosmos, 7 compare, 12 simple, 9 tickless, 14 interleave,
  7 errors, 6 stress, 4 noise, 2 tick, 5 rtapp, 1 perfetto,
  1 slice_boost, 2 doc-tests: all passed
- Total: 447 passed, 0 failed, 3 ignored (stress fuzzers)
rrnewton added a commit that referenced this pull request Feb 21, 2026
…ests

Generalize cgroup lookup kfuncs so all schedulers benefit, and add 6 new
mitosis integration tests modeling scenarios from PR #3270's
test_cell_isolation.sh.

Infrastructure changes:
- Wire up scx_bpf_task_cgroup, bpf_cgroup_from_id, bpf_cgroup_ancestor
  in kfuncs.rs to call the CgroupRegistry instead of returning NULL.
- Add sim_get_root_cgroup FFI declaration for root cgroup fallback.
- Remove 64 lines of redundant per-scheduler cgroup overrides from
  mitosis wrapper.c (now handled by general kfuncs).
- Add CgroupCpusetChangeEvent for runtime cpuset modifications with
  engine support (update_cpuset + cgroup_init notification).

New tests:
- test_mitosis_basic_cell_isolation: N cgroups, non-overlapping cpusets
- test_mitosis_dynamic_cell_lifecycle: runtime create/destroy/migrate
- test_mitosis_cpu_borrowing_select_cpu: rapid sleep/wake borrowing
- test_mitosis_cpu_borrowing_enqueue: pure CPU spinner borrowing
- test_mitosis_demand_rebalancing: 2x oversubscribed + idle cell
- test_mitosis_cpuset_change_detection: swap cpusets at 100ms

Test Results Summary:
- 102 unit tests: all passed
- 56 mitosis tests: all passed (50 existing + 6 new)
- 195 lavd tests: all passed
- 11 cosmos, 7 compare, 12 simple, 9 tickless, 14 interleave,
  7 errors, 6 stress, 4 noise, 2 tick, 5 rtapp, 1 perfetto,
  1 slice_boost, 2 doc-tests: all passed
- Total: 447 passed, 0 failed, 3 ignored (stress fuzzers)
u32 i, cell_id;

/* Read configuration from global struct (populated by userspace) */
struct cell_config *config = &cell_config;
Copy link
Copy Markdown
Contributor

@tommy-u tommy-u Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we pass cell_config as an arg to the bpf syscall ctx parameter

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkdwvd maybe has more context here but I tried this and it failed with some verifier errors that I've since lost. To the best my understanding the verifier requires that offsets into any BPF_PROG_TYPE_SYSCALL are compile-time constants but we do a bunch of indexing into arrays (which create variable-offsets).

Comment thread scheds/rust/scx_mitosis/src/main.rs Outdated
Comment thread scheds/rust/scx_mitosis/src/bpf/intf.h
Comment thread scheds/rust/scx_mitosis/src/main.rs Outdated
Comment thread scheds/rust/scx_mitosis/src/bpf/mitosis.bpf.c
Comment thread scheds/rust/scx_mitosis/src/main.rs
Comment thread scheds/rust/scx_mitosis/src/main.rs Outdated
rebalance_cooldown_s: u64,

/// EWMA smoothing factor for demand tracking. Higher = more responsive (default: 0.3)
#[clap(long, default_value = "0.3")]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check range or this will do weird stuff :)

#[clap(long, default_value = "0.1", value_parser = clap::value_parser!(f64).range(0.0..=1.0))]

Copy link
Copy Markdown
Contributor

@tommy-u tommy-u left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this & can't wait to get to testing it.

My only concern is marking a task !all_cell_cpus_allowed when borrowable cpuset is not a subset of the task cpuset. This seems like another version of the multi-cpu-pinning issue. Do we just need a second variable like "can_borrow" which we disable here?

Or maybe I misunderstand the borrowable cpuset or an assumption that we'll only be here when there is no pinning. https://github.com/sched-ext/scx/pull/3270/changes#r2850203114

Anyways, looks awesome. Let's fire it up

Copy link
Copy Markdown
Contributor

@tommy-u tommy-u left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleared up my misunderstanding of how borrowing cpusets work. Looks great.

@dschatzberg dschatzberg force-pushed the mitosis-overcommit branch 3 times, most recently from 8d3b823 to fe14bef Compare March 3, 2026 17:20
@dschatzberg dschatzberg dismissed likewhatevs’s stale review March 3, 2026 18:26

discussed offline

@dschatzberg dschatzberg added this pull request to the merge queue Mar 4, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 4, 2026
@dschatzberg dschatzberg added this pull request to the merge queue Mar 4, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 4, 2026
Add symmetric counterpart to from_cpulist() that formats a Cpumask as a
compact CPU list string like "0-7,16-23". Returns "none" if no CPUs are
set.

Includes comprehensive tests for empty masks, single CPUs, contiguous
ranges, multiple ranges, scattered CPUs, mixed patterns, and roundtrip
conversion.
Replace recv_timeout blocking with an epoll-based event loop. This
change prepares the infrastructure for adding more event sources later
(like inotify for cgroup events).

Key changes:
- Add nix dependency with event feature for epoll/eventfd
- Add epoll and stats_waker fields to Scheduler
- Add stats bridge thread that writes to eventfd on stats requests
- Main loop uses epoll.wait() with timeout, handles STATS_TOKEN

The event loop currently only handles stats requests, but the
architecture supports adding additional event sources in future commits.
Adds a CellManager module that watches a parent cgroup directory and
creates cells for each direct child cgroup. This is a step towards
the --cell-parent-cgroup mode where cells are managed from userspace.

Key features:
- Uses inotify to watch for cgroup creation/destruction
- Maintains mapping from cgroup ID (inode) to cell ID
- Cell ID allocation with reuse of freed IDs
- Basic CPU assignment: divides CPUs equally among all cells
- CellInfo uses Option<PathBuf> and Option<u64> for cgroup_path and
  cgid: cell 0 has no owning cgroup, so these are None rather than
  misleading sentinel values

This commit provides the core cell management infrastructure.
Cpuset-aware CPU assignment will be added in a follow-up commit.
Extends CellManager to read and honor cpuset.cpus from each cgroup
directory. This enables pinning-aware CPU assignment:

- CellInfo now includes an optional cpuset mask
- CPU assignment algorithm handles:
  - Exclusive CPUs: assigned to sole claimant
  - Contested CPUs: divided proportionally among claimants
  - Unclaimed CPUs: shared between cell 0 and unpinned cells

Adds comprehensive tests for cpuset parsing and overlapping scenarios.
Add BPF support for userspace-managed cell configuration. When enabled
via userspace_managed_cell_mode, the internal timer-based configuration
is disabled and userspace controls cells via apply_cell_config.

New components:
- cell_config struct: holds cgroup->cell assignments and cell cpumasks
- cell_assignment struct: maps cgroup ID to cell ID
- cell_cpumask_data struct: byte array for cell CPU masks
- apply_cell_config SEC("syscall"): applies complete configuration

The apply_cell_config program operates in five phases:
1. Mark all cells (except cell 0) as not in use
2. Apply cell-to-cgroup assignments for owner cgroups
3. Walk cgroup hierarchy to propagate cells to children
4. Apply cell cpumasks and CPU-to-cell mappings
5. Bump applied_configuration_seq to signal completion
Wire CellManager into the scheduler with --cell-parent-cgroup CLI flag.
When specified, the scheduler watches the parent cgroup directory for
child cgroup creation/deletion and creates cells accordingly.

Key integration points:
- Add --cell-parent-cgroup CLI option
- Set userspace_managed_cell_mode BPF variable when option is provided
- Register inotify fd with epoll instance (INOTIFY_TOKEN)
- Apply initial cell configuration on scheduler attach
- Handle INOTIFY_TOKEN events to process cgroup changes
- Build and apply cell configuration via apply_cell_config BPF program

The cell configuration is sent to BPF via the cell_config global struct,
which contains cgroup->cell assignments and cell cpumasks.
Add integration test script for verifying cell isolation behavior.
The script creates test cgroups under a parent, starts the scheduler
with --cell-parent-cgroup, and verifies CPU isolation.

Test modes:
- Basic isolation: verify processes in different cells run on separate CPUs
- Dynamic cell creation: verify new cgroups are detected and assigned cells
- Dynamic cell destruction: verify removed cgroups trigger cell cleanup
- CPU reallocation: verify CPUs are redistributed when cells change
- Cell ID reuse: verify freed cell IDs are reused for new cells

Usage:
  ./test_cell_isolation.sh [--cpuset] [--proportional] [--num-cells N]
                           [--workers N] [--test-dynamic] [--test-all]
Allow excluding specific cgroup directory names from cell creation so
they remain in cell 0. The flag accepts exact basenames, can be
specified multiple times, and requires --cell-parent-cgroup.
Extract the idle CPU search + dispatch + kick logic from select_cpu and
enqueue into a shared try_pick_idle_cpu() helper.

The helper:
- Calls pick_idle_cpu() for cell-local idle CPU search
- On success: bumps CSTAT_LOCAL, dispatches to SCX_DSQ_LOCAL
- If kick=true: kicks the idle CPU
- Returns the CPU number (>= 0 success, -1 error, -EBUSY no idle CPU)

Both select_cpu and enqueue now call try_pick_idle_cpu() instead of
duplicating the pick_idle_cpu() + stats + dispatch logic. This
centralizes the idle CPU path so future extensions (like CPU borrowing)
can be added in one place.

In enqueue, this changes the idle CPU hit path: previously the task was
dispatched to the cell DSQ with vtime scheduling, now it goes directly
to SCX_DSQ_LOCAL. This is safe because the target CPU is idle — there
is no contention to arbitrate with vtime, and the task's vtime will be
updated normally in stopping() when its slice completes.
Allow busy cells to borrow idle CPUs from other cells when
--enable-borrowing is set. This provides soft overcommit: a cell under
heavy load can temporarily use CPUs that belong to idle cells without
permanently reassigning them.

BPF changes:
- Add borrowing fallback to try_pick_idle_cpu(): when no cell-local
  idle CPU is found and borrowing is enabled, search the cell's
  borrowable cpumask for idle CPUs from other cells. Borrowed tasks
  are dispatched to SCX_DSQ_LOCAL and bump CSTAT_BORROWED.
- Add per-cell borrowable_cpumask tracking in cell_cpumask_wrapper
- Extend update_task_cpumask() to check borrowable mask coverage
  when determining all_cell_cpus_allowed
- Initialize borrowable cpumasks in mitosis_init()
- Apply borrowable cpumasks in apply_cell_config()

Since borrowing is integrated into try_pick_idle_cpu(), both the
select_cpu and enqueue paths get borrowing support automatically.

Userspace changes:
- Add compute_borrowable_cpumasks() to CellManager: for each cell,
  the borrowable mask is the union of all other cells' CPUs
- Wire borrowable cpumasks into cell_config alongside cell cpumasks
- Add --enable-borrowing CLI flag and borrowed_pct stat

Testing:
- Add --test-borrowing integration test (pipe stressor, select_cpu path)
- Add --test-enqueue-borrowing integration test (CPU stressor, enqueue
  path) to validate borrowing works for CPU-bound tasks that are
  re-enqueued without going through select_cpu
Track per-cell CPU demand metrics: util_pct, demand_borrow_pct, and
lent_pct. BPF accounts per-task running time in cpu_ctx.running_ns
indexed by the task's cell, and userspace aggregates these into
utilization, borrowed time, and lent time percentages per cell.

The previous cell_cycles field was only used within BPF and not read
by userspace, so it is replaced with running_ns which userspace reads
to compute demand metrics.

Demand tracking is only active when --cell-parent-cgroup is set (i.e.,
the cell manager is driving cell creation), since all cells are known
to userspace in that mode.

Includes an integration test (--test-usage-tracking) that validates
the metrics end-to-end using monitor JSON output.
Dynamically adjust per-cell CPU allocation based on EWMA-smoothed
utilization. When --enable-rebalancing is set, the scheduler
periodically checks if the utilization spread across cells exceeds a
threshold and redistributes CPUs proportionally to demand.

Key changes:
- Add distribute_cpus_by_weight() helper for weighted CPU distribution
  with a floor guarantee of 1 CPU per cell to prevent starvation
- Refactor compute_cpu_assignments() to share logic with the new
  compute_demand_cpu_assignments() via compute_cpu_assignments_inner()
- Add borrow budget capping to compute_borrowable_cpumasks() so cells
  only borrow when they genuinely need extra capacity
- Add EWMA smoothing in collect_demand_metrics() to track per-cell
  demand over time
- Add maybe_rebalance() triggered in the main loop after metrics
  collection, gated by cooldown and threshold checks
- Make compute_and_apply_cell_config() use demand-weighted assignment
  when rebalancing is enabled and utilization data exists, so topology
  changes (new/destroyed cells, cpuset updates) preserve the current
  demand-weighted distribution instead of resetting to equal-weight
- Seed new cells to the average smoothed_util of existing cells so
  they start with a fair share rather than zero (which would starve
  them to the 1-CPU floor on the next rebalance tick)
- Clear smoothed_util for destroyed cells to prevent stale data from
  leaking if the cell ID is reused
- Expose rebalance_count and smoothed_util_pct in metrics/stats
- Add --test-rebalancing and --test-new-cell-demand integration tests

New CLI flags (all gated behind --enable-rebalancing):
  --enable-rebalancing     Enable demand-based rebalancing
  --rebalance-threshold    Utilization spread trigger (default: 20%)
  --rebalance-cooldown-s   Min seconds between rebalances (default: 5)
  --demand-smoothing       EWMA alpha factor (default: 0.3)
In userspace-managed mode, cpusets were read once at cell creation and
never re-read. If someone later writes to a cell's cpuset.cpus, the
scheduler silently ignored the change.

Add a dedicated BPF counter cpuset_seq (separate from configuration_seq)
that is bumped by fentry_cpuset_write_resmask in userspace-managed mode.
Userspace checks this counter each main-loop iteration and re-reads
cpusets only when it changes, recomputing CPU assignments if any cell's
cpuset actually differs.
@dschatzberg dschatzberg merged commit cd2ca2e into sched-ext:main Mar 4, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants