scx_mitosis: Initial support for non-cpuset cell assignment#3270
scx_mitosis: Initial support for non-cpuset cell assignment#3270dschatzberg merged 13 commits intosched-ext:mainfrom
Conversation
tommy-u
left a comment
There was a problem hiding this comment.
I think this is a huge step toward a scheduler that's more understandable and easier to modify, fix, and extend. It's got me excited about how to design across the kernel boundary.
High level things for me are terminology and opportunities to shift code out of apply_cell_config.
- Terminology. Right now, terminology is a bit messy. We have "Cell 0" or the "root cell." I saw "all_cell_ids" which was only the workload cell ids (I think this went away in a later commit).
This is a little tricky because the terms we need are doing double duty between 2 views:
- cgroups - hierarchical
- cells - flat
So how about we have a "system" cgroup(s)/cell and multiple "workload" cgroups/cells. Not married to these words, but I'd really like to drop "Cell 0" (implementation specific) and "root" which fits for the cgroup view, but is misleading in the cell view.
- I really like what you're cooking in apply_cell_config(). Why not take it further though? It seems like almost all the data you're accessing and processing in there is available for userspace to precompute and hand over for apply_cell_config to apply trivially.
I think user code is >10x easier to change and maintain than bpf, so I think it would be way easier to test new ideas. Making stuff up here, but e.g. if we didn't hardcode assigning child cgroup's to same cell as the workload owner, someone could play with extracting child cgroups into their own cells, maybe there's an application with virtualization... idk. Policy <-> Mechanism
| let mut cell_cpus: HashMap<u32, Cpumask> = HashMap::new(); | ||
|
|
||
| // All CPUs are distributed among cell 0 + all cells | ||
| let all_cell_ids: Vec<u32> = self.cells.values().map(|c| c.cell_id).collect(); |
There was a problem hiding this comment.
If I'm getting this right all_cell_ids means "all cell ids except the root cell." That's definitely too confusing for me. How about all_workload_cells?
There was a problem hiding this comment.
Yeah, it might be worth some additional logic just to create cell 0. I get you on the terminology point but I don't love the workload/system distinction (nothing seems to indicate to me that we wouldn't want multiple "system" cells). I could maybe call it the default cell? But for whatever reason root seems most natural to me. I acknowledge that its coupled with the cgroup hierarchy, but that's also how I think about it?
There was a problem hiding this comment.
I don't mean to bikeshed, anything is better than cell 0. I can live with with "root" and something to mean non-root. Let's chat offline :)
|
|
||
| // Verify all cells have at least one CPU assigned | ||
| // Cell 0 always exists | ||
| if !cell_cpus.contains_key(&0) || cell_cpus.get(&0).map_or(true, |m| m.weight() == 0) { |
There was a problem hiding this comment.
How about a constant to keep things self documenting
const SYSTEM_CELL_ID: u32 = 0;
| /// | ||
| /// Returns a Vec of (cell_id, Cpumask), or an error if any cell would | ||
| /// receive zero CPUs (which indicates too many cells for available CPUs). | ||
| pub fn compute_cpu_assignments(&self) -> Result<Vec<(u32, Cpumask)>> { |
There was a problem hiding this comment.
Personal preference, but as this fn grows longer, it would be easier for me to understand and modify if it were decomposed into separate functions.
There was a problem hiding this comment.
Personal preference, but as this fn grows longer, it would be easier for me to understand and modify if it were decomposed into separate functions.
Yeah, I think this is just my personal preference that I like the linearity of a single function (with comments to make it easier to follow) rather than breaking it up into multiple functions that requires jumping around to understand the entire logic.
|
|
||
| while !shutdown.load(Ordering::Relaxed) && !uei_exited!(&self.skel, uei) { | ||
| let mut events = [EpollEvent::empty(); 1]; | ||
| let timeout_ms = self.monitor_interval.as_millis() as u16; |
There was a problem hiding this comment.
I might be reading this wrong, but would putting a monitor_interval of greater than 65 seconds cause this to roll over?
| } | ||
| } | ||
|
|
||
| // Phase 3: Distribute contested CPUs proportionally among claimants |
There was a problem hiding this comment.
Split the baby. very king solomon
|
|
||
| /* Cell cpumask data for a single cell */ | ||
| struct cell_cpumask_data { | ||
| unsigned char mask[MAX_CPUS_U8]; |
| * 5. Bump applied_configuration_seq to signal completion | ||
| * | ||
| * Note: This is not atomic - tasks may observe intermediate states during | ||
| * execution. On error, the scheduler may be left in a partially-configured |
There was a problem hiding this comment.
This sounds bad, but probably fine in practice. Can you say another word about why we should feel chill about it?
There was a problem hiding this comment.
Are you asking about the error case or the non-atomic aspect?
There was a problem hiding this comment.
Sorry for the ambiguity in highlighting two ideas here.
My question was about the non-atomic aspect here.
I think there is an implicit statement that might help a reader, something like: "The worst case is tasks transiently running on stale CPUs while applying cell config runs."
| struct cgroup *cur_cgrp, *root_cgrp_ref; | ||
| u32 i, cell_id; | ||
| int ret = 0; | ||
|
|
There was a problem hiding this comment.
Personal preference, but I struggle to hold this many ideas in my head at a time. Maybe can procedurally decompose like on the userspace side.
There was a problem hiding this comment.
I'm curious - how do multiple functions differ from the phases w/ block comments differ for you? I prefer the latter because I read the code linearly rather than needing to jump around.
There was a problem hiding this comment.
Skill issue on my part, probably. But here's my case for it not being arbitrary.
The one-big-function approach is like a cookbook that lists every ingredient for every recipe upfront—then you read each recipe with no idea what you actually need until you're done.
One big function:
- 20+ local variables, any of which could be used at any point
- You can't mentally set any of them down until you've read everything
Small functions:
- Natural summary of steps with none of the implementation detail
- Explicit data flow (prepared by one, consumed by another)
- Compiler complains if you pass unused variables to minimize scope
Keep the summary open on one window to keep the big picture, jump around to each fn on the other.
| * exits, causing the scheduler to be unloaded. | ||
| */ | ||
| SEC("syscall") | ||
| int apply_cell_config(void *ctx) |
There was a problem hiding this comment.
I really like this overall. It feels like we're moving way closer to a simpler world driven by userspace.
But can't we still move most of this function to userspace? We're almost entirely looking up and processing data that is available in userspace. Why don't we precompute all that and let this function be like 4 trivial loops that write it into an essential bpf datastructures?
Also, possible opportunity to pass data via ctx?
There was a problem hiding this comment.
What parts are you suggesting could be moved to userspace?
All the cgroup lookups and walking the cgroup hierarchy would be heinously expensive in userspace. While this isn't a particularly perf sensitive path, it's pretty egregious to have to do so many filesystem syscalls compared to the in-kernel iteration.
The rest is really just atomically publishing the cpumasks - I'm not sure how that can be done from userspace.
I guess I could bump the configuration_seq from userspace but it seems cleaner to keep the parts that actually synchronize in the same code base.
There was a problem hiding this comment.
I think Phase 3 is the best candidate. I tried to sketch an example in my overall note where we might be able to remove a hard-coded policy assumption that child cgroups should be mapped to the same cell as the owning ancestor.
I was thinking of this as something that only happens on the ~hours time scale. And that you'd only have to walk cgroup in the workload path because all the rest map to the root cell, which would be pretty cheap?
|
|
||
| cgrp = bpf_cgroup_from_id(cgid); | ||
| if (!cgrp) | ||
| return -ENOENT; |
| lazy_static = "1.5.0" | ||
| libbpf-rs = "=0.26.0-beta.1" | ||
| libc = "0.2.175" | ||
| inotify = "0.10" |
There was a problem hiding this comment.
layered uses 0.11, easier using that imo.
| Ok(n) => { | ||
| for event in &events[..n] { | ||
| match event.data() { | ||
| INOTIFY_TOKEN => { |
There was a problem hiding this comment.
IIUC, to make things simpler/less fragile, we're doing a kinda-sorta figure out the state of the world/reconcile thing here vs something more granular/delta based.
If so, I think there should be some debounce here for when things thrash/to avoid reconcile amplifying that.
There was a problem hiding this comment.
Can you expand on what your concern is regarding thrashing?
There was a problem hiding this comment.
So, we use inotify to trigger a full "figure out where things should run/what CPUs they should get".
If we have some container that keeps getting created/destroyed (say, something crashlooping and getting destroyed/created over and over), we could keep re-figuring out where things should run with different outcomes (i.e. split 4 ways, split 3 ways).
I think this issue applies even w/o a crashlooping container, but it'd be tolerable given spacing between create/destroy (i.e. if u care about perf, you aren't doing that already), but I think the crashloop case would be the particularly bad one because we'd have relentless migrations and task spraying, iiuc.
There was a problem hiding this comment.
@likewhatevs - sorry I missed this response. I think it's a valid concern but I don't see this "thrashing" as being so impactful to actual workloads since we'll continue to have preference for prev CPU and other idle CPU detection.
But I'm also not sure what the alternative would be - if a cell leaves the machine or joins the machine, we need to react to that.
We covered the points I wanted to hit. All happy
likewhatevs
left a comment
There was a problem hiding this comment.
this looks good but i think we need some confirmation that my understanding of thrash risk is wrong or some mitigation of it (i think something like a 10s debounce might get the job done quick and dirty, maybe).
25f2067 to
0ab4268
Compare
…ests Generalize cgroup lookup kfuncs so all schedulers benefit, and add 6 new mitosis integration tests modeling scenarios from PR #3270's test_cell_isolation.sh. Infrastructure changes: - Wire up scx_bpf_task_cgroup, bpf_cgroup_from_id, bpf_cgroup_ancestor in kfuncs.rs to call the CgroupRegistry instead of returning NULL. - Add sim_get_root_cgroup FFI declaration for root cgroup fallback. - Remove 64 lines of redundant per-scheduler cgroup overrides from mitosis wrapper.c (now handled by general kfuncs). - Add CgroupCpusetChangeEvent for runtime cpuset modifications with engine support (update_cpuset + cgroup_init notification). New tests: - test_mitosis_basic_cell_isolation: N cgroups, non-overlapping cpusets - test_mitosis_dynamic_cell_lifecycle: runtime create/destroy/migrate - test_mitosis_cpu_borrowing_select_cpu: rapid sleep/wake borrowing - test_mitosis_cpu_borrowing_enqueue: pure CPU spinner borrowing - test_mitosis_demand_rebalancing: 2x oversubscribed + idle cell - test_mitosis_cpuset_change_detection: swap cpusets at 100ms Test Results Summary: - 102 unit tests: all passed - 56 mitosis tests: all passed (50 existing + 6 new) - 195 lavd tests: all passed - 11 cosmos, 7 compare, 12 simple, 9 tickless, 14 interleave, 7 errors, 6 stress, 4 noise, 2 tick, 5 rtapp, 1 perfetto, 1 slice_boost, 2 doc-tests: all passed - Total: 447 passed, 0 failed, 3 ignored (stress fuzzers)
…ests Generalize cgroup lookup kfuncs so all schedulers benefit, and add 6 new mitosis integration tests modeling scenarios from PR #3270's test_cell_isolation.sh. Infrastructure changes: - Wire up scx_bpf_task_cgroup, bpf_cgroup_from_id, bpf_cgroup_ancestor in kfuncs.rs to call the CgroupRegistry instead of returning NULL. - Add sim_get_root_cgroup FFI declaration for root cgroup fallback. - Remove 64 lines of redundant per-scheduler cgroup overrides from mitosis wrapper.c (now handled by general kfuncs). - Add CgroupCpusetChangeEvent for runtime cpuset modifications with engine support (update_cpuset + cgroup_init notification). New tests: - test_mitosis_basic_cell_isolation: N cgroups, non-overlapping cpusets - test_mitosis_dynamic_cell_lifecycle: runtime create/destroy/migrate - test_mitosis_cpu_borrowing_select_cpu: rapid sleep/wake borrowing - test_mitosis_cpu_borrowing_enqueue: pure CPU spinner borrowing - test_mitosis_demand_rebalancing: 2x oversubscribed + idle cell - test_mitosis_cpuset_change_detection: swap cpusets at 100ms Test Results Summary: - 102 unit tests: all passed - 56 mitosis tests: all passed (50 existing + 6 new) - 195 lavd tests: all passed - 11 cosmos, 7 compare, 12 simple, 9 tickless, 14 interleave, 7 errors, 6 stress, 4 noise, 2 tick, 5 rtapp, 1 perfetto, 1 slice_boost, 2 doc-tests: all passed - Total: 447 passed, 0 failed, 3 ignored (stress fuzzers)
| u32 i, cell_id; | ||
|
|
||
| /* Read configuration from global struct (populated by userspace) */ | ||
| struct cell_config *config = &cell_config; |
There was a problem hiding this comment.
Can't we pass cell_config as an arg to the bpf syscall ctx parameter
There was a problem hiding this comment.
@kkdwvd maybe has more context here but I tried this and it failed with some verifier errors that I've since lost. To the best my understanding the verifier requires that offsets into any BPF_PROG_TYPE_SYSCALL are compile-time constants but we do a bunch of indexing into arrays (which create variable-offsets).
| rebalance_cooldown_s: u64, | ||
|
|
||
| /// EWMA smoothing factor for demand tracking. Higher = more responsive (default: 0.3) | ||
| #[clap(long, default_value = "0.3")] |
There was a problem hiding this comment.
check range or this will do weird stuff :)
#[clap(long, default_value = "0.1", value_parser = clap::value_parser!(f64).range(0.0..=1.0))]
tommy-u
left a comment
There was a problem hiding this comment.
I really like this & can't wait to get to testing it.
My only concern is marking a task !all_cell_cpus_allowed when borrowable cpuset is not a subset of the task cpuset. This seems like another version of the multi-cpu-pinning issue. Do we just need a second variable like "can_borrow" which we disable here?
Or maybe I misunderstand the borrowable cpuset or an assumption that we'll only be here when there is no pinning. https://github.com/sched-ext/scx/pull/3270/changes#r2850203114
Anyways, looks awesome. Let's fire it up
tommy-u
left a comment
There was a problem hiding this comment.
Cleared up my misunderstanding of how borrowing cpusets work. Looks great.
8d3b823 to
fe14bef
Compare
Add symmetric counterpart to from_cpulist() that formats a Cpumask as a compact CPU list string like "0-7,16-23". Returns "none" if no CPUs are set. Includes comprehensive tests for empty masks, single CPUs, contiguous ranges, multiple ranges, scattered CPUs, mixed patterns, and roundtrip conversion.
Replace recv_timeout blocking with an epoll-based event loop. This change prepares the infrastructure for adding more event sources later (like inotify for cgroup events). Key changes: - Add nix dependency with event feature for epoll/eventfd - Add epoll and stats_waker fields to Scheduler - Add stats bridge thread that writes to eventfd on stats requests - Main loop uses epoll.wait() with timeout, handles STATS_TOKEN The event loop currently only handles stats requests, but the architecture supports adding additional event sources in future commits.
Adds a CellManager module that watches a parent cgroup directory and creates cells for each direct child cgroup. This is a step towards the --cell-parent-cgroup mode where cells are managed from userspace. Key features: - Uses inotify to watch for cgroup creation/destruction - Maintains mapping from cgroup ID (inode) to cell ID - Cell ID allocation with reuse of freed IDs - Basic CPU assignment: divides CPUs equally among all cells - CellInfo uses Option<PathBuf> and Option<u64> for cgroup_path and cgid: cell 0 has no owning cgroup, so these are None rather than misleading sentinel values This commit provides the core cell management infrastructure. Cpuset-aware CPU assignment will be added in a follow-up commit.
Extends CellManager to read and honor cpuset.cpus from each cgroup directory. This enables pinning-aware CPU assignment: - CellInfo now includes an optional cpuset mask - CPU assignment algorithm handles: - Exclusive CPUs: assigned to sole claimant - Contested CPUs: divided proportionally among claimants - Unclaimed CPUs: shared between cell 0 and unpinned cells Adds comprehensive tests for cpuset parsing and overlapping scenarios.
Add BPF support for userspace-managed cell configuration. When enabled
via userspace_managed_cell_mode, the internal timer-based configuration
is disabled and userspace controls cells via apply_cell_config.
New components:
- cell_config struct: holds cgroup->cell assignments and cell cpumasks
- cell_assignment struct: maps cgroup ID to cell ID
- cell_cpumask_data struct: byte array for cell CPU masks
- apply_cell_config SEC("syscall"): applies complete configuration
The apply_cell_config program operates in five phases:
1. Mark all cells (except cell 0) as not in use
2. Apply cell-to-cgroup assignments for owner cgroups
3. Walk cgroup hierarchy to propagate cells to children
4. Apply cell cpumasks and CPU-to-cell mappings
5. Bump applied_configuration_seq to signal completion
Wire CellManager into the scheduler with --cell-parent-cgroup CLI flag. When specified, the scheduler watches the parent cgroup directory for child cgroup creation/deletion and creates cells accordingly. Key integration points: - Add --cell-parent-cgroup CLI option - Set userspace_managed_cell_mode BPF variable when option is provided - Register inotify fd with epoll instance (INOTIFY_TOKEN) - Apply initial cell configuration on scheduler attach - Handle INOTIFY_TOKEN events to process cgroup changes - Build and apply cell configuration via apply_cell_config BPF program The cell configuration is sent to BPF via the cell_config global struct, which contains cgroup->cell assignments and cell cpumasks.
Add integration test script for verifying cell isolation behavior.
The script creates test cgroups under a parent, starts the scheduler
with --cell-parent-cgroup, and verifies CPU isolation.
Test modes:
- Basic isolation: verify processes in different cells run on separate CPUs
- Dynamic cell creation: verify new cgroups are detected and assigned cells
- Dynamic cell destruction: verify removed cgroups trigger cell cleanup
- CPU reallocation: verify CPUs are redistributed when cells change
- Cell ID reuse: verify freed cell IDs are reused for new cells
Usage:
./test_cell_isolation.sh [--cpuset] [--proportional] [--num-cells N]
[--workers N] [--test-dynamic] [--test-all]
Allow excluding specific cgroup directory names from cell creation so they remain in cell 0. The flag accepts exact basenames, can be specified multiple times, and requires --cell-parent-cgroup.
Extract the idle CPU search + dispatch + kick logic from select_cpu and enqueue into a shared try_pick_idle_cpu() helper. The helper: - Calls pick_idle_cpu() for cell-local idle CPU search - On success: bumps CSTAT_LOCAL, dispatches to SCX_DSQ_LOCAL - If kick=true: kicks the idle CPU - Returns the CPU number (>= 0 success, -1 error, -EBUSY no idle CPU) Both select_cpu and enqueue now call try_pick_idle_cpu() instead of duplicating the pick_idle_cpu() + stats + dispatch logic. This centralizes the idle CPU path so future extensions (like CPU borrowing) can be added in one place. In enqueue, this changes the idle CPU hit path: previously the task was dispatched to the cell DSQ with vtime scheduling, now it goes directly to SCX_DSQ_LOCAL. This is safe because the target CPU is idle — there is no contention to arbitrate with vtime, and the task's vtime will be updated normally in stopping() when its slice completes.
Allow busy cells to borrow idle CPUs from other cells when --enable-borrowing is set. This provides soft overcommit: a cell under heavy load can temporarily use CPUs that belong to idle cells without permanently reassigning them. BPF changes: - Add borrowing fallback to try_pick_idle_cpu(): when no cell-local idle CPU is found and borrowing is enabled, search the cell's borrowable cpumask for idle CPUs from other cells. Borrowed tasks are dispatched to SCX_DSQ_LOCAL and bump CSTAT_BORROWED. - Add per-cell borrowable_cpumask tracking in cell_cpumask_wrapper - Extend update_task_cpumask() to check borrowable mask coverage when determining all_cell_cpus_allowed - Initialize borrowable cpumasks in mitosis_init() - Apply borrowable cpumasks in apply_cell_config() Since borrowing is integrated into try_pick_idle_cpu(), both the select_cpu and enqueue paths get borrowing support automatically. Userspace changes: - Add compute_borrowable_cpumasks() to CellManager: for each cell, the borrowable mask is the union of all other cells' CPUs - Wire borrowable cpumasks into cell_config alongside cell cpumasks - Add --enable-borrowing CLI flag and borrowed_pct stat Testing: - Add --test-borrowing integration test (pipe stressor, select_cpu path) - Add --test-enqueue-borrowing integration test (CPU stressor, enqueue path) to validate borrowing works for CPU-bound tasks that are re-enqueued without going through select_cpu
Track per-cell CPU demand metrics: util_pct, demand_borrow_pct, and lent_pct. BPF accounts per-task running time in cpu_ctx.running_ns indexed by the task's cell, and userspace aggregates these into utilization, borrowed time, and lent time percentages per cell. The previous cell_cycles field was only used within BPF and not read by userspace, so it is replaced with running_ns which userspace reads to compute demand metrics. Demand tracking is only active when --cell-parent-cgroup is set (i.e., the cell manager is driving cell creation), since all cells are known to userspace in that mode. Includes an integration test (--test-usage-tracking) that validates the metrics end-to-end using monitor JSON output.
Dynamically adjust per-cell CPU allocation based on EWMA-smoothed utilization. When --enable-rebalancing is set, the scheduler periodically checks if the utilization spread across cells exceeds a threshold and redistributes CPUs proportionally to demand. Key changes: - Add distribute_cpus_by_weight() helper for weighted CPU distribution with a floor guarantee of 1 CPU per cell to prevent starvation - Refactor compute_cpu_assignments() to share logic with the new compute_demand_cpu_assignments() via compute_cpu_assignments_inner() - Add borrow budget capping to compute_borrowable_cpumasks() so cells only borrow when they genuinely need extra capacity - Add EWMA smoothing in collect_demand_metrics() to track per-cell demand over time - Add maybe_rebalance() triggered in the main loop after metrics collection, gated by cooldown and threshold checks - Make compute_and_apply_cell_config() use demand-weighted assignment when rebalancing is enabled and utilization data exists, so topology changes (new/destroyed cells, cpuset updates) preserve the current demand-weighted distribution instead of resetting to equal-weight - Seed new cells to the average smoothed_util of existing cells so they start with a fair share rather than zero (which would starve them to the 1-CPU floor on the next rebalance tick) - Clear smoothed_util for destroyed cells to prevent stale data from leaking if the cell ID is reused - Expose rebalance_count and smoothed_util_pct in metrics/stats - Add --test-rebalancing and --test-new-cell-demand integration tests New CLI flags (all gated behind --enable-rebalancing): --enable-rebalancing Enable demand-based rebalancing --rebalance-threshold Utilization spread trigger (default: 20%) --rebalance-cooldown-s Min seconds between rebalances (default: 5) --demand-smoothing EWMA alpha factor (default: 0.3)
In userspace-managed mode, cpusets were read once at cell creation and never re-read. If someone later writes to a cell's cpuset.cpus, the scheduler silently ignored the change. Add a dedicated BPF counter cpuset_seq (separate from configuration_seq) that is bumped by fentry_cpuset_write_resmask in userspace-managed mode. Userspace checks this counter each main-loop iteration and re-reads cpusets only when it changes, recomputing CPU assignments if any cell's cpuset actually differs.
fe14bef to
dc2c52b
Compare
Currently Mitosis creates cells from bpf purely based on cpusets. We want to make mitosis capable of creating cells without strictly cpu pinning cgroups. This PR implements an initial naive version that:
Future work here will extend this to make cell <=> cpu assignment topology aware, utilization aware, etc. And enable allocation of CPUs through configuration.