feat(jailer): decouple host cgroup from bwrap + default DoS limits#619
Draft
G4614 wants to merge 3 commits into
Draft
feat(jailer): decouple host cgroup from bwrap + default DoS limits#619G4614 wants to merge 3 commits into
G4614 wants to merge 3 commits into
Conversation
Host cgroup setup (create + join) lived inside BwrapSandbox, gated behind
jailer_enabled and the bwrap user-namespace preflight. On hosts where
apparmor restricts unprivileged userns (Ubuntu 24.04+ default), the
preflight fails before cgroup setup runs, so user-set resource_limits
silently never applied on Linux.
- Move cgroup create to Jailer::prepare and the cgroup-join pre_exec hook
to Jailer::command, gated only by whether cgroup limits exist. Cgroup
creation only writes /sys/fs/cgroup and needs no user namespace.
- Default DoS limits via Jailer::cgroup_config, populating ONLY the cgroup
(never the rlimit pre_exec hook, which maps to RLIMIT_AS/NPROC/CPU and
would break libkrun's mmaps or SIGKILL the VM):
pids.max = 1024 (baseline box uses ~22 host tasks)
memory.max = 2x VM RAM + 512 MiB (scales with --memory; guest RAM is
hard-capped by libkrun, so this only fires on VMM-side leaks)
- Wire remove_cgroup (previously dead code) into ShimHandler::stop with a
bounded retry: a detached shim is reaped by init, so the cgroup can be
briefly EBUSY after termination.
Verified: cgroup applied with jailer off / no bwrap / under apparmor
restriction; removed on stop; 256MB and default 2GB boxes get correct
limits and stay healthy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y/pids)
enable_controllers wrote the literal "+cpu +memory +pids" to
cgroup.subtree_control. cgroup v2 rejects the whole write if any named
controller isn't available, and on rootless/systemd-user hosts the session is
delegated only memory + pids (no cpu) — so the write failed with EINVAL and box
cgroups ended up with NO controllers, leaving memory.max/pids.max unwritten and
the DoS limits silently unenforced.
Enable the intersection of {cpu, memory, pids} with cgroup.controllers instead,
and run enable_controllers idempotently (not only on parent creation) so a
parent left un-delegated by an older build is repaired. Verified: a box's
cgroup now gets memory.max = 2×VM+512MiB and pids.max = 1024 under a rootless
user session.
NOTE: process placement in rootless mode is still blocked separately — the shim
starts in session-N.scope and can't be migrated across the root-owned
user.slice into user@.service/boxlite (EACCES). That needs systemd-run scope
placement; tracked as follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…a systemd scope The direct-cgroup path only works as root: a rootless process can't migrate itself from its login session-N.scope into user@.service/.../boxlite-<id> (EACCES on the root-owned user.slice common ancestor), so the shim ran unconstrained in session-N.scope and the host memory/pids limits were a no-op. Fix: gate the direct cgroup + pre_exec join to root, and for rootless adopt the already-spawned shim into a systemd *user* transient scope via StartTransientUnit(PIDs=[shim], MemoryMax, TasksMax). systemd owns the user.slice hierarchy, so it can do the placement an unprivileged process can't. Done post-spawn so the shim keeps the PID identity the watchdog/recovery rely on (no systemd-run interposition). The transient scope auto-removes when the shim exits. busctl keeps systemd a runtime, not a build, dependency. Verified two-sided: boxlite-<id>.scope reports MemoryMax = 2×VM+512MiB and TasksMax = 1024 with the adoption, and MemoryMax=infinity (unenforced) without it. Box start/exec/stop/rm unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Put each box's shim under a host cgroup with
memory.max+pids.maxso one box can't exhaust host RAM/PIDs — fixing the two bugs that made it a silent no-op rootless: the atomic+cpu +memory +pidscontroller write failing whencpuisn't delegated, and the shim being unable to migrate itself across the root-owneduser.sliceinto the cgroup (now adopted into a systemd scope instead).Test plan
shim_is_scoped_with_host_memory_and_pids_limits(integration, rootless): assertsboxlite-<id>.scopereportsMemoryMax= 2×VM + 512 MiB andTasksMax= 1024. Two-side verified.+cpuwrite EINVAL takes memory+pids down with it)memory+pidsenabledsession-N.scope(unconstrained)boxlite-<id>.scopeMemoryMaxthe shim runs underinfinity(no limit)