Skip to content

feat(jailer): decouple host cgroup from bwrap + default DoS limits#619

Draft
G4614 wants to merge 3 commits into
boxlite-ai:mainfrom
G4614:feat/host-cgroup-dos-limits
Draft

feat(jailer): decouple host cgroup from bwrap + default DoS limits#619
G4614 wants to merge 3 commits into
boxlite-ai:mainfrom
G4614:feat/host-cgroup-dos-limits

Conversation

@G4614
Copy link
Copy Markdown
Contributor

@G4614 G4614 commented May 28, 2026

Put each box's shim under a host cgroup with memory.max + pids.max so one box can't exhaust host RAM/PIDs — fixing the two bugs that made it a silent no-op rootless: the atomic +cpu +memory +pids controller write failing when cpu isn't delegated, and the shim being unable to migrate itself across the root-owned user.slice into the cgroup (now adopted into a systemd scope instead).

Test plan

  • shim_is_scoped_with_host_memory_and_pids_limits (integration, rootless): asserts boxlite-<id>.scope reports MemoryMax = 2×VM + 512 MiB and TasksMax = 1024. Two-side verified.
  • Guest unit + clippy clean; box start/exec/stop/rm unaffected.
observed (rootless) pre-fix post-fix
controllers on the box cgroup none (+cpu write EINVAL takes memory+pids down with it) memory + pids enabled
shim placement session-N.scope (unconstrained) boxlite-<id>.scope
MemoryMax the shim runs under infinity (no limit) 2×VM + 512 MiB

gamnaansong and others added 3 commits May 28, 2026 12:25
Host cgroup setup (create + join) lived inside BwrapSandbox, gated behind
jailer_enabled and the bwrap user-namespace preflight. On hosts where
apparmor restricts unprivileged userns (Ubuntu 24.04+ default), the
preflight fails before cgroup setup runs, so user-set resource_limits
silently never applied on Linux.

- Move cgroup create to Jailer::prepare and the cgroup-join pre_exec hook
  to Jailer::command, gated only by whether cgroup limits exist. Cgroup
  creation only writes /sys/fs/cgroup and needs no user namespace.
- Default DoS limits via Jailer::cgroup_config, populating ONLY the cgroup
  (never the rlimit pre_exec hook, which maps to RLIMIT_AS/NPROC/CPU and
  would break libkrun's mmaps or SIGKILL the VM):
    pids.max = 1024 (baseline box uses ~22 host tasks)
    memory.max = 2x VM RAM + 512 MiB (scales with --memory; guest RAM is
    hard-capped by libkrun, so this only fires on VMM-side leaks)
- Wire remove_cgroup (previously dead code) into ShimHandler::stop with a
  bounded retry: a detached shim is reaped by init, so the cgroup can be
  briefly EBUSY after termination.

Verified: cgroup applied with jailer off / no bwrap / under apparmor
restriction; removed on stop; 256MB and default 2GB boxes get correct
limits and stay healthy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y/pids)

enable_controllers wrote the literal "+cpu +memory +pids" to
cgroup.subtree_control. cgroup v2 rejects the whole write if any named
controller isn't available, and on rootless/systemd-user hosts the session is
delegated only memory + pids (no cpu) — so the write failed with EINVAL and box
cgroups ended up with NO controllers, leaving memory.max/pids.max unwritten and
the DoS limits silently unenforced.

Enable the intersection of {cpu, memory, pids} with cgroup.controllers instead,
and run enable_controllers idempotently (not only on parent creation) so a
parent left un-delegated by an older build is repaired. Verified: a box's
cgroup now gets memory.max = 2×VM+512MiB and pids.max = 1024 under a rootless
user session.

NOTE: process placement in rootless mode is still blocked separately — the shim
starts in session-N.scope and can't be migrated across the root-owned
user.slice into user@.service/boxlite (EACCES). That needs systemd-run scope
placement; tracked as follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…a systemd scope

The direct-cgroup path only works as root: a rootless process can't migrate
itself from its login session-N.scope into user@.service/.../boxlite-<id>
(EACCES on the root-owned user.slice common ancestor), so the shim ran
unconstrained in session-N.scope and the host memory/pids limits were a no-op.

Fix: gate the direct cgroup + pre_exec join to root, and for rootless adopt the
already-spawned shim into a systemd *user* transient scope via
StartTransientUnit(PIDs=[shim], MemoryMax, TasksMax). systemd owns the
user.slice hierarchy, so it can do the placement an unprivileged process can't.
Done post-spawn so the shim keeps the PID identity the watchdog/recovery rely
on (no systemd-run interposition). The transient scope auto-removes when the
shim exits. busctl keeps systemd a runtime, not a build, dependency.

Verified two-sided: boxlite-<id>.scope reports MemoryMax = 2×VM+512MiB and
TasksMax = 1024 with the adoption, and MemoryMax=infinity (unenforced) without
it. Box start/exec/stop/rm unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant