From fd6e892ffc3f4881070274c0291809c790a6ed3c Mon Sep 17 00:00:00 2001 From: Wayland Yang Date: Fri, 29 May 2026 01:11:13 +0800 Subject: [PATCH] packaging(systemd): add cgroup to RestrictNamespaces, document each token MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes #163. Thanks to @mrvellang for catching the latent trap. RestrictNamespaces is an allowlist — anything not enumerated returns EPERM on unshare/clone. Two changes: 1. Add `cgroup` to the allowlist. Without it, the moment we wire up per-child cgroup-v2 namespaces (the natural next step for fork-out isolation), `unshare(CLONE_NEWCGROUP)` silently returns EPERM. As #163 notes, the kind of trap that costs an afternoon to debug. 2. Inline-comment each token (net / mnt / user / pid / cgroup) so a future well-intentioned hardening pass doesn't quietly remove one without noticing what it gates. --- packaging/systemd/forkd-controller.service | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/packaging/systemd/forkd-controller.service b/packaging/systemd/forkd-controller.service index 7735228..2af5903 100644 --- a/packaging/systemd/forkd-controller.service +++ b/packaging/systemd/forkd-controller.service @@ -36,7 +36,19 @@ ProtectKernelModules=true LockPersonality=true MemoryDenyWriteExecute=false RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX AF_NETLINK -RestrictNamespaces=net mnt user pid +# Allowed namespaces (RestrictNamespaces is an allowlist; everything else +# returns EPERM on unshare/clone). Document the reason for each so a +# well-intentioned trim doesn't silently break a feature: +# net — per-child network namespace (one tap + one bridge endpoint +# per fork) +# mnt — per-VM mount namespace (rootfs, virtio-fs, scratch) +# user — unprivileged subprocess isolation +# pid — Firecracker per-VM PID namespace (so PID 1 in the guest +# doesn't collide with host PIDs in logs and signals) +# cgroup — required for per-child cgroup-v2 namespace under the +# delegated subtree (without this, `unshare(CLONE_NEWCGROUP)` +# returns EPERM — see #163) +RestrictNamespaces=net mnt user pid cgroup RestrictRealtime=true SystemCallArchitectures=native SystemCallFilter=@system-service