nspawn: Stop overmounting /sys and /proc when a user namespace is used by daandemeyer · Pull Request #8 · daandemeyer/systemd

daandemeyer · 2026-03-06T18:21:11Z

When the container runs in a user namespace, we don't need to protect /proc and /sys by overmounting things. In fact this is actively harmful as it prevents nested systemd-nspawn from working as to mount procfs and sysfs in a container it cannot be overmounted or the kernel will refuse the mount.

To make nesting possible, let's stop overmounting parts of /proc and /sys when user namespaces are in use.

daandemeyer · 2026-03-06T18:21:22Z

@claude review this pull request

github-actions

Claude review: 1 must-fix, 5 suggestion(s)

github-actions · 2026-03-06T18:36:33Z

src/nspawn/nspawn.c

        const char *to;
        int r;

+        r = container_in_userns();


Claude: must-fix: setup_boot_id() skips generating a unique boot ID when in a user namespace, meaning the container will inherit the host's boot ID. This is a functional regression — containers sharing the host boot ID can confuse systemd-journald and other services that use boot ID to distinguish boots. If bind-mounting over /proc/sys/kernel/random/boot_id is not possible in a user namespace, this should at minimum be documented with a comment explaining why skipping is acceptable, or an alternative mechanism should be used.

github-actions · 2026-03-06T18:36:33Z

src/nspawn/nspawn.c

        const char *to;
        int r;

+        r = container_in_userns();


Claude: suggestion: In both setup_boot_id() (here) and setup_kmsg() (line 2565), the pattern if (r != 0) return r; returns true (1) on the success-skip path from functions that conventionally return 0 on success or negative on error. While the callers only check r < 0, propagating a positive value from a setup function is non-idiomatic. Consider separating the error and skip cases:

r = container_in_userns(); if (r < 0) return r; if (r > 0) return 0;

github-actions · 2026-03-06T18:36:33Z

src/nspawn/nspawn.c

+
+        r = namespace_is_init(NAMESPACE_USER);
+        if (r < 0 && !IN_SET(r, -EBADR, -ENOSYS))
+                return log_error_errno(r, "Failed to check if in initial user namespace: %m");


Claude: suggestion: When namespace_is_init() returns -EBADR or -ENOSYS, the error is silently swallowed and the function falls through to return r == 0, which evaluates to false ("not in a user namespace"). The -ENOSYS case (procfs not mounted / not available) could indicate a constrained environment where assuming "not in a user namespace" may be incorrect. Consider whether these error cases should default to "assume we're in a user namespace" for safety, or at minimum log a debug/warning message so the fallback behavior is observable.

github-actions · 2026-03-06T18:36:33Z

src/nspawn/nspawn.c

+         * namespace ourselves, and when nspawn is invoked from within an existing user namespace. */
+        r = container_in_userns();
+        if (r < 0)
+                return r;


Claude: suggestion: Clearing MOUNT_APPLY_APIVFS_RO disables not just read-only sysfs/procfs mounts but also all PROC_INACCESSIBLE_REG and PROC_READ_ONLY overmounts in mount_all() — entries like /proc/kallsyms, /proc/kcore, /proc/keys, /proc/sysrq-trigger, /proc/timer_list, and directories like /proc/acpi, /proc/bus, /proc/scsi are left exposed. While the kernel does restrict access to most of these in user namespaces, some entries like /proc/kallsyms (depending on kptr_restrict sysctl) may still leak kernel address information. The commit message and comments should explicitly acknowledge that these masking overmounts are also being removed, not just the read-only flags.

github-actions · 2026-03-06T18:36:33Z

src/nspawn/nspawn.c

        return 1;
 }

+static int container_in_userns(void) {


Claude: suggestion: container_in_userns() calls namespace_is_init() (a stat() syscall on /proc/ns/user) every time it is invoked, and it is called in 5 different places (verify_arguments, setup_boot_id, setup_kmsg, run_container, and indirectly via mount_sysfs). Since the result cannot change during the process lifetime, consider computing this once in verify_arguments() and encoding the result into a flag (e.g., in arg_mount_settings), rather than re-checking the namespace status at each call site.

github-actions · 2026-03-06T18:36:33Z

src/nspawn/nspawn-mount.c

@@ -467,6 +465,21 @@ int mount_sysfs(const char *dest, MountSettingsMask mount_settings) {
                        return 0;


Claude: suggestion: When is_mount_point is true AND the existing mount is already sysfs (checked in the block above), the function returns 0 at line 465 before reaching the new !FLAGS_SET(mount_settings, MOUNT_APPLY_APIVFS_RO) check. This means if /sys is already mounted as sysfs and we're in the user namespace path, we return without ensuring it's read-only. The comment at line 468 says "we still have to mount it read-only" but this early-return path doesn't enforce that. Is this intentional?

daandemeyer · 2026-03-06T19:10:30Z

@claude review this pr

github-actions

Claude review: 3 must-fix, 1 suggestion(s), 1 nit(s)

github-actions · 2026-03-06T19:16:52Z

src/nspawn/nspawn.c

        const char *to;
        int r;

+        r = container_in_userns();


Claude: must-fix: setup_boot_id() uses if (r != 0) return r; which returns true (1) when container_in_userns() indicates a user namespace. Functions like setup_boot_id() follow the 0-on-success / negative-on-error convention. The current callers only check r < 0 so this works by accident, but returning a positive value is non-idiomatic and fragile — any future caller checking r > 0 or r != 0 would misinterpret the result. Please split the check:

r = container_in_userns(); if (r < 0) return r; if (r > 0) return 0;

github-actions · 2026-03-06T19:16:52Z

src/nspawn/nspawn.c

        assert(fd_inner_socket >= 0);

+        r = container_in_userns();
+        if (r != 0)


Claude: must-fix: Same issue as setup_boot_id(): setup_kmsg() returns 1 (true) via if (r != 0) return r; on the user-namespace skip path. Should be:

r = container_in_userns(); if (r < 0) return r; if (r > 0) return 0;

github-actions · 2026-03-06T19:16:52Z

src/nspawn/nspawn-mount.c

@@ -467,6 +465,21 @@ int mount_sysfs(const char *dest, MountSettingsMask mount_settings) {
                        return 0;


Claude: must-fix: When /sys is already a mount point with sysfs (the is_mount_point && is_sysfs path), the function returns 0 here before reaching the new !FLAGS_SET(mount_settings, MOUNT_APPLY_APIVFS_RO) check at line 469. In the user namespace case MOUNT_APPLY_APIVFS_RO is cleared, so the intent is to mount sysfs read-only directly — but this early return bypasses that entirely. The container ends up with whatever mount flags sysfs already had, contradicting the comment at line 468 that says "we still have to mount it read-only". The early return should either be moved below the new userns check, or this path should explicitly handle the read-only remount.

github-actions · 2026-03-06T19:16:52Z

src/nspawn/nspawn.c

        return 1;
 }

+static int container_in_userns(void) {


Claude: suggestion: container_in_userns() calls namespace_is_init() (a procfs stat) on every invocation, and it is called from 4 separate sites (verify_arguments, setup_boot_id, setup_kmsg, run_container). Since the user namespace status cannot change during the process lifetime, consider computing this once in verify_arguments() and storing the result in a flag (e.g., extending arg_mount_settings or a dedicated static variable) to avoid redundant syscalls and the repeated error-handling boilerplate at each call site.

github-actions · 2026-03-06T19:16:52Z

src/nspawn/nspawn-mount.c

-                if (r < 0)
-                        return r;
-        } else {
+        is_mount_point = r > 0;


Claude: nit: bool is_mount_point is declared at the top but not initialized until several lines later. It could be declared at the point of initialization for clarity:

bool is_mount_point = r > 0;

github-actions · 2026-03-06T19:16:53Z

Claude review of PR #8 (`5070e78`)

Must fix

setup_boot_id returns positive value — src/nspawn/nspawn.c:2152 — if (r != 0) return r; returns 1 (true) to callers expecting 0-on-success convention; should split into separate error/skip checks
setup_kmsg returns positive value — src/nspawn/nspawn.c:2566 — Same if (r != 0) return r; issue as setup_boot_id
mount_sysfs early return bypasses userns read-only path — src/nspawn/nspawn-mount.c:465 — When sysfs is already mounted, the function returns before reaching the new !FLAGS_SET(MOUNT_APPLY_APIVFS_RO) check, so the read-only mount intended for user namespaces is never applied

Suggestions

Silent error swallowing in container_in_userns — src/nspawn/nspawn.c:1653 — -EBADR/-ENOSYS errors are silently ignored, defaulting to "not in userns" which may be incorrect; consider at least a debug log
Cache container_in_userns result — src/nspawn/nspawn.c:1645 — Called from 4 sites but result is invariant; compute once and store

Nits

Declare is_mount_point at initialization — src/nspawn/nspawn-mount.c:453 — Could be bool is_mount_point = r > 0; at point of use

When the container runs in a user namespace, we don't need to protect /proc and /sys by overmounting things. In fact this is actively harmful as it prevents nested systemd-nspawn from working as to mount procfs and sysfs in a container it cannot be overmounted or the kernel will refuse the mount. To make nesting possible, let's stop overmounting parts of /proc and /sys when user namespaces are in use.

daandemeyer · 2026-03-06T19:49:27Z

@claude review this pr

daandemeyer · 2026-03-06T19:51:24Z

@claude review this pr

Fix a typo which causes a segfault when processing a user record with matchHostname when it's an array instead of a simple string: $ echo '{"userName":"crashhostarray","perMachine":[{"matchHostname":["host1","host2"],"locked":false}]}' | userdbctl -F - Segmentation fault (core dumped) $ coredumpctl info ... Message: Process 1172301 (userdbctl) of user 1000 dumped core. Module libz.so.1 from rpm zlib-ng-2.3.3-1.fc43.x86_64 Module libcrypto.so.3 from rpm openssl-3.5.4-2.fc43.x86_64 Stack trace of thread 1172301: #0 0x00007fded7b3a656 __strcmp_evex (libc.so.6 + 0x159656) #1 0x00007fded7e95397 per_machine_hostname_match (libsystemd-shared-260.so + 0x295397) #2 0x00007fded7e955b5 per_machine_match (libsystemd-shared-260.so + 0x2955b5) #3 0x00007fded7e957c6 dispatch_per_machine (libsystemd-shared-260.so + 0x2957c6) #4 0x00007fded7e96c97 user_record_load (libsystemd-shared-260.so + 0x296c97) #5 0x000000000040572d display_user (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x572d) #6 0x00007fded7ea9727 dispatch_verb (libsystemd-shared-260.so + 0x2a9727) #7 0x000000000041077c run (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x1077c) #8 0x00000000004107ce main (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x107ce) systemd#9 0x00007fded79e45b5 __libc_start_call_main (libc.so.6 + 0x35b5) systemd#10 0x00007fded79e4668 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x3668) systemd#11 0x00000000004038d5 _start (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x38d5) ELF object binary architecture: AMD x86-64

The fido2_hmac_salt/fido2_hmac_credential/recovery_key fields kept leaking memory as the array itself wasn't deallocated after deallocating each of its elements data: $ build-san/userdbctl -F fuzz-corpus-userdb/auth-fido2.json ... ================================================================= ==1292840==ERROR: LeakSanitizer: detected memory leaks Direct leak of 112 byte(s) in 1 object(s) allocated from: #0 0x7f56f00e5e4b in realloc.part.0 (/lib64/libasan.so.8+0xe5e4b) (BuildId: 25975f766867e9e604dc5a71a8befeaed3301942) #1 0x7f56ed869e42 in greedy_realloc ../src/basic/alloc-util.c:65 #2 0x7f56ed7ff5e9 in dispatch_fido2_hmac_salt ../src/shared/user-record.c:836 #3 0x7f56edd73cbc in sd_json_dispatch_full ../src/libsystemd/sd-json/sd-json.c:5204 #4 0x7f56edd745fc in sd_json_dispatch ../src/libsystemd/sd-json/sd-json.c:5276 #5 0x7f56ed80100b in dispatch_privileged ../src/shared/user-record.c:998 #6 0x7f56edd73cbc in sd_json_dispatch_full ../src/libsystemd/sd-json/sd-json.c:5204 #7 0x7f56edd745fc in sd_json_dispatch ../src/libsystemd/sd-json/sd-json.c:5276 #8 0x7f56ed80622c in user_record_load ../src/shared/user-record.c:1697 systemd#9 0x000000408c15 in display_user ../src/userdb/userdbctl.c:447 systemd#10 0x7f56ed83cc9a in dispatch_verb ../src/shared/verbs.c:137 systemd#11 0x00000041df2b in run ../src/userdb/userdbctl.c:1908 systemd#12 0x00000041dfbe in main ../src/userdb/userdbctl.c:1911 systemd#13 0x7f56ec8105b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4) (BuildId: 2b5beec0fd24fe9c9f43eddfdd5facf0b8a1b805) systemd#14 0x7f56ec810667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667) (BuildId: 2b5beec0fd24fe9c9f43eddfdd5facf0b8a1b805) systemd#15 0x000000404a44 in _start (/home/fsumsal/repos/@systemd/systemd/build-san/userdbctl+0x404a44) (BuildId: 19e8b7e7b7038d2cea20bc18a55bea2a9e4406d5) Direct leak of 64 byte(s) in 1 object(s) allocated from: #0 0x7f56f00e5e4b in realloc.part.0 (/lib64/libasan.so.8+0xe5e4b) (BuildId: 25975f766867e9e604dc5a71a8befeaed3301942) #1 0x7f56ed869e42 in greedy_realloc ../src/basic/alloc-util.c:65 #2 0x7f56ed7fe779 in dispatch_fido2_hmac_credential_array ../src/shared/user-record.c:775 #3 0x7f56edd73cbc in sd_json_dispatch_full ../src/libsystemd/sd-json/sd-json.c:5204 #4 0x7f56edd745fc in sd_json_dispatch ../src/libsystemd/sd-json/sd-json.c:5276 #5 0x7f56ed80622c in user_record_load ../src/shared/user-record.c:1697 #6 0x000000408c15 in display_user ../src/userdb/userdbctl.c:447 #7 0x7f56ed83cc9a in dispatch_verb ../src/shared/verbs.c:137 #8 0x00000041df2b in run ../src/userdb/userdbctl.c:1908 systemd#9 0x00000041dfbe in main ../src/userdb/userdbctl.c:1911 systemd#10 0x7f56ec8105b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4) (BuildId: 2b5beec0fd24fe9c9f43eddfdd5facf0b8a1b805) systemd#11 0x7f56ec810667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667) (BuildId: 2b5beec0fd24fe9c9f43eddfdd5facf0b8a1b805) systemd#12 0x000000404a44 in _start (/home/fsumsal/repos/@systemd/systemd/build-san/userdbctl+0x404a44) (BuildId: 19e8b7e7b7038d2cea20bc18a55bea2a9e4406d5) SUMMARY: AddressSanitizer: 176 byte(s) leaked in 2 allocation(s).

…d#40979) Fix a typo which causes a segfault when processing a user record with `matchHostname` when it's an array instead of a simple string: ``` $ echo '{"userName":"crashhostarray","perMachine":[{"matchHostname":["host1","host2"],"locked":false}]}' | userdbctl -F - Segmentation fault (core dumped) $ coredumpctl info ... Message: Process 1172301 (userdbctl) of user 1000 dumped core. Module libz.so.1 from rpm zlib-ng-2.3.3-1.fc43.x86_64 Module libcrypto.so.3 from rpm openssl-3.5.4-2.fc43.x86_64 Stack trace of thread 1172301: #0 0x00007fded7b3a656 __strcmp_evex (libc.so.6 + 0x159656) #1 0x00007fded7e95397 per_machine_hostname_match (libsystemd-shared-260.so + 0x295397) #2 0x00007fded7e955b5 per_machine_match (libsystemd-shared-260.so + 0x2955b5) #3 0x00007fded7e957c6 dispatch_per_machine (libsystemd-shared-260.so + 0x2957c6) #4 0x00007fded7e96c97 user_record_load (libsystemd-shared-260.so + 0x296c97) #5 0x000000000040572d display_user (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x572d) #6 0x00007fded7ea9727 dispatch_verb (libsystemd-shared-260.so + 0x2a9727) #7 0x000000000041077c run (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x1077c) #8 0x00000000004107ce main (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x107ce) systemd#9 0x00007fded79e45b5 __libc_start_call_main (libc.so.6 + 0x35b5) systemd#10 0x00007fded79e4668 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x3668) systemd#11 0x00000000004038d5 _start (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x38d5) ELF object binary architecture: AMD x86-64 ```

github-actions bot reviewed Mar 6, 2026

View reviewed changes

daandemeyer force-pushed the main branch from 8621bcd to cfa1a73 Compare March 6, 2026 19:09

daandemeyer force-pushed the nspawn-nesting branch from 60a51a7 to 481fd14 Compare March 6, 2026 19:10

github-actions bot reviewed Mar 6, 2026

View reviewed changes

daandemeyer force-pushed the main branch from cfa1a73 to 270d976 Compare March 6, 2026 19:41

daandemeyer force-pushed the nspawn-nesting branch from 481fd14 to 97987d7 Compare March 6, 2026 19:42

daandemeyer force-pushed the main branch from 270d976 to f660aa6 Compare March 6, 2026 19:48

daandemeyer force-pushed the nspawn-nesting branch from 97987d7 to 5070e78 Compare March 6, 2026 19:49

daandemeyer force-pushed the main branch from f660aa6 to 2f31a1b Compare March 6, 2026 19:51

		@@ -467,6 +465,21 @@ int mount_sysfs(const char *dest, MountSettingsMask mount_settings) {
		return 0;

Conversation

daandemeyer commented Mar 6, 2026

Uh oh!

daandemeyer commented Mar 6, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

daandemeyer commented Mar 6, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude review of PR #8 (5070e78)

Must fix

Suggestions

Nits

Uh oh!

daandemeyer commented Mar 6, 2026

Uh oh!

daandemeyer commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 6, 2026 •

edited

Loading

Claude review of PR #8 (`5070e78`)