nspawn: Stop overmounting /sys and /proc when a user namespace is used#8
nspawn: Stop overmounting /sys and /proc when a user namespace is used#8daandemeyer wants to merge 1 commit intomainfrom
Conversation
|
@claude review this pull request |
| const char *to; | ||
| int r; | ||
|
|
||
| r = container_in_userns(); |
There was a problem hiding this comment.
Claude: must-fix: setup_boot_id() skips generating a unique boot ID when in a user namespace, meaning the container will inherit the host's boot ID. This is a functional regression — containers sharing the host boot ID can confuse systemd-journald and other services that use boot ID to distinguish boots. If bind-mounting over /proc/sys/kernel/random/boot_id is not possible in a user namespace, this should at minimum be documented with a comment explaining why skipping is acceptable, or an alternative mechanism should be used.
| const char *to; | ||
| int r; | ||
|
|
||
| r = container_in_userns(); |
There was a problem hiding this comment.
Claude: suggestion: In both setup_boot_id() (here) and setup_kmsg() (line 2565), the pattern if (r != 0) return r; returns true (1) on the success-skip path from functions that conventionally return 0 on success or negative on error. While the callers only check r < 0, propagating a positive value from a setup function is non-idiomatic. Consider separating the error and skip cases:
r = container_in_userns();
if (r < 0)
return r;
if (r > 0)
return 0;|
|
||
| r = namespace_is_init(NAMESPACE_USER); | ||
| if (r < 0 && !IN_SET(r, -EBADR, -ENOSYS)) | ||
| return log_error_errno(r, "Failed to check if in initial user namespace: %m"); |
There was a problem hiding this comment.
Claude: suggestion: When namespace_is_init() returns -EBADR or -ENOSYS, the error is silently swallowed and the function falls through to return r == 0, which evaluates to false ("not in a user namespace"). The -ENOSYS case (procfs not mounted / not available) could indicate a constrained environment where assuming "not in a user namespace" may be incorrect. Consider whether these error cases should default to "assume we're in a user namespace" for safety, or at minimum log a debug/warning message so the fallback behavior is observable.
| * namespace ourselves, and when nspawn is invoked from within an existing user namespace. */ | ||
| r = container_in_userns(); | ||
| if (r < 0) | ||
| return r; |
There was a problem hiding this comment.
Claude: suggestion: Clearing MOUNT_APPLY_APIVFS_RO disables not just read-only sysfs/procfs mounts but also all PROC_INACCESSIBLE_REG and PROC_READ_ONLY overmounts in mount_all() — entries like /proc/kallsyms, /proc/kcore, /proc/keys, /proc/sysrq-trigger, /proc/timer_list, and directories like /proc/acpi, /proc/bus, /proc/scsi are left exposed. While the kernel does restrict access to most of these in user namespaces, some entries like /proc/kallsyms (depending on kptr_restrict sysctl) may still leak kernel address information. The commit message and comments should explicitly acknowledge that these masking overmounts are also being removed, not just the read-only flags.
| return 1; | ||
| } | ||
|
|
||
| static int container_in_userns(void) { |
There was a problem hiding this comment.
Claude: suggestion: container_in_userns() calls namespace_is_init() (a stat() syscall on /proc/ns/user) every time it is invoked, and it is called in 5 different places (verify_arguments, setup_boot_id, setup_kmsg, run_container, and indirectly via mount_sysfs). Since the result cannot change during the process lifetime, consider computing this once in verify_arguments() and encoding the result into a flag (e.g., in arg_mount_settings), rather than re-checking the namespace status at each call site.
| @@ -467,6 +465,21 @@ int mount_sysfs(const char *dest, MountSettingsMask mount_settings) { | |||
| return 0; | |||
There was a problem hiding this comment.
Claude: suggestion: When is_mount_point is true AND the existing mount is already sysfs (checked in the block above), the function returns 0 at line 465 before reaching the new !FLAGS_SET(mount_settings, MOUNT_APPLY_APIVFS_RO) check. This means if /sys is already mounted as sysfs and we're in the user namespace path, we return without ensuring it's read-only. The comment at line 468 says "we still have to mount it read-only" but this early-return path doesn't enforce that. Is this intentional?
60a51a7 to
481fd14
Compare
|
@claude review this pr |
| const char *to; | ||
| int r; | ||
|
|
||
| r = container_in_userns(); |
There was a problem hiding this comment.
Claude: must-fix: setup_boot_id() uses if (r != 0) return r; which returns true (1) when container_in_userns() indicates a user namespace. Functions like setup_boot_id() follow the 0-on-success / negative-on-error convention. The current callers only check r < 0 so this works by accident, but returning a positive value is non-idiomatic and fragile — any future caller checking r > 0 or r != 0 would misinterpret the result. Please split the check:
r = container_in_userns();
if (r < 0)
return r;
if (r > 0)
return 0;| assert(fd_inner_socket >= 0); | ||
|
|
||
| r = container_in_userns(); | ||
| if (r != 0) |
There was a problem hiding this comment.
Claude: must-fix: Same issue as setup_boot_id(): setup_kmsg() returns 1 (true) via if (r != 0) return r; on the user-namespace skip path. Should be:
r = container_in_userns();
if (r < 0)
return r;
if (r > 0)
return 0;| @@ -467,6 +465,21 @@ int mount_sysfs(const char *dest, MountSettingsMask mount_settings) { | |||
| return 0; | |||
There was a problem hiding this comment.
Claude: must-fix: When /sys is already a mount point with sysfs (the is_mount_point && is_sysfs path), the function returns 0 here before reaching the new !FLAGS_SET(mount_settings, MOUNT_APPLY_APIVFS_RO) check at line 469. In the user namespace case MOUNT_APPLY_APIVFS_RO is cleared, so the intent is to mount sysfs read-only directly — but this early return bypasses that entirely. The container ends up with whatever mount flags sysfs already had, contradicting the comment at line 468 that says "we still have to mount it read-only". The early return should either be moved below the new userns check, or this path should explicitly handle the read-only remount.
| return 1; | ||
| } | ||
|
|
||
| static int container_in_userns(void) { |
There was a problem hiding this comment.
Claude: suggestion: container_in_userns() calls namespace_is_init() (a procfs stat) on every invocation, and it is called from 4 separate sites (verify_arguments, setup_boot_id, setup_kmsg, run_container). Since the user namespace status cannot change during the process lifetime, consider computing this once in verify_arguments() and storing the result in a flag (e.g., extending arg_mount_settings or a dedicated static variable) to avoid redundant syscalls and the repeated error-handling boilerplate at each call site.
| if (r < 0) | ||
| return r; | ||
| } else { | ||
| is_mount_point = r > 0; |
There was a problem hiding this comment.
Claude: nit: bool is_mount_point is declared at the top but not initialized until several lines later. It could be declared at the point of initialization for clarity:
bool is_mount_point = r > 0;
Claude review of PR #8 (5070e78)Must fix
Suggestions
Nits
|
481fd14 to
97987d7
Compare
When the container runs in a user namespace, we don't need to protect /proc and /sys by overmounting things. In fact this is actively harmful as it prevents nested systemd-nspawn from working as to mount procfs and sysfs in a container it cannot be overmounted or the kernel will refuse the mount. To make nesting possible, let's stop overmounting parts of /proc and /sys when user namespaces are in use.
97987d7 to
5070e78
Compare
|
@claude review this pr |
|
@claude review this pr |
Fix a typo which causes a segfault when processing a user record
with matchHostname when it's an array instead of a simple string:
$ echo '{"userName":"crashhostarray","perMachine":[{"matchHostname":["host1","host2"],"locked":false}]}' | userdbctl -F -
Segmentation fault (core dumped)
$ coredumpctl info
...
Message: Process 1172301 (userdbctl) of user 1000 dumped core.
Module libz.so.1 from rpm zlib-ng-2.3.3-1.fc43.x86_64
Module libcrypto.so.3 from rpm openssl-3.5.4-2.fc43.x86_64
Stack trace of thread 1172301:
#0 0x00007fded7b3a656 __strcmp_evex (libc.so.6 + 0x159656)
#1 0x00007fded7e95397 per_machine_hostname_match (libsystemd-shared-260.so + 0x295397)
#2 0x00007fded7e955b5 per_machine_match (libsystemd-shared-260.so + 0x2955b5)
#3 0x00007fded7e957c6 dispatch_per_machine (libsystemd-shared-260.so + 0x2957c6)
#4 0x00007fded7e96c97 user_record_load (libsystemd-shared-260.so + 0x296c97)
#5 0x000000000040572d display_user (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x572d)
#6 0x00007fded7ea9727 dispatch_verb (libsystemd-shared-260.so + 0x2a9727)
#7 0x000000000041077c run (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x1077c)
#8 0x00000000004107ce main (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x107ce)
systemd#9 0x00007fded79e45b5 __libc_start_call_main (libc.so.6 + 0x35b5)
systemd#10 0x00007fded79e4668 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x3668)
systemd#11 0x00000000004038d5 _start (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x38d5)
ELF object binary architecture: AMD x86-64
The fido2_hmac_salt/fido2_hmac_credential/recovery_key fields kept
leaking memory as the array itself wasn't deallocated after deallocating
each of its elements data:
$ build-san/userdbctl -F fuzz-corpus-userdb/auth-fido2.json
...
=================================================================
==1292840==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 112 byte(s) in 1 object(s) allocated from:
#0 0x7f56f00e5e4b in realloc.part.0 (/lib64/libasan.so.8+0xe5e4b) (BuildId: 25975f766867e9e604dc5a71a8befeaed3301942)
#1 0x7f56ed869e42 in greedy_realloc ../src/basic/alloc-util.c:65
#2 0x7f56ed7ff5e9 in dispatch_fido2_hmac_salt ../src/shared/user-record.c:836
#3 0x7f56edd73cbc in sd_json_dispatch_full ../src/libsystemd/sd-json/sd-json.c:5204
#4 0x7f56edd745fc in sd_json_dispatch ../src/libsystemd/sd-json/sd-json.c:5276
#5 0x7f56ed80100b in dispatch_privileged ../src/shared/user-record.c:998
#6 0x7f56edd73cbc in sd_json_dispatch_full ../src/libsystemd/sd-json/sd-json.c:5204
#7 0x7f56edd745fc in sd_json_dispatch ../src/libsystemd/sd-json/sd-json.c:5276
#8 0x7f56ed80622c in user_record_load ../src/shared/user-record.c:1697
systemd#9 0x000000408c15 in display_user ../src/userdb/userdbctl.c:447
systemd#10 0x7f56ed83cc9a in dispatch_verb ../src/shared/verbs.c:137
systemd#11 0x00000041df2b in run ../src/userdb/userdbctl.c:1908
systemd#12 0x00000041dfbe in main ../src/userdb/userdbctl.c:1911
systemd#13 0x7f56ec8105b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4) (BuildId: 2b5beec0fd24fe9c9f43eddfdd5facf0b8a1b805)
systemd#14 0x7f56ec810667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667) (BuildId: 2b5beec0fd24fe9c9f43eddfdd5facf0b8a1b805)
systemd#15 0x000000404a44 in _start (/home/fsumsal/repos/@systemd/systemd/build-san/userdbctl+0x404a44) (BuildId: 19e8b7e7b7038d2cea20bc18a55bea2a9e4406d5)
Direct leak of 64 byte(s) in 1 object(s) allocated from:
#0 0x7f56f00e5e4b in realloc.part.0 (/lib64/libasan.so.8+0xe5e4b) (BuildId: 25975f766867e9e604dc5a71a8befeaed3301942)
#1 0x7f56ed869e42 in greedy_realloc ../src/basic/alloc-util.c:65
#2 0x7f56ed7fe779 in dispatch_fido2_hmac_credential_array ../src/shared/user-record.c:775
#3 0x7f56edd73cbc in sd_json_dispatch_full ../src/libsystemd/sd-json/sd-json.c:5204
#4 0x7f56edd745fc in sd_json_dispatch ../src/libsystemd/sd-json/sd-json.c:5276
#5 0x7f56ed80622c in user_record_load ../src/shared/user-record.c:1697
#6 0x000000408c15 in display_user ../src/userdb/userdbctl.c:447
#7 0x7f56ed83cc9a in dispatch_verb ../src/shared/verbs.c:137
#8 0x00000041df2b in run ../src/userdb/userdbctl.c:1908
systemd#9 0x00000041dfbe in main ../src/userdb/userdbctl.c:1911
systemd#10 0x7f56ec8105b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4) (BuildId: 2b5beec0fd24fe9c9f43eddfdd5facf0b8a1b805)
systemd#11 0x7f56ec810667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667) (BuildId: 2b5beec0fd24fe9c9f43eddfdd5facf0b8a1b805)
systemd#12 0x000000404a44 in _start (/home/fsumsal/repos/@systemd/systemd/build-san/userdbctl+0x404a44) (BuildId: 19e8b7e7b7038d2cea20bc18a55bea2a9e4406d5)
SUMMARY: AddressSanitizer: 176 byte(s) leaked in 2 allocation(s).
…d#40979) Fix a typo which causes a segfault when processing a user record with `matchHostname` when it's an array instead of a simple string: ``` $ echo '{"userName":"crashhostarray","perMachine":[{"matchHostname":["host1","host2"],"locked":false}]}' | userdbctl -F - Segmentation fault (core dumped) $ coredumpctl info ... Message: Process 1172301 (userdbctl) of user 1000 dumped core. Module libz.so.1 from rpm zlib-ng-2.3.3-1.fc43.x86_64 Module libcrypto.so.3 from rpm openssl-3.5.4-2.fc43.x86_64 Stack trace of thread 1172301: #0 0x00007fded7b3a656 __strcmp_evex (libc.so.6 + 0x159656) #1 0x00007fded7e95397 per_machine_hostname_match (libsystemd-shared-260.so + 0x295397) #2 0x00007fded7e955b5 per_machine_match (libsystemd-shared-260.so + 0x2955b5) #3 0x00007fded7e957c6 dispatch_per_machine (libsystemd-shared-260.so + 0x2957c6) #4 0x00007fded7e96c97 user_record_load (libsystemd-shared-260.so + 0x296c97) #5 0x000000000040572d display_user (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x572d) #6 0x00007fded7ea9727 dispatch_verb (libsystemd-shared-260.so + 0x2a9727) #7 0x000000000041077c run (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x1077c) #8 0x00000000004107ce main (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x107ce) systemd#9 0x00007fded79e45b5 __libc_start_call_main (libc.so.6 + 0x35b5) systemd#10 0x00007fded79e4668 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x3668) systemd#11 0x00000000004038d5 _start (/home/fsumsal/repos/@systemd/systemd/build/userdbctl + 0x38d5) ELF object binary architecture: AMD x86-64 ```
When the container runs in a user namespace, we don't need to protect /proc and /sys by overmounting things. In fact this is actively harmful as it prevents nested systemd-nspawn from working as to mount procfs and sysfs in a container it cannot be overmounted or the kernel will refuse the mount.
To make nesting possible, let's stop overmounting parts of /proc and /sys when user namespaces are in use.