mode: auto-detect mode from hardware topology, remove nvrc.mode#138
mode: auto-detect mode from hardware topology, remove nvrc.mode#138zvonkok wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
5626eeb to
247e674
Compare
There was a problem hiding this comment.
Pull request overview
This PR removes the nvrc.mode kernel parameter and shifts mode selection to automatic detection based on hardware topology, wiring the detected mode directly into Fabric Manager configuration/launch logic.
Changes:
- Removes
modeandfabric_modefromNVRC, and deletes related kernel parameter parsing/tests. - Refactors init flow to dispatch behavior based on
mode::detect()(including renaming NVSwitch modes toservicevm-*). - Updates Fabric Manager startup/config generation to be driven by the detected mode string.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/nvrc.rs | Removes NVRC fields tied to manual mode selection. |
| src/main.rs | Switches from kernel-param-based dispatch to mode::detect() with explicit match arms. |
| src/lib.rs | Exposes the new mode module from the library API surface. |
| src/kernel_params.rs | Drops parsing/tests for nvrc.mode and nvrc.fm.mode. |
| src/daemon.rs | Changes Fabric Manager API to accept a mode string and derives FABRIC_MODE from it. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
247e674 to
cdf1ed1
Compare
cdf1ed1 to
a55b321
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// NVSwitch fabric manager is only needed for multi-GPU NVLink topologies. | ||
| /// Disabled by default since most VMs have single GPUs. | ||
| pub fn nv_fabricmanager(&mut self) { | ||
| self.configure_fabricmanager(FM_CONFIG); | ||
| pub fn nv_fabricmanager(&mut self, mode: &str) { | ||
| fs::copy(FM_CONFIG, FM_RUNTIME_CONFIG) | ||
| .or_panic(format_args!("copy {FM_CONFIG} to {FM_RUNTIME_CONFIG}")); | ||
| self.configure_fabricmanager(FM_RUNTIME_CONFIG, mode); | ||
| fs::set_permissions(FM_RUNTIME_CONFIG, fs::Permissions::from_mode(0o400)) | ||
| .or_panic(format_args!("set permissions {FM_RUNTIME_CONFIG}")); | ||
| self.spawn_fabricmanager("/bin/nv-fabricmanager") | ||
| } |
There was a problem hiding this comment.
nv_fabricmanager() always copies /usr/share/.../fabricmanager.cfg to /run/fabricmanager.cfg, chmods it, and spawns nv-fabricmanager even when mode is "cpu" or any unsupported string. This breaks the intent of the "skipped_for_cpu" path (and will also panic in unit tests if /usr/share/... or /run aren’t available), and can accidentally start Fabric Manager with an unconfigured config.
Suggestion: decide whether Fabric Manager should run up-front (e.g., map the passed mode to FABRIC_MODE, and if it’s not one of the supported modes, return early before touching the filesystem or spawning the daemon).
| _ => { | ||
| panic!( | ||
| "unexpected topology: {} NVSWITCH, {} GPU, {} IB_SW_MNG — cannot determine mode", | ||
| nvswitches, gpus, sw_mng | ||
| ); | ||
| } |
There was a problem hiding this comment.
detect_from() panics for any topology outside a few exact (NVSwitch, GPU, SW_MNG) count tuples. Since this PR also removes the nvrc.mode override, any minor hardware variation (different GPU/NVSwitch counts, multiple baseboards, permissions preventing VPD reads, etc.) will hard-fail init and likely prevent the VM from booting.
Suggestion: avoid panic! here—return a best-effort mode (and log a warning), or return a Result<Detection, ...> so callers can choose a safe fallback behavior.
a55b321 to
6e8ccd2
Compare
Replace the nvrc.mode kernel parameter with automatic hardware topology
detection from PCI and InfiniBand sysfs. The mode is now determined by
scanning for NVIDIA GPUs, NVSwitches, and SW_MNG InfiniBand devices.
- Remove mode field from NVRC struct
- Remove nvrc.mode kernel parameter parsing
- Pass detected mode directly to configure_fabricmanager
- Simplify mode_gpu bare-metal NVSwitch path
- Rename nvswitch-nvl{4,5} modes to servicevm-nvl{4,5}
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
6e8ccd2 to
77151c7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/kernel_params.rs:41
- The removal of the nvrc.mode kernel parameter eliminates the ability to manually override the operating mode. In the old code, users could set nvrc.mode=cpu to explicitly skip GPU initialization even on systems with GPUs present. With automatic detection, there's no way to force CPU-only mode if GPUs are detected in hardware. Consider whether this use case needs to be supported, or if the automatic detection is always correct. If manual override capability is needed for testing or special configurations, a kernel parameter override mechanism should be added.
for (k, v) in content.split_whitespace().filter_map(|p| p.split_once('=')) {
match k {
"nvrc.log" => nvrc_log(v, self),
"nvrc.uvm.persistence.mode" => uvm_persistenced_mode(v, self),
"nvrc.dcgm" => nvrc_dcgm(v, self),
"nvrc.smi.srs" => nvidia_smi_srs(v, self),
"nvrc.smi.lgc" => nvidia_smi_lgc(v, self),
"nvrc.smi.lmc" => nvidia_smi_lmc(v, self),
"nvrc.smi.pl" => nvidia_smi_pl(v, self),
_ => {}
}
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let fm = match mode { | ||
| "gpu" => "0", | ||
| "servicevm-nvl4" | "servicevm-nvl5" => "1", | ||
| _ => return, |
There was a problem hiding this comment.
The configure_fabricmanager method silently returns for invalid modes (line 134), but the nv_fabricmanager caller still proceeds to spawn the fabricmanager daemon (line 91) with potentially unmodified configuration. If an invalid mode is passed, the daemon will start with whatever default values exist in the original config file, which could lead to incorrect fabric manager behavior. Consider panicking for invalid modes instead of silently returning, or move the mode validation to nv_fabricmanager before copying the config file. This would fail fast and make configuration errors more obvious.
| _ => return, | |
| _ => panic!("Invalid fabric manager mode: {mode}"), |
Replace the nvrc.mode kernel parameter with automatic hardware topology detection from PCI and InfiniBand sysfs. The mode is now determined by scanning for NVIDIA GPUs, NVSwitches, and SW_MNG InfiniBand devices.