mode: auto-detect mode from hardware topology, remove nvrc.mode by zvonkok · Pull Request #138 · NVIDIA/nvrc

zvonkok · 2026-02-18T17:46:57Z

Replace the nvrc.mode kernel parameter with automatic hardware topology detection from PCI and InfiniBand sysfs. The mode is now determined by scanning for NVIDIA GPUs, NVSwitches, and SW_MNG InfiniBand devices.

Remove mode field from NVRC struct
Remove nvrc.mode kernel parameter parsing
Pass detected mode directly to configure_fabricmanager
Simplify mode_gpu bare-metal NVSwitch path
Rename nvswitch-nvl{4,5} modes to servicevm-nvl{4,5}

Copilot

Pull request overview

This PR removes the nvrc.mode kernel parameter and shifts mode selection to automatic detection based on hardware topology, wiring the detected mode directly into Fabric Manager configuration/launch logic.

Changes:

Removes mode and fabric_mode from NVRC, and deletes related kernel parameter parsing/tests.
Refactors init flow to dispatch behavior based on mode::detect() (including renaming NVSwitch modes to servicevm-*).
Updates Fabric Manager startup/config generation to be driven by the detected mode string.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/nvrc.rs	Removes `NVRC` fields tied to manual mode selection.
src/main.rs	Switches from kernel-param-based dispatch to `mode::detect()` with explicit match arms.
src/lib.rs	Exposes the new `mode` module from the library API surface.
src/kernel_params.rs	Drops parsing/tests for `nvrc.mode` and `nvrc.fm.mode`.
src/daemon.rs	Changes Fabric Manager API to accept a mode string and derives FABRIC_MODE from it.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/lib.rs

src/daemon.rs

src/main.rs

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T14:55:10Z

src/daemon.rs

    /// NVSwitch fabric manager is only needed for multi-GPU NVLink topologies.
    /// Disabled by default since most VMs have single GPUs.
-    pub fn nv_fabricmanager(&mut self) {
-        self.configure_fabricmanager(FM_CONFIG);
+    pub fn nv_fabricmanager(&mut self, mode: &str) {
+        fs::copy(FM_CONFIG, FM_RUNTIME_CONFIG)
+            .or_panic(format_args!("copy {FM_CONFIG} to {FM_RUNTIME_CONFIG}"));
+        self.configure_fabricmanager(FM_RUNTIME_CONFIG, mode);
+        fs::set_permissions(FM_RUNTIME_CONFIG, fs::Permissions::from_mode(0o400))
+            .or_panic(format_args!("set permissions {FM_RUNTIME_CONFIG}"));
        self.spawn_fabricmanager("/bin/nv-fabricmanager")
    }


nv_fabricmanager() always copies /usr/share/.../fabricmanager.cfg to /run/fabricmanager.cfg, chmods it, and spawns nv-fabricmanager even when mode is "cpu" or any unsupported string. This breaks the intent of the "skipped_for_cpu" path (and will also panic in unit tests if /usr/share/... or /run aren’t available), and can accidentally start Fabric Manager with an unconfigured config.

Suggestion: decide whether Fabric Manager should run up-front (e.g., map the passed mode to FABRIC_MODE, and if it’s not one of the supported modes, return early before touching the filesystem or spawning the daemon).

Copilot · 2026-02-20T14:55:11Z

src/mode.rs

+        _ => {
+            panic!(
+                "unexpected topology: {} NVSWITCH, {} GPU, {} IB_SW_MNG — cannot determine mode",
+                nvswitches, gpus, sw_mng
+            );
+        }


detect_from() panics for any topology outside a few exact (NVSwitch, GPU, SW_MNG) count tuples. Since this PR also removes the nvrc.mode override, any minor hardware variation (different GPU/NVSwitch counts, multiple baseboards, permissions preventing VPD reads, etc.) will hard-fail init and likely prevent the VM from booting.

Suggestion: avoid panic! here—return a best-effort mode (and log a warning), or return a Result<Detection, ...> so callers can choose a safe fallback behavior.

Replace the nvrc.mode kernel parameter with automatic hardware topology detection from PCI and InfiniBand sysfs. The mode is now determined by scanning for NVIDIA GPUs, NVSwitches, and SW_MNG InfiniBand devices. - Remove mode field from NVRC struct - Remove nvrc.mode kernel parameter parsing - Pass detected mode directly to configure_fabricmanager - Simplify mode_gpu bare-metal NVSwitch path - Rename nvswitch-nvl{4,5} modes to servicevm-nvl{4,5} Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/kernel_params.rs:41

The removal of the nvrc.mode kernel parameter eliminates the ability to manually override the operating mode. In the old code, users could set nvrc.mode=cpu to explicitly skip GPU initialization even on systems with GPUs present. With automatic detection, there's no way to force CPU-only mode if GPUs are detected in hardware. Consider whether this use case needs to be supported, or if the automatic detection is always correct. If manual override capability is needed for testing or special configurations, a kernel parameter override mechanism should be added.

        for (k, v) in content.split_whitespace().filter_map(|p| p.split_once('=')) {
            match k {
                "nvrc.log" => nvrc_log(v, self),
                "nvrc.uvm.persistence.mode" => uvm_persistenced_mode(v, self),
                "nvrc.dcgm" => nvrc_dcgm(v, self),

                "nvrc.smi.srs" => nvidia_smi_srs(v, self),
                "nvrc.smi.lgc" => nvidia_smi_lgc(v, self),
                "nvrc.smi.lmc" => nvidia_smi_lmc(v, self),
                "nvrc.smi.pl" => nvidia_smi_pl(v, self),
                _ => {}
            }
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T19:52:42Z

src/daemon.rs

+        let fm = match mode {
+            "gpu" => "0",
+            "servicevm-nvl4" | "servicevm-nvl5" => "1",
+            _ => return,


The configure_fabricmanager method silently returns for invalid modes (line 134), but the nv_fabricmanager caller still proceeds to spawn the fabricmanager daemon (line 91) with potentially unmodified configuration. If an invalid mode is passed, the daemon will start with whatever default values exist in the original config file, which could lead to incorrect fabric manager behavior. Consider panicking for invalid modes instead of silently returning, or move the mode validation to nv_fabricmanager before copying the config file. This would fail fast and make configuration errors more obvious.

Suggested change

_ => return,

_ => panic!("Invalid fabric manager mode: {mode}"),

Copilot AI review requested due to automatic review settings February 18, 2026 17:46

Copilot started reviewing on behalf of zvonkok February 18, 2026 17:47 View session

zvonkok force-pushed the nvswtich-detection branch from 5626eeb to 247e674 Compare February 18, 2026 17:49

Copilot AI reviewed Feb 18, 2026

View reviewed changes

src/lib.rs Show resolved Hide resolved

src/daemon.rs Outdated Show resolved Hide resolved

src/main.rs Show resolved Hide resolved

zvonkok force-pushed the nvswtich-detection branch from 247e674 to cdf1ed1 Compare February 18, 2026 18:19

Copilot AI review requested due to automatic review settings February 20, 2026 14:50

zvonkok force-pushed the nvswtich-detection branch from cdf1ed1 to a55b321 Compare February 20, 2026 14:50

Copilot started reviewing on behalf of zvonkok February 20, 2026 14:51 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

zvonkok force-pushed the nvswtich-detection branch from a55b321 to 6e8ccd2 Compare February 20, 2026 21:40

zvonkok force-pushed the nvswtich-detection branch from 6e8ccd2 to 77151c7 Compare February 25, 2026 19:46

Copilot AI review requested due to automatic review settings February 25, 2026 19:46

Copilot started reviewing on behalf of zvonkok February 25, 2026 19:46 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mode: auto-detect mode from hardware topology, remove nvrc.mode#138

mode: auto-detect mode from hardware topology, remove nvrc.mode#138
zvonkok wants to merge 1 commit intoNVIDIA:mainfrom
zvonkok:nvswtich-detection

zvonkok commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	_ => return,
	_ => panic!("Invalid fabric manager mode: {mode}"),

Conversation

zvonkok commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants