Skip to content

mode: auto-detect mode from hardware topology, remove nvrc.mode#138

Open
zvonkok wants to merge 1 commit intoNVIDIA:mainfrom
zvonkok:nvswtich-detection
Open

mode: auto-detect mode from hardware topology, remove nvrc.mode#138
zvonkok wants to merge 1 commit intoNVIDIA:mainfrom
zvonkok:nvswtich-detection

Conversation

@zvonkok
Copy link
Collaborator

@zvonkok zvonkok commented Feb 18, 2026

Replace the nvrc.mode kernel parameter with automatic hardware topology detection from PCI and InfiniBand sysfs. The mode is now determined by scanning for NVIDIA GPUs, NVSwitches, and SW_MNG InfiniBand devices.

  • Remove mode field from NVRC struct
  • Remove nvrc.mode kernel parameter parsing
  • Pass detected mode directly to configure_fabricmanager
  • Simplify mode_gpu bare-metal NVSwitch path
  • Rename nvswitch-nvl{4,5} modes to servicevm-nvl{4,5}

Copilot AI review requested due to automatic review settings February 18, 2026 17:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the nvrc.mode kernel parameter and shifts mode selection to automatic detection based on hardware topology, wiring the detected mode directly into Fabric Manager configuration/launch logic.

Changes:

  • Removes mode and fabric_mode from NVRC, and deletes related kernel parameter parsing/tests.
  • Refactors init flow to dispatch behavior based on mode::detect() (including renaming NVSwitch modes to servicevm-*).
  • Updates Fabric Manager startup/config generation to be driven by the detected mode string.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/nvrc.rs Removes NVRC fields tied to manual mode selection.
src/main.rs Switches from kernel-param-based dispatch to mode::detect() with explicit match arms.
src/lib.rs Exposes the new mode module from the library API surface.
src/kernel_params.rs Drops parsing/tests for nvrc.mode and nvrc.fm.mode.
src/daemon.rs Changes Fabric Manager API to accept a mode string and derives FABRIC_MODE from it.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 20, 2026 14:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 83 to 92
/// NVSwitch fabric manager is only needed for multi-GPU NVLink topologies.
/// Disabled by default since most VMs have single GPUs.
pub fn nv_fabricmanager(&mut self) {
self.configure_fabricmanager(FM_CONFIG);
pub fn nv_fabricmanager(&mut self, mode: &str) {
fs::copy(FM_CONFIG, FM_RUNTIME_CONFIG)
.or_panic(format_args!("copy {FM_CONFIG} to {FM_RUNTIME_CONFIG}"));
self.configure_fabricmanager(FM_RUNTIME_CONFIG, mode);
fs::set_permissions(FM_RUNTIME_CONFIG, fs::Permissions::from_mode(0o400))
.or_panic(format_args!("set permissions {FM_RUNTIME_CONFIG}"));
self.spawn_fabricmanager("/bin/nv-fabricmanager")
}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nv_fabricmanager() always copies /usr/share/.../fabricmanager.cfg to /run/fabricmanager.cfg, chmods it, and spawns nv-fabricmanager even when mode is "cpu" or any unsupported string. This breaks the intent of the "skipped_for_cpu" path (and will also panic in unit tests if /usr/share/... or /run aren’t available), and can accidentally start Fabric Manager with an unconfigured config.

Suggestion: decide whether Fabric Manager should run up-front (e.g., map the passed mode to FABRIC_MODE, and if it’s not one of the supported modes, return early before touching the filesystem or spawning the daemon).

Copilot uses AI. Check for mistakes.
Comment on lines 90 to 95
_ => {
panic!(
"unexpected topology: {} NVSWITCH, {} GPU, {} IB_SW_MNG — cannot determine mode",
nvswitches, gpus, sw_mng
);
}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect_from() panics for any topology outside a few exact (NVSwitch, GPU, SW_MNG) count tuples. Since this PR also removes the nvrc.mode override, any minor hardware variation (different GPU/NVSwitch counts, multiple baseboards, permissions preventing VPD reads, etc.) will hard-fail init and likely prevent the VM from booting.

Suggestion: avoid panic! here—return a best-effort mode (and log a warning), or return a Result<Detection, ...> so callers can choose a safe fallback behavior.

Copilot uses AI. Check for mistakes.
Replace the nvrc.mode kernel parameter with automatic hardware topology
detection from PCI and InfiniBand sysfs. The mode is now determined by
scanning for NVIDIA GPUs, NVSwitches, and SW_MNG InfiniBand devices.

- Remove mode field from NVRC struct
- Remove nvrc.mode kernel parameter parsing
- Pass detected mode directly to configure_fabricmanager
- Simplify mode_gpu bare-metal NVSwitch path
- Rename nvswitch-nvl{4,5} modes to servicevm-nvl{4,5}

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Copilot AI review requested due to automatic review settings February 25, 2026 19:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/kernel_params.rs:41

  • The removal of the nvrc.mode kernel parameter eliminates the ability to manually override the operating mode. In the old code, users could set nvrc.mode=cpu to explicitly skip GPU initialization even on systems with GPUs present. With automatic detection, there's no way to force CPU-only mode if GPUs are detected in hardware. Consider whether this use case needs to be supported, or if the automatic detection is always correct. If manual override capability is needed for testing or special configurations, a kernel parameter override mechanism should be added.
        for (k, v) in content.split_whitespace().filter_map(|p| p.split_once('=')) {
            match k {
                "nvrc.log" => nvrc_log(v, self),
                "nvrc.uvm.persistence.mode" => uvm_persistenced_mode(v, self),
                "nvrc.dcgm" => nvrc_dcgm(v, self),

                "nvrc.smi.srs" => nvidia_smi_srs(v, self),
                "nvrc.smi.lgc" => nvidia_smi_lgc(v, self),
                "nvrc.smi.lmc" => nvidia_smi_lmc(v, self),
                "nvrc.smi.pl" => nvidia_smi_pl(v, self),
                _ => {}
            }
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

let fm = match mode {
"gpu" => "0",
"servicevm-nvl4" | "servicevm-nvl5" => "1",
_ => return,
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configure_fabricmanager method silently returns for invalid modes (line 134), but the nv_fabricmanager caller still proceeds to spawn the fabricmanager daemon (line 91) with potentially unmodified configuration. If an invalid mode is passed, the daemon will start with whatever default values exist in the original config file, which could lead to incorrect fabric manager behavior. Consider panicking for invalid modes instead of silently returning, or move the mode validation to nv_fabricmanager before copying the config file. This would fail fast and make configuration errors more obvious.

Suggested change
_ => return,
_ => panic!("Invalid fabric manager mode: {mode}"),

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants