nvswitch: Add fabric manager modes#134
Conversation
There was a problem hiding this comment.
Pull request overview
Updates NVSwitch/Fabric Manager configuration to support explicit Fabric Manager modes and partition rail policy, and introduces a helper intended to simplify KEY=VALUE config file updates.
Changes:
- Replace
nvrc.fabricmanagerboolean withnvrc.fm.mode(0/1/2) and addnvrc.fm.rail.policy(greedy/symmetric) kernel params. - Configure
fabricmanager.cfgwithFABRIC_MODE,FABRIC_MODE_RESTART, andPARTITION_RAIL_POLICYbased on parsed settings. - Add
configmodule wiring (binary + library) to host the newupdate_config_file()helper (currently missing from the PR).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/nvrc.rs | Replaces Fabric Manager enable boolean with fabric_mode + rail_policy fields in NVRC state. |
| src/main.rs | Sets fabric_mode=1 in NVSwitch modes and adds mod config; module declaration. |
| src/lib.rs | Exposes config module from the library crate. |
| src/kernel_params.rs | Parses new nvrc.fm.mode and nvrc.fm.rail.policy parameters and updates tests accordingly. |
| src/daemon.rs | Writes Fabric Manager settings into fabricmanager.cfg and uses update_config_file() helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fe5e4f8 to
5fdfc25
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replace boolean nvrc.fabricmanager with nvrc.fm.mode (0=bare metal, 1=servicevm, 2=vgpu) to properly configure FABRIC_MODE and FABRIC_MODE_RESTART in fabricmanager.cfg. Add nvrc.fm.rail.policy (greedy|symmetric) for PARTITION_RAIL_POLICY. Symmetric policy required for Confidential Computing on Blackwell to ensure memory isolation boundaries during attestation. Introduce generic update_config_file() helper in src/config.rs to eliminate repetitive KEY=VALUE file manipulation and enable easy addition of future configuration parameters. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
9dc7917 to
76d71f6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
HGX Bx00 systems use CX7 bridges for NVLink management instead of direct GPU access. GPUs are passed to tenant VMs; only the CX7 IB devices are visible here. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Addiionally introduce a check between NVLSM and FM Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Improve require_root() to prevent running the entire test suite when a single root-requiring test needs to be re-executed with sudo. Add --exact flag when a test filter is present to prevent fuzzy matching (e.g., "test_log" no longer matches "test_log_debug", "test_log_info", etc.) Add --test-threads=1 to force serial execution under sudo, preventing race conditions between root-requiring tests Before: cargo test test_foo → reruns all test_foo* tests with sudo After: cargo test test_foo → reruns only test_foo with sudo Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Replace boolean nvrc.fabricmanager with nvrc.fm.mode (0=bare metal, 1=servicevm) to properly configure FABRIC_MODE and FABRIC_MODE_RESTART in fabricmanager.cfg.
Add nvrc.fm.rail.policy (greedy|symmetric) for PARTITION_RAIL_POLICY. Symmetric policy required for Confidential Computing on Blackwell to ensure memory isolation boundaries during attestation.
Introduce generic update_config_file() helper in src/config.rs to eliminate repetitive KEY=VALUE file manipulation and enable easy addition of future configuration parameters.