Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/coverage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,9 @@ jobs:
# Run as root: some tests require privileged operations (modprobe, /proc writes).
# The #[cfg(coverage)] paths in require_root() panic if not run as root,
# ensuring coverage accurately reflects execution with proper permissions.
# --test-threads=1 forces serial execution to avoid /dev/log socket conflicts.
- name: Generate coverage
run: sudo -E env "PATH=$PATH" cargo llvm-cov --all-features --workspace --lcov --output-path lcov.info -- --include-ignored
run: sudo -E env "PATH=$PATH" cargo llvm-cov --all-features --workspace --lcov --output-path lcov.info -- --include-ignored --test-threads=1

- name: Upload coverage to Coveralls
uses: coverallsapp/github-action@cfd0633edbd2411b532b808ba7a8b5e04f76d2c8 # v2.3.4
Expand Down
73 changes: 37 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,36 +21,36 @@ recovery mechanisms—if GPU initialization fails, the VM powers off. This
## Architecture

```text
┌────────────────────────────────────────────────────────────────
│ NVRC (PID 1)
│ 1. Set panic hook (power off VM on panic)
│ 2. Mount filesystems (/proc, /dev, /sys, /dev/shm)
│ 3. Initialize kernel message logging
│ 4. Start syslog daemon
│ 5. Remount / as read-only (security hardening)
6. Parse kernel parameters (/proc/cmdline)
┌─────────────────────────────────────────────────────────────────────────────┐
│ │ Mode Selection (nvrc.mode) │
│ │ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ │ GPU (default) │ │ CPU Mode │ │ NVSwitch-NVL4│ │ NVSwitch-NVL5│ │
│ │ │ • nvidia.ko │ • Skip GPU │ │ (H100/H200) │ │ (B200/B300)
│ │ │ • nvidia-uvm │ │ • Jump to │ │ • nvidia.ko │ │ • ib_umad │ │
│ │ │ • Lock clocks kata-agent │ │ • fabric-mgr │• fabric-mgr │ │ │
│ │ │ • Lock memory │ │ │ │ • Check daemons│ │ • NVLSM auto │ │
│ │ │ • Power limit │ │ │ │ • Jump agent │ │ • Jump agent │ │
│ │ │ • Daemons │ │ │ │ │ │
│ │ │ • CDI spec │ │ │ │ │ │
│ │ │ • SRS config │ │ │ │ │ │
│ │ └────────────────┘ └─────────────┘ └─────────────└──────────────┘ │
│ └─────────────────────────────────────────────────────────────────────────────┘
7. Check daemon health (fail if any crashed)
8. Disable kernel module loading (lockdown)
9. Fork kata-agent (handoff control)
10. Poll syslog forever (keep PID 1 alive)
└────────────────────────────────────────────────────────────────
┌────────────────────────────────────────────────────────────────┐
│ NVRC (PID 1) │
│ │
│ 1. Set panic hook (power off VM on panic) │
│ 2. Mount filesystems (/proc, /dev, /sys, /dev/shm) │
│ 3. Initialize kernel message logging │
│ 4. Start syslog daemon │
│ 5. Parse kernel parameters (/proc/cmdline)
Comment thread
zvonkok marked this conversation as resolved.
┌──────────────────────────────────────────────────────────┐
│ Mode Selection (nvrc.mode) │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ │ │GPU (default)│ │ CPU Mode │ │NVSwitch-NVL4│ ... │
│ │ │• nvidia.ko │ │• Skip GPU │(H100/H200)
│ │ │• nvidia-uvm │ │ │• nvidia.ko
│ │ │• Lock clocks│ │ │ │• fabric-mgr │
│ │ │• Lock memory│ │• Health chk
│ │ │• Power limit│ │ │ │ │ │
│ │ │• Daemons │ │ │ │ │ │
│ │ │• CDI spec │ │ │ │ │
│ │ │• SRS config │ │ │ │ │
│ │ │• Health chk │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘
│ └──────────────────────────────────────────────────────────
│ │
6. Remount / as read-only (security hardening)
7. Disable kernel module loading (lockdown) │
8. Fork kata-agent (handoff control) │
9. Poll syslog forever (keep PID 1 alive) │
└────────────────────────────────────────────────────────────────┘
```

## Kernel Parameters
Expand All @@ -77,11 +77,12 @@ configuration doesn't exist yet.

### Daemon Control

| Parameter | Values | Default | Description |
| --------------------------- | --------------------------------------- | ------- | ------------------------------------------------------------------------------ |
| `nvrc.uvm.persistence.mode` | `on/off`, `true/false`, `1/0`, `yes/no` | `true` | UVM persistence mode keeps unified memory state across CUDA context teardowns. |
| `nvrc.dcgm` | `on/off`, `true/false`, `1/0`, `yes/no` | `false` | Enable DCGM (Data Center GPU Manager) for telemetry and health monitoring. |
| `nvrc.fabricmanager` | `on/off`, `true/false`, `1/0`, `yes/no` | `false` | Enable Fabric Manager for NVLink/NVSwitch multi-GPU communication. |
| Parameter | Values | Default | Description |
| --------------------------- | --------------------------------------- | -------- | -------------------------------------------------------------------------------------------------- |
| `nvrc.uvm.persistence.mode` | `on/off`, `true/false`, `1/0`, `yes/no` | `true` | UVM persistence mode keeps unified memory state across CUDA context teardowns. |
| `nvrc.dcgm` | `on/off`, `true/false`, `1/0`, `yes/no` | `false` | Enable DCGM (Data Center GPU Manager) for telemetry and health monitoring. |
| `nvrc.fm.mode` | `0`, `1` | - | Fabric Manager mode: 0=bare metal, 1=servicevm (shared nvswitch). Auto-set in nvswitch modes. |
| `nvrc.fm.rail.policy` | `greedy`, `symmetric` | `greedy` | Partition rail policy. Symmetric required for Confidential Computing on Blackwell. |

### Example Configurations

Expand Down Expand Up @@ -124,7 +125,7 @@ nvrc.mode=gpu nvrc.dcgm=on nvrc.log=info
**Multi-GPU with NVLink:**

```text
nvrc.mode=gpu nvrc.fabricmanager=on nvrc.log=debug
nvrc.mode=gpu nvrc.fm.mode=0 nvrc.log=debug
```

## Build
Expand Down
175 changes: 175 additions & 0 deletions src/config.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
// SPDX-License-Identifier: Apache-2.0
// Copyright (c) NVIDIA CORPORATION

//! Generic KEY=VALUE configuration file utilities.

use crate::macros::ResultExt;
use log::debug;
use std::collections::HashSet;
use std::fs;

/// Updates KEY=VALUE pairs in a config file, adding them if missing.
/// Existing keys are updated in place, new keys are appended to the end.
pub fn update_config_file(path: &str, updates: &[(&str, &str)]) {
let content = fs::read_to_string(path).or_panic(format_args!("read {path}"));

let mut lines: Vec<String> = content.lines().map(String::from).collect();
let mut found_keys: HashSet<&str> = HashSet::new();

// Update existing lines
for line in &mut lines {
let trimmed = line.trim();
for (key, value) in updates {
if trimmed.starts_with(&format!("{}=", key)) {
*line = format!("{}={}", key, value);
found_keys.insert(key);
debug!("{}: {}={}", path, key, value);
break;
}
}
}

// Add missing keys
for (key, value) in updates {
if !found_keys.contains(key) {
lines.push(format!("{}={}", key, value));
debug!("{}: {}={}", path, key, value);
}
}

let updated = lines.join("\n") + "\n";
fs::write(path, updated).or_panic(format_args!("write {path}"));
}

#[cfg(test)]
mod tests {
use super::*;
use std::fs;
use tempfile::NamedTempFile;

#[test]
fn test_update_config_file_add_new_keys() {
let tmpfile = NamedTempFile::new().unwrap();
let path = tmpfile.path().to_str().unwrap();

// Start with empty file
fs::write(path, "").unwrap();

update_config_file(path, &[("KEY1", "value1"), ("KEY2", "value2")]);

let content = fs::read_to_string(path).unwrap();
assert!(content.contains("KEY1=value1"));
assert!(content.contains("KEY2=value2"));
}

#[test]
fn test_update_config_file_update_existing_keys() {
let tmpfile = NamedTempFile::new().unwrap();
let path = tmpfile.path().to_str().unwrap();

// Start with existing content
fs::write(path, "KEY1=oldvalue\nKEY2=oldvalue\n").unwrap();

update_config_file(path, &[("KEY1", "newvalue"), ("KEY2", "newvalue")]);

let content = fs::read_to_string(path).unwrap();
assert!(content.contains("KEY1=newvalue"));
assert!(content.contains("KEY2=newvalue"));
assert!(!content.contains("oldvalue"));
}

#[test]
fn test_update_config_file_mixed_update_and_add() {
let tmpfile = NamedTempFile::new().unwrap();
let path = tmpfile.path().to_str().unwrap();

// Start with one existing key
fs::write(path, "KEY1=oldvalue\n").unwrap();

update_config_file(path, &[("KEY1", "updated"), ("KEY2", "new")]);

let content = fs::read_to_string(path).unwrap();
assert!(content.contains("KEY1=updated"));
assert!(content.contains("KEY2=new"));
assert!(!content.contains("oldvalue"));
}

#[test]
fn test_update_config_file_preserves_other_lines() {
let tmpfile = NamedTempFile::new().unwrap();
let path = tmpfile.path().to_str().unwrap();

// Start with mixed content
fs::write(path, "# Comment\nKEY1=old\nOTHER=unchanged\n").unwrap();

update_config_file(path, &[("KEY1", "new")]);

let content = fs::read_to_string(path).unwrap();
assert!(content.contains("# Comment"));
assert!(content.contains("KEY1=new"));
assert!(content.contains("OTHER=unchanged"));
}

#[test]
fn test_update_config_file_with_spaces() {
let tmpfile = NamedTempFile::new().unwrap();
let path = tmpfile.path().to_str().unwrap();

fs::write(path, " KEY1=old \n").unwrap();

update_config_file(path, &[("KEY1", "new")]);

let content = fs::read_to_string(path).unwrap();
assert!(content.contains("KEY1=new"));
}

#[test]
fn test_update_config_file_empty_value() {
let tmpfile = NamedTempFile::new().unwrap();
let path = tmpfile.path().to_str().unwrap();

fs::write(path, "").unwrap();

update_config_file(path, &[("KEY1", "")]);

let content = fs::read_to_string(path).unwrap();
assert!(content.contains("KEY1="));
}

#[test]
fn test_update_config_file_multiple_updates_same_key() {
let tmpfile = NamedTempFile::new().unwrap();
let path = tmpfile.path().to_str().unwrap();

fs::write(path, "KEY1=old\n").unwrap();

// Update twice
update_config_file(path, &[("KEY1", "first")]);
update_config_file(path, &[("KEY1", "second")]);

let content = fs::read_to_string(path).unwrap();
assert!(content.contains("KEY1=second"));
assert!(!content.contains("first"));
}

#[test]
fn test_update_config_file_similar_key_names() {
let tmpfile = NamedTempFile::new().unwrap();
let path = tmpfile.path().to_str().unwrap();

// Test that FABRIC_MODE_RESTART doesn't match FABRIC_MODE
fs::write(path, "FABRIC_MODE=0\nFABRIC_MODE_RESTART=0\n").unwrap();

update_config_file(path, &[("FABRIC_MODE", "1")]);

let content = fs::read_to_string(path).unwrap();
assert!(content.contains("FABRIC_MODE=1"));
assert!(content.contains("FABRIC_MODE_RESTART=0"));
}

#[test]
#[should_panic(expected = "read")]
fn test_update_config_file_nonexistent_file() {
update_config_file("/nonexistent/path/file.cfg", &[("KEY", "value")]);
}
}
Loading