feat(podvm): opt-in NVIDIA 595.71.05 driver for B200 multi-GPU CC by alhassankhedr-cohere · Pull Request #29 · cohere-ai/cloud-api-adaptor

alhassankhedr-cohere · 2026-05-08T13:30:27Z

Summary

Adds a new b200_cc_drivers workflow_dispatch flag (default false) to the Build PodVM Image (Cohere) workflow. When enabled, the build swaps the in-image NVIDIA stack from the default 580 LTS to 595.71.05 (open-kernel branch) — the minimum driver / fabricmanager required to enable Confidential Computing on multi-GPU B200 hosts (NVSwitch fabric requires the 595.x series with TDISP/CC support).

When the flag is enabled the workflow:

sed-replaces the four NVIDIA package pins in mkosi.presets/system/mkosi.conf.d/ubuntu.conf before mkosi runs:
- nvidia-driver-580-open → nvidia-driver-open=595.71.05-1ubuntu1 (the 595 branch only ships the unversioned metapackage in NVIDIA's CUDA repo — no nvidia-driver-595-open).
- nvidia-persistenced=595.71.05-1ubuntu1
- nvidia-fabricmanager=595.71.05-1ubuntu1
- libnvidia-nscq=595.71.05-1ubuntu1
- Patterns are anchored on the package name (580.x.y is matched by =.*) so this survives future bumps to the 580 baseline (e.g. fix(podvm): bump NVIDIA driver pins to 580.159.03 #28).
Suffixes tag / OCI tag / GCP image names with -cc595 so a CC build never silently overwrites the standard cohere-latest artifact (e.g. podvm-ubuntu-tdx-release-cohere-latest-cc595).
Records the resolved driver version (extracted from ubuntu.conf at build time, so it's accurate for both standard and CC builds) in measurements.json as nvidia_driver and as a com.cohere.nvidia.driver annotation on the published OCI artifact.

All four 595.71.05-1ubuntu1 packages were verified present in NVIDIA's CUDA apt repo for ubuntu2404/x86_64 (already wired up and pinned to priority 1001 in Dockerfile.mkosi.ubuntu).

Default behaviour is unchanged: the flag is off, push events on the cohere branch and podvm-v* tags continue to ship 580.x.

Attestation impact

CC-driver builds report a different x-nvidia-gpu-driver-version (595.71.05) and a different RTMR2 (different kernel modules in initrd / driver firmware). Any ITA appraisal policies need a parallel 595.71.05 profile before the CC image is rolled into a prod attestation flow. Standard builds are unaffected.

Test plan

Trigger Build PodVM Image (Cohere) via workflow_dispatch with b200_cc_drivers = false and confirm:
- Output image tag stays …-cohere-latest-{release,debug} (no -cc595 suffix).
- measurements.json nvidia_driver matches whatever ubuntu.conf pins on cohere.
Trigger Build PodVM Image (Cohere) via workflow_dispatch with b200_cc_drivers = true and confirm:
- "Override NVIDIA driver to 595.71.05" step runs and grep output shows the four pins flipped.
- mkosi successfully resolves nvidia-driver-open=595.71.05-1ubuntu1 and friends from the CUDA repo.
- Output image tag is suffixed with -cc595 and OCI artifact has com.cohere.nvidia.driver=595.71.05 annotation.
Boot the resulting image on a multi-GPU B200 instance and confirm nvidia-fabricmanager.service reaches active and CC mode is reported by nvidia-smi conf-compute -f.
Update ITA appraisal policies with a 595.71.05 profile before promoting the image.

Notes

Independent of fix(podvm): bump NVIDIA driver pins to 580.159.03 #28 — this PR is based on cohere, not stacked on the 580 bump branch.

Note

Medium Risk
Modifies PodVM build pipeline and Ubuntu image contents (NVIDIA driver stack, systemd units, and low-level boot/partition sizing), which can impact GPU initialization and image reproducibility. Default behavior is mostly unchanged, but enabling the new flag produces materially different artifacts and attestation measurements.

Overview
Adds an opt-in b200_cc_drivers toggle to the Cohere PodVM GitHub Actions workflow to build a separate -cc595 image variant, record the resolved NVIDIA driver version in measurements.json, and publish it as an OCI annotation.

Updates the Ubuntu mkosi image to support B200 multi-GPU confidential compute by wiring an additional NVIDIA DOCA apt repo (pinned as low-priority fallback), expanding/pinning the NVIDIA package set around the R595 stack (incl. Fabric Manager dev headers, IMEX/NVLink components, RDMA/DCGM tooling), and increasing the debug root partition sizing to avoid build/boot space failures.

Hardens runtime/boot behavior by adding a post-install Ubuntu-only guard, baking an fmctl-probe binary into the image during mkosi postinst, ensuring ib_umad loads for Fabric Manager, gating nvidia-imex on GPU presence with a bounded startup timeout, and adding a pre-start wait (wait-nvlink-fabric.sh) before nvidia-persistenced to avoid NVLink fabric readiness races.

^{Reviewed by Cursor Bugbot for commit 077f479. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds a `b200_cc_drivers` workflow_dispatch flag (default false) to `Build PodVM Image (Cohere)` that swaps the in-image NVIDIA stack from the default 580 LTS to the 595.71.05 open-kernel branch. This driver is required to enable Confidential Computing on multi-GPU B200 hosts (NVSwitch fabric requires fabricmanager 595.x with TDISP/CC support). When enabled the workflow: * sed-replaces the four NVIDIA package pins in `mkosi.presets/system/mkosi.conf.d/ubuntu.conf` before mkosi runs (`nvidia-driver-580-open` -> unversioned `nvidia-driver-open=595.71.05-1ubuntu1`, plus `nvidia-persistenced`, `nvidia-fabricmanager`, `libnvidia-nscq` pinned to `595.71.05-1ubuntu1`). Patterns are package-name anchored so they survive future 580.x.y baseline bumps. * suffixes `tag` / image names with `-cc595` so a CC build never silently overwrites the standard `cohere-latest` artifact. * records the resolved driver version (extracted from the conf at build time, so it stays accurate for both standard and CC builds) in `measurements.json` and as a `com.cohere.nvidia.driver` OCI annotation on the published artifact. All four packages were verified present in NVIDIA's CUDA repo for ubuntu2404/x86_64 (already pinned to priority 1001 in `Dockerfile.mkosi.ubuntu`). The 595 branch only ships the unversioned `nvidia-driver-open` metapackage; pinning by version selects the correct branch. Default behaviour is unchanged: the flag is off and standard builds continue to ship 580.x.

The attestation-agent was compiled with --features nvidia-attester but the nv-attestation-sdk-sys crate requires libnvat (the NVIDIA Attestation SDK C++ library) to be pre-built. Without it, the build panics or the feature is silently excluded, producing an AA binary that cannot collect GPU attestation evidence. Changes: - Install cmake, libclang-dev, and NVAT runtime deps in gc_builder stage - Clone and build libnvat from NVIDIA/attestation-sdk before cargo build - Set NVAT_USE_SYSTEM_LIB=1 so the sys crate links against the installed lib - Copy libnvat.so into the final PodVM image tree - Add libcurl4t64, libxml2, libxmlsec1-openssl, pciutils to mkosi packages (runtime deps for libnvat and lspci for Fabric Manager NVL5 detection)

cursor · 2026-05-12T04:28:47Z

+# which must also be present in the final PodVM image at runtime.
+RUN set -e; \
+    if echo "${AA_FEATURES}" | grep -q "nvidia-attester"; then \
+      git clone --depth 1 --branch "${NVAT_TAG}" "${NVAT_REPO}" /build/nvat && \


🔒 Agentic Security Review
Severity: MEDIUM

The new build step clones nvidia-attestation-sdk from a mutable Git tag (NVAT_TAG) and builds it directly without immutable pinning or integrity verification. This weakens the supply-chain trust boundary for PodVM artifacts.

Impact: If the upstream tag is retargeted or the source repo is compromised, malicious code could be compiled into libnvat and shipped in the resulting image.

…olves The Cargo.lock on the guest-components cohere branch was generated without the nvidia-attester feature, so nv-attestation-sdk was absent. Building with --locked silently skipped the dependency, resulting in an attestation-agent binary with NvAttester symbols (from libnvat linkage) but no runtime nvidia detection code compiled in. Add `cargo update --workspace` before the locked build so new optional feature dependencies are resolved into the lockfile first.

Two fixes for B200 multi-GPU CC builds: 1. Patch detect_platform() after cloning guest-components to accept multi-GPU systems (count >= 1 instead of count == 1). Without this, the nvidia-attester silently skips registration on systems with more than one GPU. Temporary until cohere-ai/guest-components#7 merges. 2. Increase debug image root partition from Minimize=guess to a fixed 12G. The NVIDIA 595 drivers make the root filesystem too large for systemd-repart's size estimation, causing "No space left on device" during mkfs.ext4.

cursor · 2026-05-12T19:01:22Z

+      # Refresh lockfile so optional feature deps (e.g. nv-attestation-sdk
+      # for nvidia-attester) are resolved even if the checked-in Cargo.lock
+      # was generated without them.
+      cargo update --workspace


🔒 Agentic Security Review
Severity: HIGH

The new cargo update --workspace step rewrites dependency resolution from live registries during image builds, then cargo build --locked only enforces that freshly-updated lock state. This removes the protection of building from a pre-reviewed, committed dependency graph.

Impact: A malicious or compromised transitive crate release could be silently pulled into attestation-agent at build time and shipped in PodVM artifacts without an explicit dependency-pin change in this repository.

The nvidia-attester count==1 bug is now fixed upstream via guest-components PR #9 (sync main → cohere), which brings in the full NVAT SDK rewrite. The Dockerfile sed workaround is no longer needed.

The new nv-attestation-sdk-sys build.rs expects nvat.h at /usr/include/nvat.h when NVAT_USE_SYSTEM_LIB=1 is set. Set CMAKE_INSTALL_PREFIX=/usr so headers and libs install to /usr/include and /usr/lib instead of /usr/local.

The previous commit changed CMAKE_INSTALL_PREFIX to /usr, but the COPY step still looked for libnvat in /usr/local/lib. Also install to /usr/lib in the guest image so ldconfig finds it without extra configuration.

On Debian, CMAKE_INSTALL_LIBDIR defaults to lib/x86_64-linux-gnu, so libnvat.so ends up at /usr/lib/x86_64-linux-gnu/ instead of /usr/lib/. The COPY step and runtime linker then can't find it. Force CMAKE_INSTALL_LIBDIR=lib so the library installs to /usr/lib/ consistently.

1. NVAT build step now installs its own deps (cmake, libcurl, etc.) independently of the guest-components block, so nvidia-attester works even if CUSTOM_GC_BINARIES is not set. 2. Replace COPY --from glob (which fails when no files match) with RUN --mount for libnvat. This makes non-GPU builds safe — the mount always succeeds, and the if-ls check handles the empty case.

Mirror the change just made on the kata UVM workflow (build-kata-uvm-cohere.yaml) so both PodVM build paths default to the same guest-components branch and produce binaries with a working multi-GPU nvidia-attester out of the box. The `cohere` branch's nvidia-attester::detect_platform() has a `count == 1` guard that silently disables the attester on 2+ GPU systems. Upstream main's NVAT-SDK-based rewrite (synced into the fork by PR #9, head = alhassankhedr/sync-main-to-cohere) drops the guard and handles multi-GPU enumeration via GpuEvidenceSource::collect(). Switch back to `cohere` once PR #9 merges.

cursor · 2026-05-15T04:07:02Z

+            --annotation "com.cohere.nvidia.driver=${NVIDIA_DRIVER}" \
            --format json > oras-output.json

          cat oras-output.json


🔒 Agentic Security Review
Severity: HIGH

GC_REF now defaults to a mutable personal branch (alhassankhedr/sync-main-to-cohere) for push/tag-driven PodVM builds instead of an immutable, reviewed ref. That expands the build trust boundary to branch-head state that can change outside this repository’s review path.

Impact: If that branch is updated maliciously (or compromised), unreviewed guest-components code can be pulled into release artifacts and published as trusted PodVM images.

The default `Minimize=guess` for `mkosi.repart-debug/10-root.conf` chronically under-sizes the debug image once the 595 NVIDIA stack lands in /usr. We hit two failure modes during the B200 multi-GPU work, both reproducible from a clean checkout: 1. mkosi `systemd-repart` step fails with "no space left on device" mid-build because the guessed size doesn't account for libnvat, the open-driver kernel modules, fabricmanager, and nscq landing on top of the standard ubuntu base. 2. When the build does squeeze through, the resulting qcow2 boots but runs out of root-fs space the first time anything writes into /var (apt cache, journald, attestation-agent's `/run/aa`). Fixed locally on the B200 host's checkout (the one that produced the working /mnt/vms/guest-nvat-debug.qcow2 that demonstrated the 8-GPU evidence path) but never made it back into the branch. Closing the gap now: pin `Minimize=off`, `SizeMinBytes=12G`, `SizeMaxBytes=12G`. 12 GiB is empirically enough for the full B200 CC userspace with comfortable headroom for ad-hoc debugging. Release variant is unaffected (mkosi.repart/ uses verity-sized partitions independently).

cursor · 2026-05-15T04:16:25Z

+# the debug variant has headroom for the full B200 CC userspace.
+Minimize=off
+SizeMinBytes=12G
+SizeMaxBytes=12G


Debug partition permanently enlarged for all builds, not just CC595

Medium Severity

The committed 10-root.conf permanently replaces Minimize=guess with Minimize=off and a fixed 12 GiB size for ALL debug builds, not just CC595 builds. This contradicts the PR's claim that "default behaviour is unchanged." The conditional workflow step "Increase debug root partition for CC595 drivers" (guarded by b200_cc_drivers == 'true') writes the exact same content that's already in the committed file, making it a no-op. Standard 580 debug images will now be a fixed 12 GiB instead of minimized.

Additional Locations (1)

.github/workflows/build-podvm-cohere.yaml#L237-L249

^{Reviewed by Cursor Bugbot for commit 7a46e77. Configure here.}

…or NVL5+ - Pin nvidia-driver-open / persistenced / fabricmanager / nscq / fabricmanager-dev to 595.71.05-1ubuntu1; add nvlsm and infiniband-diags packages required by the B200 Shared NVSwitch path. Also pulls in docker.io and a small python/curl/pciutils baseline used by the in-VM attestation + tenant-setup flow. - modules-load.d/nvlink-fabric.conf: add ib_umad. The B200 nvidia-fabricmanager-start.sh checks lsmod for ib_umad and exits if missing (NVL5+ subnet management path goes over the CX7 bridge umad interface). - mkosi.repart-debug/10-root.conf: pin the debug rootfs at a fixed 12G (Minimize=off, SizeMin/Max=12G) instead of guess-minimized so the systemd-firstboot resize does not run out of room when the in-VM setup writes the NVIDIA driver state during early boot. Tested end-to-end on a B200 bare-metal host: SVM + 4 x 2-GPU tenants (partitions 4/5/6/7) come up clean, fabric.state=Completed/Success on all 8 GPUs, ITA composite TDX+NVGPU attestation passes for all 4.

On HGX B200 in confidential-compute mode, the Service VM Fabric Manager programs NVSwitch routing for a partition asynchronously. The handshake with the guest GPU happens over in-band NVLink MAD (Probe Request -> Probe Response) AFTER fmActivateFabricPartition() returns success, so guest userspace can see the GPU on the PCI bus before that GPU has actually finished registering with the fabric. nvidia-persistenced races that handshake on guest boot. If it tries to register a GPU before that GPUs fabric.state hits Completed, NVML returns 0x81 (NVLINK_FABRIC_NOT_READY) and the daemon SILENTLY falls back to non-UVM persistence. Per the NVIDIA Secure AI Operations Guide, the only way to recover from a missed SPDM/UVM session in CC mode is an FLR -- i.e. the affected GPU is permanently unable to do NVLink P2P until the VM is fully reset, with no easy diagnostic. Symptom we hit in the field: vLLM crashes deep into NCCL init with cudaErrorSystemNotReady (CUDA error 802) on one GPU of a 2-GPU tenant, and Xid 170 / 145 cascades on the rest of the partition. The new wait-nvlink-fabric.sh polls nvidia-smi --query-gpu= fabric.state,fabric.status until every visible GPU reports Completed/Success, then exits 0. It is wired into the nvidia-persistenced systemd drop-in as ExecStartPre. On timeout (default 180s, knob: WAIT_FABRIC_TIMEOUT) it exits non-zero and the daemon fails loud rather than silently degrading -- which is much easier to operate than the old failure mode and converts an unrecoverable silent corruption into a recoverable explicit failure. Verified on a B200 bare-metal host (Stage C, FABRIC_MODE=1 Service VM): SVM + 4 x 2-GPU tenants started in parallel, gate logs "all GPUs fabric ready (attempt N)" on each tenant, persistenced successfully enables UVM Persistence on all 8 GPUs, ITA composite TDX+NVGPU attestation passes (nvgpu_overall=true) on all 4, zero Xid in dmesg.

cursor · 2026-05-19T22:31:11Z

+    nvidia-driver-open=595.71.05-1ubuntu1
+    nvidia-persistenced=595.71.05-1ubuntu1
+    nvidia-fabricmanager=595.71.05-1ubuntu1
+    libnvidia-nscq=595.71.05-1ubuntu1


Base config permanently ships 595 drivers, breaking opt-in intent

High Severity

The base ubuntu.conf has been permanently changed from nvidia-driver-580-open=580.126.20-1ubuntu1 to nvidia-driver-open=595.71.05-1ubuntu1 (and the other three packages likewise). This means all builds — including the default b200_cc_drivers=false path and push-triggered builds — will ship the 595 driver stack, directly contradicting the PR's stated intent that "default behaviour is unchanged" and "push events… continue to ship 580.x." Additionally, the sed override step's first pattern (nvidia-driver-580-open=.*) is now dead code since that string no longer exists in the file. The base file needs to retain the 580 packages so the dynamic sed replacement has something to match when the flag is enabled.

Additional Locations (1)

.github/workflows/build-podvm-cohere.yaml#L228-L234

^{Reviewed by Cursor Bugbot for commit 85f075d. Configure here.}

cursor · 2026-05-19T22:31:11Z

+    libcurl4t64
+    libxml2
+    libxmlsec1-openssl
+    pciutils


Duplicate pciutils entry in package list

Low Severity

pciutils appears twice in the Packages= list (lines 41 and 45). While the package manager handles duplicates gracefully, the repetition is unnecessary and suggests a copy-paste oversight.

Additional Locations (1)

src/cloud-api-adaptor/podvm-mkosi/mkosi.presets/system/mkosi.conf.d/ubuntu.conf#L40-L41

^{Reviewed by Cursor Bugbot for commit 85f075d. Configure here.}

cursor · 2026-05-19T22:31:12Z

 ExecCondition=/usr/local/bin/check-nvidia-gpu
+# Block daemon startup until every visible GPU has fabric.state=Completed.
+# See /usr/local/bin/wait-nvlink-fabric.sh for the rationale (B200 CC race).
+ExecStartPre=/usr/local/bin/wait-nvlink-fabric.sh


Unconditional fabric wait blocks persistenced on non-NVLink GPUs

Medium Severity

The new ExecStartPre=/usr/local/bin/wait-nvlink-fabric.sh is added unconditionally to the skeleton (all images, not just CC builds). On GPUs without NVLink fabric, nvidia-smi returns N/A for fabric.state/fabric.status, which never matches Completed/Success, so the script loops for 180 seconds and exits non-zero. Unlike the pre-existing ExecStartPost (non-fatal), a failed ExecStartPre is fatal in systemd — nvidia-persistenced will never start on non-NVLink GPU systems. This is a behavioral regression from the prior override, which allowed persistenced to run even when CC-specific post-start commands failed.

Additional Locations (1)

src/cloud-api-adaptor/podvm-mkosi/mkosi.skeleton/usr/local/bin/wait-nvlink-fabric.sh#L39-L58

^{Reviewed by Cursor Bugbot for commit 85f075d. Configure here.}

…ata-plane Per the NVIDIA Fabric Manager User Guide audit on 2026-05-19 (fortress scratch/oci-b200/.../docs/KNOWN-ISSUES.md §7), the SVM image was missing eight B200-specific Shared-NVSwitch packages that the deprecated `nvlink5-<branch>` metapackage used to pull in. Bake them into the podvm-mkosi image build so a fresh SVM boot starts with everything provisioned under proper dpkg management — no runtime `apt-get download` + `dpkg -x` workarounds needed. Added (all version-pinned to match the in-image NVIDIA driver where the apt repo offers a 595-series build): libnvsdm=595.71.05-1ubuntu1 NVSwitch Device Manager telemetry library; replaces the SXID error path on B200/B300 and is required for FM/NVLSM/DCGM to surface NVSwitch errors (FM Guide §1259). nvidia-imex=595.71.05-1ubuntu1 Internode Memory Exchange daemon. Brokers cross-OS-instance shared CUDA memory channels over NVLink; required for multi-tenant Shared-NVSwitch workloads where NCCL crosses partition boundaries. Strongest hypothesis for the `Xid 170 SECURE Fatal CROSS_CONTAIN` storm on small partitions (KNOWN-ISSUES §1 mode C). collectx-bringup, mft, mft-oem, mft-autocomplete CX7 telemetry / firmware tools (mst, flint, mlxconfig, mlxlink). Not strictly load-bearing but required for B200 LPF triage. rdma-core, ibverbs-utils OFED data-plane (libibverbs1, librdmacm1, ibv_devices, ibv_devinfo). FM Guide "NVIDIA Software Packages" mandates "OFED or MOFED package is required" on B200/B300; the image already had the management plane (libibmad/libibumad/ ibstatus from infiniband-diags) but not the data plane. Verification gates for all of these are wired into fortress scratch/oci-b200/stacks/qemu-shared-nvswitch/ service-vm/scripts/verify-svc-vm.sh (eight hard-fail gates as of 2026-05-19).

cursor · 2026-05-20T01:45:42Z

+    nvidia-driver-open=595.71.05-1ubuntu1
+    nvidia-persistenced=595.71.05-1ubuntu1
+    nvidia-fabricmanager=595.71.05-1ubuntu1
+    libnvidia-nscq=595.71.05-1ubuntu1


Config files unconditionally ship 595 drivers, breaking opt-in intent

High Severity

The committed ubuntu.conf already has 595.71.05 packages and B200-specific dependencies (e.g. nvidia-fabricmanager-dev, nvlsm, nvidia-imex, libnvsdm), making the b200_cc_drivers flag a no-op for image content. The workflow's conditional sed step tries to match nvidia-driver-580-open=.* which no longer exists in the file (now nvidia-driver-open=…), so it silently matches nothing. Similarly, 10-root.conf is already committed with Minimize=off / 12G fixed size. Both b200_cc_drivers=false and true builds produce identical images — only the tag suffix differs. The PR description states "default behaviour is unchanged… continue to ship 580.x" which contradicts the committed source.

Additional Locations (2)

.github/workflows/build-podvm-cohere.yaml#L228-L234

src/cloud-api-adaptor/podvm-mkosi/mkosi.presets/system/mkosi.repart-debug/10-root.conf#L10-L13

^{Reviewed by Cursor Bugbot for commit bb64832. Configure here.}

The previous commit (bb64832) added libnvsdm, nvidia-imex, collectx-bringup, mft, mft-oem, mft-autocomplete, rdma-core, ibverbs-utils to ubuntu.conf. The 2026-05-19 podvm build attempt on the B200 host failed at the mkosi apt-resolve phase with: collectx-bringup : Depends: ucx but it is not installable The `ucx` (Unified Communication X) package and the matching MOFED userspace stack live in NVIDIA's DOCA-Host networking apt repo (linux.mellanox.com/public/repo/doca/...), which the CAA mkosi configuration was not sourcing from. The CUDA repo at developer.download.nvidia.com does ship `collectx-bringup`, `mft`, `mft-oem`, `mft-autocomplete`, but does NOT ship `ucx`. Fix in two parts: 1. Dockerfile.mkosi.ubuntu — wire the DOCA-Host repo into mkosi.skeleton's apt sources alongside the existing CUDA and nvidia-container-toolkit repos. Pin DOCA to priority 100 ("install only when explicitly requested or to satisfy a dep, never replace a higher-priority candidate"). This makes: - ucx (only in DOCA) install from DOCA ✓ - rdma-core / libibumad3 (also in DOCA but in universe @500) install from universe ✓ (keeps inbox OFED, not MOFED) - collectx-bringup / mft* install from CUDA repo (origin developer.download .nvidia.com, default 500 > 100) 2. ubuntu.conf — keep collectx-bringup, mft, mft-oem, mft-autocomplete in Packages= (they are nvlink5-595 metapackage components per the FM User Guide §"Installing Fabric Manager / Systems Using Fourth Generation NVSwitches"). Add a comment explaining the cross-repo dependency chain. Also folded in this commit (separate but discovered during the same FM Guide audit): - libibumad3 §"Other NVIDIA Software Packages" (§366) calls it out by name; pin explicitly so it stops being a transitive-only dep. - nvidia-utils-595 provides /usr/bin/nvidia-smi which verify-svc-vm.sh invokes directly. Was transitive via nvidia-driver-open; pin. - nvidia-modprobe SUID helper for /dev/nvidia* device nodes; required by nvidia-imex and any non-root NVML caller. Was transitive; pin. - datacenter-gpu-manager-4-cuda12 datacenter-gpu-manager-4-config DCGM v4. FM Guide §"NVSwitch Errors On DGX B200/B300" specifies that DCGM is the consumer that surfaces NVSwitch errors via libnvsdm. We had libnvsdm but no DCGM — half the error path was present. Adding DCGM closes that gap. - lshw Provides `vpddecode` per FM Guide §"Additional Steps for NVIDIA HGX B200/B300 Systems" for CX7 bridge VPD identification.

cursor · 2026-05-20T02:05:31Z

+            -e 's|^([[:space:]]*)nvidia-persistenced=.*|\1nvidia-persistenced=595.71.05-1ubuntu1|' \
+            -e 's|^([[:space:]]*)nvidia-fabricmanager=.*|\1nvidia-fabricmanager=595.71.05-1ubuntu1|' \
+            -e 's|^([[:space:]]*)libnvidia-nscq=.*|\1libnvidia-nscq=595.71.05-1ubuntu1|' \
+            "$CONF"


Opt-in flag is non-functional; config already hardcodes 595

High Severity

The b200_cc_drivers flag is supposed to opt-in to the 595 driver by sed-replacing 580 package pins, but ubuntu.conf was directly committed with 595 packages (nvidia-driver-open=595.71.05-1ubuntu1), making the sed a no-op. Specifically, the first sed pattern looks for nvidia-driver-580-open= which doesn't exist in the committed file (it has nvidia-driver-open= without 580), and patterns 2–4 match but replace with the same already-present 595 values. This means ALL builds use the 595 driver regardless of the flag, contradicting the stated "Default behaviour is unchanged" and the PR's "opt-in" design. The flag's only actual effect is appending -cc595 to the image tag.

Additional Locations (1)

src/cloud-api-adaptor/podvm-mkosi/mkosi.presets/system/mkosi.conf.d/ubuntu.conf#L27-L31

^{Reviewed by Cursor Bugbot for commit fd50ec5. Configure here.}

cursor · 2026-05-20T02:05:31Z

 ExecCondition=/usr/local/bin/check-nvidia-gpu
+# Block daemon startup until every visible GPU has fabric.state=Completed.
+# See /usr/local/bin/wait-nvlink-fabric.sh for the rationale (B200 CC race).
+ExecStartPre=/usr/local/bin/wait-nvlink-fabric.sh


Missing chmod for new wait-nvlink-fabric.sh script

High Severity

The new wait-nvlink-fabric.sh is wired as ExecStartPre in the nvidia-persistenced service override, but the mkosi.finalize.chroot script only runs chmod +x on check-nvidia-gpu (line 23) — there is no matching chmod +x /usr/local/bin/wait-nvlink-fabric.sh. If the file's executable bit isn't preserved through the skeleton copy, ExecStartPre will fail with a permission error and nvidia-persistenced won't start, leaving GPUs without persistence mode on B200 hosts.

^{Reviewed by Cursor Bugbot for commit fd50ec5. Configure here.}

The 2026-05-20 build attempt at fd50ec5 confirmed that the NVIDIA DOCA-Host repo wiring works (apt cleanly fetched the DOCA Release/Packages indexes and the `ucx`/collectx-bringup chain resolved). It also surfaced two package names from the prior commit that do not exist in the NVIDIA cuda repo: E: Unable to locate package nvidia-utils-595 E: Unable to locate package datacenter-gpu-manager-4-config Root cause: 1. nvidia-utils-595 — the `nvidia-utils-<branch>` suffix series in the cuda repo (developer.download.nvidia.com/.../ubuntu2404) stops at -580. Starting with R595, the open-driver branch bundles nvidia-smi and the other userspace tools inside nvidia-driver-open=595.71.05-1ubuntu1 itself rather than shipping a separate -utils-<branch> package. Verify gates already pass on prior 595-stack images for exactly that reason; no explicit pin is needed. 2. datacenter-gpu-manager-4-config — does not exist as a separate package. The real DCGM 4 layout (confirmed via `apt-cache search datacenter-gpu-manager` against the cuda repo Packages.gz) is: datacenter-gpu-manager-4-core datacenter-gpu-manager-4-cuda{11,12,13} datacenter-gpu-manager-4-cuda-all datacenter-gpu-manager-4-dev datacenter-gpu-manager-4-multinode{,-cuda12,-cuda13} datacenter-gpu-manager-4-proprietary{,-cuda11,-cuda12,-cuda13} The systemd unit (nvidia-dcgm.service), the dcgmi CLI, and the default config files in /etc/nvidia-dcgm/ all ship inside datacenter-gpu-manager-4-cuda12. No separate -config package is required or available. Drop both names; keep nvidia-modprobe=595.71.05-1ubuntu1 (which DID resolve cleanly) and datacenter-gpu-manager-4-cuda12 (which likewise resolves on its own). Add lengthy comments capturing the package-layout finding so this trap doesn't get re-hit on the next NVIDIA-stack rev.

Previously fmctl-probe (the FM SDK client used by the host's {activate,deactivate}-partition-by-bdfs.sh wrappers via SSH-into-SVM) had no production build path — operators were expected to scp the source and g++ it inside a running SVM. The image's "/usr/local/bin/fmctl-probe is baked in" claim was aspirational, and the binary's libnvfm ABI was only guaranteed by accident (whatever libnvfm happened to be on the build host). Vendor the source at mkosi.skeleton/usr/src/fmctl-probe/fmctl-probe.cpp (byte-identical to the canonical copy in cohere-ai/fortress at scratch/oci-b200/stacks/qemu-shared-nvswitch/orchestration/scripts/ fmctl-probe.cpp) and add a postinst block that chroots into ${BUILDROOT} and compiles against the rootfs's own libnvfm (from nvidia-fabricmanager-dev=595.71.05-1ubuntu1), then strips the source tree. Result: every podvm qcow2 ships /usr/local/bin/fmctl-probe with guaranteed ABI parity to the libnvfm.so that loads at runtime, and the SVM never compiles anything during bring-up. Also add g++ to Packages= — Ubuntu's gcc package does not pull the C++ front-end, so the postinst would otherwise skip the bake silently and fall back to the runtime g++ workaround. The fortress-side verify-svc-vm.sh gate 7 fails closed if /usr/local/bin/fmctl-probe is missing or the 'resolve' subcommand isn't there, catching both forgot-to-rebuild and forgot-to-resync-vendored-copy regressions before any tenant ExecStartPre runs.

The upstream Makefile defaults to PODVM_DISTRO=fedora, so a bare `make image-debug` silently produces a Fedora qcow2 lacking every NVIDIA package the B200 Service VM stack pins in mkosi.presets/system/mkosi.conf.d/ubuntu.conf. The build emits no error and the broken image only fails at SVM bring-up time -- after the operator has copied a 5 GiB qcow2 into place and restarted the VM. Hit this on 2026-05-19 evening: re-built after pushing the fmctl-probe bake (f4b67ed), invoked `make binaries && make image-debug` directly under nohup (mistakenly skipping /tmp/run-podvm-build.sh which exports PODVM_DISTRO=ubuntu). The build ran 5 min into the systemd-repart phase and surfaced as `mkfs binary for ext4 is not available` -- a misleading symptom whose actual cause was Fedora 43 e2fsprogs vs Oracular's systemd-repart. fmctl-probe also silently skipped its bake because nv_fm_agent.h doesn't exist in Fedora's NVIDIA repos. Add a guard at the very top of mkosi.postinst that reads ${BUILDROOT}/etc/os-release and exits non-zero if ID != ubuntu. This fails the mkosi build BEFORE finalize, with a message naming all three wrapper scripts (run-podvm-build.sh, fortress 04-build-podvm-locally.sh, fortress run-podvm-build.host.sh) so the operator knows the canonical fix. Works regardless of entry point -- CI, ad-hoc make, or wrapper.

The qemu-shared-nvswitch SVM is GPU-less by design (only the four CX7 LPFs are passed through; the eight B200 GPUs go straight to tenant VMs), so nvidia-imex always fails at startup with NV_ERR_OPERATING_SYSTEM ("Failed to allocate handle to NVIDIA GPU driver") because /dev/nvidiactl doesn't exist. Worse, the upstream unit is Type=forking + TimeoutStartSec=infinity, so the failed-init parent hangs in sigtimedwait() forever and `systemctl start nvidia-imex.service` never returns -- observed as a 5+ minute boot stall in svc-vm-bootstrap.sh. Add a drop-in override that mirrors the existing pattern used by nvidia-fabricmanager.service / nvidia-persistenced.service / nvidia-cdi-refresh.service: ExecCondition=/usr/local/bin/check-nvidia-gpu TimeoutStartSec=120 The check-nvidia-gpu predicate skips the unit cleanly on any VM whose lspci -n doesn't show a 10de:* device, so on the SVM the unit becomes a no-op (same way nvidia-fabricmanager already does). Tenant VMs (which DO have GPUs) still start nvidia-imex normally. The TimeoutStartSec=120 is belt-and-suspenders: even on a real GPU node, an indefinite hang in the forking handshake would mask a real config error (e.g. malformed nodes_config.cfg) and stall the unit dependency graph. 120 s is generous enough for the legitimate path (driver init + IMEX cluster bootstrap) without being unbounded.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 8 total unresolved issues (including 7 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit de75f12. Configure here.}

cursor · 2026-05-20T04:59:57Z

+          # "No space left on device" during the build.
+          printf '[Partition]\nType=root\nFormat=ext4\nCopyFiles=/\nMinimize=off\nSizeMinBytes=12G\nSizeMaxBytes=12G\n' > "$CONF"
+          echo "----- Updated repart config -----"
+          cat "$CONF"


Debug root partition override is redundant dead code

Low Severity

The "Increase debug root partition for CC595 drivers" workflow step (conditional on b200_cc_drivers == 'true') writes Minimize=off / SizeMinBytes=12G / SizeMaxBytes=12G to 10-root.conf. However, the base 10-root.conf was already directly changed in this commit from Minimize=guess to the identical Minimize=off + 12G content. The conditional override is redundant dead code that writes exactly what's already in the file.

Additional Locations (1)

src/cloud-api-adaptor/podvm-mkosi/mkosi.presets/system/mkosi.repart-debug/10-root.conf#L1-L13

^{Reviewed by Cursor Bugbot for commit de75f12. Configure here.}

…t pollution The previous `resolve <bdf,bdf,...>` subcommand matches partitions by fmGetSupportedFabricPartitions().gpuInfo[].pciBusId, which is correct on single-host non-FABRIC_MODE setups but ALWAYS empty in our qemu-shared-nvswitch SVM topology: FM runs in FABRIC_MODE=1 with no GPUs in its OS instance (the GPUs are passed through to tenant VMs), so pciBusId never populates. Every `fmctl-probe resolve <bdf>` therefore returned "no supported partition matches BDF set" and tenant ExecStartPre failed before QEMU launched. physicalId, by contrast, IS populated by FM in shared-NVSwitch mode -- it's a baseboard-fixed property reported by NVSwitch firmware regardless of who owns the GPUs. Add a parallel `resolve-by-physids <id,id,...>` subcommand that matches against gpuInfo[].physicalId. The host-side activate-/deactivate-partition-by-bdfs.sh wrappers compute the BDF -> physicalId map from /sys/bus/pci on the host (sort 10de:* + class 0x030200 by PCI address; canonical B200 HGX baseboard convention) and call the new subcommand. Also fix a latent stdout-pollution bug: the post-fmConnect "[fmctl] connected to %s" banner was on stdout, which got concatenated with the machine-readable partition id by the calling shell `pid=$(fmctl-probe resolve...)`. Move it to stderr -- only the partition id stays on stdout. Verified end-to-end on a live b200-cc-test SVM: rebuilt fmctl-probe in place against the rootfs's libnvfm, ran activate-partition-by-bdfs.sh 89:00.0,a8:00.0, the script auto-detected physicalIds 5,6 and resolved to partition 6 which fmctl-probe list confirmed active. Then launched all four mid-{a,b,c,d} tenants (8 GPUs split 2+2+2+2 across partitions 4,5,6,7) -- all systemd units came up active, all four FM partitions active=1. Source remains byte-identical with the fortress canonical copy (scratch/oci-b200/.../orchestration/scripts/fmctl-probe.cpp); SHA256 matches.

alhassankhedr-cohere added 2 commits May 8, 2026 09:29

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/cloud-api-adaptor/podvm/Dockerfile.podvm_binaries.ubuntu

alhassankhedr-cohere added 2 commits May 12, 2026 09:00

alhassankhedr-cohere force-pushed the alhassankhedr/podvm-b200-cc-driver branch from da5003f to 6361ef8 Compare May 12, 2026 18:55

cursor Bot reviewed May 12, 2026

View reviewed changes

alhassankhedr-cohere added 2 commits May 13, 2026 10:38

chore: remove temporary detect_platform sed patch

0f10cc0

The nvidia-attester count==1 bug is now fixed upstream via guest-components PR #9 (sync main → cohere), which brings in the full NVAT SDK rewrite. The Dockerfile sed workaround is no longer needed.

fix: install libnvat to /usr instead of /usr/local

c174616

The new nv-attestation-sdk-sys build.rs expects nvat.h at /usr/include/nvat.h when NVAT_USE_SYSTEM_LIB=1 is set. Set CMAKE_INSTALL_PREFIX=/usr so headers and libs install to /usr/include and /usr/lib instead of /usr/local.

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/cloud-api-adaptor/podvm/Dockerfile.podvm_binaries.ubuntu Outdated

fix: copy libnvat from /usr/lib to match CMAKE_INSTALL_PREFIX

92af4d5

The previous commit changed CMAKE_INSTALL_PREFIX to /usr, but the COPY step still looked for libnvat in /usr/local/lib. Also install to /usr/lib in the guest image so ldconfig finds it without extra configuration.

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/cloud-api-adaptor/podvm/Dockerfile.podvm_binaries.ubuntu

alhassankhedr-cohere added 4 commits May 13, 2026 11:32

fix: add libssl-dev to NVAT build deps (prevents OpenSSL source build)

9f839bf

cursor Bot reviewed May 15, 2026

View reviewed changes

alhassankhedr-cohere added 2 commits May 19, 2026 18:22

cursor Bot reviewed May 19, 2026

View reviewed changes

cursor Bot reviewed May 20, 2026

View reviewed changes

alhassankhedr-cohere added 3 commits May 19, 2026 22:16

cursor Bot reviewed May 20, 2026

View reviewed changes

Conversation

alhassankhedr-cohere commented May 8, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Attestation impact

Test plan

Notes

Uh oh!

cursor Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 15, 2026

Choose a reason for hiding this comment

Debug partition permanently enlarged for all builds, not just CC595

Uh oh!

cursor Bot May 19, 2026

Choose a reason for hiding this comment

Base config permanently ships 595 drivers, breaking opt-in intent

Uh oh!

cursor Bot May 19, 2026

Choose a reason for hiding this comment

Duplicate pciutils entry in package list

Uh oh!

cursor Bot May 19, 2026

Choose a reason for hiding this comment

Unconditional fabric wait blocks persistenced on non-NVLink GPUs

Uh oh!

cursor Bot May 20, 2026

Choose a reason for hiding this comment

Config files unconditionally ship 595 drivers, breaking opt-in intent

Uh oh!

cursor Bot May 20, 2026

Choose a reason for hiding this comment

Opt-in flag is non-functional; config already hardcodes 595

Uh oh!

cursor Bot May 20, 2026

Choose a reason for hiding this comment

Missing chmod for new wait-nvlink-fabric.sh script

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 20, 2026

Choose a reason for hiding this comment

Debug root partition override is redundant dead code

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alhassankhedr-cohere commented May 8, 2026 •

edited by cursor Bot

Loading

Duplicate `pciutils` entry in package list