Skip to content

feat(podvm): opt-in NVIDIA 595.71.05 driver for B200 multi-GPU CC#29

Open
alhassankhedr-cohere wants to merge 21 commits into
coherefrom
alhassankhedr/podvm-b200-cc-driver
Open

feat(podvm): opt-in NVIDIA 595.71.05 driver for B200 multi-GPU CC#29
alhassankhedr-cohere wants to merge 21 commits into
coherefrom
alhassankhedr/podvm-b200-cc-driver

Conversation

@alhassankhedr-cohere
Copy link
Copy Markdown

@alhassankhedr-cohere alhassankhedr-cohere commented May 8, 2026

Summary

Adds a new b200_cc_drivers workflow_dispatch flag (default false) to the Build PodVM Image (Cohere) workflow. When enabled, the build swaps the in-image NVIDIA stack from the default 580 LTS to 595.71.05 (open-kernel branch) — the minimum driver / fabricmanager required to enable Confidential Computing on multi-GPU B200 hosts (NVSwitch fabric requires the 595.x series with TDISP/CC support).

When the flag is enabled the workflow:

  • sed-replaces the four NVIDIA package pins in mkosi.presets/system/mkosi.conf.d/ubuntu.conf before mkosi runs:
    • nvidia-driver-580-opennvidia-driver-open=595.71.05-1ubuntu1 (the 595 branch only ships the unversioned metapackage in NVIDIA's CUDA repo — no nvidia-driver-595-open).
    • nvidia-persistenced=595.71.05-1ubuntu1
    • nvidia-fabricmanager=595.71.05-1ubuntu1
    • libnvidia-nscq=595.71.05-1ubuntu1
    • Patterns are anchored on the package name (580.x.y is matched by =.*) so this survives future bumps to the 580 baseline (e.g. fix(podvm): bump NVIDIA driver pins to 580.159.03 #28).
  • Suffixes tag / OCI tag / GCP image names with -cc595 so a CC build never silently overwrites the standard cohere-latest artifact (e.g. podvm-ubuntu-tdx-release-cohere-latest-cc595).
  • Records the resolved driver version (extracted from ubuntu.conf at build time, so it's accurate for both standard and CC builds) in measurements.json as nvidia_driver and as a com.cohere.nvidia.driver annotation on the published OCI artifact.

All four 595.71.05-1ubuntu1 packages were verified present in NVIDIA's CUDA apt repo for ubuntu2404/x86_64 (already wired up and pinned to priority 1001 in Dockerfile.mkosi.ubuntu).

Default behaviour is unchanged: the flag is off, push events on the cohere branch and podvm-v* tags continue to ship 580.x.

Attestation impact

CC-driver builds report a different x-nvidia-gpu-driver-version (595.71.05) and a different RTMR2 (different kernel modules in initrd / driver firmware). Any ITA appraisal policies need a parallel 595.71.05 profile before the CC image is rolled into a prod attestation flow. Standard builds are unaffected.

Test plan

  • Trigger Build PodVM Image (Cohere) via workflow_dispatch with b200_cc_drivers = false and confirm:
    • Output image tag stays …-cohere-latest-{release,debug} (no -cc595 suffix).
    • measurements.json nvidia_driver matches whatever ubuntu.conf pins on cohere.
  • Trigger Build PodVM Image (Cohere) via workflow_dispatch with b200_cc_drivers = true and confirm:
    • "Override NVIDIA driver to 595.71.05" step runs and grep output shows the four pins flipped.
    • mkosi successfully resolves nvidia-driver-open=595.71.05-1ubuntu1 and friends from the CUDA repo.
    • Output image tag is suffixed with -cc595 and OCI artifact has com.cohere.nvidia.driver=595.71.05 annotation.
  • Boot the resulting image on a multi-GPU B200 instance and confirm nvidia-fabricmanager.service reaches active and CC mode is reported by nvidia-smi conf-compute -f.
  • Update ITA appraisal policies with a 595.71.05 profile before promoting the image.

Notes


Note

Medium Risk
Modifies PodVM build pipeline and Ubuntu image contents (NVIDIA driver stack, systemd units, and low-level boot/partition sizing), which can impact GPU initialization and image reproducibility. Default behavior is mostly unchanged, but enabling the new flag produces materially different artifacts and attestation measurements.

Overview
Adds an opt-in b200_cc_drivers toggle to the Cohere PodVM GitHub Actions workflow to build a separate -cc595 image variant, record the resolved NVIDIA driver version in measurements.json, and publish it as an OCI annotation.

Updates the Ubuntu mkosi image to support B200 multi-GPU confidential compute by wiring an additional NVIDIA DOCA apt repo (pinned as low-priority fallback), expanding/pinning the NVIDIA package set around the R595 stack (incl. Fabric Manager dev headers, IMEX/NVLink components, RDMA/DCGM tooling), and increasing the debug root partition sizing to avoid build/boot space failures.

Hardens runtime/boot behavior by adding a post-install Ubuntu-only guard, baking an fmctl-probe binary into the image during mkosi postinst, ensuring ib_umad loads for Fabric Manager, gating nvidia-imex on GPU presence with a bounded startup timeout, and adding a pre-start wait (wait-nvlink-fabric.sh) before nvidia-persistenced to avoid NVLink fabric readiness races.

Reviewed by Cursor Bugbot for commit 077f479. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds a `b200_cc_drivers` workflow_dispatch flag (default false) to
`Build PodVM Image (Cohere)` that swaps the in-image NVIDIA stack from
the default 580 LTS to the 595.71.05 open-kernel branch. This driver
is required to enable Confidential Computing on multi-GPU B200 hosts
(NVSwitch fabric requires fabricmanager 595.x with TDISP/CC support).

When enabled the workflow:

* sed-replaces the four NVIDIA package pins in
  `mkosi.presets/system/mkosi.conf.d/ubuntu.conf` before mkosi runs
  (`nvidia-driver-580-open` -> unversioned `nvidia-driver-open=595.71.05-1ubuntu1`,
  plus `nvidia-persistenced`, `nvidia-fabricmanager`, `libnvidia-nscq`
  pinned to `595.71.05-1ubuntu1`). Patterns are package-name anchored
  so they survive future 580.x.y baseline bumps.
* suffixes `tag` / image names with `-cc595` so a CC build never
  silently overwrites the standard `cohere-latest` artifact.
* records the resolved driver version (extracted from the conf at
  build time, so it stays accurate for both standard and CC builds)
  in `measurements.json` and as a `com.cohere.nvidia.driver` OCI
  annotation on the published artifact.

All four packages were verified present in NVIDIA's CUDA repo for
ubuntu2404/x86_64 (already pinned to priority 1001 in
`Dockerfile.mkosi.ubuntu`). The 595 branch only ships the
unversioned `nvidia-driver-open` metapackage; pinning by version
selects the correct branch.

Default behaviour is unchanged: the flag is off and standard builds
continue to ship 580.x.
The attestation-agent was compiled with --features nvidia-attester but
the nv-attestation-sdk-sys crate requires libnvat (the NVIDIA Attestation
SDK C++ library) to be pre-built. Without it, the build panics or the
feature is silently excluded, producing an AA binary that cannot collect
GPU attestation evidence.

Changes:
- Install cmake, libclang-dev, and NVAT runtime deps in gc_builder stage
- Clone and build libnvat from NVIDIA/attestation-sdk before cargo build
- Set NVAT_USE_SYSTEM_LIB=1 so the sys crate links against the installed lib
- Copy libnvat.so into the final PodVM image tree
- Add libcurl4t64, libxml2, libxmlsec1-openssl, pciutils to mkosi packages
  (runtime deps for libnvat and lspci for Fabric Manager NVL5 detection)
# which must also be present in the final PodVM image at runtime.
RUN set -e; \
if echo "${AA_FEATURES}" | grep -q "nvidia-attester"; then \
git clone --depth 1 --branch "${NVAT_TAG}" "${NVAT_REPO}" /build/nvat && \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Agentic Security Review
Severity: MEDIUM

The new build step clones nvidia-attestation-sdk from a mutable Git tag (NVAT_TAG) and builds it directly without immutable pinning or integrity verification. This weakens the supply-chain trust boundary for PodVM artifacts.

Impact: If the upstream tag is retargeted or the source repo is compromised, malicious code could be compiled into libnvat and shipped in the resulting image.

Comment thread src/cloud-api-adaptor/podvm/Dockerfile.podvm_binaries.ubuntu
…olves

The Cargo.lock on the guest-components cohere branch was generated
without the nvidia-attester feature, so nv-attestation-sdk was absent.
Building with --locked silently skipped the dependency, resulting in an
attestation-agent binary with NvAttester symbols (from libnvat linkage)
but no runtime nvidia detection code compiled in.

Add `cargo update --workspace` before the locked build so new optional
feature dependencies are resolved into the lockfile first.
Two fixes for B200 multi-GPU CC builds:

1. Patch detect_platform() after cloning guest-components to accept
   multi-GPU systems (count >= 1 instead of count == 1). Without this,
   the nvidia-attester silently skips registration on systems with more
   than one GPU. Temporary until cohere-ai/guest-components#7 merges.

2. Increase debug image root partition from Minimize=guess to a fixed
   12G. The NVIDIA 595 drivers make the root filesystem too large for
   systemd-repart's size estimation, causing "No space left on device"
   during mkfs.ext4.
@alhassankhedr-cohere alhassankhedr-cohere force-pushed the alhassankhedr/podvm-b200-cc-driver branch from da5003f to 6361ef8 Compare May 12, 2026 18:55
# Refresh lockfile so optional feature deps (e.g. nv-attestation-sdk
# for nvidia-attester) are resolved even if the checked-in Cargo.lock
# was generated without them.
cargo update --workspace
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Agentic Security Review
Severity: HIGH

The new cargo update --workspace step rewrites dependency resolution from live registries during image builds, then cargo build --locked only enforces that freshly-updated lock state. This removes the protection of building from a pre-reviewed, committed dependency graph.

Impact: A malicious or compromised transitive crate release could be silently pulled into attestation-agent at build time and shipped in PodVM artifacts without an explicit dependency-pin change in this repository.

The nvidia-attester count==1 bug is now fixed upstream via
guest-components PR #9 (sync main → cohere), which brings in
the full NVAT SDK rewrite. The Dockerfile sed workaround is
no longer needed.
The new nv-attestation-sdk-sys build.rs expects nvat.h at
/usr/include/nvat.h when NVAT_USE_SYSTEM_LIB=1 is set.
Set CMAKE_INSTALL_PREFIX=/usr so headers and libs install
to /usr/include and /usr/lib instead of /usr/local.
Comment thread src/cloud-api-adaptor/podvm/Dockerfile.podvm_binaries.ubuntu Outdated
The previous commit changed CMAKE_INSTALL_PREFIX to /usr, but the
COPY step still looked for libnvat in /usr/local/lib. Also install
to /usr/lib in the guest image so ldconfig finds it without extra
configuration.
Comment thread src/cloud-api-adaptor/podvm/Dockerfile.podvm_binaries.ubuntu
On Debian, CMAKE_INSTALL_LIBDIR defaults to lib/x86_64-linux-gnu,
so libnvat.so ends up at /usr/lib/x86_64-linux-gnu/ instead of
/usr/lib/. The COPY step and runtime linker then can't find it.

Force CMAKE_INSTALL_LIBDIR=lib so the library installs to /usr/lib/
consistently.
1. NVAT build step now installs its own deps (cmake, libcurl, etc.)
   independently of the guest-components block, so nvidia-attester
   works even if CUSTOM_GC_BINARIES is not set.

2. Replace COPY --from glob (which fails when no files match) with
   RUN --mount for libnvat. This makes non-GPU builds safe — the
   mount always succeeds, and the if-ls check handles the empty case.
Mirror the change just made on the kata UVM workflow
(build-kata-uvm-cohere.yaml) so both PodVM build paths default to the
same guest-components branch and produce binaries with a working
multi-GPU nvidia-attester out of the box.

The `cohere` branch's nvidia-attester::detect_platform() has a
`count == 1` guard that silently disables the attester on 2+ GPU
systems. Upstream main's NVAT-SDK-based rewrite (synced into the fork
by PR #9, head = alhassankhedr/sync-main-to-cohere) drops the guard
and handles multi-GPU enumeration via GpuEvidenceSource::collect().

Switch back to `cohere` once PR #9 merges.
--annotation "com.cohere.nvidia.driver=${NVIDIA_DRIVER}" \
--format json > oras-output.json

cat oras-output.json
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Agentic Security Review
Severity: HIGH

GC_REF now defaults to a mutable personal branch (alhassankhedr/sync-main-to-cohere) for push/tag-driven PodVM builds instead of an immutable, reviewed ref. That expands the build trust boundary to branch-head state that can change outside this repository’s review path.

Impact: If that branch is updated maliciously (or compromised), unreviewed guest-components code can be pulled into release artifacts and published as trusted PodVM images.

The default `Minimize=guess` for `mkosi.repart-debug/10-root.conf`
chronically under-sizes the debug image once the 595 NVIDIA stack
lands in /usr. We hit two failure modes during the B200 multi-GPU
work, both reproducible from a clean checkout:

1. mkosi `systemd-repart` step fails with
     "no space left on device"
   mid-build because the guessed size doesn't account for libnvat,
   the open-driver kernel modules, fabricmanager, and nscq landing
   on top of the standard ubuntu base.
2. When the build does squeeze through, the resulting qcow2 boots
   but runs out of root-fs space the first time anything writes
   into /var (apt cache, journald, attestation-agent's `/run/aa`).

Fixed locally on the B200 host's checkout (the one that produced
the working /mnt/vms/guest-nvat-debug.qcow2 that demonstrated the
8-GPU evidence path) but never made it back into the branch.
Closing the gap now: pin `Minimize=off`, `SizeMinBytes=12G`,
`SizeMaxBytes=12G`. 12 GiB is empirically enough for the full
B200 CC userspace with comfortable headroom for ad-hoc debugging.

Release variant is unaffected (mkosi.repart/ uses verity-sized
partitions independently).
# the debug variant has headroom for the full B200 CC userspace.
Minimize=off
SizeMinBytes=12G
SizeMaxBytes=12G
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug partition permanently enlarged for all builds, not just CC595

Medium Severity

The committed 10-root.conf permanently replaces Minimize=guess with Minimize=off and a fixed 12 GiB size for ALL debug builds, not just CC595 builds. This contradicts the PR's claim that "default behaviour is unchanged." The conditional workflow step "Increase debug root partition for CC595 drivers" (guarded by b200_cc_drivers == 'true') writes the exact same content that's already in the committed file, making it a no-op. Standard 580 debug images will now be a fixed 12 GiB instead of minimized.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7a46e77. Configure here.

…or NVL5+

- Pin nvidia-driver-open / persistenced / fabricmanager / nscq /
  fabricmanager-dev to 595.71.05-1ubuntu1; add nvlsm and infiniband-diags
  packages required by the B200 Shared NVSwitch path. Also pulls in
  docker.io and a small python/curl/pciutils baseline used by the in-VM
  attestation + tenant-setup flow.
- modules-load.d/nvlink-fabric.conf: add ib_umad. The B200
  nvidia-fabricmanager-start.sh checks lsmod for ib_umad and exits if
  missing (NVL5+ subnet management path goes over the CX7 bridge umad
  interface).
- mkosi.repart-debug/10-root.conf: pin the debug rootfs at a fixed 12G
  (Minimize=off, SizeMin/Max=12G) instead of guess-minimized so the
  systemd-firstboot resize does not run out of room when the in-VM
  setup writes the NVIDIA driver state during early boot.

Tested end-to-end on a B200 bare-metal host: SVM + 4 x 2-GPU tenants
(partitions 4/5/6/7) come up clean, fabric.state=Completed/Success on
all 8 GPUs, ITA composite TDX+NVGPU attestation passes for all 4.
On HGX B200 in confidential-compute mode, the Service VM Fabric Manager
programs NVSwitch routing for a partition asynchronously. The handshake
with the guest GPU happens over in-band NVLink MAD (Probe Request ->
Probe Response) AFTER fmActivateFabricPartition() returns success, so
guest userspace can see the GPU on the PCI bus before that GPU has
actually finished registering with the fabric.

nvidia-persistenced races that handshake on guest boot. If it tries to
register a GPU before that GPUs fabric.state hits Completed, NVML
returns 0x81 (NVLINK_FABRIC_NOT_READY) and the daemon SILENTLY falls
back to non-UVM persistence. Per the NVIDIA Secure AI Operations Guide,
the only way to recover from a missed SPDM/UVM session in CC mode is an
FLR -- i.e. the affected GPU is permanently unable to do NVLink P2P
until the VM is fully reset, with no easy diagnostic. Symptom we hit
in the field: vLLM crashes deep into NCCL init with
cudaErrorSystemNotReady (CUDA error 802) on one GPU of a 2-GPU tenant,
and Xid 170 / 145 cascades on the rest of the partition.

The new wait-nvlink-fabric.sh polls nvidia-smi --query-gpu=
fabric.state,fabric.status until every visible GPU reports
Completed/Success, then exits 0. It is wired into the
nvidia-persistenced systemd drop-in as ExecStartPre. On timeout
(default 180s, knob: WAIT_FABRIC_TIMEOUT) it exits non-zero and the
daemon fails loud rather than silently degrading -- which is much
easier to operate than the old failure mode and converts an
unrecoverable silent corruption into a recoverable explicit failure.

Verified on a B200 bare-metal host (Stage C, FABRIC_MODE=1 Service
VM): SVM + 4 x 2-GPU tenants started in parallel, gate logs
"all GPUs fabric ready (attempt N)" on each tenant, persistenced
successfully enables UVM Persistence on all 8 GPUs, ITA composite
TDX+NVGPU attestation passes (nvgpu_overall=true) on all 4, zero Xid
in dmesg.
nvidia-driver-open=595.71.05-1ubuntu1
nvidia-persistenced=595.71.05-1ubuntu1
nvidia-fabricmanager=595.71.05-1ubuntu1
libnvidia-nscq=595.71.05-1ubuntu1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base config permanently ships 595 drivers, breaking opt-in intent

High Severity

The base ubuntu.conf has been permanently changed from nvidia-driver-580-open=580.126.20-1ubuntu1 to nvidia-driver-open=595.71.05-1ubuntu1 (and the other three packages likewise). This means all builds — including the default b200_cc_drivers=false path and push-triggered builds — will ship the 595 driver stack, directly contradicting the PR's stated intent that "default behaviour is unchanged" and "push events… continue to ship 580.x." Additionally, the sed override step's first pattern (nvidia-driver-580-open=.*) is now dead code since that string no longer exists in the file. The base file needs to retain the 580 packages so the dynamic sed replacement has something to match when the flag is enabled.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 85f075d. Configure here.

libcurl4t64
libxml2
libxmlsec1-openssl
pciutils
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate pciutils entry in package list

Low Severity

pciutils appears twice in the Packages= list (lines 41 and 45). While the package manager handles duplicates gracefully, the repetition is unnecessary and suggests a copy-paste oversight.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 85f075d. Configure here.

ExecCondition=/usr/local/bin/check-nvidia-gpu
# Block daemon startup until every visible GPU has fabric.state=Completed.
# See /usr/local/bin/wait-nvlink-fabric.sh for the rationale (B200 CC race).
ExecStartPre=/usr/local/bin/wait-nvlink-fabric.sh
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditional fabric wait blocks persistenced on non-NVLink GPUs

Medium Severity

The new ExecStartPre=/usr/local/bin/wait-nvlink-fabric.sh is added unconditionally to the skeleton (all images, not just CC builds). On GPUs without NVLink fabric, nvidia-smi returns N/A for fabric.state/fabric.status, which never matches Completed/Success, so the script loops for 180 seconds and exits non-zero. Unlike the pre-existing ExecStartPost (non-fatal), a failed ExecStartPre is fatal in systemd — nvidia-persistenced will never start on non-NVLink GPU systems. This is a behavioral regression from the prior override, which allowed persistenced to run even when CC-specific post-start commands failed.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 85f075d. Configure here.

…ata-plane

Per the NVIDIA Fabric Manager User Guide audit on 2026-05-19
(fortress scratch/oci-b200/.../docs/KNOWN-ISSUES.md §7), the SVM
image was missing eight B200-specific Shared-NVSwitch packages
that the deprecated `nvlink5-<branch>` metapackage used to pull
in. Bake them into the podvm-mkosi image build so a fresh SVM
boot starts with everything provisioned under proper dpkg
management — no runtime `apt-get download` + `dpkg -x`
workarounds needed.

Added (all version-pinned to match the in-image NVIDIA driver
where the apt repo offers a 595-series build):

  libnvsdm=595.71.05-1ubuntu1
    NVSwitch Device Manager telemetry library; replaces the SXID
    error path on B200/B300 and is required for FM/NVLSM/DCGM
    to surface NVSwitch errors (FM Guide §1259).

  nvidia-imex=595.71.05-1ubuntu1
    Internode Memory Exchange daemon. Brokers cross-OS-instance
    shared CUDA memory channels over NVLink; required for
    multi-tenant Shared-NVSwitch workloads where NCCL crosses
    partition boundaries. Strongest hypothesis for the
    `Xid 170 SECURE Fatal CROSS_CONTAIN` storm on small
    partitions (KNOWN-ISSUES §1 mode C).

  collectx-bringup, mft, mft-oem, mft-autocomplete
    CX7 telemetry / firmware tools (mst, flint, mlxconfig,
    mlxlink). Not strictly load-bearing but required for B200
    LPF triage.

  rdma-core, ibverbs-utils
    OFED data-plane (libibverbs1, librdmacm1, ibv_devices,
    ibv_devinfo). FM Guide "NVIDIA Software Packages" mandates
    "OFED or MOFED package is required" on B200/B300; the image
    already had the management plane (libibmad/libibumad/
    ibstatus from infiniband-diags) but not the data plane.

Verification gates for all of these are wired into
fortress scratch/oci-b200/stacks/qemu-shared-nvswitch/
service-vm/scripts/verify-svc-vm.sh (eight hard-fail gates as
of 2026-05-19).
nvidia-driver-open=595.71.05-1ubuntu1
nvidia-persistenced=595.71.05-1ubuntu1
nvidia-fabricmanager=595.71.05-1ubuntu1
libnvidia-nscq=595.71.05-1ubuntu1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config files unconditionally ship 595 drivers, breaking opt-in intent

High Severity

The committed ubuntu.conf already has 595.71.05 packages and B200-specific dependencies (e.g. nvidia-fabricmanager-dev, nvlsm, nvidia-imex, libnvsdm), making the b200_cc_drivers flag a no-op for image content. The workflow's conditional sed step tries to match nvidia-driver-580-open=.* which no longer exists in the file (now nvidia-driver-open=…), so it silently matches nothing. Similarly, 10-root.conf is already committed with Minimize=off / 12G fixed size. Both b200_cc_drivers=false and true builds produce identical images — only the tag suffix differs. The PR description states "default behaviour is unchanged… continue to ship 580.x" which contradicts the committed source.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit bb64832. Configure here.

The previous commit (bb64832) added libnvsdm, nvidia-imex,
collectx-bringup, mft, mft-oem, mft-autocomplete, rdma-core,
ibverbs-utils to ubuntu.conf. The 2026-05-19 podvm build attempt
on the B200 host failed at the mkosi apt-resolve phase with:

  collectx-bringup : Depends: ucx but it is not installable

The `ucx` (Unified Communication X) package and the matching
MOFED userspace stack live in NVIDIA's DOCA-Host networking apt
repo (linux.mellanox.com/public/repo/doca/...), which the CAA
mkosi configuration was not sourcing from. The CUDA repo at
developer.download.nvidia.com does ship `collectx-bringup`, `mft`,
`mft-oem`, `mft-autocomplete`, but does NOT ship `ucx`.

Fix in two parts:

1. Dockerfile.mkosi.ubuntu — wire the DOCA-Host repo into
   mkosi.skeleton's apt sources alongside the existing CUDA and
   nvidia-container-toolkit repos. Pin DOCA to priority 100 ("install
   only when explicitly requested or to satisfy a dep, never replace
   a higher-priority candidate"). This makes:
     - ucx (only in DOCA)               install from DOCA  ✓
     - rdma-core / libibumad3 (also in
       DOCA but in universe @500)       install from universe ✓
                                        (keeps inbox OFED, not MOFED)
     - collectx-bringup / mft*          install from CUDA repo
                                        (origin developer.download
                                        .nvidia.com, default 500 > 100)

2. ubuntu.conf — keep collectx-bringup, mft, mft-oem,
   mft-autocomplete in Packages= (they are nvlink5-595 metapackage
   components per the FM User Guide §"Installing Fabric Manager /
   Systems Using Fourth Generation NVSwitches"). Add a comment
   explaining the cross-repo dependency chain.

Also folded in this commit (separate but discovered during the
same FM Guide audit):
  - libibumad3              §"Other NVIDIA Software Packages" (§366)
                            calls it out by name; pin explicitly so
                            it stops being a transitive-only dep.
  - nvidia-utils-595        provides /usr/bin/nvidia-smi which
                            verify-svc-vm.sh invokes directly. Was
                            transitive via nvidia-driver-open; pin.
  - nvidia-modprobe         SUID helper for /dev/nvidia* device
                            nodes; required by nvidia-imex and any
                            non-root NVML caller. Was transitive; pin.
  - datacenter-gpu-manager-4-cuda12
    datacenter-gpu-manager-4-config
                            DCGM v4. FM Guide §"NVSwitch Errors On
                            DGX B200/B300" specifies that DCGM is
                            the consumer that surfaces NVSwitch
                            errors via libnvsdm. We had libnvsdm but
                            no DCGM — half the error path was
                            present. Adding DCGM closes that gap.
  - lshw                    Provides `vpddecode` per FM Guide
                            §"Additional Steps for NVIDIA HGX
                            B200/B300 Systems" for CX7 bridge VPD
                            identification.
-e 's|^([[:space:]]*)nvidia-persistenced=.*|\1nvidia-persistenced=595.71.05-1ubuntu1|' \
-e 's|^([[:space:]]*)nvidia-fabricmanager=.*|\1nvidia-fabricmanager=595.71.05-1ubuntu1|' \
-e 's|^([[:space:]]*)libnvidia-nscq=.*|\1libnvidia-nscq=595.71.05-1ubuntu1|' \
"$CONF"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opt-in flag is non-functional; config already hardcodes 595

High Severity

The b200_cc_drivers flag is supposed to opt-in to the 595 driver by sed-replacing 580 package pins, but ubuntu.conf was directly committed with 595 packages (nvidia-driver-open=595.71.05-1ubuntu1), making the sed a no-op. Specifically, the first sed pattern looks for nvidia-driver-580-open= which doesn't exist in the committed file (it has nvidia-driver-open= without 580), and patterns 2–4 match but replace with the same already-present 595 values. This means ALL builds use the 595 driver regardless of the flag, contradicting the stated "Default behaviour is unchanged" and the PR's "opt-in" design. The flag's only actual effect is appending -cc595 to the image tag.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fd50ec5. Configure here.

ExecCondition=/usr/local/bin/check-nvidia-gpu
# Block daemon startup until every visible GPU has fabric.state=Completed.
# See /usr/local/bin/wait-nvlink-fabric.sh for the rationale (B200 CC race).
ExecStartPre=/usr/local/bin/wait-nvlink-fabric.sh
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing chmod for new wait-nvlink-fabric.sh script

High Severity

The new wait-nvlink-fabric.sh is wired as ExecStartPre in the nvidia-persistenced service override, but the mkosi.finalize.chroot script only runs chmod +x on check-nvidia-gpu (line 23) — there is no matching chmod +x /usr/local/bin/wait-nvlink-fabric.sh. If the file's executable bit isn't preserved through the skeleton copy, ExecStartPre will fail with a permission error and nvidia-persistenced won't start, leaving GPUs without persistence mode on B200 hosts.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fd50ec5. Configure here.

The 2026-05-20 build attempt at fd50ec5 confirmed that the
NVIDIA DOCA-Host repo wiring works (apt cleanly fetched the
DOCA Release/Packages indexes and the `ucx`/collectx-bringup
chain resolved). It also surfaced two package names from the
prior commit that do not exist in the NVIDIA cuda repo:

  E: Unable to locate package nvidia-utils-595
  E: Unable to locate package datacenter-gpu-manager-4-config

Root cause:

1. nvidia-utils-595 — the `nvidia-utils-<branch>` suffix series
   in the cuda repo (developer.download.nvidia.com/.../ubuntu2404)
   stops at -580. Starting with R595, the open-driver branch
   bundles nvidia-smi and the other userspace tools inside
   nvidia-driver-open=595.71.05-1ubuntu1 itself rather than
   shipping a separate -utils-<branch> package. Verify gates
   already pass on prior 595-stack images for exactly that
   reason; no explicit pin is needed.

2. datacenter-gpu-manager-4-config — does not exist as a
   separate package. The real DCGM 4 layout (confirmed via
   `apt-cache search datacenter-gpu-manager` against the cuda
   repo Packages.gz) is:
     datacenter-gpu-manager-4-core
     datacenter-gpu-manager-4-cuda{11,12,13}
     datacenter-gpu-manager-4-cuda-all
     datacenter-gpu-manager-4-dev
     datacenter-gpu-manager-4-multinode{,-cuda12,-cuda13}
     datacenter-gpu-manager-4-proprietary{,-cuda11,-cuda12,-cuda13}
   The systemd unit (nvidia-dcgm.service), the dcgmi CLI, and
   the default config files in /etc/nvidia-dcgm/ all ship inside
   datacenter-gpu-manager-4-cuda12. No separate -config package
   is required or available.

Drop both names; keep nvidia-modprobe=595.71.05-1ubuntu1 (which
DID resolve cleanly) and datacenter-gpu-manager-4-cuda12 (which
likewise resolves on its own). Add lengthy comments capturing the
package-layout finding so this trap doesn't get re-hit on the next
NVIDIA-stack rev.
Previously fmctl-probe (the FM SDK client used by the host's
{activate,deactivate}-partition-by-bdfs.sh wrappers via SSH-into-SVM) had
no production build path — operators were expected to scp the source and
g++ it inside a running SVM. The image's "/usr/local/bin/fmctl-probe is
baked in" claim was aspirational, and the binary's libnvfm ABI was only
guaranteed by accident (whatever libnvfm happened to be on the build
host).

Vendor the source at mkosi.skeleton/usr/src/fmctl-probe/fmctl-probe.cpp
(byte-identical to the canonical copy in cohere-ai/fortress at
scratch/oci-b200/stacks/qemu-shared-nvswitch/orchestration/scripts/
fmctl-probe.cpp) and add a postinst block that chroots into ${BUILDROOT}
and compiles against the rootfs's own libnvfm (from
nvidia-fabricmanager-dev=595.71.05-1ubuntu1), then strips the source
tree. Result: every podvm qcow2 ships /usr/local/bin/fmctl-probe with
guaranteed ABI parity to the libnvfm.so that loads at runtime, and the
SVM never compiles anything during bring-up.

Also add g++ to Packages= — Ubuntu's gcc package does not pull the C++
front-end, so the postinst would otherwise skip the bake silently and
fall back to the runtime g++ workaround. The fortress-side
verify-svc-vm.sh gate 7 fails closed if /usr/local/bin/fmctl-probe is
missing or the 'resolve' subcommand isn't there, catching both
forgot-to-rebuild and forgot-to-resync-vendored-copy regressions before
any tenant ExecStartPre runs.
The upstream Makefile defaults to PODVM_DISTRO=fedora, so a bare
`make image-debug` silently produces a Fedora qcow2 lacking every NVIDIA
package the B200 Service VM stack pins in
mkosi.presets/system/mkosi.conf.d/ubuntu.conf. The build emits no error
and the broken image only fails at SVM bring-up time -- after the
operator has copied a 5 GiB qcow2 into place and restarted the VM.

Hit this on 2026-05-19 evening: re-built after pushing the fmctl-probe
bake (f4b67ed), invoked `make binaries && make image-debug` directly
under nohup (mistakenly skipping /tmp/run-podvm-build.sh which exports
PODVM_DISTRO=ubuntu). The build ran 5 min into the systemd-repart phase
and surfaced as `mkfs binary for ext4 is not available` -- a misleading
symptom whose actual cause was Fedora 43 e2fsprogs vs Oracular's
systemd-repart. fmctl-probe also silently skipped its bake because
nv_fm_agent.h doesn't exist in Fedora's NVIDIA repos.

Add a guard at the very top of mkosi.postinst that reads
${BUILDROOT}/etc/os-release and exits non-zero if ID != ubuntu. This
fails the mkosi build BEFORE finalize, with a message naming all three
wrapper scripts (run-podvm-build.sh, fortress 04-build-podvm-locally.sh,
fortress run-podvm-build.host.sh) so the operator knows the canonical
fix. Works regardless of entry point -- CI, ad-hoc make, or wrapper.
The qemu-shared-nvswitch SVM is GPU-less by design (only the four CX7
LPFs are passed through; the eight B200 GPUs go straight to tenant VMs),
so nvidia-imex always fails at startup with NV_ERR_OPERATING_SYSTEM
("Failed to allocate handle to NVIDIA GPU driver") because /dev/nvidiactl
doesn't exist. Worse, the upstream unit is Type=forking +
TimeoutStartSec=infinity, so the failed-init parent hangs in
sigtimedwait() forever and `systemctl start nvidia-imex.service` never
returns -- observed as a 5+ minute boot stall in svc-vm-bootstrap.sh.

Add a drop-in override that mirrors the existing pattern used by
nvidia-fabricmanager.service / nvidia-persistenced.service /
nvidia-cdi-refresh.service:

    ExecCondition=/usr/local/bin/check-nvidia-gpu
    TimeoutStartSec=120

The check-nvidia-gpu predicate skips the unit cleanly on any VM whose
lspci -n doesn't show a 10de:* device, so on the SVM the unit becomes
a no-op (same way nvidia-fabricmanager already does). Tenant VMs (which
DO have GPUs) still start nvidia-imex normally.

The TimeoutStartSec=120 is belt-and-suspenders: even on a real GPU
node, an indefinite hang in the forking handshake would mask a real
config error (e.g. malformed nodes_config.cfg) and stall the unit
dependency graph. 120 s is generous enough for the legitimate path
(driver init + IMEX cluster bootstrap) without being unbounded.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 8 total unresolved issues (including 7 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit de75f12. Configure here.

# "No space left on device" during the build.
printf '[Partition]\nType=root\nFormat=ext4\nCopyFiles=/\nMinimize=off\nSizeMinBytes=12G\nSizeMaxBytes=12G\n' > "$CONF"
echo "----- Updated repart config -----"
cat "$CONF"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug root partition override is redundant dead code

Low Severity

The "Increase debug root partition for CC595 drivers" workflow step (conditional on b200_cc_drivers == 'true') writes Minimize=off / SizeMinBytes=12G / SizeMaxBytes=12G to 10-root.conf. However, the base 10-root.conf was already directly changed in this commit from Minimize=guess to the identical Minimize=off + 12G content. The conditional override is redundant dead code that writes exactly what's already in the file.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit de75f12. Configure here.

…t pollution

The previous `resolve <bdf,bdf,...>` subcommand matches partitions by
fmGetSupportedFabricPartitions().gpuInfo[].pciBusId, which is correct on
single-host non-FABRIC_MODE setups but ALWAYS empty in our
qemu-shared-nvswitch SVM topology: FM runs in FABRIC_MODE=1 with no GPUs
in its OS instance (the GPUs are passed through to tenant VMs), so
pciBusId never populates. Every `fmctl-probe resolve <bdf>` therefore
returned "no supported partition matches BDF set" and tenant ExecStartPre
failed before QEMU launched.

physicalId, by contrast, IS populated by FM in shared-NVSwitch mode --
it's a baseboard-fixed property reported by NVSwitch firmware regardless
of who owns the GPUs. Add a parallel `resolve-by-physids <id,id,...>`
subcommand that matches against gpuInfo[].physicalId. The host-side
activate-/deactivate-partition-by-bdfs.sh wrappers compute the
BDF -> physicalId map from /sys/bus/pci on the host (sort 10de:* +
class 0x030200 by PCI address; canonical B200 HGX baseboard convention)
and call the new subcommand.

Also fix a latent stdout-pollution bug: the post-fmConnect "[fmctl]
connected to %s" banner was on stdout, which got concatenated with the
machine-readable partition id by the calling shell `pid=$(fmctl-probe
resolve...)`. Move it to stderr -- only the partition id stays on stdout.

Verified end-to-end on a live b200-cc-test SVM: rebuilt fmctl-probe in
place against the rootfs's libnvfm, ran activate-partition-by-bdfs.sh
89:00.0,a8:00.0, the script auto-detected physicalIds 5,6 and resolved
to partition 6 which fmctl-probe list confirmed active. Then launched
all four mid-{a,b,c,d} tenants (8 GPUs split 2+2+2+2 across partitions
4,5,6,7) -- all systemd units came up active, all four FM partitions
active=1.

Source remains byte-identical with the fortress canonical copy
(scratch/oci-b200/.../orchestration/scripts/fmctl-probe.cpp); SHA256
matches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant