feat(podvm): opt-in NVIDIA 595.71.05 driver for B200 multi-GPU CC#29
feat(podvm): opt-in NVIDIA 595.71.05 driver for B200 multi-GPU CC#29alhassankhedr-cohere wants to merge 21 commits into
Conversation
Adds a `b200_cc_drivers` workflow_dispatch flag (default false) to `Build PodVM Image (Cohere)` that swaps the in-image NVIDIA stack from the default 580 LTS to the 595.71.05 open-kernel branch. This driver is required to enable Confidential Computing on multi-GPU B200 hosts (NVSwitch fabric requires fabricmanager 595.x with TDISP/CC support). When enabled the workflow: * sed-replaces the four NVIDIA package pins in `mkosi.presets/system/mkosi.conf.d/ubuntu.conf` before mkosi runs (`nvidia-driver-580-open` -> unversioned `nvidia-driver-open=595.71.05-1ubuntu1`, plus `nvidia-persistenced`, `nvidia-fabricmanager`, `libnvidia-nscq` pinned to `595.71.05-1ubuntu1`). Patterns are package-name anchored so they survive future 580.x.y baseline bumps. * suffixes `tag` / image names with `-cc595` so a CC build never silently overwrites the standard `cohere-latest` artifact. * records the resolved driver version (extracted from the conf at build time, so it stays accurate for both standard and CC builds) in `measurements.json` and as a `com.cohere.nvidia.driver` OCI annotation on the published artifact. All four packages were verified present in NVIDIA's CUDA repo for ubuntu2404/x86_64 (already pinned to priority 1001 in `Dockerfile.mkosi.ubuntu`). The 595 branch only ships the unversioned `nvidia-driver-open` metapackage; pinning by version selects the correct branch. Default behaviour is unchanged: the flag is off and standard builds continue to ship 580.x.
The attestation-agent was compiled with --features nvidia-attester but the nv-attestation-sdk-sys crate requires libnvat (the NVIDIA Attestation SDK C++ library) to be pre-built. Without it, the build panics or the feature is silently excluded, producing an AA binary that cannot collect GPU attestation evidence. Changes: - Install cmake, libclang-dev, and NVAT runtime deps in gc_builder stage - Clone and build libnvat from NVIDIA/attestation-sdk before cargo build - Set NVAT_USE_SYSTEM_LIB=1 so the sys crate links against the installed lib - Copy libnvat.so into the final PodVM image tree - Add libcurl4t64, libxml2, libxmlsec1-openssl, pciutils to mkosi packages (runtime deps for libnvat and lspci for Fabric Manager NVL5 detection)
| # which must also be present in the final PodVM image at runtime. | ||
| RUN set -e; \ | ||
| if echo "${AA_FEATURES}" | grep -q "nvidia-attester"; then \ | ||
| git clone --depth 1 --branch "${NVAT_TAG}" "${NVAT_REPO}" /build/nvat && \ |
There was a problem hiding this comment.
🔒 Agentic Security Review
Severity: MEDIUM
The new build step clones nvidia-attestation-sdk from a mutable Git tag (NVAT_TAG) and builds it directly without immutable pinning or integrity verification. This weakens the supply-chain trust boundary for PodVM artifacts.
Impact: If the upstream tag is retargeted or the source repo is compromised, malicious code could be compiled into libnvat and shipped in the resulting image.
…olves The Cargo.lock on the guest-components cohere branch was generated without the nvidia-attester feature, so nv-attestation-sdk was absent. Building with --locked silently skipped the dependency, resulting in an attestation-agent binary with NvAttester symbols (from libnvat linkage) but no runtime nvidia detection code compiled in. Add `cargo update --workspace` before the locked build so new optional feature dependencies are resolved into the lockfile first.
Two fixes for B200 multi-GPU CC builds: 1. Patch detect_platform() after cloning guest-components to accept multi-GPU systems (count >= 1 instead of count == 1). Without this, the nvidia-attester silently skips registration on systems with more than one GPU. Temporary until cohere-ai/guest-components#7 merges. 2. Increase debug image root partition from Minimize=guess to a fixed 12G. The NVIDIA 595 drivers make the root filesystem too large for systemd-repart's size estimation, causing "No space left on device" during mkfs.ext4.
da5003f to
6361ef8
Compare
| # Refresh lockfile so optional feature deps (e.g. nv-attestation-sdk | ||
| # for nvidia-attester) are resolved even if the checked-in Cargo.lock | ||
| # was generated without them. | ||
| cargo update --workspace |
There was a problem hiding this comment.
🔒 Agentic Security Review
Severity: HIGH
The new cargo update --workspace step rewrites dependency resolution from live registries during image builds, then cargo build --locked only enforces that freshly-updated lock state. This removes the protection of building from a pre-reviewed, committed dependency graph.
Impact: A malicious or compromised transitive crate release could be silently pulled into attestation-agent at build time and shipped in PodVM artifacts without an explicit dependency-pin change in this repository.
The nvidia-attester count==1 bug is now fixed upstream via guest-components PR #9 (sync main → cohere), which brings in the full NVAT SDK rewrite. The Dockerfile sed workaround is no longer needed.
The new nv-attestation-sdk-sys build.rs expects nvat.h at /usr/include/nvat.h when NVAT_USE_SYSTEM_LIB=1 is set. Set CMAKE_INSTALL_PREFIX=/usr so headers and libs install to /usr/include and /usr/lib instead of /usr/local.
The previous commit changed CMAKE_INSTALL_PREFIX to /usr, but the COPY step still looked for libnvat in /usr/local/lib. Also install to /usr/lib in the guest image so ldconfig finds it without extra configuration.
On Debian, CMAKE_INSTALL_LIBDIR defaults to lib/x86_64-linux-gnu, so libnvat.so ends up at /usr/lib/x86_64-linux-gnu/ instead of /usr/lib/. The COPY step and runtime linker then can't find it. Force CMAKE_INSTALL_LIBDIR=lib so the library installs to /usr/lib/ consistently.
1. NVAT build step now installs its own deps (cmake, libcurl, etc.) independently of the guest-components block, so nvidia-attester works even if CUSTOM_GC_BINARIES is not set. 2. Replace COPY --from glob (which fails when no files match) with RUN --mount for libnvat. This makes non-GPU builds safe — the mount always succeeds, and the if-ls check handles the empty case.
Mirror the change just made on the kata UVM workflow (build-kata-uvm-cohere.yaml) so both PodVM build paths default to the same guest-components branch and produce binaries with a working multi-GPU nvidia-attester out of the box. The `cohere` branch's nvidia-attester::detect_platform() has a `count == 1` guard that silently disables the attester on 2+ GPU systems. Upstream main's NVAT-SDK-based rewrite (synced into the fork by PR #9, head = alhassankhedr/sync-main-to-cohere) drops the guard and handles multi-GPU enumeration via GpuEvidenceSource::collect(). Switch back to `cohere` once PR #9 merges.
| --annotation "com.cohere.nvidia.driver=${NVIDIA_DRIVER}" \ | ||
| --format json > oras-output.json | ||
|
|
||
| cat oras-output.json |
There was a problem hiding this comment.
🔒 Agentic Security Review
Severity: HIGH
GC_REF now defaults to a mutable personal branch (alhassankhedr/sync-main-to-cohere) for push/tag-driven PodVM builds instead of an immutable, reviewed ref. That expands the build trust boundary to branch-head state that can change outside this repository’s review path.
Impact: If that branch is updated maliciously (or compromised), unreviewed guest-components code can be pulled into release artifacts and published as trusted PodVM images.
The default `Minimize=guess` for `mkosi.repart-debug/10-root.conf`
chronically under-sizes the debug image once the 595 NVIDIA stack
lands in /usr. We hit two failure modes during the B200 multi-GPU
work, both reproducible from a clean checkout:
1. mkosi `systemd-repart` step fails with
"no space left on device"
mid-build because the guessed size doesn't account for libnvat,
the open-driver kernel modules, fabricmanager, and nscq landing
on top of the standard ubuntu base.
2. When the build does squeeze through, the resulting qcow2 boots
but runs out of root-fs space the first time anything writes
into /var (apt cache, journald, attestation-agent's `/run/aa`).
Fixed locally on the B200 host's checkout (the one that produced
the working /mnt/vms/guest-nvat-debug.qcow2 that demonstrated the
8-GPU evidence path) but never made it back into the branch.
Closing the gap now: pin `Minimize=off`, `SizeMinBytes=12G`,
`SizeMaxBytes=12G`. 12 GiB is empirically enough for the full
B200 CC userspace with comfortable headroom for ad-hoc debugging.
Release variant is unaffected (mkosi.repart/ uses verity-sized
partitions independently).
| # the debug variant has headroom for the full B200 CC userspace. | ||
| Minimize=off | ||
| SizeMinBytes=12G | ||
| SizeMaxBytes=12G |
There was a problem hiding this comment.
Debug partition permanently enlarged for all builds, not just CC595
Medium Severity
The committed 10-root.conf permanently replaces Minimize=guess with Minimize=off and a fixed 12 GiB size for ALL debug builds, not just CC595 builds. This contradicts the PR's claim that "default behaviour is unchanged." The conditional workflow step "Increase debug root partition for CC595 drivers" (guarded by b200_cc_drivers == 'true') writes the exact same content that's already in the committed file, making it a no-op. Standard 580 debug images will now be a fixed 12 GiB instead of minimized.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 7a46e77. Configure here.
…or NVL5+ - Pin nvidia-driver-open / persistenced / fabricmanager / nscq / fabricmanager-dev to 595.71.05-1ubuntu1; add nvlsm and infiniband-diags packages required by the B200 Shared NVSwitch path. Also pulls in docker.io and a small python/curl/pciutils baseline used by the in-VM attestation + tenant-setup flow. - modules-load.d/nvlink-fabric.conf: add ib_umad. The B200 nvidia-fabricmanager-start.sh checks lsmod for ib_umad and exits if missing (NVL5+ subnet management path goes over the CX7 bridge umad interface). - mkosi.repart-debug/10-root.conf: pin the debug rootfs at a fixed 12G (Minimize=off, SizeMin/Max=12G) instead of guess-minimized so the systemd-firstboot resize does not run out of room when the in-VM setup writes the NVIDIA driver state during early boot. Tested end-to-end on a B200 bare-metal host: SVM + 4 x 2-GPU tenants (partitions 4/5/6/7) come up clean, fabric.state=Completed/Success on all 8 GPUs, ITA composite TDX+NVGPU attestation passes for all 4.
On HGX B200 in confidential-compute mode, the Service VM Fabric Manager programs NVSwitch routing for a partition asynchronously. The handshake with the guest GPU happens over in-band NVLink MAD (Probe Request -> Probe Response) AFTER fmActivateFabricPartition() returns success, so guest userspace can see the GPU on the PCI bus before that GPU has actually finished registering with the fabric. nvidia-persistenced races that handshake on guest boot. If it tries to register a GPU before that GPUs fabric.state hits Completed, NVML returns 0x81 (NVLINK_FABRIC_NOT_READY) and the daemon SILENTLY falls back to non-UVM persistence. Per the NVIDIA Secure AI Operations Guide, the only way to recover from a missed SPDM/UVM session in CC mode is an FLR -- i.e. the affected GPU is permanently unable to do NVLink P2P until the VM is fully reset, with no easy diagnostic. Symptom we hit in the field: vLLM crashes deep into NCCL init with cudaErrorSystemNotReady (CUDA error 802) on one GPU of a 2-GPU tenant, and Xid 170 / 145 cascades on the rest of the partition. The new wait-nvlink-fabric.sh polls nvidia-smi --query-gpu= fabric.state,fabric.status until every visible GPU reports Completed/Success, then exits 0. It is wired into the nvidia-persistenced systemd drop-in as ExecStartPre. On timeout (default 180s, knob: WAIT_FABRIC_TIMEOUT) it exits non-zero and the daemon fails loud rather than silently degrading -- which is much easier to operate than the old failure mode and converts an unrecoverable silent corruption into a recoverable explicit failure. Verified on a B200 bare-metal host (Stage C, FABRIC_MODE=1 Service VM): SVM + 4 x 2-GPU tenants started in parallel, gate logs "all GPUs fabric ready (attempt N)" on each tenant, persistenced successfully enables UVM Persistence on all 8 GPUs, ITA composite TDX+NVGPU attestation passes (nvgpu_overall=true) on all 4, zero Xid in dmesg.
| nvidia-driver-open=595.71.05-1ubuntu1 | ||
| nvidia-persistenced=595.71.05-1ubuntu1 | ||
| nvidia-fabricmanager=595.71.05-1ubuntu1 | ||
| libnvidia-nscq=595.71.05-1ubuntu1 |
There was a problem hiding this comment.
Base config permanently ships 595 drivers, breaking opt-in intent
High Severity
The base ubuntu.conf has been permanently changed from nvidia-driver-580-open=580.126.20-1ubuntu1 to nvidia-driver-open=595.71.05-1ubuntu1 (and the other three packages likewise). This means all builds — including the default b200_cc_drivers=false path and push-triggered builds — will ship the 595 driver stack, directly contradicting the PR's stated intent that "default behaviour is unchanged" and "push events… continue to ship 580.x." Additionally, the sed override step's first pattern (nvidia-driver-580-open=.*) is now dead code since that string no longer exists in the file. The base file needs to retain the 580 packages so the dynamic sed replacement has something to match when the flag is enabled.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 85f075d. Configure here.
| libcurl4t64 | ||
| libxml2 | ||
| libxmlsec1-openssl | ||
| pciutils |
There was a problem hiding this comment.
Duplicate pciutils entry in package list
Low Severity
pciutils appears twice in the Packages= list (lines 41 and 45). While the package manager handles duplicates gracefully, the repetition is unnecessary and suggests a copy-paste oversight.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 85f075d. Configure here.
| ExecCondition=/usr/local/bin/check-nvidia-gpu | ||
| # Block daemon startup until every visible GPU has fabric.state=Completed. | ||
| # See /usr/local/bin/wait-nvlink-fabric.sh for the rationale (B200 CC race). | ||
| ExecStartPre=/usr/local/bin/wait-nvlink-fabric.sh |
There was a problem hiding this comment.
Unconditional fabric wait blocks persistenced on non-NVLink GPUs
Medium Severity
The new ExecStartPre=/usr/local/bin/wait-nvlink-fabric.sh is added unconditionally to the skeleton (all images, not just CC builds). On GPUs without NVLink fabric, nvidia-smi returns N/A for fabric.state/fabric.status, which never matches Completed/Success, so the script loops for 180 seconds and exits non-zero. Unlike the pre-existing ExecStartPost (non-fatal), a failed ExecStartPre is fatal in systemd — nvidia-persistenced will never start on non-NVLink GPU systems. This is a behavioral regression from the prior override, which allowed persistenced to run even when CC-specific post-start commands failed.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 85f075d. Configure here.
…ata-plane
Per the NVIDIA Fabric Manager User Guide audit on 2026-05-19
(fortress scratch/oci-b200/.../docs/KNOWN-ISSUES.md §7), the SVM
image was missing eight B200-specific Shared-NVSwitch packages
that the deprecated `nvlink5-<branch>` metapackage used to pull
in. Bake them into the podvm-mkosi image build so a fresh SVM
boot starts with everything provisioned under proper dpkg
management — no runtime `apt-get download` + `dpkg -x`
workarounds needed.
Added (all version-pinned to match the in-image NVIDIA driver
where the apt repo offers a 595-series build):
libnvsdm=595.71.05-1ubuntu1
NVSwitch Device Manager telemetry library; replaces the SXID
error path on B200/B300 and is required for FM/NVLSM/DCGM
to surface NVSwitch errors (FM Guide §1259).
nvidia-imex=595.71.05-1ubuntu1
Internode Memory Exchange daemon. Brokers cross-OS-instance
shared CUDA memory channels over NVLink; required for
multi-tenant Shared-NVSwitch workloads where NCCL crosses
partition boundaries. Strongest hypothesis for the
`Xid 170 SECURE Fatal CROSS_CONTAIN` storm on small
partitions (KNOWN-ISSUES §1 mode C).
collectx-bringup, mft, mft-oem, mft-autocomplete
CX7 telemetry / firmware tools (mst, flint, mlxconfig,
mlxlink). Not strictly load-bearing but required for B200
LPF triage.
rdma-core, ibverbs-utils
OFED data-plane (libibverbs1, librdmacm1, ibv_devices,
ibv_devinfo). FM Guide "NVIDIA Software Packages" mandates
"OFED or MOFED package is required" on B200/B300; the image
already had the management plane (libibmad/libibumad/
ibstatus from infiniband-diags) but not the data plane.
Verification gates for all of these are wired into
fortress scratch/oci-b200/stacks/qemu-shared-nvswitch/
service-vm/scripts/verify-svc-vm.sh (eight hard-fail gates as
of 2026-05-19).
| nvidia-driver-open=595.71.05-1ubuntu1 | ||
| nvidia-persistenced=595.71.05-1ubuntu1 | ||
| nvidia-fabricmanager=595.71.05-1ubuntu1 | ||
| libnvidia-nscq=595.71.05-1ubuntu1 |
There was a problem hiding this comment.
Config files unconditionally ship 595 drivers, breaking opt-in intent
High Severity
The committed ubuntu.conf already has 595.71.05 packages and B200-specific dependencies (e.g. nvidia-fabricmanager-dev, nvlsm, nvidia-imex, libnvsdm), making the b200_cc_drivers flag a no-op for image content. The workflow's conditional sed step tries to match nvidia-driver-580-open=.* which no longer exists in the file (now nvidia-driver-open=…), so it silently matches nothing. Similarly, 10-root.conf is already committed with Minimize=off / 12G fixed size. Both b200_cc_drivers=false and true builds produce identical images — only the tag suffix differs. The PR description states "default behaviour is unchanged… continue to ship 580.x" which contradicts the committed source.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit bb64832. Configure here.
The previous commit (bb64832) added libnvsdm, nvidia-imex, collectx-bringup, mft, mft-oem, mft-autocomplete, rdma-core, ibverbs-utils to ubuntu.conf. The 2026-05-19 podvm build attempt on the B200 host failed at the mkosi apt-resolve phase with: collectx-bringup : Depends: ucx but it is not installable The `ucx` (Unified Communication X) package and the matching MOFED userspace stack live in NVIDIA's DOCA-Host networking apt repo (linux.mellanox.com/public/repo/doca/...), which the CAA mkosi configuration was not sourcing from. The CUDA repo at developer.download.nvidia.com does ship `collectx-bringup`, `mft`, `mft-oem`, `mft-autocomplete`, but does NOT ship `ucx`. Fix in two parts: 1. Dockerfile.mkosi.ubuntu — wire the DOCA-Host repo into mkosi.skeleton's apt sources alongside the existing CUDA and nvidia-container-toolkit repos. Pin DOCA to priority 100 ("install only when explicitly requested or to satisfy a dep, never replace a higher-priority candidate"). This makes: - ucx (only in DOCA) install from DOCA ✓ - rdma-core / libibumad3 (also in DOCA but in universe @500) install from universe ✓ (keeps inbox OFED, not MOFED) - collectx-bringup / mft* install from CUDA repo (origin developer.download .nvidia.com, default 500 > 100) 2. ubuntu.conf — keep collectx-bringup, mft, mft-oem, mft-autocomplete in Packages= (they are nvlink5-595 metapackage components per the FM User Guide §"Installing Fabric Manager / Systems Using Fourth Generation NVSwitches"). Add a comment explaining the cross-repo dependency chain. Also folded in this commit (separate but discovered during the same FM Guide audit): - libibumad3 §"Other NVIDIA Software Packages" (§366) calls it out by name; pin explicitly so it stops being a transitive-only dep. - nvidia-utils-595 provides /usr/bin/nvidia-smi which verify-svc-vm.sh invokes directly. Was transitive via nvidia-driver-open; pin. - nvidia-modprobe SUID helper for /dev/nvidia* device nodes; required by nvidia-imex and any non-root NVML caller. Was transitive; pin. - datacenter-gpu-manager-4-cuda12 datacenter-gpu-manager-4-config DCGM v4. FM Guide §"NVSwitch Errors On DGX B200/B300" specifies that DCGM is the consumer that surfaces NVSwitch errors via libnvsdm. We had libnvsdm but no DCGM — half the error path was present. Adding DCGM closes that gap. - lshw Provides `vpddecode` per FM Guide §"Additional Steps for NVIDIA HGX B200/B300 Systems" for CX7 bridge VPD identification.
| -e 's|^([[:space:]]*)nvidia-persistenced=.*|\1nvidia-persistenced=595.71.05-1ubuntu1|' \ | ||
| -e 's|^([[:space:]]*)nvidia-fabricmanager=.*|\1nvidia-fabricmanager=595.71.05-1ubuntu1|' \ | ||
| -e 's|^([[:space:]]*)libnvidia-nscq=.*|\1libnvidia-nscq=595.71.05-1ubuntu1|' \ | ||
| "$CONF" |
There was a problem hiding this comment.
Opt-in flag is non-functional; config already hardcodes 595
High Severity
The b200_cc_drivers flag is supposed to opt-in to the 595 driver by sed-replacing 580 package pins, but ubuntu.conf was directly committed with 595 packages (nvidia-driver-open=595.71.05-1ubuntu1), making the sed a no-op. Specifically, the first sed pattern looks for nvidia-driver-580-open= which doesn't exist in the committed file (it has nvidia-driver-open= without 580), and patterns 2–4 match but replace with the same already-present 595 values. This means ALL builds use the 595 driver regardless of the flag, contradicting the stated "Default behaviour is unchanged" and the PR's "opt-in" design. The flag's only actual effect is appending -cc595 to the image tag.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit fd50ec5. Configure here.
| ExecCondition=/usr/local/bin/check-nvidia-gpu | ||
| # Block daemon startup until every visible GPU has fabric.state=Completed. | ||
| # See /usr/local/bin/wait-nvlink-fabric.sh for the rationale (B200 CC race). | ||
| ExecStartPre=/usr/local/bin/wait-nvlink-fabric.sh |
There was a problem hiding this comment.
Missing chmod for new wait-nvlink-fabric.sh script
High Severity
The new wait-nvlink-fabric.sh is wired as ExecStartPre in the nvidia-persistenced service override, but the mkosi.finalize.chroot script only runs chmod +x on check-nvidia-gpu (line 23) — there is no matching chmod +x /usr/local/bin/wait-nvlink-fabric.sh. If the file's executable bit isn't preserved through the skeleton copy, ExecStartPre will fail with a permission error and nvidia-persistenced won't start, leaving GPUs without persistence mode on B200 hosts.
Reviewed by Cursor Bugbot for commit fd50ec5. Configure here.
The 2026-05-20 build attempt at fd50ec5 confirmed that the NVIDIA DOCA-Host repo wiring works (apt cleanly fetched the DOCA Release/Packages indexes and the `ucx`/collectx-bringup chain resolved). It also surfaced two package names from the prior commit that do not exist in the NVIDIA cuda repo: E: Unable to locate package nvidia-utils-595 E: Unable to locate package datacenter-gpu-manager-4-config Root cause: 1. nvidia-utils-595 — the `nvidia-utils-<branch>` suffix series in the cuda repo (developer.download.nvidia.com/.../ubuntu2404) stops at -580. Starting with R595, the open-driver branch bundles nvidia-smi and the other userspace tools inside nvidia-driver-open=595.71.05-1ubuntu1 itself rather than shipping a separate -utils-<branch> package. Verify gates already pass on prior 595-stack images for exactly that reason; no explicit pin is needed. 2. datacenter-gpu-manager-4-config — does not exist as a separate package. The real DCGM 4 layout (confirmed via `apt-cache search datacenter-gpu-manager` against the cuda repo Packages.gz) is: datacenter-gpu-manager-4-core datacenter-gpu-manager-4-cuda{11,12,13} datacenter-gpu-manager-4-cuda-all datacenter-gpu-manager-4-dev datacenter-gpu-manager-4-multinode{,-cuda12,-cuda13} datacenter-gpu-manager-4-proprietary{,-cuda11,-cuda12,-cuda13} The systemd unit (nvidia-dcgm.service), the dcgmi CLI, and the default config files in /etc/nvidia-dcgm/ all ship inside datacenter-gpu-manager-4-cuda12. No separate -config package is required or available. Drop both names; keep nvidia-modprobe=595.71.05-1ubuntu1 (which DID resolve cleanly) and datacenter-gpu-manager-4-cuda12 (which likewise resolves on its own). Add lengthy comments capturing the package-layout finding so this trap doesn't get re-hit on the next NVIDIA-stack rev.
Previously fmctl-probe (the FM SDK client used by the host's
{activate,deactivate}-partition-by-bdfs.sh wrappers via SSH-into-SVM) had
no production build path — operators were expected to scp the source and
g++ it inside a running SVM. The image's "/usr/local/bin/fmctl-probe is
baked in" claim was aspirational, and the binary's libnvfm ABI was only
guaranteed by accident (whatever libnvfm happened to be on the build
host).
Vendor the source at mkosi.skeleton/usr/src/fmctl-probe/fmctl-probe.cpp
(byte-identical to the canonical copy in cohere-ai/fortress at
scratch/oci-b200/stacks/qemu-shared-nvswitch/orchestration/scripts/
fmctl-probe.cpp) and add a postinst block that chroots into ${BUILDROOT}
and compiles against the rootfs's own libnvfm (from
nvidia-fabricmanager-dev=595.71.05-1ubuntu1), then strips the source
tree. Result: every podvm qcow2 ships /usr/local/bin/fmctl-probe with
guaranteed ABI parity to the libnvfm.so that loads at runtime, and the
SVM never compiles anything during bring-up.
Also add g++ to Packages= — Ubuntu's gcc package does not pull the C++
front-end, so the postinst would otherwise skip the bake silently and
fall back to the runtime g++ workaround. The fortress-side
verify-svc-vm.sh gate 7 fails closed if /usr/local/bin/fmctl-probe is
missing or the 'resolve' subcommand isn't there, catching both
forgot-to-rebuild and forgot-to-resync-vendored-copy regressions before
any tenant ExecStartPre runs.
The upstream Makefile defaults to PODVM_DISTRO=fedora, so a bare `make image-debug` silently produces a Fedora qcow2 lacking every NVIDIA package the B200 Service VM stack pins in mkosi.presets/system/mkosi.conf.d/ubuntu.conf. The build emits no error and the broken image only fails at SVM bring-up time -- after the operator has copied a 5 GiB qcow2 into place and restarted the VM. Hit this on 2026-05-19 evening: re-built after pushing the fmctl-probe bake (f4b67ed), invoked `make binaries && make image-debug` directly under nohup (mistakenly skipping /tmp/run-podvm-build.sh which exports PODVM_DISTRO=ubuntu). The build ran 5 min into the systemd-repart phase and surfaced as `mkfs binary for ext4 is not available` -- a misleading symptom whose actual cause was Fedora 43 e2fsprogs vs Oracular's systemd-repart. fmctl-probe also silently skipped its bake because nv_fm_agent.h doesn't exist in Fedora's NVIDIA repos. Add a guard at the very top of mkosi.postinst that reads ${BUILDROOT}/etc/os-release and exits non-zero if ID != ubuntu. This fails the mkosi build BEFORE finalize, with a message naming all three wrapper scripts (run-podvm-build.sh, fortress 04-build-podvm-locally.sh, fortress run-podvm-build.host.sh) so the operator knows the canonical fix. Works regardless of entry point -- CI, ad-hoc make, or wrapper.
The qemu-shared-nvswitch SVM is GPU-less by design (only the four CX7
LPFs are passed through; the eight B200 GPUs go straight to tenant VMs),
so nvidia-imex always fails at startup with NV_ERR_OPERATING_SYSTEM
("Failed to allocate handle to NVIDIA GPU driver") because /dev/nvidiactl
doesn't exist. Worse, the upstream unit is Type=forking +
TimeoutStartSec=infinity, so the failed-init parent hangs in
sigtimedwait() forever and `systemctl start nvidia-imex.service` never
returns -- observed as a 5+ minute boot stall in svc-vm-bootstrap.sh.
Add a drop-in override that mirrors the existing pattern used by
nvidia-fabricmanager.service / nvidia-persistenced.service /
nvidia-cdi-refresh.service:
ExecCondition=/usr/local/bin/check-nvidia-gpu
TimeoutStartSec=120
The check-nvidia-gpu predicate skips the unit cleanly on any VM whose
lspci -n doesn't show a 10de:* device, so on the SVM the unit becomes
a no-op (same way nvidia-fabricmanager already does). Tenant VMs (which
DO have GPUs) still start nvidia-imex normally.
The TimeoutStartSec=120 is belt-and-suspenders: even on a real GPU
node, an indefinite hang in the forking handshake would mask a real
config error (e.g. malformed nodes_config.cfg) and stall the unit
dependency graph. 120 s is generous enough for the legitimate path
(driver init + IMEX cluster bootstrap) without being unbounded.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 8 total unresolved issues (including 7 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit de75f12. Configure here.
| # "No space left on device" during the build. | ||
| printf '[Partition]\nType=root\nFormat=ext4\nCopyFiles=/\nMinimize=off\nSizeMinBytes=12G\nSizeMaxBytes=12G\n' > "$CONF" | ||
| echo "----- Updated repart config -----" | ||
| cat "$CONF" |
There was a problem hiding this comment.
Debug root partition override is redundant dead code
Low Severity
The "Increase debug root partition for CC595 drivers" workflow step (conditional on b200_cc_drivers == 'true') writes Minimize=off / SizeMinBytes=12G / SizeMaxBytes=12G to 10-root.conf. However, the base 10-root.conf was already directly changed in this commit from Minimize=guess to the identical Minimize=off + 12G content. The conditional override is redundant dead code that writes exactly what's already in the file.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit de75f12. Configure here.
…t pollution
The previous `resolve <bdf,bdf,...>` subcommand matches partitions by
fmGetSupportedFabricPartitions().gpuInfo[].pciBusId, which is correct on
single-host non-FABRIC_MODE setups but ALWAYS empty in our
qemu-shared-nvswitch SVM topology: FM runs in FABRIC_MODE=1 with no GPUs
in its OS instance (the GPUs are passed through to tenant VMs), so
pciBusId never populates. Every `fmctl-probe resolve <bdf>` therefore
returned "no supported partition matches BDF set" and tenant ExecStartPre
failed before QEMU launched.
physicalId, by contrast, IS populated by FM in shared-NVSwitch mode --
it's a baseboard-fixed property reported by NVSwitch firmware regardless
of who owns the GPUs. Add a parallel `resolve-by-physids <id,id,...>`
subcommand that matches against gpuInfo[].physicalId. The host-side
activate-/deactivate-partition-by-bdfs.sh wrappers compute the
BDF -> physicalId map from /sys/bus/pci on the host (sort 10de:* +
class 0x030200 by PCI address; canonical B200 HGX baseboard convention)
and call the new subcommand.
Also fix a latent stdout-pollution bug: the post-fmConnect "[fmctl]
connected to %s" banner was on stdout, which got concatenated with the
machine-readable partition id by the calling shell `pid=$(fmctl-probe
resolve...)`. Move it to stderr -- only the partition id stays on stdout.
Verified end-to-end on a live b200-cc-test SVM: rebuilt fmctl-probe in
place against the rootfs's libnvfm, ran activate-partition-by-bdfs.sh
89:00.0,a8:00.0, the script auto-detected physicalIds 5,6 and resolved
to partition 6 which fmctl-probe list confirmed active. Then launched
all four mid-{a,b,c,d} tenants (8 GPUs split 2+2+2+2 across partitions
4,5,6,7) -- all systemd units came up active, all four FM partitions
active=1.
Source remains byte-identical with the fortress canonical copy
(scratch/oci-b200/.../orchestration/scripts/fmctl-probe.cpp); SHA256
matches.


Summary
Adds a new
b200_cc_driversworkflow_dispatchflag (default false) to theBuild PodVM Image (Cohere)workflow. When enabled, the build swaps the in-image NVIDIA stack from the default 580 LTS to 595.71.05 (open-kernel branch) — the minimum driver / fabricmanager required to enable Confidential Computing on multi-GPU B200 hosts (NVSwitch fabric requires the 595.x series with TDISP/CC support).When the flag is enabled the workflow:
mkosi.presets/system/mkosi.conf.d/ubuntu.confbefore mkosi runs:nvidia-driver-580-open→nvidia-driver-open=595.71.05-1ubuntu1(the 595 branch only ships the unversioned metapackage in NVIDIA's CUDA repo — nonvidia-driver-595-open).nvidia-persistenced=595.71.05-1ubuntu1nvidia-fabricmanager=595.71.05-1ubuntu1libnvidia-nscq=595.71.05-1ubuntu1580.x.yis matched by=.*) so this survives future bumps to the 580 baseline (e.g. fix(podvm): bump NVIDIA driver pins to 580.159.03 #28).tag/ OCI tag / GCP image names with-cc595so a CC build never silently overwrites the standardcohere-latestartifact (e.g.podvm-ubuntu-tdx-release-cohere-latest-cc595).ubuntu.confat build time, so it's accurate for both standard and CC builds) inmeasurements.jsonasnvidia_driverand as acom.cohere.nvidia.driverannotation on the published OCI artifact.All four
595.71.05-1ubuntu1packages were verified present in NVIDIA's CUDA apt repo forubuntu2404/x86_64(already wired up and pinned to priority 1001 inDockerfile.mkosi.ubuntu).Default behaviour is unchanged: the flag is off, push events on the
coherebranch andpodvm-v*tags continue to ship 580.x.Attestation impact
CC-driver builds report a different
x-nvidia-gpu-driver-version(595.71.05) and a different RTMR2 (different kernel modules in initrd / driver firmware). Any ITA appraisal policies need a parallel595.71.05profile before the CC image is rolled into a prod attestation flow. Standard builds are unaffected.Test plan
Build PodVM Image (Cohere)viaworkflow_dispatchwithb200_cc_drivers = falseand confirm:…-cohere-latest-{release,debug}(no-cc595suffix).measurements.jsonnvidia_drivermatches whateverubuntu.confpins oncohere.Build PodVM Image (Cohere)viaworkflow_dispatchwithb200_cc_drivers = trueand confirm:grepoutput shows the four pins flipped.nvidia-driver-open=595.71.05-1ubuntu1and friends from the CUDA repo.-cc595and OCI artifact hascom.cohere.nvidia.driver=595.71.05annotation.nvidia-fabricmanager.servicereachesactiveand CC mode is reported bynvidia-smi conf-compute -f.595.71.05profile before promoting the image.Notes
cohere, not stacked on the 580 bump branch.Note
Medium Risk
Modifies PodVM build pipeline and Ubuntu image contents (NVIDIA driver stack, systemd units, and low-level boot/partition sizing), which can impact GPU initialization and image reproducibility. Default behavior is mostly unchanged, but enabling the new flag produces materially different artifacts and attestation measurements.
Overview
Adds an opt-in
b200_cc_driverstoggle to the Cohere PodVM GitHub Actions workflow to build a separate-cc595image variant, record the resolved NVIDIA driver version inmeasurements.json, and publish it as an OCI annotation.Updates the Ubuntu mkosi image to support B200 multi-GPU confidential compute by wiring an additional NVIDIA DOCA apt repo (pinned as low-priority fallback), expanding/pinning the NVIDIA package set around the R595 stack (incl. Fabric Manager dev headers, IMEX/NVLink components, RDMA/DCGM tooling), and increasing the debug root partition sizing to avoid build/boot space failures.
Hardens runtime/boot behavior by adding a post-install Ubuntu-only guard, baking an
fmctl-probebinary into the image during mkosi postinst, ensuringib_umadloads for Fabric Manager, gatingnvidia-imexon GPU presence with a bounded startup timeout, and adding a pre-start wait (wait-nvlink-fabric.sh) beforenvidia-persistencedto avoid NVLink fabric readiness races.Reviewed by Cursor Bugbot for commit 077f479. Bugbot is set up for automated code reviews on this repo. Configure here.