ci: build NVIDIA GPU confidential Kata UVM image from source by alhassankhedr-cohere · Pull Request #31 · cohere-ai/cloud-api-adaptor

alhassankhedr-cohere · 2026-05-14T18:22:14Z

Summary

Add a workflow that builds the kata-containers nvidia-gpu-confidential UVM image with our cohere-fork guest-components (attestation-agent + api-server-rest) baked in at compile time, instead of post-hoc patching the stock NVIDIA image with losetup + veritysetup format (which is what fortress/scratch/oci-b200/k8s/06-patch-uvm.sh has been doing).

Mechanics

Check out kata-containers @ inputs.kata_ref (default 3.30.0).
Rewrite versions.yaml: point externals.coco-guest-components.url and .version at cohere-ai/guest-components @ <gc_ref> (resolved to a SHA via git ls-remote so the build is reproducible).
make rootfs-image-nvidia-gpu-confidential-tarball — kata's existing build infrastructure clones our fork into the coco-guest-components builder container, statically builds AA + api-server-rest + CDH, and nvidia_rootfs.sh::coco_guest_components() copies them into the rootfs at /usr/local/bin/. From there the standard rootfs assembly + dm-verity formatting runs unchanged.
Extract the .image and root_hash file from the tarball, surface dm-verity params (root_hash, salt, data_blocks, block sizes) and the image sha256 as a measurements.json layer.
zstd -19 the .image, push to GHCR via oras as a 3-layer artifact with annotations covering build provenance + verity params.
SLSA build provenance attestation.

Output

ghcr.io/cohere-ai/cloud-api-adaptor/kata-uvm-nvidia-gpu-confidential:<tag>

where <tag> is cohere-latest for branch pushes, kata-${kata_ref}-gc-${gc_ref} for workflow_dispatch, or the literal tag for kata-uvm-v* tag pushes.

Companion host-side script

fortress/scratch/oci-b200/k8s/08-install-uvm.sh (already pushed) consumes this artifact: it pulls, verifies sha256 against measurements.json, and rewrites kernel_verity_params in the kata config from the manifest. No host veritysetup needed. This replaces 06-patch-uvm.sh for production.

Why this is strictly better than the patch path

	`06-patch-uvm.sh` (patch)	`build-kata-uvm-cohere` + `08-install-uvm.sh` (build)
Where binaries come from	Built locally, copied in via `losetup` + bind mount	Built from source by the same kata machinery as upstream, baked into the rootfs
dm-verity correctness	Re-run `veritysetup format` on the host, write the new hash chain in place, then update kata config	Verity computed once at build time, surfaced as annotations; host just pastes the values
Reproducibility	Depends on what's in `/tmp/attestation-agent` on the host	Pinned to `kata_ref` + `gc_ref` (resolved to SHA at build time)
Provenance	None	SLSA attestation + GHCR-signed
Host failure modes	"ContainerCreating context deadline exceeded" if config gets out of sync; needs `05-fix-uvm-verity.sh` recovery	sha256 mismatch fails the install before touching anything live

Pre-merge cleanup required

The on.push.branches block temporarily includes alhassankhedr/build-kata-uvm-cohere so the workflow can auto-trigger on this PR's push for end-to-end validation. Please remove that branch entry before merging — only cohere should remain in the final form.

Test plan

actionlint clean
YAML parses
Workflow run on this PR completes successfully and produces a valid OCI artifact at ghcr.io/cohere-ai/cloud-api-adaptor/kata-uvm-nvidia-gpu-confidential:<tag>
08-install-uvm.sh on a B200 host pulls the artifact, swaps the UVM image, and a Kata TDX pod boots
07-test-ita-attestation.sh against that pod returns nvgpu_overall: true

Note

Medium Risk
Introduces a new release workflow that builds and publishes a bootable Kata UVM + paired kernel to GHCR; failures or misconfiguration could block releases or publish incompatible images, but it does not change runtime application code.

Overview
Adds a new GitHub Actions workflow (.github/workflows/build-kata-uvm-cohere.yaml) to build the Kata nvidia-gpu-confidential UVM image from source with Cohere’s guest-components baked in, instead of patching a stock image.

The workflow checks out kata-containers, rewrites versions.yaml to point at a pinned guest-components ref (optionally overriding the NVIDIA driver and NVAT SDK pins), builds rootfs-image-nvidia-gpu-confidential-tarball, and extracts the .image + dm-verity root_hash.

It then stages the paired kernel artifacts, generates a measurements.json (verity params + image/kernel hashes), compresses the image, pushes everything to GHCR as an OCI artifact via oras with annotations, and publishes SLSA provenance attestation.

^{Reviewed by Cursor Bugbot for commit a55b3ac. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add a workflow that builds the kata-containers nvidia-gpu-confidential UVM image with our cohere-fork guest-components (attestation-agent + api-server-rest) baked in *at compile time*, instead of post-hoc patching the stock NVIDIA image with `losetup` + `veritysetup format` (which is what fortress/scratch/oci-b200/k8s/06-patch-uvm.sh has been doing). Mechanics: 1. Check out kata-containers @ inputs.kata_ref (default 3.30.0). 2. Rewrite versions.yaml: point externals.coco-guest-components.url and .version at cohere-ai/guest-components @ <gc_ref> (resolved to a SHA via git ls-remote so the build is reproducible). 3. `make rootfs-image-nvidia-gpu-confidential-tarball` — kata's existing build infrastructure clones our fork into the coco-guest-components builder container, statically builds AA + api-server-rest + CDH, and nvidia_rootfs.sh::coco_guest_components() copies them into the rootfs at /usr/local/bin/. From there the standard rootfs assembly + dm-verity formatting runs unchanged. 4. Extract the .image and root_hash file from the tarball, surface dm-verity params (root_hash, salt, data_blocks, block sizes) and the image sha256 as a measurements.json layer. 5. zstd -19 the .image, push to GHCR via oras as a 3-layer artifact with annotations covering build provenance + verity params. 6. SLSA build provenance attestation. Output: ghcr.io/cohere-ai/cloud-api-adaptor/kata-uvm-nvidia-gpu-confidential:<tag> where <tag> is `cohere-latest` for branch pushes, `kata-${kata_ref}-gc-${gc_ref}` for workflow_dispatch, or the literal tag for `kata-uvm-v*` tag pushes. Companion host-side install script lives at fortress/scratch/oci-b200/k8s/08-install-uvm.sh: it pulls this artifact, verifies sha256 against measurements.json, and rewrites kernel_verity_params in the kata config from the manifest. No host veritysetup needed. NOTE: this commit also temporarily adds `alhassankhedr/build-kata-uvm-cohere` to `on.push.branches` so we can validate end-to-end on the PR branch before merge. That entry must be removed before this lands on cohere.

github-advanced-security

zizmor found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

cursor · 2026-05-14T18:27:00Z

+      # TEMPORARY: enable end-to-end validation of the workflow on the
+      # feature branch before merge. Remove this entry as part of the
+      # final review; only `cohere` should remain.
+      - "alhassankhedr/build-kata-uvm-cohere"


Temporary feature branch trigger left in workflow

Medium Severity

The branch alhassankhedr/build-kata-uvm-cohere is included in the on.push.branches trigger for end-to-end validation during the PR. The PR description explicitly states "Please remove that branch entry before merging — only cohere should remain." If merged as-is, every push to that feature branch will trigger a full ~3-hour UVM build and push an artifact to GHCR.

^{Reviewed by Cursor Bugbot for commit 2d80833. Configure here.}

cursor · 2026-05-14T18:27:00Z

+          set -eux
+          git clone --depth 1 --branch "${{ needs.meta.outputs.kata_ref }}" \
+            "${{ inputs.kata_repo || 'https://github.com/kata-containers/kata-containers.git' }}" \
+            /tmp/kata


SHA input for kata_ref breaks shallow clone

Medium Severity

The kata_ref input is documented as accepting "tag, branch, or SHA", but git clone --depth 1 --branch only accepts branch and tag names — not commit SHAs. Providing a SHA causes git to error with "Remote branch not found in upstream origin", failing the entire build.

Additional Locations (1)

.github/workflows/build-kata-uvm-cohere.yaml#L43-L44

^{Reviewed by Cursor Bugbot for commit 2d80833. Configure here.}

cursor · 2026-05-14T18:27:00Z

+          jq -n \
+            --arg kata_ref       "${{ needs.meta.outputs.kata_ref }}" \
+            --arg gc_repo        "${{ needs.meta.outputs.gc_repo }}" \
+            --arg gc_ref         "${{ needs.meta.outputs.gc_ref }}" \


Resolved guest-components SHA missing from provenance metadata

Medium Severity

The gc_ref is resolved to an immutable SHA via git ls-remote in the "Override coco-guest-components" step for build reproducibility, but this resolved SHA is never written to $GITHUB_OUTPUT. Both measurements.json and the OCI annotations record the original mutable ref (e.g., cohere) instead of the pinned SHA, undermining the reproducibility goal stated in the code comments.

Additional Locations (2)

.github/workflows/build-kata-uvm-cohere.yaml#L339-L340

.github/workflows/build-kata-uvm-cohere.yaml#L195-L201

^{Reviewed by Cursor Bugbot for commit 2d80833. Configure here.}

cursor · 2026-05-14T18:28:02Z

+      # TEMPORARY: enable end-to-end validation of the workflow on the
+      # feature branch before merge. Remove this entry as part of the
+      # final review; only `cohere` should remain.
+      - "alhassankhedr/build-kata-uvm-cohere"


🔒 Agentic Security Review
Severity: HIGH

The workflow still triggers on a temporary feature branch and branch pushes can publish to the stable cohere-latest image tag. That expands artifact publish authority beyond the intended protected branch and makes it possible to overwrite a trusted mutable tag from non-release branch pushes.

Impact: if an attacker can push to this branch, they can publish a malicious UVM image under cohere-latest, creating a supply-chain compromise risk for downstream consumers.

Companion to fortress's k8s/ script reordering. The CI workflow's header comments and the GHCR step summary now point at the new numbering (05-install-uvm.sh) and reference the legacy patch path (08-patch-uvm.sh) by its new number too.

kata 3.30+ nvidia_chroot.sh runs with set -u and only assigns driver_version when NVIDIA_GPU_STACK contains a literal `driver=<ver>` component. Without it the rootfs-assembly stage dies at the very last step with `driver_version: unbound variable`, after the runner has already done ~45 minutes of work (agent, busybox, pause-image, coco-guest-components, kernel-nvidia-gpu). This is exactly how run 25877534335 failed. Fix: derive the driver pin from .assets.nvidia.driver.version in kata's own versions.yaml and prepend driver=<ver> to NVIDIA_GPU_STACK in the build step. Auto-tracks kata_ref.

cursor · 2026-05-14T19:30:18Z

+            git make curl ca-certificates jq python3 python3-pip
+          # Ensure yq is present (kata's build scripts rely on it).
+          if ! command -v yq >/dev/null 2>&1; then
+            sudo curl -fsSL -o /usr/local/bin/yq \


🔒 Agentic Security Review
Severity: HIGH

This workflow downloads executable binaries (yq and oras) from GitHub release URLs and runs them without any integrity verification (checksum/signature/provenance check).

Impact: a compromised upstream release asset could execute arbitrary code in a job with packages: write and id-token: write, enabling malicious image publication or credential/token abuse.

cursor · 2026-05-14T19:35:19Z

+      - name: Checkout kata-containers @ ${{ needs.meta.outputs.kata_ref }}
+        run: |
+          set -eux
+          git clone --depth 1 --branch "${{ needs.meta.outputs.kata_ref }}" \


🔒 Agentic Security Review
Severity: HIGH

workflow_dispatch inputs are injected directly into a shell run script via GitHub expression interpolation at clone time (${{ needs.meta.outputs.kata_ref }} and ${{ inputs.kata_repo }}). Because expressions are rendered before shell parsing, crafted values can trigger command substitution and execute arbitrary commands in this privileged job.

Impact: a caller who can dispatch this workflow can run attacker-controlled commands and publish malicious artifacts with trusted GHCR + provenance permissions (packages: write, id-token: write).

Two bugs in the "Extract" / "Surface verity params" steps that together caused the workflow to abort with `jq: error ... Expected JSON value (while parsing '')` (exit 5) and would also have produced a junk artifact even if jq had not failed: 1. root_hash.txt is a SINGLE comma-separated line written by kata's osbuilder, not five newline-separated key=value lines. The previous `awk -F'=' '/^salt=/ {print $2}'` parsers therefore returned empty strings for everything except root_hash (and even that came out with a trailing ",salt"), which crashed jq's `tonumber` on data_blocks. Replace with a single comma-split + case dispatch, plus regex sanity checks so a future format change fails loudly. 2. The .img inside the tarball is a symlink to the versioned .image alongside it. The previous `mv` only relocated the symlink, then `rm -rf opt/` deleted the underlying file. Resolve via `readlink -f` and `cp` the real file before tearing the directory down. Add a minimum-size assertion (>100 MiB) so a dangling symlink is caught immediately rather than producing measurements.json with bytes=57. Also tightens the shell with `set -euxo pipefail` and a `jq -e .` validation of the produced measurements.json.

…iver Kata 3.30.0 pins driver=595.58.03 in versions.yaml, but on 8x B200 OCI hosts that driver hits a fabric-probe race where RmGpuFabricProbe times out and fail-stops GPU init. The fix landed in 595.71.05 (which is also the version present in the working mkosi-built images). This adds an optional workflow_dispatch input `kata_nvidia_driver_ver`. When set (e.g. to 595.71.05), the build: - Rewrites .externals.nvidia.driver.version in kata's versions.yaml before the rootfs build, so the pin flows through to both open-gpu-kernel-modules (cloned from the GitHub tag) and the nvidia-driver-pinning-<ver> apt package. - Surfaces the override in the OCI tag (kata-...-drv-<ver>), the com.cohere.kata-uvm.nvidia-driver annotation, measurements.json's new nvidia_driver.version field, and the job summary. When unset, the build behaves exactly as before. measurements.json always reflects the *actually baked-in* driver (read from the post-rewrite versions.yaml) rather than the requested input, so it stays truthful when the override is empty. Mirrors the same mechanic in fortress/scratch/oci-b200/k8s/04-build-uvm-locally.sh.

cursor · 2026-05-15T03:23:37Z

+            --annotation "org.opencontainers.image.created=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
+            --annotation "com.cohere.caa.commit=${GITHUB_SHA}" \
+            --annotation "com.cohere.kata.ref=${{ needs.meta.outputs.kata_ref }}" \
+            --annotation "com.cohere.guest-components.repo=${{ needs.meta.outputs.gc_repo }}" \


🔒 Agentic Security Review
Severity: HIGH

workflow_dispatch inputs are interpolated directly into this shell run script via GitHub expressions. If a caller provides a value containing shell substitution syntax (for example $(...)) in gc_repo or gc_ref, it is rendered into the script before execution and can execute attacker-controlled commands.

Impact: a user able to dispatch this workflow can run arbitrary commands in a job with packages: write and id-token: write, enabling malicious image publication and provenance abuse.

The plain `cohere` branch of guest-components has a `count == 1` guard in `nvidia-attester::detect_platform()` that silently disables the attester on multi-GPU systems. Multi-GPU pods on 8x B200 boot fine but `/aa/additional_evidence` returns empty, which looks like a build issue but is actually the userspace attester refusing to register. Upstream main has a complete rewrite of nvidia-attester on top of the NVAT SDK (no `count == 1` check). PR #9 in cohere-ai/guest-components syncs that rewrite into our fork. Until PR #9 merges into `cohere`, default `gc_ref` to `alhassankhedr/sync-main-to-cohere` so kata UVM builds out of this workflow have a working multi-GPU attester. Switch back to `cohere` once PR #9 is merged.

Adds a `kata_nvat_ver` workflow_dispatch input (default 2026.03.02) that rewrites `.externals.nvidia.nvat.{version,url,desc}` in kata's versions.yaml before the rootfs build. Why this matters: kata's tools/packaging/static-build/coco-guest-components/build.sh forwards NVAT_VERSION from versions.yaml to the GC builder Dockerfile. The Dockerfile gates the entire libnvat clone+cmake+install behind `if [ -n "${NVAT_VERSION}" ]`, and upstream kata 3.30.0 ships *without* that key set. Net effect on the cohere fork's UVM: * libnvat is never built into the GC builder image. * build-static-coco-guest-components.sh's second AA build pass — the one that compiles `attestation-agent` with `nvidia-attester` against /usr/local/lib/libnvat.so and installs the result as /usr/local/bin/attestation-agent-nv — silently no-ops because the required system lib is missing. * The rootfs ends up with only the standard, non-NVIDIA AA. Symbol fingerprint of the installed UVM confirms it: zero `nvmlDeviceGetCount`, zero `nv_attestation_sdk`, zero `libnvat`. * `/aa/additional_evidence` returns empty on multi-GPU pods regardless of which guest-components branch we baked. ITA appraisal can never see `nvgpu_overall: true`. Pins 2026.03.02 to match the version the podvm-mkosi side already builds against (NVAT_TAG in cloud-api-adaptor's Dockerfile.podvm_binaries.ubuntu). Tag, measurements.json, and OCI annotations all surface the pin so the binding is inspectable from the registry (`-nvat-<ver>` tag suffix, `nvat_sdk.version` field, `com.cohere.kata-uvm.nvat-sdk` annotation).

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit d5e0166. Configure here.}

cursor · 2026-05-15T04:33:01Z

+          GC_REPO:        ${{ inputs.gc_repo  || 'https://github.com/cohere-ai/guest-components.git' }}
+          GC_REF:         ${{ inputs.gc_ref   || 'alhassankhedr/sync-main-to-cohere' }}
+          DRIVER_VER:     ${{ inputs.kata_nvidia_driver_ver || '' }}
+          NVAT_VER:       ${{ inputs.kata_nvat_ver || '2026.03.02' }}


Empty kata_nvat_ver input silently overridden by fallback

Medium Severity

The input description for kata_nvat_ver says "Set "" to leave nvat unpinned" but the || operator on line 142 (${{ inputs.kata_nvat_ver || '2026.03.02' }}) treats empty strings as falsy, so an explicit "" gets replaced with '2026.03.02'. This makes it impossible to disable the NVAT SDK pin via workflow_dispatch, contradicting the documented behavior.

Additional Locations (1)

.github/workflows/build-kata-uvm-cohere.yaml#L96-L98

^{Reviewed by Cursor Bugbot for commit d5e0166. Configure here.}

kata's kernel-nvidia-gpu build emits a fresh random certs/signing_key.pem per invocation; the NVIDIA modules baked into kata-static-kernel-nvidia-gpu-modules.tar.zst (and therefore into the rootfs) are signed against THAT key. If the host launches our UVM against a kernel from a different build (e.g. the kata-deploy-bundled one), every NVIDIA .ko is rejected at first modprobe with "Loading of unsigned module is rejected", NVRC panics in src/execute.rs:24:9, the guest powers down, and pods sit in Pending forever. Verified end-to-end on the B200 host on 2026-05-15 (README "Bug F"). The host-side fix lives in fortress's 05-install-uvm.sh, which atomically installs both the rootfs symlink and the kernel binary. For that to work, the OCI artifact has to ship the kernel. Mirror the local build pipeline (04-build-uvm-locally.sh) here: * Force a clean kernel + modules + rootfs rebuild whenever kata_nvidia_driver_ver is overridden, so kata's make can't reuse a cached kernel-nvidia-gpu builddir whose embedded signing key doesn't match the new modules tarball. * After "Build rootfs", stage the locally-built vmlinuz (+ vmlinux, System.map, config) into /tmp/uvm-out alongside the rootfs and write kernel.basename as a single source of truth for the install side. * Add a defensive signing-key sanity check that extracts the SKID from kernel-nvidia-gpu/builddir/.../certs/signing_key.x509 and confirms it appears in the trailing PKCS#7 signature of nvidia.ko. Fails the build if the modules tarball is signed by a different key than the kernel embeds. * Extend measurements.json with .kernel.{filename,sha256} so 05-install-uvm.sh can validate the kernel post-pull. * Push the kernel files (vmlinuz/vmlinux/System.map/config and kernel.basename) into the OCI artifact with media type application/vnd.cohere.kata-uvm.kernel+octet-stream, and surface the kernel-basename + kernel-sha256 as OCI annotations. After this, the UVM artifact is self-contained: pulling and installing it places a kernel and rootfs that share a signing key, so guest modprobe of nvidia.ko / nvidia-uvm.ko / nvidia-modeset.ko / nvidia-drm.ko / nvidia-peermem.ko succeeds and NVRC boots cleanly.

github-advanced-security AI found potential problems May 14, 2026

View reviewed changes

cursor Bot reviewed May 14, 2026

View reviewed changes

alhassankhedr-cohere added 2 commits May 14, 2026 14:50

alhassankhedr-cohere force-pushed the alhassankhedr/build-kata-uvm-cohere branch from 426dcd9 to fccd4d7 Compare May 14, 2026 19:24

cursor Bot reviewed May 14, 2026

View reviewed changes

alhassankhedr-cohere added 2 commits May 14, 2026 23:08

cursor Bot reviewed May 15, 2026

View reviewed changes

alhassankhedr-cohere added 2 commits May 14, 2026 23:58

cursor Bot reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: build NVIDIA GPU confidential Kata UVM image from source#31

ci: build NVIDIA GPU confidential Kata UVM image from source#31
alhassankhedr-cohere wants to merge 8 commits into
coherefrom
alhassankhedr/build-kata-uvm-cohere

alhassankhedr-cohere commented May 14, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-advanced-security AI left a comment

Uh oh!

cursor Bot May 14, 2026

Uh oh!

cursor Bot May 14, 2026

Uh oh!

cursor Bot May 14, 2026

Uh oh!

cursor Bot May 14, 2026

Uh oh!

cursor Bot May 14, 2026

Uh oh!

cursor Bot May 14, 2026

Uh oh!

cursor Bot May 15, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alhassankhedr-cohere commented May 14, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Mechanics

Output

Companion host-side script

Why this is strictly better than the patch path

Pre-merge cleanup required

Test plan

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 14, 2026

Choose a reason for hiding this comment

Temporary feature branch trigger left in workflow

Uh oh!

cursor Bot May 14, 2026

Choose a reason for hiding this comment

SHA input for kata_ref breaks shallow clone

Uh oh!

cursor Bot May 14, 2026

Choose a reason for hiding this comment

Resolved guest-components SHA missing from provenance metadata

Uh oh!

cursor Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 15, 2026

Choose a reason for hiding this comment

Empty kata_nvat_ver input silently overridden by fallback

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alhassankhedr-cohere commented May 14, 2026 •

edited by cursor Bot

Loading

Empty `kata_nvat_ver` input silently overridden by fallback