Skip to content

Commit 13239b6

Browse files
authored
feat(sandboxes): add nvidia gpu sandbox image (#72)
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
1 parent 36c558e commit 13239b6

6 files changed

Lines changed: 262 additions & 18 deletions

File tree

.github/workflows/build-sandboxes.yml

Lines changed: 42 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -176,33 +176,47 @@ jobs:
176176
username: ${{ github.actor }}
177177
password: ${{ secrets.GITHUB_TOKEN }}
178178

179+
- name: Determine parent sandbox
180+
id: parent
181+
run: |
182+
set -euo pipefail
183+
DOCKERFILE="sandboxes/${{ matrix.sandbox }}/Dockerfile"
184+
DEFAULT_BASE=""
185+
if grep -q '^ARG BASE_IMAGE=' "$DOCKERFILE"; then
186+
DEFAULT_BASE=$(grep '^ARG BASE_IMAGE=' "$DOCKERFILE" | head -1 | cut -d= -f2-)
187+
fi
188+
189+
PARENT=""
190+
if [ -n "$DEFAULT_BASE" ]; then
191+
PARENT=$(echo "$DEFAULT_BASE" | sed -n 's|.*/sandboxes/\([^:]*\).*|\1|p')
192+
if [ -z "$PARENT" ]; then
193+
PARENT="base"
194+
fi
195+
fi
196+
197+
echo "sandbox=$PARENT" >> "$GITHUB_OUTPUT"
198+
if [ -n "$PARENT" ]; then
199+
echo "Parent for ${{ matrix.sandbox }}: $PARENT"
200+
else
201+
echo "${{ matrix.sandbox }} is a standalone sandbox image"
202+
fi
203+
179204
# On PRs the base image is not in GHCR. Build it locally, push to the
180-
# local registry, and override BASE_IMAGE to point there.
205+
# local registry, and override BASE_IMAGE to point there for dependent
206+
# sandbox images. Standalone images do not need this step.
181207
- name: Build base image locally (PR only)
182-
if: github.ref != 'refs/heads/main'
208+
if: github.ref != 'refs/heads/main' && steps.parent.outputs.sandbox != ''
183209
uses: docker/build-push-action@v6
184210
with:
185211
context: sandboxes/base
186212
push: true
187213
tags: localhost:5000/sandboxes/base:latest
188214
cache-from: type=gha,scope=base
189215

190-
- name: Determine parent sandbox
191-
id: parent
192-
run: |
193-
set -euo pipefail
194-
DEFAULT_BASE=$(grep '^ARG BASE_IMAGE=' "sandboxes/${{ matrix.sandbox }}/Dockerfile" | head -1 | cut -d= -f2-)
195-
PARENT=$(echo "$DEFAULT_BASE" | sed -n 's|.*/sandboxes/\([^:]*\).*|\1|p')
196-
if [ -z "$PARENT" ]; then
197-
PARENT="base"
198-
fi
199-
echo "sandbox=$PARENT" >> "$GITHUB_OUTPUT"
200-
echo "Parent for ${{ matrix.sandbox }}: $PARENT"
201-
202216
# When a sandbox depends on another sandbox (not base), build that
203217
# intermediate parent locally so it is available to the buildx build.
204218
- name: Build parent sandbox locally (PR only)
205-
if: github.ref != 'refs/heads/main' && steps.parent.outputs.sandbox != 'base'
219+
if: github.ref != 'refs/heads/main' && steps.parent.outputs.sandbox != '' && steps.parent.outputs.sandbox != 'base'
206220
uses: docker/build-push-action@v6
207221
with:
208222
context: sandboxes/${{ steps.parent.outputs.sandbox }}
@@ -216,7 +230,9 @@ jobs:
216230
id: base
217231
run: |
218232
PARENT="${{ steps.parent.outputs.sandbox }}"
219-
if [ "${{ github.ref }}" = "refs/heads/main" ]; then
233+
if [ -z "$PARENT" ]; then
234+
echo "image=" >> "$GITHUB_OUTPUT"
235+
elif [ "${{ github.ref }}" = "refs/heads/main" ]; then
220236
echo "image=${{ env.REGISTRY }}/${{ steps.repo.outputs.image_prefix }}/sandboxes/${PARENT}:latest" >> "$GITHUB_OUTPUT"
221237
else
222238
echo "image=localhost:5000/sandboxes/${PARENT}:latest" >> "$GITHUB_OUTPUT"
@@ -231,11 +247,20 @@ jobs:
231247
type=sha,prefix=
232248
type=raw,value=latest,enable={{is_default_branch}}
233249
250+
- name: Set build platforms
251+
id: platforms
252+
run: |
253+
if [ "${{ github.ref }}" = "refs/heads/main" ] || [ "${{ matrix.sandbox }}" = "nvidia-gpu" ]; then
254+
echo "value=linux/amd64,linux/arm64" >> "$GITHUB_OUTPUT"
255+
else
256+
echo "value=linux/amd64" >> "$GITHUB_OUTPUT"
257+
fi
258+
234259
- name: Build and push
235260
uses: docker/build-push-action@v6
236261
with:
237262
context: sandboxes/${{ matrix.sandbox }}
238-
platforms: ${{ github.ref == 'refs/heads/main' && 'linux/amd64,linux/arm64' || 'linux/amd64' }}
263+
platforms: ${{ steps.platforms.outputs.value }}
239264
push: ${{ github.ref == 'refs/heads/main' }}
240265
tags: ${{ steps.meta.outputs.tags }}
241266
labels: ${{ steps.meta.outputs.labels }}
@@ -244,4 +269,3 @@ jobs:
244269
cache-from: type=gha,scope=${{ matrix.sandbox }}
245270
cache-to: type=gha,mode=max,scope=${{ matrix.sandbox }}
246271

247-

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ This repo is the community ecosystem around OpenShell -- a hub for contributed s
2626
| `sandboxes/ollama/` | Ollama for local and cloud LLMs with Claude Code, Codex, OpenCode pre-installed |
2727
| `sandboxes/sdg/` | Synthetic data generation workflows |
2828
| `sandboxes/openclaw/` | OpenClaw -- open agent manipulation and control |
29+
| `sandboxes/nvidia-gpu/` | GPU-enabled VM sandbox image with NVIDIA userspace tooling |
2930

3031
## Getting Started
3132

THIRD-PARTY-NOTICES

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,10 @@ Image: docker/dockerfile:1.4 (BuildKit frontend)
1919
License: Apache-2.0
2020
URL: https://github.com/moby/buildkit
2121

22+
Image: nvidia/cuda:12.8.1-base-ubuntu22.04
23+
License: NVIDIA CUDA Toolkit End User License Agreement and Ubuntu component licenses
24+
URL: https://hub.docker.com/r/nvidia/cuda
25+
2226
================================================================================
2327
System Packages (APT — Ubuntu 24.04)
2428
================================================================================

sandboxes/nvidia-gpu/Dockerfile

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# syntax=docker/dockerfile:1
2+
3+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
4+
# SPDX-License-Identifier: Apache-2.0
5+
6+
# GPU-enabled sandbox image for OpenShell VM driver.
7+
#
8+
# Provides userspace GPU tooling (nvidia-smi, NVML, CUDA driver libs, kmod)
9+
# on top of a minimal Ubuntu base with the full NVIDIA driver userspace
10+
# installed via the official .run installer (no kernel modules -- those are
11+
# injected at rootfs preparation time by the VM driver).
12+
#
13+
# Usage:
14+
# openshell sandbox create --gpu --from ./sandboxes/nvidia-gpu/Dockerfile
15+
# openshell sandbox create --gpu --from nvidia-gpu # once published
16+
#
17+
# Build-time args:
18+
# CUDA_VERSION - CUDA toolkit version (default: 12.8.1)
19+
# UBUNTU_VERSION - Ubuntu release (default: 22.04)
20+
# NVIDIA_DRIVER_VERSION - Must match the kernel modules built by
21+
# `mise run vm:nvidia-modules` in the OpenShell
22+
# core repo (default: 580.159.03)
23+
# TARGETARCH - Set automatically by BuildKit (amd64 or arm64)
24+
25+
ARG CUDA_VERSION=12.8.1
26+
ARG UBUNTU_VERSION=22.04
27+
28+
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${UBUNTU_VERSION}
29+
30+
ARG CUDA_VERSION
31+
ARG UBUNTU_VERSION
32+
# Must match NVIDIA_DRIVER_VERSION in sandboxes/nvidia-gpu/versions.env
33+
# and NVIDIA_OPEN_VERSION in the OpenShell core VM module build.
34+
ARG NVIDIA_DRIVER_VERSION=580.159.03
35+
ARG TARGETARCH
36+
37+
# ── System packages required by the sandbox init script ──────────────
38+
RUN apt-get update && apt-get install -y --no-install-recommends \
39+
bash \
40+
busybox-static \
41+
ca-certificates \
42+
curl \
43+
iproute2 \
44+
iptables \
45+
kmod \
46+
pciutils \
47+
&& rm -rf /var/lib/apt/lists/*
48+
49+
RUN mkdir -p /usr/share/udhcpc && ln -sf /bin/busybox /sbin/udhcpc
50+
51+
# ── NVIDIA driver userspace ──────────────────────────────────────────
52+
# The nvidia/cuda base image does NOT include the driver (nvidia-smi,
53+
# libcuda.so, libnvidia-ml.so). It relies on the NVIDIA Container
54+
# Runtime to mount them from the host. In a VM there is no container
55+
# runtime, so we install the driver userspace via the .run installer
56+
# with --no-kernel-module (kernel modules are injected separately).
57+
# TODO(gpu): Pin SHA-256 checksum for reproducible builds. Compute with:
58+
# curl -fsSL <url> | sha256sum
59+
RUN set -eux; \
60+
case "${TARGETARCH}" in \
61+
amd64) nvidia_arch="x86_64" ;; \
62+
arm64) nvidia_arch="aarch64" ;; \
63+
*) echo "unsupported TARGETARCH=${TARGETARCH}" >&2; exit 1 ;; \
64+
esac; \
65+
curl -fsSL \
66+
"https://download.nvidia.com/XFree86/Linux-${nvidia_arch}/${NVIDIA_DRIVER_VERSION}/NVIDIA-Linux-${nvidia_arch}-${NVIDIA_DRIVER_VERSION}.run" \
67+
-o /tmp/nvidia.run \
68+
&& chmod +x /tmp/nvidia.run \
69+
&& /tmp/nvidia.run \
70+
--silent \
71+
--no-kernel-module \
72+
--no-drm \
73+
--no-x-check \
74+
--no-systemd \
75+
--no-nvidia-modprobe \
76+
--no-distro-scripts \
77+
&& rm -f /tmp/nvidia.run
78+
79+
# Ensure library paths are indexed for dlopen.
80+
RUN set -eux; \
81+
case "${TARGETARCH}" in \
82+
amd64) deb_arch="x86_64-linux-gnu" ;; \
83+
arm64) deb_arch="aarch64-linux-gnu" ;; \
84+
*) echo "unsupported TARGETARCH=${TARGETARCH}" >&2; exit 1 ;; \
85+
esac; \
86+
mkdir -p /etc/ld.so.conf.d; \
87+
printf "/usr/local/cuda/lib64\n/usr/lib/%s\n" "${deb_arch}" > /etc/ld.so.conf.d/cuda.conf; \
88+
ldconfig 2>/dev/null || true
89+
90+
# ── Kernel modules ───────────────────────────────────────────────────
91+
# NVIDIA kernel modules (.ko) must match the guest VM kernel (libkrunfw).
92+
# They are NOT in this image -- the VM driver injects them at rootfs
93+
# preparation time via `inject_gpu_modules`.
94+
#
95+
# GSP firmware (.bin) IS provided by the .run installer above. The VM
96+
# driver detects its presence and skips firmware injection, avoiding
97+
# version mismatches when the host driver differs from this image's.
98+
RUN mkdir -p /lib/modules
99+
100+
LABEL org.opencontainers.image.title="OpenShell GPU Sandbox" \
101+
org.opencontainers.image.description="GPU-enabled sandbox for OpenShell VM driver with CUDA support" \
102+
org.opencontainers.image.version="${NVIDIA_DRIVER_VERSION}" \
103+
io.openshell.sandbox.cuda-version="${CUDA_VERSION}" \
104+
io.openshell.sandbox.ubuntu-version="${UBUNTU_VERSION}" \
105+
io.openshell.sandbox.nvidia-driver-version="${NVIDIA_DRIVER_VERSION}" \
106+
io.openshell.sandbox.gpu="true"

sandboxes/nvidia-gpu/README.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
2+
<!-- SPDX-License-Identifier: Apache-2.0 -->
3+
4+
# GPU Sandbox Image
5+
6+
GPU-enabled sandbox image for the OpenShell VM driver. Provides NVIDIA
7+
userspace tooling (nvidia-smi, NVML, CUDA driver libraries) on top of a
8+
minimal Ubuntu base. Kernel modules are injected separately by the VM
9+
driver at sandbox creation time. The image publishes for `linux/amd64` and
10+
`linux/arm64`.
11+
12+
## Architecture
13+
14+
The GPU sandbox splits responsibility between the container image and the
15+
VM driver:
16+
17+
| Layer | Source | Contents |
18+
|-------|--------|----------|
19+
| **Userspace** | This Dockerfile | nvidia-smi, libcuda.so, libnvidia-ml.so, kmod, iproute2 |
20+
| **Kernel modules** | OpenShell VM driver injection | nvidia.ko, nvidia_uvm.ko, nvidia_modeset.ko (built for guest kernel 6.12.76) |
21+
| **GSP firmware** | `.run` installer in image OR host fallback | gsp_ga10x.bin, gsp_tu10x.bin |
22+
23+
The kernel modules must be compiled against the exact guest kernel version
24+
used by libkrunfw. The VM driver injects them into each sandbox's rootfs
25+
at creation time via `inject_gpu_modules()`.
26+
27+
## Prerequisites
28+
29+
- Linux host with an NVIDIA GPU
30+
- IOMMU enabled (for VFIO GPU passthrough)
31+
- Docker (for building the sandbox image)
32+
- OpenShell core checkout for VM runtime/module tasks
33+
- Guest kernel built with `CONFIG_MODULES=y` in the OpenShell core checkout (`mise run vm:setup`)
34+
35+
## Quick Start
36+
37+
```shell
38+
# 1. In the OpenShell core repo: build the VM runtime
39+
mise run vm:setup
40+
41+
# 2. In the OpenShell core repo: build NVIDIA modules for the guest kernel
42+
mise run vm:nvidia-modules
43+
44+
# 3. Start the gateway with GPU support
45+
sudo mise run gateway:vm -- --gpu
46+
47+
# 4. Create a GPU sandbox from the published community image
48+
openshell sandbox create --gpu --from nvidia-gpu
49+
```
50+
51+
## Version Coupling
52+
53+
The NVIDIA driver version must match across the image and the VM guest
54+
kernel modules:
55+
56+
| Component | Variable | Default |
57+
|-----------|----------|---------|
58+
| Dockerfile userspace | `NVIDIA_DRIVER_VERSION` | `580.159.03` |
59+
| Image version reference | `sandboxes/nvidia-gpu/versions.env` | `580.159.03` |
60+
| OpenShell core module build | `NVIDIA_OPEN_VERSION` | `580.159.03` |
61+
62+
A mismatch causes `modprobe` "version magic" errors or nvidia-smi ABI
63+
failures at sandbox boot time.
64+
65+
## Customization
66+
67+
### Changing the CUDA version
68+
69+
```shell
70+
docker build \
71+
--platform linux/amd64 \
72+
--build-arg CUDA_VERSION=12.6.0 \
73+
--build-arg UBUNTU_VERSION=22.04 \
74+
-t my-gpu-sandbox:latest \
75+
./sandboxes/nvidia-gpu/
76+
```
77+
78+
To build an arm64 variant locally, use `--platform linux/arm64`. Published
79+
images include both `linux/amd64` and `linux/arm64` manifests.
80+
81+
### Changing the NVIDIA driver version
82+
83+
Update the image version reference and rebuild matching VM guest modules in
84+
the OpenShell core repo:
85+
86+
1. `sandboxes/nvidia-gpu/versions.env`
87+
2. `sandboxes/nvidia-gpu/Dockerfile` ARG `NVIDIA_DRIVER_VERSION`
88+
3. In the OpenShell core repo, rebuild kernel modules:
89+
`NVIDIA_OPEN_VERSION=<version> mise run vm:nvidia-modules`
90+
91+
### Adding packages
92+
93+
Add packages to the `apt-get install` line in the Dockerfile. The image
94+
must retain `bash`, `kmod`, `iproute2`, and `busybox-static` — the VM
95+
driver validates these at rootfs preparation time.
96+
97+
## Troubleshooting
98+
99+
| Symptom | Cause | Fix |
100+
|---------|-------|-----|
101+
| "No GPU kernel modules found" | Modules not built | `mise run vm:nvidia-modules` |
102+
| "kmod not found in rootfs" | Image missing kmod package | Add `kmod` to Dockerfile `apt-get install` |
103+
| `modprobe nvidia` fails | Kernel version mismatch | Rebuild modules after `mise run vm:setup` |
104+
| nvidia-smi "driver/library mismatch" | Userspace/module version mismatch | Ensure Dockerfile and module versions match |
105+
| "kernel version mismatch: expected X, got Y" | Guest kernel was rebuilt | Rebuild modules: `mise run vm:nvidia-modules` |

sandboxes/nvidia-gpu/versions.env

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# NVIDIA driver userspace version for the GPU sandbox image.
2+
# Must match the NVIDIA_OPEN_VERSION used by OpenShell core's
3+
# `mise run vm:nvidia-modules` workflow.
4+
NVIDIA_DRIVER_VERSION=580.159.03

0 commit comments

Comments
 (0)