Skip to content

Commit 707e362

Browse files
committed
refactor(build): unify image build graph for cache reuse
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
1 parent 48cb689 commit 707e362

File tree

12 files changed

+482
-809
lines changed

12 files changed

+482
-809
lines changed

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -312,7 +312,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
312312
| `metrics-server` errors in logs | Normal k3s noise, not the root cause | These errors are benign — look for the actual failing health check component |
313313
| Stale NotReady nodes from previous deploys | Volume reused across container recreations | The deploy flow now auto-cleans stale nodes; if it still fails, manually delete NotReady nodes (see Step 2) or choose "Recreate" when prompted |
314314
| gRPC `UNIMPLEMENTED` for newer RPCs in push mode | Helm values still point at older pulled images instead of the pushed refs | Verify rendered `openshell-helmchart.yaml` uses the expected push refs (`server`, `sandbox`, `pki-job`) and not `:latest` |
315-
| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` stage. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
315+
| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` target in `deploy/docker/Dockerfile.images`. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
316316
| `HEALTHCHECK_MISSING_SUPERVISOR` in health check logs | `/opt/openshell/bin/openshell-sandbox` not found in gateway container | Rebuild cluster image: `mise run docker:build:cluster`, then `openshell gateway destroy <name> && openshell gateway start` |
317317

318318
## Full Diagnostic Dump

AGENTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ These pipelines connect skills into end-to-end workflows. Individual skill files
9999

100100
## Cluster Infrastructure Changes
101101

102-
- If you change cluster bootstrap infrastructure (e.g., `openshell-bootstrap` crate, `Dockerfile.cluster`, `cluster-entrypoint.sh`, `cluster-healthcheck.sh`, deploy logic in `openshell-cli`), update the `debug-openshell-cluster` skill in `.agents/skills/debug-openshell-cluster/SKILL.md` to reflect those changes.
102+
- If you change cluster bootstrap infrastructure (e.g., `openshell-bootstrap` crate, `deploy/docker/Dockerfile.images`, `cluster-entrypoint.sh`, `cluster-healthcheck.sh`, deploy logic in `openshell-cli`), update the `debug-openshell-cluster` skill in `.agents/skills/debug-openshell-cluster/SKILL.md` to reflect those changes.
103103

104104
## Documentation
105105

architecture/build-containers.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ OpenShell produces two container images, both published for `linux/amd64` and `l
66

77
The gateway runs the control plane API server. It is deployed as a StatefulSet inside the cluster container via a bundled Helm chart.
88

9-
- **Dockerfile**: `deploy/docker/Dockerfile.gateway`
9+
- **Docker target**: `gateway` in `deploy/docker/Dockerfile.images`
1010
- **Registry**: `ghcr.io/nvidia/openshell/gateway:latest`
1111
- **Pulled when**: Cluster startup (the Helm chart triggers the pull)
1212
- **Entrypoint**: `openshell-server --port 8080` (gRPC + HTTP, mTLS)
@@ -15,11 +15,11 @@ The gateway runs the control plane API server. It is deployed as a StatefulSet i
1515

1616
The cluster image is a single-container Kubernetes distribution that bundles the Helm charts, Kubernetes manifests, and the `openshell-sandbox` supervisor binary needed to bootstrap the control plane.
1717

18-
- **Dockerfile**: `deploy/docker/Dockerfile.cluster`
18+
- **Docker target**: `cluster` in `deploy/docker/Dockerfile.images`
1919
- **Registry**: `ghcr.io/nvidia/openshell/cluster:latest`
2020
- **Pulled when**: `openshell gateway start`
2121

22-
The supervisor binary (`openshell-sandbox`) is cross-compiled in a build stage and placed at `/opt/openshell/bin/openshell-sandbox`. It is exposed to sandbox pods at runtime via a read-only `hostPath` volume mount — it is not baked into sandbox images.
22+
The supervisor binary (`openshell-sandbox`) is built by the shared `supervisor-builder` stage in `deploy/docker/Dockerfile.images` and placed at `/opt/openshell/bin/openshell-sandbox`. It is exposed to sandbox pods at runtime via a read-only `hostPath` volume mount — it is not baked into sandbox images.
2323

2424
## Sandbox Images
2525

@@ -42,7 +42,7 @@ The incremental deploy (`cluster-deploy-fast.sh`) fingerprints local Git changes
4242
| Changed files | Rebuild triggered |
4343
|---|---|
4444
| Cargo manifests, proto definitions, cross-build script | Gateway + supervisor |
45-
| `crates/openshell-server/*`, `Dockerfile.gateway` | Gateway |
45+
| `crates/openshell-server/*`, `deploy/docker/Dockerfile.images` | Gateway |
4646
| `crates/openshell-sandbox/*`, `crates/openshell-policy/*` | Supervisor |
4747
| `deploy/helm/openshell/*` | Helm upgrade |
4848

architecture/gateway-single-node.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Out of scope:
2929
- `crates/openshell-bootstrap/src/push.rs`: Local development image push into k3s containerd.
3030
- `crates/openshell-bootstrap/src/paths.rs`: XDG path resolution.
3131
- `crates/openshell-bootstrap/src/constants.rs`: Shared constants (image name, container/volume/network naming).
32-
- `deploy/docker/Dockerfile.cluster`: Container image definition (k3s base + Helm charts + manifests + entrypoint).
32+
- `deploy/docker/Dockerfile.images` (target `cluster`): Container image definition (k3s base + Helm charts + manifests + entrypoint).
3333
- `deploy/docker/cluster-entrypoint.sh`: Container entrypoint (DNS proxy, registry config, manifest injection).
3434
- `deploy/docker/cluster-healthcheck.sh`: Docker HEALTHCHECK script.
3535
- Docker daemon(s):
@@ -226,7 +226,7 @@ After deploy, the CLI calls `save_active_gateway(name)`, writing the gateway nam
226226

227227
## Container Image
228228

229-
The gateway image is defined in `deploy/docker/Dockerfile.cluster`:
229+
The cluster image is defined by target `cluster` in `deploy/docker/Dockerfile.images`:
230230

231231
```
232232
Base: rancher/k3s:v1.35.2-k3s1
@@ -296,7 +296,7 @@ GPU support is part of the single-node gateway bootstrap path rather than a sepa
296296

297297
- `openshell gateway start --gpu` threads a boolean deploy option through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`.
298298
- When enabled, the cluster container is created with Docker `DeviceRequests`, which is the API equivalent of `docker run --gpus all`.
299-
- `deploy/docker/Dockerfile.cluster` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
299+
- `deploy/docker/Dockerfile.images` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
300300
- `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory.
301301
- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`, along with GPU Feature Discovery and Node Feature Discovery.
302302
- k3s auto-detects `nvidia-container-runtime` on `PATH`, registers the `nvidia` containerd runtime, and creates the `nvidia` `RuntimeClass` automatically.
@@ -452,7 +452,7 @@ openshell/
452452
- `crates/openshell-cli/src/main.rs` -- CLI command definitions
453453
- `crates/openshell-cli/src/run.rs` -- CLI command implementations
454454
- `crates/openshell-cli/src/bootstrap.rs` -- auto-bootstrap from sandbox create
455-
- `deploy/docker/Dockerfile.cluster` -- container image definition
455+
- `deploy/docker/Dockerfile.images` -- shared image build definition (cluster target)
456456
- `deploy/docker/cluster-entrypoint.sh` -- container entrypoint script
457457
- `deploy/docker/cluster-healthcheck.sh` -- Docker HEALTHCHECK script
458458
- `deploy/kube/manifests/openshell-helmchart.yaml` -- OpenShell Helm chart manifest

0 commit comments

Comments
 (0)