Skip to content

Commit a912848

Browse files
authored
refactor(build): unify image build graph for cache reuse (#390)
1 parent e45d415 commit a912848

29 files changed

+1374
-931
lines changed

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -312,7 +312,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
312312
| `metrics-server` errors in logs | Normal k3s noise, not the root cause | These errors are benign — look for the actual failing health check component |
313313
| Stale NotReady nodes from previous deploys | Volume reused across container recreations | The deploy flow now auto-cleans stale nodes; if it still fails, manually delete NotReady nodes (see Step 2) or choose "Recreate" when prompted |
314314
| gRPC `UNIMPLEMENTED` for newer RPCs in push mode | Helm values still point at older pulled images instead of the pushed refs | Verify rendered `openshell-helmchart.yaml` uses the expected push refs (`server`, `sandbox`, `pki-job`) and not `:latest` |
315-
| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` stage. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
315+
| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` target in `deploy/docker/Dockerfile.images`. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
316316
| `HEALTHCHECK_MISSING_SUPERVISOR` in health check logs | `/opt/openshell/bin/openshell-sandbox` not found in gateway container | Rebuild cluster image: `mise run docker:build:cluster`, then `openshell gateway destroy <name> && openshell gateway start` |
317317

318318
## Full Diagnostic Dump

.github/workflows/release-auto-tag.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ name: Release Auto-Tag
66
on:
77
workflow_dispatch: {}
88
schedule:
9-
- cron: "0 2 * * *" # 7 PM PDT
9+
- cron: "0 14 * * *" # 7 AM PDT
1010

1111
permissions:
1212
contents: write

AGENTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ These pipelines connect skills into end-to-end workflows. Individual skill files
9999

100100
## Cluster Infrastructure Changes
101101

102-
- If you change cluster bootstrap infrastructure (e.g., `openshell-bootstrap` crate, `Dockerfile.cluster`, `cluster-entrypoint.sh`, `cluster-healthcheck.sh`, deploy logic in `openshell-cli`), update the `debug-openshell-cluster` skill in `.agents/skills/debug-openshell-cluster/SKILL.md` to reflect those changes.
102+
- If you change cluster bootstrap infrastructure (e.g., `openshell-bootstrap` crate, `deploy/docker/Dockerfile.images`, `cluster-entrypoint.sh`, `cluster-healthcheck.sh`, deploy logic in `openshell-cli`), update the `debug-openshell-cluster` skill in `.agents/skills/debug-openshell-cluster/SKILL.md` to reflect those changes.
103103

104104
## Documentation
105105

CONTRIBUTING.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -63,24 +63,24 @@ This project ships with [agent skills](#agent-skills-for-contributors) that can
6363

6464
Skills live in `.agents/skills/`. Your agent's harness can discover and load them natively. Here is the full inventory:
6565

66-
| Category | Skill | Purpose |
67-
|----------|-------|---------|
68-
| Getting Started | `openshell-cli` | CLI usage, sandbox lifecycle, provider management, BYOC workflows |
69-
| Getting Started | `debug-openshell-cluster` | Diagnose cluster startup failures and health issues |
70-
| Getting Started | `debug-inference` | Diagnose `inference.local`, host-backed local inference, and direct external inference setup issues |
71-
| Contributing | `create-spike` | Investigate a problem, produce a structured GitHub issue |
72-
| Contributing | `build-from-issue` | Plan and implement work from a GitHub issue (maintainer workflow) |
73-
| Contributing | `create-github-issue` | Create well-structured GitHub issues |
74-
| Contributing | `create-github-pr` | Create pull requests with proper conventions |
75-
| Reviewing | `review-github-pr` | Summarize PR diffs and key design decisions |
76-
| Reviewing | `review-security-issue` | Assess security issues for severity and remediation |
77-
| Reviewing | `watch-github-actions` | Monitor CI pipeline status and logs |
78-
| Triage | `triage-issue` | Assess, classify, and route community-filed issues |
79-
| Platform | `generate-sandbox-policy` | Generate YAML sandbox policies from requirements or API docs |
80-
| Platform | `tui-development` | Development guide for the ratatui-based terminal UI |
81-
| Documentation | `update-docs` | Scan recent commits and draft doc updates for user-facing changes |
82-
| Maintenance | `sync-agent-infra` | Detect and fix drift across agent-first infrastructure files |
83-
| Reference | `sbom` | Generate SBOMs and resolve dependency licenses |
66+
| Category | Skill | Purpose |
67+
| --------------- | ------------------------- | --------------------------------------------------------------------------------------------------- |
68+
| Getting Started | `openshell-cli` | CLI usage, sandbox lifecycle, provider management, BYOC workflows |
69+
| Getting Started | `debug-openshell-cluster` | Diagnose cluster startup failures and health issues |
70+
| Getting Started | `debug-inference` | Diagnose `inference.local`, host-backed local inference, and direct external inference setup issues |
71+
| Contributing | `create-spike` | Investigate a problem, produce a structured GitHub issue |
72+
| Contributing | `build-from-issue` | Plan and implement work from a GitHub issue (maintainer workflow) |
73+
| Contributing | `create-github-issue` | Create well-structured GitHub issues |
74+
| Contributing | `create-github-pr` | Create pull requests with proper conventions |
75+
| Reviewing | `review-github-pr` | Summarize PR diffs and key design decisions |
76+
| Reviewing | `review-security-issue` | Assess security issues for severity and remediation |
77+
| Reviewing | `watch-github-actions` | Monitor CI pipeline status and logs |
78+
| Triage | `triage-issue` | Assess, classify, and route community-filed issues |
79+
| Platform | `generate-sandbox-policy` | Generate YAML sandbox policies from requirements or API docs |
80+
| Platform | `tui-development` | Development guide for the ratatui-based terminal UI |
81+
| Documentation | `update-docs` | Scan recent commits and draft doc updates for user-facing changes |
82+
| Maintenance | `sync-agent-infra` | Detect and fix drift across agent-first infrastructure files |
83+
| Reference | `sbom` | Generate SBOMs and resolve dependency licenses |
8484

8585
### Workflow Chains
8686

@@ -148,10 +148,10 @@ openshell sandbox create -- codex
148148

149149
Two additional scripts in `scripts/bin/` provide gateway-aware wrappers for cluster debugging:
150150

151-
| Script | What it does |
152-
|--------|-------------|
151+
| Script | What it does |
152+
| --------- | ------------------------------------------------------------------------------------ |
153153
| `kubectl` | Runs `kubectl` inside the active gateway's k3s container via `openshell doctor exec` |
154-
| `k9s` | Runs `k9s` inside the active gateway's k3s container via `openshell doctor exec` |
154+
| `k9s` | Runs `k9s` inside the active gateway's k3s container via `openshell doctor exec` |
155155

156156
These work for both local and remote gateways (SSH is handled automatically). Examples:
157157

Cargo.lock

Lines changed: 9 additions & 9 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ resolver = "2"
66
members = ["crates/*"]
77

88
[workspace.package]
9-
version = "0.1.0"
9+
version = "0.0.0"
1010
edition = "2024"
1111
rust-version = "1.88"
1212
license = "Apache-2.0"
@@ -124,8 +124,6 @@ ref_option = "allow" # Common pattern for optional references
124124
missing_fields_in_debug = "allow" # Manual Debug impls often intentionally omit fields
125125

126126
[profile.release]
127-
lto = "thin"
128-
codegen-units = 1
129127
strip = true
130128

131129
[profile.dev]

architecture/build-containers.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ OpenShell produces two container images, both published for `linux/amd64` and `l
66

77
The gateway runs the control plane API server. It is deployed as a StatefulSet inside the cluster container via a bundled Helm chart.
88

9-
- **Dockerfile**: `deploy/docker/Dockerfile.gateway`
9+
- **Docker target**: `gateway` in `deploy/docker/Dockerfile.images`
1010
- **Registry**: `ghcr.io/nvidia/openshell/gateway:latest`
1111
- **Pulled when**: Cluster startup (the Helm chart triggers the pull)
1212
- **Entrypoint**: `openshell-server --port 8080` (gRPC + HTTP, mTLS)
@@ -15,11 +15,11 @@ The gateway runs the control plane API server. It is deployed as a StatefulSet i
1515

1616
The cluster image is a single-container Kubernetes distribution that bundles the Helm charts, Kubernetes manifests, and the `openshell-sandbox` supervisor binary needed to bootstrap the control plane.
1717

18-
- **Dockerfile**: `deploy/docker/Dockerfile.cluster`
18+
- **Docker target**: `cluster` in `deploy/docker/Dockerfile.images`
1919
- **Registry**: `ghcr.io/nvidia/openshell/cluster:latest`
2020
- **Pulled when**: `openshell gateway start`
2121

22-
The supervisor binary (`openshell-sandbox`) is cross-compiled in a build stage and placed at `/opt/openshell/bin/openshell-sandbox`. It is exposed to sandbox pods at runtime via a read-only `hostPath` volume mount — it is not baked into sandbox images.
22+
The supervisor binary (`openshell-sandbox`) is built by the shared `supervisor-builder` stage in `deploy/docker/Dockerfile.images` and placed at `/opt/openshell/bin/openshell-sandbox`. It is exposed to sandbox pods at runtime via a read-only `hostPath` volume mount — it is not baked into sandbox images.
2323

2424
## Sandbox Images
2525

@@ -42,7 +42,7 @@ The incremental deploy (`cluster-deploy-fast.sh`) fingerprints local Git changes
4242
| Changed files | Rebuild triggered |
4343
|---|---|
4444
| Cargo manifests, proto definitions, cross-build script | Gateway + supervisor |
45-
| `crates/openshell-server/*`, `Dockerfile.gateway` | Gateway |
45+
| `crates/openshell-server/*`, `deploy/docker/Dockerfile.images` | Gateway |
4646
| `crates/openshell-sandbox/*`, `crates/openshell-policy/*` | Supervisor |
4747
| `deploy/helm/openshell/*` | Helm upgrade |
4848

@@ -58,3 +58,16 @@ mise run cluster -- supervisor # rebuild supervisor only
5858
mise run cluster -- chart # helm upgrade only
5959
mise run cluster -- all # rebuild everything
6060
```
61+
62+
To validate incremental routing and BuildKit cache reuse locally, run:
63+
64+
```bash
65+
mise run cluster:test:fast-deploy-cache
66+
```
67+
68+
The harness runs isolated scenarios in temporary git worktrees, keeps its own state and cache under `.cache/cluster-deploy-fast-test/`, and writes a Markdown summary with:
69+
70+
- auto-detection checks for gateway-only, supervisor-only, shared, Helm-only, unrelated, and explicit-target changes
71+
- cold vs warm rebuild comparisons for gateway and supervisor code changes
72+
- container-ID invalidation coverage to verify gateway + Helm are retriggered when the cluster container changes
73+

architecture/gateway-single-node.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Out of scope:
2929
- `crates/openshell-bootstrap/src/push.rs`: Local development image push into k3s containerd.
3030
- `crates/openshell-bootstrap/src/paths.rs`: XDG path resolution.
3131
- `crates/openshell-bootstrap/src/constants.rs`: Shared constants (image name, container/volume/network naming).
32-
- `deploy/docker/Dockerfile.cluster`: Container image definition (k3s base + Helm charts + manifests + entrypoint).
32+
- `deploy/docker/Dockerfile.images` (target `cluster`): Container image definition (k3s base + Helm charts + manifests + entrypoint).
3333
- `deploy/docker/cluster-entrypoint.sh`: Container entrypoint (DNS proxy, registry config, manifest injection).
3434
- `deploy/docker/cluster-healthcheck.sh`: Docker HEALTHCHECK script.
3535
- Docker daemon(s):
@@ -228,7 +228,7 @@ After deploy, the CLI calls `save_active_gateway(name)`, writing the gateway nam
228228

229229
## Container Image
230230

231-
The gateway image is defined in `deploy/docker/Dockerfile.cluster`:
231+
The cluster image is defined by target `cluster` in `deploy/docker/Dockerfile.images`:
232232

233233
```
234234
Base: rancher/k3s:v1.35.2-k3s1
@@ -298,7 +298,7 @@ GPU support is part of the single-node gateway bootstrap path rather than a sepa
298298

299299
- `openshell gateway start --gpu` threads a boolean deploy option through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`.
300300
- When enabled, the cluster container is created with Docker `DeviceRequests`, which is the API equivalent of `docker run --gpus all`.
301-
- `deploy/docker/Dockerfile.cluster` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
301+
- `deploy/docker/Dockerfile.images` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
302302
- `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory.
303303
- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`, along with GPU Feature Discovery and Node Feature Discovery.
304304
- k3s auto-detects `nvidia-container-runtime` on `PATH`, registers the `nvidia` containerd runtime, and creates the `nvidia` `RuntimeClass` automatically.
@@ -454,7 +454,7 @@ openshell/
454454
- `crates/openshell-cli/src/main.rs` -- CLI command definitions
455455
- `crates/openshell-cli/src/run.rs` -- CLI command implementations
456456
- `crates/openshell-cli/src/bootstrap.rs` -- auto-bootstrap from sandbox create
457-
- `deploy/docker/Dockerfile.cluster` -- container image definition
457+
- `deploy/docker/Dockerfile.images` -- shared image build definition (cluster target)
458458
- `deploy/docker/cluster-entrypoint.sh` -- container entrypoint script
459459
- `deploy/docker/cluster-healthcheck.sh` -- Docker HEALTHCHECK script
460460
- `deploy/kube/manifests/openshell-helmchart.yaml` -- OpenShell Helm chart manifest

crates/openshell-core/build.rs

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,10 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
1717
}
1818

1919
// --- Protobuf compilation ---
20-
// Use bundled protoc from protobuf-src
20+
// Use bundled protoc from protobuf-src. The system protoc (from apt-get)
21+
// does not bundle the well-known type includes (google/protobuf/struct.proto
22+
// etc.), so we must use protobuf-src which ships both the binary and the
23+
// include tree.
2124
// SAFETY: This is run at build time in a single-threaded build script context.
2225
// No other threads are reading environment variables concurrently.
2326
#[allow(unsafe_code)]

0 commit comments

Comments
 (0)