Skip to content

Commit 7a72e33

Browse files
committed
Merge remote-tracking branch 'origin/main' into feat/device-plugin-cdi-injection
2 parents 9555515 + 047de66 commit 7a72e33

51 files changed

Lines changed: 2949 additions & 236 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ Look for:
104104
- k3s startup errors (certificate issues, port binding failures)
105105
- Manifest copy errors from `/opt/openshell/manifests/`
106106
- `iptables` or `cgroup` errors (privilege/capability issues)
107+
- `Warning: br_netfilter does not appear to be loaded` — this is advisory only; many kernels work without the explicit module. Only act on it if you also see DNS failures or pod-to-service connectivity problems (see Common Failure Patterns).
107108

108109
### Step 2: Check k3s Cluster Health
109110

@@ -344,6 +345,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
344345
| Port conflict | Another service on the configured gateway host port (default 8080) | Stop conflicting service or use `--port` on `openshell gateway start` to pick a different host port |
345346
| gRPC connect refused to `127.0.0.1:443` in CI | Docker daemon is remote (`DOCKER_HOST=tcp://...`) but metadata still points to loopback | Verify metadata endpoint host matches `DOCKER_HOST` and includes non-loopback host |
346347
| DNS failures inside container | Entrypoint DNS detection failed | `openshell doctor exec -- cat /etc/rancher/k3s/resolv.conf` and `openshell doctor logs --lines 20` |
348+
| Pods can't reach kube-dns / ClusterIP services | `br_netfilter` not loaded; bridge traffic bypasses iptables DNAT rules | `sudo modprobe br_netfilter` on the host, then `echo br_netfilter \| sudo tee /etc/modules-load.d/br_netfilter.conf` to persist. Known to be required on Jetson Linux 5.15-tegra; other kernels (e.g. standard x86/aarch64 Linux) may have bridge netfilter built in and work without the module. The entrypoint logs a warning when `/proc/sys/net/bridge/bridge-nf-call-iptables` is absent but does not abort — only act on it if DNS or service connectivity is actually broken. |
347349
| Node DiskPressure / MemoryPressure / PIDPressure | Insufficient disk, memory, or PIDs on host | Free disk (`docker system prune -a --volumes`), increase memory, or expand host resources. Bootstrap auto-detects via `HEALTHCHECK_NODE_PRESSURE` marker |
348350
| Pods evicted with "The node had condition: [DiskPressure]" | Host disk full, kubelet evicting pods | Free disk space on host, then `openshell gateway destroy <name> && openshell gateway start` |
349351
| `metrics-server` errors in logs | Normal k3s noise, not the root cause | These errors are benign — look for the actual failing health check component |

.github/workflows/docs-preview-pr.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ jobs:
4949
find _build -name .buildinfo -exec rm {} \;
5050
5151
- name: Deploy preview
52+
if: github.event.pull_request.head.repo.full_name == github.repository
5253
uses: rossjrw/pr-preview-action@v1
5354
with:
5455
source-dir: ./_build/docs/

.github/workflows/issue-triage.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ jobs:
2323
// The template placeholder starts with "Example:" — if that's still
2424
// there or the section is empty, the reporter didn't fill it in.
2525
const diagnosticMatch = body.match(
26-
/## Agent Diagnostic\s*\n([\s\S]*?)(?=\n## |\n$)/
26+
/### Agent Diagnostic\s*\n([\s\S]*?)(?=\n### |\n$)/
2727
);
2828
2929
const hasSubstantiveDiagnostic = diagnosticMatch

CONTRIBUTING.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,9 +186,14 @@ These are the primary `mise` tasks for day-to-day development:
186186
| `tasks/` | `mise` task definitions and build scripts |
187187
| `deploy/` | Dockerfiles, Helm chart, Kubernetes manifests |
188188
| `architecture/` | Architecture docs and plans |
189+
| `rfc/` | Request for Comments proposals |
189190
| `docs/` | User-facing documentation (Sphinx/MyST) |
190191
| `.agents/` | Agent skills and persona definitions |
191192

193+
## RFCs
194+
195+
For cross-cutting architectural decisions, API contract changes, or process proposals that need broad consensus, use the RFC process. RFCs live in `rfc/` — copy the template, fill it in, and open a PR for discussion. See [rfc/README.md](rfc/README.md) for the full lifecycle and guidelines on when to write an RFC versus a spike issue or architecture doc.
196+
192197
## Documentation
193198

194199
If your change affects user-facing behavior (new flags, changed defaults, new features, bug fixes that contradict existing docs), update the relevant pages under `docs/` in the same PR.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ OpenShell can pass host GPUs into sandboxes for local inference, fine-tuning, or
128128
openshell sandbox create --gpu --from [gpu-enabled-sandbox] -- claude
129129
```
130130

131-
The CLI auto-bootstraps a GPU-enabled gateway on first use. GPU intent is also inferred automatically for community images with `gpu` in the name.
131+
The CLI auto-bootstraps a GPU-enabled gateway on first use, auto-selecting CDI when available and otherwise falling back to Docker's NVIDIA GPU request path (`--gpus all`). GPU intent is also inferred automatically for community images with `gpu` in the name.
132132

133133
**Requirements:** NVIDIA drivers and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) must be installed on the host. The sandbox image itself must include the appropriate GPU drivers and libraries for your workload — the default `base` image does not. See the [BYOC example](https://github.com/NVIDIA/OpenShell/tree/main/examples/bring-your-own-container) for building a custom sandbox image with GPU support.
134134

architecture/gateway-single-node.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -296,8 +296,10 @@ When environment variables are set, the entrypoint modifies the HelmChart manife
296296

297297
GPU support is part of the single-node gateway bootstrap path rather than a separate architecture.
298298

299-
- `openshell gateway start --gpu` threads a boolean deploy option through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`.
300-
- When enabled, the cluster container is created with Docker `DeviceRequests`, which is the API equivalent of `docker run --gpus all`.
299+
- `openshell gateway start --gpu` threads GPU device options through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`.
300+
- When enabled, the cluster container is created with Docker `DeviceRequests`. The injection mechanism is selected based on whether CDI is enabled on the daemon (`SystemInfo.CDISpecDirs` via `GET /info`):
301+
- **CDI enabled** (daemon reports non-empty `CDISpecDirs`): CDI device injection — `driver="cdi"` with `nvidia.com/gpu=all`. Specs are expected to be pre-generated on the host (e.g. automatically by the `nvidia-cdi-refresh.service` or manually via `nvidia-ctk generate`).
302+
- **CDI not enabled**: `--gpus all` device request — `driver="nvidia"`, `count=-1`, which relies on the NVIDIA Container Runtime hook.
301303
- `deploy/docker/Dockerfile.images` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
302304
- `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory.
303305
- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`. NFD and GFD are disabled; the device plugin's default `nodeAffinity` (which requires `feature.node.kubernetes.io/pci-10de.present=true` or `nvidia.com/gpu.present=true` from NFD/GFD) is overridden to empty so the DaemonSet schedules on the single-node cluster without requiring those labels. The chart is configured with `deviceListStrategy: cdi-cri` so the device plugin injects devices via direct CDI device requests in the CRI.
@@ -308,15 +310,19 @@ The runtime chain is:
308310

309311
```text
310312
Host GPU drivers & NVIDIA Container Toolkit
311-
└─ Docker: --gpus all (DeviceRequests in bollard API)
313+
└─ Docker: DeviceRequests (CDI when enabled, --gpus all otherwise)
312314
└─ k3s/containerd: nvidia-container-runtime on PATH -> auto-detected
313315
└─ k8s: nvidia-device-plugin DaemonSet advertises nvidia.com/gpu
314316
└─ Pods: request nvidia.com/gpu in resource limits (CDI injection — no runtimeClassName needed)
315317
```
316318

317-
Device injection uses CDI (`deviceListStrategy: cdi-cri`): the device plugin injects devices via direct CDI device requests in the CRI. Sandbox pods only need `nvidia.com/gpu: 1` in their resource limits — no `runtimeClassName` field is set on GPU pods.
319+
### `--gpu` flag
318320

319-
The expected smoke test is a plain pod requesting `nvidia.com/gpu: 1` (without `runtimeClassName`) and running `nvidia-smi`.
321+
The `--gpu` flag on `gateway start` enables GPU passthrough. OpenShell auto-selects CDI when enabled on the daemon and falls back to Docker's NVIDIA GPU request path (`--gpus all`) otherwise.
322+
323+
Device injection uses CDI (`deviceListStrategy: cdi-cri`): the device plugin injects devices via direct CDI device requests in the CRI. Sandbox pods only need `nvidia.com/gpu: 1` in their resource limits, and GPU pods do not set `runtimeClassName`.
324+
325+
The expected smoke test is a plain pod requesting `nvidia.com/gpu: 1` without `runtimeClassName` and running `nvidia-smi`.
320326

321327
## Remote Image Transfer
322328

@@ -383,7 +389,7 @@ When `openshell sandbox create` cannot connect to a gateway (connection refused,
383389
1. `should_attempt_bootstrap()` in `crates/openshell-cli/src/bootstrap.rs` checks the error type. It returns `true` for connectivity errors and missing default TLS materials, but `false` for TLS handshake/auth errors.
384390
2. If running in a terminal, the user is prompted to confirm.
385391
3. `run_bootstrap()` deploys a gateway named `"openshell"`, sets it as active, and returns fresh `TlsOptions` pointing to the newly-written mTLS certs.
386-
4. When `sandbox create` requests GPU explicitly (`--gpu`) or infers it from an image whose final name component contains `gpu` (such as `nvidia-gpu`), the bootstrap path enables gateway GPU support before retrying sandbox creation.
392+
4. When `sandbox create` requests GPU explicitly (`--gpu`) or infers it from an image whose final name component contains `gpu` (such as `nvidia-gpu`), the bootstrap path enables gateway GPU support before retrying sandbox creation, using the same CDI-or-fallback selection as `gateway start --gpu`.
387393

388394
## Container Environment Variables
389395

architecture/inference-routing.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -92,10 +92,10 @@ File: `proto/inference.proto`
9292

9393
Key messages:
9494

95-
- `SetClusterInferenceRequest` -- `provider_name` + `model_id` + optional `no_verify` override, with verification enabled by default
96-
- `SetClusterInferenceResponse` -- `provider_name` + `model_id` + `version`
95+
- `SetClusterInferenceRequest` -- `provider_name` + `model_id` + `timeout_secs` + optional `no_verify` override, with verification enabled by default
96+
- `SetClusterInferenceResponse` -- `provider_name` + `model_id` + `timeout_secs` + `version`
9797
- `GetInferenceBundleResponse` -- `repeated ResolvedRoute routes` + `revision` + `generated_at_ms`
98-
- `ResolvedRoute` -- `name`, `base_url`, `protocols`, `api_key`, `model_id`, `provider_type`
98+
- `ResolvedRoute` -- `name`, `base_url`, `protocols`, `api_key`, `model_id`, `provider_type`, `timeout_secs`
9999

100100
## Data Plane (Sandbox)
101101

@@ -106,7 +106,7 @@ Files:
106106
- `crates/openshell-sandbox/src/lib.rs` -- inference context initialization, route refresh
107107
- `crates/openshell-sandbox/src/grpc_client.rs` -- `fetch_inference_bundle()`
108108

109-
In cluster mode, the sandbox starts a background refresh loop as soon as the inference context is created. The loop polls the gateway every 5 seconds by default (`OPENSHELL_ROUTE_REFRESH_INTERVAL_SECS` override) and uses the bundle revision hash to skip no-op cache writes.
109+
In cluster mode, the sandbox starts a background refresh loop as soon as the inference context is created. The loop polls the gateway every 5 seconds by default (`OPENSHELL_ROUTE_REFRESH_INTERVAL_SECS` override) and uses the bundle revision hash to skip no-op cache writes. The revision hash covers all route fields including `timeout_secs`, so any configuration change (provider, model, or timeout) triggers a cache update on the next poll.
110110

111111
### Interception flow
112112

@@ -143,7 +143,7 @@ If no pattern matches, the proxy returns `403 Forbidden` with `{"error": "connec
143143
### Route cache
144144

145145
- `InferenceContext` holds a `Router`, the pattern list, and an `Arc<RwLock<Vec<ResolvedRoute>>>` route cache.
146-
- In cluster mode, `spawn_route_refresh()` polls `GetInferenceBundle` every 30 seconds (`ROUTE_REFRESH_INTERVAL_SECS`). On failure, stale routes are kept.
146+
- In cluster mode, `spawn_route_refresh()` polls `GetInferenceBundle` every 5 seconds (`OPENSHELL_ROUTE_REFRESH_INTERVAL_SECS`). On failure, stale routes are kept.
147147
- In file mode (`--inference-routes`), routes load once at startup from YAML. No refresh task is spawned.
148148
- In cluster mode, an empty initial bundle still enables the inference context so the refresh task can pick up later configuration.
149149

@@ -209,9 +209,11 @@ File: `crates/openshell-router/src/mock.rs`
209209

210210
Routes with `mock://` scheme endpoints return canned responses without making HTTP requests. Mock responses are protocol-aware (OpenAI chat completion, OpenAI completion, Anthropic messages, or generic JSON). Mock routes include an `x-openshell-mock: true` response header.
211211

212-
### HTTP client
212+
### Per-request timeout
213213

214-
The router uses a `reqwest::Client` with a 60-second timeout. Timeouts and connection failures map to `RouterError::UpstreamUnavailable`.
214+
Each `ResolvedRoute` carries a `timeout` field (`Duration`). The `reqwest::Client` has no global timeout; instead, each outgoing request applies `.timeout(route.timeout)` on the request builder. When `timeout_secs` is `0` in the proto message, the default of 60 seconds is used (defined as `DEFAULT_ROUTE_TIMEOUT` in `config.rs`). Timeouts and connection failures map to `RouterError::UpstreamUnavailable`.
215+
216+
Timeout changes propagate dynamically to running sandboxes. The bundle revision hash includes `timeout_secs`, so when the timeout is updated via `openshell inference update --timeout`, the refresh loop detects the revision change and updates the route cache within one polling interval (5 seconds by default).
215217

216218
## Standalone Route File
217219

@@ -297,13 +299,16 @@ The system route is stored as a separate `InferenceRoute` record in the gateway
297299

298300
Cluster inference commands:
299301

300-
- `openshell inference set --provider <name> --model <id>` -- configures user-facing cluster inference
301-
- `openshell inference set --system --provider <name> --model <id>` -- configures system inference
302+
- `openshell inference set --provider <name> --model <id> [--timeout <secs>]` -- configures user-facing cluster inference
303+
- `openshell inference set --system --provider <name> --model <id> [--timeout <secs>]` -- configures system inference
304+
- `openshell inference update [--provider <name>] [--model <id>] [--timeout <secs>]` -- updates individual fields without resetting others
302305
- `openshell inference get` -- displays both user and system inference configuration
303306
- `openshell inference get --system` -- displays only the system inference configuration
304307

305308
The `--provider` flag references a provider record name (not a provider type). The provider must already exist in the cluster and have a supported inference type (`openai`, `anthropic`, or `nvidia`).
306309

310+
The `--timeout` flag sets the per-request timeout in seconds for upstream inference calls. When omitted or set to `0`, the default of 60 seconds applies. Timeout changes propagate to running sandboxes within the route refresh interval (5 seconds by default).
311+
307312
Inference writes verify by default. `--no-verify` is the explicit opt-out for endpoints that are not up yet.
308313

309314
## Provider Discovery

architecture/sandbox.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -431,15 +431,21 @@ Landlock restricts the child process's filesystem access to an explicit allowlis
431431
1. Build path lists from `filesystem.read_only` and `filesystem.read_write`
432432
2. If `include_workdir` is true, add the working directory to `read_write`
433433
3. If both lists are empty, skip Landlock entirely (no-op)
434-
4. Create a Landlock ruleset targeting ABI V1:
434+
4. Create a Landlock ruleset targeting ABI V2:
435435
- Read-only paths receive `AccessFs::from_read(abi)` rights
436436
- Read-write paths receive `AccessFs::from_all(abi)` rights
437-
5. Call `ruleset.restrict_self()` -- this applies to the calling process and all descendants
437+
5. For each path, attempt `PathFd::new()`. If it fails:
438+
- `BestEffort`: Log a warning with the error classification (not found, permission denied, symlink loop, etc.) and skip the path. Continue building the ruleset from remaining valid paths.
439+
- `HardRequirement`: Return a fatal error, aborting the sandbox.
440+
6. If all paths failed (zero rules applied), return an error rather than calling `restrict_self()` on an empty ruleset (which would block all filesystem access)
441+
7. Call `ruleset.restrict_self()` -- this applies to the calling process and all descendants
438442

439-
Error behavior depends on `LandlockCompatibility`:
443+
Kernel-level error behavior (e.g., Landlock ABI unavailable) depends on `LandlockCompatibility`:
440444
- `BestEffort`: Log a warning and continue without filesystem isolation
441445
- `HardRequirement`: Return a fatal error, aborting the sandbox
442446

447+
**Baseline path filtering**: System-injected baseline paths (e.g., `/app`) are pre-filtered by `enrich_proto_baseline_paths()` / `enrich_sandbox_baseline_paths()` using `Path::exists()` before they reach Landlock. User-specified paths are not pre-filtered -- they are evaluated at Landlock apply time so misconfigurations surface as warnings or errors.
448+
443449
### Seccomp syscall filtering
444450

445451
**File:** `crates/openshell-sandbox/src/sandbox/linux/seccomp.rs`
@@ -962,7 +968,7 @@ flowchart LR
962968
| `EnforcementMode` | `Audit`, `Enforce` | What to do on L7 deny (log-only vs block) |
963969
| `L7EndpointConfig` | `{ protocol, tls, enforcement }` | Per-endpoint L7 configuration |
964970
| `L7Decision` | `{ allowed, reason, matched_rule }` | Result of L7 evaluation |
965-
| `L7RequestInfo` | `{ action, target }` | HTTP method + path for policy evaluation |
971+
| `L7RequestInfo` | `{ action, target, query_params }` | HTTP method, path, and decoded query multimap for policy evaluation |
966972

967973
### Access presets
968974

@@ -1041,7 +1047,7 @@ This enables credential injection on all HTTPS endpoints automatically, without
10411047

10421048
Implements `L7Provider` for HTTP/1.1:
10431049

1044-
- **`parse_request()`**: Reads up to 16 KiB of headers, parses the request line (method, path), determines body framing from `Content-Length` or `Transfer-Encoding: chunked` headers. Returns `L7Request` with raw header bytes (may include overflow body bytes).
1050+
- **`parse_request()`**: Reads up to 16 KiB of headers, parses the request line (method, path), decodes query parameters into a multimap, determines body framing from `Content-Length` or `Transfer-Encoding: chunked` headers. Returns `L7Request` with raw header bytes (may include overflow body bytes).
10451051

10461052
- **`relay()`**: Forwards request headers and body to upstream (handling Content-Length, chunked, and no-body cases), then reads and relays the full response back to the client.
10471053

@@ -1054,7 +1060,7 @@ Implements `L7Provider` for HTTP/1.1:
10541060
`relay_with_inspection()` in `crates/openshell-sandbox/src/l7/relay.rs` is the main relay loop:
10551061

10561062
1. Parse one HTTP request from client via the provider
1057-
2. Build L7 input JSON with `request.method`, `request.path`, plus the CONNECT-level context (host, port, binary, ancestors, cmdline)
1063+
2. Build L7 input JSON with `request.method`, `request.path`, `request.query_params`, plus the CONNECT-level context (host, port, binary, ancestors, cmdline)
10581064
3. Evaluate `data.openshell.sandbox.allow_request` and `data.openshell.sandbox.request_deny_reason`
10591065
4. Log the L7 decision (tagged `L7_REQUEST`)
10601066
5. If allowed (or audit mode): relay request to upstream and response back to client, then loop

0 commit comments

Comments
 (0)