Skip to content

feat(server): out-of-process compute driver via --compute-driver-socket#1604

Closed
st-gr wants to merge 3 commits into
NVIDIA:mainfrom
st-gr:upstream-pr/external-compute-driver-socket
Closed

feat(server): out-of-process compute driver via --compute-driver-socket#1604
st-gr wants to merge 3 commits into
NVIDIA:mainfrom
st-gr:upstream-pr/external-compute-driver-socket

Conversation

@st-gr
Copy link
Copy Markdown

@st-gr st-gr commented May 28, 2026

Summary

Adds an out-of-process compute driver mechanism: when the gateway is started with --compute-driver-socket=PATH (or OPENSHELL_COMPUTE_DRIVER_SOCKET=PATH), it connects a tonic gRPC channel over the supplied Unix domain socket and uses that as the compute driver. The remote driver speaks the existing `compute_driver.proto` contract (no proto changes).

This unblocks third-party compute drivers (e.g., custom Kubernetes flavors, on-cluster managed services like SAP BTP Kyma) that ship as separate binaries without having to be merged into this workspace.

What's in the patch (3 commits)

  1. `feat(core): add External(PathBuf) variant to ComputeDriverKind` — the enum drops `Copy` (PathBuf isn't Copy), gains an `External(PathBuf)` variant, `Display` round-trips as `external:`, and `FromStr` accepts `external:` (case-insensitive prefix), rejecting bare `external` with a clear error pointing at the new flag. 9 unit tests, including a Display↔FromStr round-trip for all variants.

  2. `feat(server): add --compute-driver-socket CLI flag` — top-level flag in `RunArgs` with env binding. When set, the resolved single driver is `External(path)`, overriding both `--drivers` and auto-detection. Updates both `effective_single_driver` (used by mTLS/OIDC pre-checks) and `with_compute_drivers` (used by the runtime factory) so the two paths cannot diverge. 3 unit tests including env-var binding and override behavior.

  3. `feat(server): wire ComputeDriverKind::External via tonic UDS channel` — `build_compute_runtime` connects a tonic `Channel` to the configured UDS using `hyper-util`'s `TokioIo` connector and constructs a `ComputeRuntime` via a new `new_remote_external` constructor (sibling to `new_remote_vm`, takes only the channel and not a `ManagedDriverProcess` since the operator owns the external driver's lifecycle).

No new dependencies (`hyper-util`, `tower`, `tonic`, `tokio` already in-tree). All commits DCO-signed.

Test plan

  • `cargo test -p openshell-core compute_driver_kind` — 9 tests passing (5 new External-variant cases + 1 round-trip across all variants + 3 pre-existing).
  • `cargo test -p openshell-server --lib cli::` — 37 tests, 3 new for the flag (presence, override-of-drivers-list, env-var binding).
  • Full `cargo test -p openshell-core` and `cargo test -p openshell-server --lib` — 144 + 680 tests, no regressions.
  • Container smoke test: built `Dockerfile.gateway` (fork-only, not in this PR), ran with `--compute-driver-socket /tmp/no.sock`, confirmed clean error trace pointing at the missing socket.
  • CI build of upstream main + this PR (will run automatically once opened).

Why upstream

The companion driver lives at https://github.com/st-gr/openshell-driver-kyma (Apache-2.0). The author of the OpenShift compute driver maintains an analogous fork at https://github.com/zanetworker/OpenShell with the same flag — they noted in their docs that they would like to upstream the patch when stable. Two independent forks both carrying this same ~150-line change is a reasonable signal that the feature is generally useful.

Compatibility

  • The default driver dispatch is unchanged: `--drivers kubernetes`, auto-detection, etc. all behave exactly as before.
  • `ComputeDriverKind` no longer derives `Copy` because `PathBuf` isn't `Copy`. Three call sites in `openshell-server` were updated to clone instead. No public API change for downstream consumers — the enum was already `Clone`.
  • `as_str` signature changed from `const fn (self) -> &'static str` to `fn (&self) -> &'static str` for the same reason. No const-context callers exist in the workspace.

Open questions for reviewers

  • Should the CLI flag's name be `--compute-driver-socket` or something like `--external-driver-socket`? Happy to bikeshed.
  • Should `External(PathBuf)` carry an optional auth/identity field (mTLS cert path, bearer token) for the driver-side connection, or is that out of scope for v1? The current patch assumes UDS file permissions are the auth boundary, which is the same trust model as the existing in-tree drivers.

Signed-off-by: st-gr 38470677+st-gr@users.noreply.github.com

st-gr and others added 3 commits May 27, 2026 20:24
Carries the UDS path supplied by --compute-driver-socket. Drops Copy
from the enum derive (PathBuf is not Copy); existing callers use Clone
or owned values. FromStr accepts 'external:<path>' and rejects bare
'external' with a message pointing at the CLI flag.

Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Adds an opt-in `--compute-driver-socket=PATH` flag (env
`OPENSHELL_COMPUTE_DRIVER_SOCKET`) on the gateway's `RunArgs`. When
set, the gateway pins `ComputeDriverKind::External(<path>)` and
skips both the `--drivers` list and the auto-detection probe. This
lets out-of-tree driver binaries (Kyma, custom backends) connect
to a stock gateway without a rebuild.

`effective_single_driver` and the `Config.compute_drivers` payload
both honour the new flag so pre-runtime checks and the runtime
factory dispatch agree on the configured driver. The companion
dispatch arm in `lib.rs::build_compute_runtime` is wired in the
follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
`build_compute_runtime` now connects a tonic `Channel` to the
configured Unix domain socket using `hyper-util`'s `TokioIo`
connector and wraps it in `RemoteComputeDriver` -- the same proxy
used by the VM driver. Replaces the placeholder
`External(_) => Err(...)` arm.

Adds two helpers in `compute/mod.rs`:

* `connect_external_compute_driver(socket_path)` -- a small
  tonic-Endpoint + tower::service_fn + UnixStream connector,
  parallel to the one in `compute::vm` but with no VM-specific
  logging or capability probing. Out-of-tree drivers manage their
  own readiness; the gateway just dials.
* `ComputeRuntime::new_remote_external(channel, ...)` -- mirrors
  `new_remote_vm` but takes no `ManagedDriverProcess`. The
  external driver's lifecycle is the operator's responsibility
  (systemd unit, sidecar container, etc.).

Smoke-tested: `--compute-driver-socket /tmp/nonexistent.sock`
now starts the gateway, logs "Connecting to external compute
driver", and fails with a clear "failed to connect to external
compute driver socket '<path>': transport error" message that
points at the new arm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

Thank you for your interest in contributing to OpenShell, @st-gr.

This project uses a vouch system for first-time contributors. Before submitting a pull request, you need to be vouched by a maintainer.

To get vouched:

  1. Open a Vouch Request discussion.
  2. Describe what you want to change and why.
  3. Write in your own words — do not have an AI generate the request.
  4. A maintainer will comment /vouch if approved.
  5. Once vouched, open a new PR (preferred) or reopen this one after a few minutes.

See CONTRIBUTING.md for details.

@github-actions github-actions Bot closed this May 28, 2026
@github-actions
Copy link
Copy Markdown

Thank you for your submission! We ask that you sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text:


I have read the DCO document and I hereby sign the DCO.


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the DCO Assistant Lite bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant