Skip to content

Commit 887f050

Browse files
authored
feat(sandbox): add gpu sandbox scheduling support (#257)
* feat(sandbox): add gpu sandbox scheduling support Allow sandbox creation to request GPU resources explicitly or infer them from GPU image names. This wires GPU intent through bootstrap, validates gateway support, and adds dedicated GPU E2E coverage for follow-up cluster testing.
1 parent a359f1d commit 887f050

File tree

17 files changed

+684
-29
lines changed

17 files changed

+684
-29
lines changed

architecture/gateway-single-node.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,7 @@ GPU support is part of the single-node gateway bootstrap path rather than a sepa
299299
- `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory.
300300
- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`, along with GPU Feature Discovery and Node Feature Discovery.
301301
- k3s auto-detects `nvidia-container-runtime` on `PATH`, registers the `nvidia` containerd runtime, and creates the `nvidia` `RuntimeClass` automatically.
302+
- The OpenShell Helm chart grants the gateway service account cluster-scoped read access to `node.k8s.io/runtimeclasses` and core `nodes` so GPU sandbox admission can verify both the `nvidia` `RuntimeClass` and allocatable GPU capacity before creating a sandbox.
302303

303304
The runtime chain is:
304305

@@ -377,6 +378,7 @@ When `openshell sandbox create` cannot connect to a gateway (connection refused,
377378
1. `should_attempt_bootstrap()` in `crates/openshell-cli/src/bootstrap.rs` checks the error type. It returns `true` for connectivity errors and missing default TLS materials, but `false` for TLS handshake/auth errors.
378379
2. If running in a terminal, the user is prompted to confirm.
379380
3. `run_bootstrap()` deploys a gateway named `"openshell"`, sets it as active, and returns fresh `TlsOptions` pointing to the newly-written mTLS certs.
381+
4. When `sandbox create` requests GPU explicitly (`--gpu`) or infers it from an image whose final name component contains `gpu` (such as `nvidia-gpu`), the bootstrap path enables gateway GPU support before retrying sandbox creation.
380382

381383
## Container Environment Variables
382384

architecture/sandbox-custom-containers.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@ The CLI classifies the value in this order:
2424

2525
The community registry prefix defaults to `ghcr.io/nvidia/openshell-community/sandboxes` and can be overridden with the `OPENSHELL_COMMUNITY_REGISTRY` environment variable.
2626

27+
### GPU image-name detection
28+
29+
`sandbox create` also infers GPU intent from the final image name. The current rule matches when the last image name component contains `gpu` (for example `ghcr.io/nvidia/openshell-community/sandboxes/nvidia-gpu:latest` or `registry.example.com/team/my-gpu-image:latest`). When that rule matches, the sandbox request is treated the same as passing `--gpu`.
30+
2731
### Dockerfile build flow
2832

2933
When `--from` points to a Dockerfile or directory, the CLI:

crates/openshell-cli/src/bootstrap.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ fn resolve_bootstrap_name() -> String {
122122
pub async fn run_bootstrap(
123123
remote: Option<&str>,
124124
ssh_key: Option<&str>,
125+
gpu: bool,
125126
) -> Result<(TlsOptions, String, String)> {
126127
let gateway_name = resolve_bootstrap_name();
127128
let location = if remote.is_some() { "remote" } else { "local" };
@@ -159,6 +160,7 @@ pub async fn run_bootstrap(
159160
{
160161
options = options.with_registry_token(token);
161162
}
163+
options = options.with_gpu(gpu);
162164

163165
let handle = deploy_gateway_with_panel(options, &gateway_name, location).await?;
164166
let server = handle.gateway_endpoint().to_string();

crates/openshell-cli/src/main.rs

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1051,6 +1051,14 @@ enum SandboxCommands {
10511051
#[arg(long, value_enum, conflicts_with = "no_keep")]
10521052
editor: Option<CliEditor>,
10531053

1054+
/// Request GPU resources for the sandbox.
1055+
///
1056+
/// When no gateway is running, auto-bootstrap starts a GPU-enabled
1057+
/// gateway. GPU intent is also inferred automatically for known
1058+
/// GPU-designated image names such as `nvidia-gpu`.
1059+
#[arg(long)]
1060+
gpu: bool,
1061+
10541062
/// SSH destination for remote bootstrap (e.g., user@hostname).
10551063
/// Only used when no cluster exists yet; ignored if a cluster is
10561064
/// already active.
@@ -1791,6 +1799,7 @@ async fn main() -> Result<()> {
17911799
keep,
17921800
no_keep,
17931801
editor,
1802+
gpu,
17941803
remote,
17951804
ssh_key,
17961805
providers,
@@ -1868,6 +1877,7 @@ async fn main() -> Result<()> {
18681877
&ctx.name,
18691878
upload_spec.as_ref(),
18701879
keep,
1880+
gpu,
18711881
editor,
18721882
remote.as_deref(),
18731883
ssh_key.as_deref(),
@@ -1889,6 +1899,7 @@ async fn main() -> Result<()> {
18891899
from.as_deref(),
18901900
upload_spec.as_ref(),
18911901
keep,
1902+
gpu,
18921903
editor,
18931904
remote.as_deref(),
18941905
ssh_key.as_deref(),

0 commit comments

Comments
 (0)