Skip to content

Commit f6fc373

Browse files
committed
fix(bootstrap,server): persist sandbox state across gateway stop/start cycles
Sandbox pod data was lost whenever the gateway was stopped and restarted. Two independent bugs caused this: 1. k3s used the Docker container ID as its node name. When the container was recreated (e.g. after an image upgrade), a new node was registered and clean_stale_nodes() deleted all PVCs with node affinity for the old node — including the server's own SQLite database PVC. 2. Sandbox pods had no PersistentVolumeClaims, so their filesystem data lived in the ephemeral container layer and was lost on any pod reschedule. Fix (1) with a three-layer approach to ensure a deterministic k3s node name: a. Set the Docker container hostname to a deterministic name derived from the gateway name. k3s uses the hostname as its default node name, and it persists across container stop/start cycles. b. Pass --node-name to k3s via the cluster entrypoint as a belt-and- suspenders measure. c. Update clean_stale_nodes() to prefer the deterministic name, with a fallback to the container hostname for backward compatibility with older cluster images. Fix (2) by injecting a default workspace volumeClaimTemplate into every sandbox pod spec (1Gi PVC mounted at /home/user). Users who supply their own volumeClaimTemplates are unaffected — the workspace mount is only added when the default VCT is used, preventing dangling volume references. Fixes #738
1 parent dd8dd8a commit f6fc373

File tree

7 files changed

+315
-22
lines changed

7 files changed

+315
-22
lines changed

architecture/gateway-single-node.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ For the target daemon (local or remote):
185185

186186
After the container starts:
187187

188-
1. **Clean stale nodes**: `clean_stale_nodes()` finds `NotReady` nodes via `kubectl get nodes` and deletes them. This is needed when a container is recreated but reuses the persistent volume -- k3s registers a new node (using the container ID as hostname) while old node entries persist in etcd. Non-fatal on error; returns the count of removed nodes.
188+
1. **Clean stale nodes**: `clean_stale_nodes()` finds nodes whose name does not match the deterministic k3s `--node-name` and deletes them. That node name is derived from the gateway name but normalized to a Kubernetes-safe lowercase form so existing gateway names that contain `_`, `.`, or uppercase characters still produce a valid node identity. This cleanup is needed when a container is recreated but reuses the persistent volume -- old node entries can persist in etcd. Non-fatal on error; returns the count of removed nodes.
189189
2. **Push local images** (optional, local deploy only): If `OPENSHELL_PUSH_IMAGES` is set, the comma-separated image refs are exported from the local Docker daemon as a single tar, uploaded into the container via `docker put_archive`, and imported into containerd via `ctr images import` in the `k8s.io` namespace. After import, `kubectl rollout restart deployment/openshell openshell` is run, followed by `kubectl rollout status --timeout=180s` to wait for completion. See `crates/openshell-bootstrap/src/push.rs`.
190190
3. **Wait for gateway health**: `wait_for_gateway_ready()` polls the Docker HEALTHCHECK status up to 180 times, 2 seconds apart (6 min total). A background task streams container logs during this wait. Failure modes:
191191
- Container exits during polling: error includes recent log lines.

architecture/gateway.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -501,7 +501,7 @@ The Helm chart template is at `deploy/helm/openshell/templates/statefulset.yaml`
501501

502502
`SandboxClient` (`crates/openshell-server/src/sandbox/mod.rs`) manages `agents.x-k8s.io/v1alpha1/Sandbox` CRDs.
503503

504-
- **Create**: Translates a `Sandbox` proto into a Kubernetes `DynamicObject` with labels (`openshell.ai/sandbox-id`, `openshell.ai/managed-by: openshell`) and a spec that includes the pod template, environment variables, and gateway-required env vars (`OPENSHELL_SANDBOX_ID`, `OPENSHELL_ENDPOINT`, `OPENSHELL_SSH_LISTEN_ADDR`, etc.).
504+
- **Create**: Translates a `Sandbox` proto into a Kubernetes `DynamicObject` with labels (`openshell.ai/sandbox-id`, `openshell.ai/managed-by: openshell`) and a spec that includes the pod template, environment variables, and gateway-required env vars (`OPENSHELL_SANDBOX_ID`, `OPENSHELL_ENDPOINT`, `OPENSHELL_SSH_LISTEN_ADDR`, etc.). When callers do not provide custom `volumeClaimTemplates`, the server injects a default `workspace` PVC and mounts it at `/sandbox` so the default sandbox home/workdir survives pod rescheduling.
505505
- **Delete**: Calls the Kubernetes API to delete the CRD by name. Returns `false` if already gone (404).
506506
- **Pod IP resolution**: `agent_pod_ip()` fetches the agent pod and reads `status.podIP`.
507507

crates/openshell-bootstrap/src/constants.rs

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,100 @@ pub const SERVER_CLIENT_CA_SECRET_NAME: &str = "openshell-server-client-ca";
1313
pub const CLIENT_TLS_SECRET_NAME: &str = "openshell-client-tls";
1414
/// K8s secret holding the SSH handshake HMAC secret (shared by gateway and sandbox pods).
1515
pub const SSH_HANDSHAKE_SECRET_NAME: &str = "openshell-ssh-handshake";
16+
const NODE_NAME_PREFIX: &str = "openshell-";
17+
const NODE_NAME_FALLBACK_SUFFIX: &str = "gateway";
18+
const KUBERNETES_MAX_NAME_LEN: usize = 253;
1619

1720
pub fn container_name(name: &str) -> String {
1821
format!("openshell-cluster-{name}")
1922
}
2023

24+
/// Deterministic k3s node name derived from the gateway name.
25+
///
26+
/// k3s defaults to using the container hostname (= Docker container ID) as
27+
/// the node name. When the container is recreated (e.g. after an image
28+
/// upgrade), the container ID changes, creating a new k3s node. The
29+
/// `clean_stale_nodes` function then deletes PVCs whose backing PVs have
30+
/// node affinity for the old node — wiping the server database and any
31+
/// sandbox persistent volumes.
32+
///
33+
/// By passing a deterministic `--node-name` to k3s, the node identity
34+
/// survives container recreation, and PVCs are never orphaned.
35+
///
36+
/// Gateway names allow Docker-friendly separators and uppercase characters,
37+
/// but Kubernetes node names must be DNS-safe. Normalize the gateway name into
38+
/// a single lowercase RFC 1123 label so previously accepted names such as
39+
/// `prod_us` or `Prod.US` still deploy successfully.
40+
pub fn node_name(name: &str) -> String {
41+
format!("{NODE_NAME_PREFIX}{}", normalize_node_name_suffix(name))
42+
}
43+
44+
fn normalize_node_name_suffix(name: &str) -> String {
45+
let mut normalized = String::with_capacity(name.len());
46+
let mut last_was_separator = false;
47+
48+
for ch in name.chars() {
49+
if ch.is_ascii_alphanumeric() {
50+
normalized.push(ch.to_ascii_lowercase());
51+
last_was_separator = false;
52+
} else if !last_was_separator {
53+
normalized.push('-');
54+
last_was_separator = true;
55+
}
56+
}
57+
58+
let mut normalized = normalized.trim_matches('-').to_string();
59+
if normalized.is_empty() {
60+
normalized.push_str(NODE_NAME_FALLBACK_SUFFIX);
61+
}
62+
63+
let max_suffix_len = KUBERNETES_MAX_NAME_LEN.saturating_sub(NODE_NAME_PREFIX.len());
64+
if normalized.len() > max_suffix_len {
65+
normalized.truncate(max_suffix_len);
66+
normalized.truncate(normalized.trim_end_matches('-').len());
67+
}
68+
69+
if normalized.is_empty() {
70+
normalized.push_str(NODE_NAME_FALLBACK_SUFFIX);
71+
}
72+
73+
normalized
74+
}
75+
2176
pub fn volume_name(name: &str) -> String {
2277
format!("openshell-cluster-{name}")
2378
}
2479

2580
pub fn network_name(name: &str) -> String {
2681
format!("openshell-cluster-{name}")
2782
}
83+
84+
#[cfg(test)]
85+
mod tests {
86+
use super::*;
87+
88+
#[test]
89+
fn node_name_normalizes_uppercase_and_underscores() {
90+
assert_eq!(node_name("Prod_US"), "openshell-prod-us");
91+
}
92+
93+
#[test]
94+
fn node_name_collapses_and_trims_separator_runs() {
95+
assert_eq!(node_name("._Prod..__-Gateway-."), "openshell-prod-gateway");
96+
}
97+
98+
#[test]
99+
fn node_name_falls_back_when_gateway_name_has_no_alphanumerics() {
100+
assert_eq!(node_name("...___---"), "openshell-gateway");
101+
}
102+
103+
#[test]
104+
fn node_name_truncates_to_kubernetes_name_limit() {
105+
let gateway_name = "A".repeat(400);
106+
let node_name = node_name(&gateway_name);
107+
108+
assert!(node_name.len() <= KUBERNETES_MAX_NAME_LEN);
109+
assert!(node_name.starts_with(NODE_NAME_PREFIX));
110+
assert!(node_name.ends_with('a'));
111+
}
112+
}

crates/openshell-bootstrap/src/docker.rs

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
// SPDX-License-Identifier: Apache-2.0
33

44
use crate::RemoteOptions;
5-
use crate::constants::{container_name, network_name, volume_name};
5+
use crate::constants::{container_name, network_name, node_name, volume_name};
66
use crate::image::{self, DEFAULT_IMAGE_REPO_BASE, DEFAULT_REGISTRY, parse_image_ref};
77
use bollard::API_DEFAULT_VERSION;
88
use bollard::Docker;
@@ -684,6 +684,11 @@ pub async fn ensure_container(
684684
format!("REGISTRY_HOST={registry_host}"),
685685
format!("REGISTRY_INSECURE={registry_insecure}"),
686686
format!("IMAGE_REPO_BASE={image_repo_base}"),
687+
// Deterministic k3s node name so the node identity survives container
688+
// recreation (e.g. after an image upgrade). Without this, k3s uses
689+
// the container ID as the hostname/node name, which changes on every
690+
// container recreate and triggers stale-node PVC cleanup.
691+
format!("OPENSHELL_NODE_NAME={}", node_name(name)),
687692
];
688693
if let Some(endpoint) = registry_endpoint {
689694
env_vars.push(format!("REGISTRY_ENDPOINT={endpoint}"));
@@ -753,6 +758,14 @@ pub async fn ensure_container(
753758

754759
let config = ContainerCreateBody {
755760
image: Some(image_ref.to_string()),
761+
// Set the container hostname to the deterministic node name.
762+
// k3s uses the container hostname as its default node name. Without
763+
// this, Docker defaults to the container ID (first 12 hex chars),
764+
// which changes on every container recreation and can cause
765+
// `clean_stale_nodes` to delete the wrong node on resume. The
766+
// hostname persists across container stop/start cycles, ensuring a
767+
// stable node identity.
768+
hostname: Some(node_name(name)),
756769
cmd: Some(cmd),
757770
env,
758771
exposed_ports: Some(exposed_ports),

crates/openshell-bootstrap/src/runtime.rs

Lines changed: 32 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
// SPDX-License-Identifier: Apache-2.0
33

4-
use crate::constants::{KUBECONFIG_PATH, container_name};
4+
use crate::constants::{KUBECONFIG_PATH, container_name, node_name};
55
use bollard::Docker;
66
use bollard::container::LogOutput;
77
use bollard::exec::CreateExecOptions;
@@ -385,11 +385,19 @@ pub async fn clean_stale_nodes(docker: &Docker, name: &str) -> Result<usize> {
385385
let container_name = container_name(name);
386386
let mut stale_nodes: Vec<String> = Vec::new();
387387

388+
// Determine the current node name. With the deterministic `--node-name`
389+
// entrypoint change the k3s node is `openshell-{gateway}`. However, older
390+
// cluster images (built before that change) still use the container hostname
391+
// (= Docker container ID) as the node name. We must handle both:
392+
//
393+
// 1. If the expected deterministic name appears in the node list, use it.
394+
// 2. Otherwise fall back to the container hostname (old behaviour).
395+
//
396+
// This ensures backward compatibility during upgrades where the bootstrap
397+
// CLI is newer than the cluster image.
398+
let deterministic_node = node_name(name);
399+
388400
for attempt in 1..=MAX_ATTEMPTS {
389-
// List ALL node names and the container's own hostname. Any node that
390-
// is not the current container is stale — we cannot rely on the Ready
391-
// condition because k3s may not have marked the old node NotReady yet
392-
// when this runs shortly after container start.
393401
let (output, exit_code) = exec_capture_with_exit(
394402
docker,
395403
&container_name,
@@ -406,16 +414,27 @@ pub async fn clean_stale_nodes(docker: &Docker, name: &str) -> Result<usize> {
406414
.await?;
407415

408416
if exit_code == 0 {
409-
// Determine the current node name (container hostname).
410-
let (hostname_out, _) =
411-
exec_capture_with_exit(docker, &container_name, vec!["hostname".to_string()])
412-
.await?;
413-
let current_hostname = hostname_out.trim().to_string();
414-
415-
stale_nodes = output
417+
let all_nodes: Vec<&str> = output
416418
.lines()
417419
.map(str::trim)
418-
.filter(|l| !l.is_empty() && *l != current_hostname)
420+
.filter(|l| !l.is_empty())
421+
.collect();
422+
423+
// Pick the current node identity: prefer the deterministic name,
424+
// fall back to the container hostname for older cluster images.
425+
let current_node = if all_nodes.contains(&deterministic_node.as_str()) {
426+
deterministic_node.clone()
427+
} else {
428+
// Older cluster image without --node-name: read hostname.
429+
let (hostname_out, _) =
430+
exec_capture_with_exit(docker, &container_name, vec!["hostname".to_string()])
431+
.await?;
432+
hostname_out.trim().to_string()
433+
};
434+
435+
stale_nodes = all_nodes
436+
.into_iter()
437+
.filter(|n| *n != current_node)
419438
.map(ToString::to_string)
420439
.collect();
421440
break;

0 commit comments

Comments
 (0)