Skip to content

Commit e9f1071

Browse files
authored
fix(security): harden sandbox SSH with mandatory HMAC secret, NetworkPolicy, and nonce replay detection (#127)
* fix(security): harden sandbox SSH with mandatory HMAC secret, NetworkPolicy, and nonce replay detection Closes NVIDIA#25 - Make NEMOCLAW_SSH_HANDSHAKE_SECRET mandatory: server and sandbox both refuse to start if the secret is empty/unset. Cluster deployments auto-generate it via openssl rand in the entrypoint script. - Add Kubernetes NetworkPolicy restricting sandbox port 2222 ingress to the gateway pod only, preventing lateral movement from other cluster workloads. - Add NSSH1 nonce replay detection with a TTL-bounded cache, rejecting replayed handshakes within the timestamp validity window. - Add unit tests for verify_preface (valid, replay, expired, bad HMAC, malformed) and env injection. * fix(deploy): pass sshHandshakeSecret in fast deploy helm upgrade --------- Co-authored-by: John Myers <johntmyers@users.noreply.github.com>
1 parent 780731f commit e9f1071

File tree

11 files changed

+260
-12
lines changed

11 files changed

+260
-12
lines changed

architecture/sandbox-connect.md

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -345,7 +345,13 @@ NSSH1 <token> <timestamp> <nonce> <hmac>\n
345345
3. Reject if skew exceeds `ssh_handshake_skew_secs` (default: 300 seconds)
346346
4. Recompute HMAC-SHA256 over `token|timestamp|nonce` with the shared secret
347347
5. Compare computed signature against the received signature (constant-time via `hmac` crate)
348-
6. Respond with `OK\n` on success or `ERR\n` on failure
348+
6. Check nonce against the replay cache; reject if the nonce has been seen before within the skew window
349+
7. Insert the nonce into the replay cache on success
350+
8. Respond with `OK\n` on success or `ERR\n` on failure
351+
352+
### Nonce replay detection
353+
354+
The SSH server maintains a per-process `NonceCache` (`HashMap<String, Instant>` behind `Arc<Mutex<...>>`) that tracks nonces seen within the handshake skew window. A background tokio task reaps expired entries every 60 seconds. If a valid preface is presented with a previously-seen nonce, the handshake is rejected. This prevents replay attacks within the timestamp validity window.
349355

350356
### HMAC computation
351357

@@ -517,7 +523,12 @@ This function is shared between the CLI and TUI via the `navigator-core::forward
517523

518524
1. **mTLS (transport layer)** -- when TLS is configured, the CLI authenticates to the gateway using client certificates. The `ssh-proxy` subprocess inherits TLS options from the parent CLI process.
519525
2. **Session token (application layer)** -- the gateway validates the session token against the persistence layer. Tokens are scoped to a specific sandbox and can be revoked.
520-
3. **NSSH1 handshake (gateway-to-sandbox)** -- the shared handshake secret proves the connection originated from an authorized gateway. The timestamp + nonce prevent replay attacks within the skew window.
526+
3. **NSSH1 handshake (gateway-to-sandbox)** -- the shared handshake secret proves the connection originated from an authorized gateway. The timestamp + nonce prevent replay attacks within the skew window. The nonce replay cache rejects duplicates.
527+
4. **Kubernetes NetworkPolicy** -- a Helm-managed `NetworkPolicy` restricts ingress to sandbox pods on port 2222 to only the gateway pod, preventing lateral movement from other in-cluster workloads. Controlled by `networkPolicy.enabled` in the Helm values (default: `true`).
528+
529+
### Mandatory handshake secret
530+
531+
The NSSH1 handshake secret (`NEMOCLAW_SSH_HANDSHAKE_SECRET`) is required. Both the server and sandbox will refuse to start if the secret is empty or unset. For cluster deployments the secret is auto-generated by the entrypoint script (`deploy/docker/cluster-entrypoint.sh`) via `openssl rand -hex 32` and injected into the Helm values.
521532

522533
### What SSH auth does NOT enforce
523534

@@ -542,7 +553,7 @@ The sandbox generates a fresh Ed25519 host key on every startup. The CLI disable
542553
| `ssh_gateway_port` | `8080` | Public port for gateway connections (0 = use bind port) |
543554
| `ssh_connect_path` | `/connect/ssh` | HTTP path for CONNECT requests |
544555
| `sandbox_ssh_port` | `2222` | SSH listen port inside sandbox pods |
545-
| `ssh_handshake_secret` | (empty) | Shared HMAC key for NSSH1 handshake |
556+
| `ssh_handshake_secret` | (required) | Shared HMAC key for NSSH1 handshake (server fails to start if empty) |
546557
| `ssh_handshake_skew_secs` | `300` | Maximum allowed clock skew (seconds) |
547558

548559
### Sandbox environment variables

crates/navigator-sandbox/src/lib.rs

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -404,7 +404,14 @@ pub async fn run_sandbox(
404404
let addr: SocketAddr = listen_addr.parse().into_diagnostic()?;
405405
let policy_clone = policy.clone();
406406
let workdir_clone = workdir.clone();
407-
let secret = ssh_handshake_secret.unwrap_or_default();
407+
let secret = ssh_handshake_secret
408+
.filter(|s| !s.is_empty())
409+
.ok_or_else(|| {
410+
miette::miette!(
411+
"NEMOCLAW_SSH_HANDSHAKE_SECRET is required when SSH is enabled.\n\
412+
Set --ssh-handshake-secret or the NEMOCLAW_SSH_HANDSHAKE_SECRET env var."
413+
)
414+
})?;
408415
let proxy_url = ssh_proxy_url;
409416
let netns_fd = ssh_netns_fd;
410417
let ca_paths = ca_file_paths.clone();

crates/navigator-sandbox/src/ssh.rs

Lines changed: 149 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,19 @@ use std::net::SocketAddr;
2121
use std::os::fd::{AsRawFd, RawFd};
2222
use std::path::PathBuf;
2323
use std::process::Command;
24-
use std::sync::{Arc, mpsc};
25-
use std::time::{Duration, SystemTime, UNIX_EPOCH};
24+
use std::sync::{Arc, Mutex, mpsc};
25+
use std::time::{Duration, Instant, SystemTime, UNIX_EPOCH};
2626
use tokio::io::{AsyncReadExt, AsyncWriteExt};
2727
use tokio::net::TcpListener;
2828
use tracing::{info, warn};
2929

3030
const PREFACE_MAGIC: &str = "NSSH1";
3131

32+
/// A time-bounded set of nonces used to detect replayed NSSH1 handshakes.
33+
/// Each entry records the `Instant` it was inserted; a background reaper task
34+
/// periodically evicts entries older than the handshake skew window.
35+
type NonceCache = Arc<Mutex<HashMap<String, Instant>>>;
36+
3237
/// Perform SSH server initialization: generate a host key, build the config,
3338
/// and bind the TCP listener. Extracted so that startup errors can be forwarded
3439
/// through the readiness channel rather than being silently logged.
@@ -85,6 +90,23 @@ pub async fn run_ssh_server(
8590
}
8691
};
8792

93+
// Nonce cache for replay detection. Entries are evicted by a background
94+
// reaper once they exceed the handshake skew window.
95+
let nonce_cache: NonceCache = Arc::new(Mutex::new(HashMap::new()));
96+
97+
// Background task that periodically purges expired nonces.
98+
let reaper_cache = nonce_cache.clone();
99+
let ttl = Duration::from_secs(handshake_skew_secs);
100+
tokio::spawn(async move {
101+
let mut interval = tokio::time::interval(Duration::from_secs(60));
102+
loop {
103+
interval.tick().await;
104+
if let Ok(mut cache) = reaper_cache.lock() {
105+
cache.retain(|_, inserted| inserted.elapsed() < ttl);
106+
}
107+
}
108+
});
109+
88110
loop {
89111
let (stream, peer) = listener.accept().await.into_diagnostic()?;
90112
stream.set_nodelay(true).into_diagnostic()?;
@@ -95,6 +117,7 @@ pub async fn run_ssh_server(
95117
let proxy_url = proxy_url.clone();
96118
let ca_paths = ca_paths.clone();
97119
let provider_env = provider_env.clone();
120+
let nonce_cache = nonce_cache.clone();
98121

99122
tokio::spawn(async move {
100123
if let Err(err) = handle_connection(
@@ -109,6 +132,7 @@ pub async fn run_ssh_server(
109132
proxy_url,
110133
ca_paths,
111134
provider_env,
135+
&nonce_cache,
112136
)
113137
.await
114138
{
@@ -131,12 +155,13 @@ async fn handle_connection(
131155
proxy_url: Option<String>,
132156
ca_file_paths: Option<Arc<(PathBuf, PathBuf)>>,
133157
provider_env: HashMap<String, String>,
158+
nonce_cache: &NonceCache,
134159
) -> Result<()> {
135160
info!(peer = %peer, "SSH connection: reading handshake preface");
136161
let mut line = String::new();
137162
read_line(&mut stream, &mut line).await?;
138163
info!(peer = %peer, preface_len = line.len(), "SSH connection: preface received, verifying");
139-
if !verify_preface(&line, secret, handshake_skew_secs)? {
164+
if !verify_preface(&line, secret, handshake_skew_secs, nonce_cache)? {
140165
warn!(peer = %peer, "SSH connection: handshake verification failed");
141166
let _ = stream.write_all(b"ERR\n").await;
142167
return Ok(());
@@ -178,7 +203,12 @@ async fn read_line(stream: &mut tokio::net::TcpStream, buf: &mut String) -> Resu
178203
Ok(())
179204
}
180205

181-
fn verify_preface(line: &str, secret: &str, handshake_skew_secs: u64) -> Result<bool> {
206+
fn verify_preface(
207+
line: &str,
208+
secret: &str,
209+
handshake_skew_secs: u64,
210+
nonce_cache: &NonceCache,
211+
) -> Result<bool> {
182212
let parts: Vec<&str> = line.split_whitespace().collect();
183213
if parts.len() != 5 || parts[0] != PREFACE_MAGIC {
184214
return Ok(false);
@@ -202,7 +232,22 @@ fn verify_preface(line: &str, secret: &str, handshake_skew_secs: u64) -> Result<
202232

203233
let payload = format!("{token}|{timestamp}|{nonce}");
204234
let expected = hmac_sha256(secret.as_bytes(), payload.as_bytes());
205-
Ok(signature == expected)
235+
if signature != expected {
236+
return Ok(false);
237+
}
238+
239+
// Reject replayed nonces. The cache is bounded by the reaper task which
240+
// evicts entries older than `handshake_skew_secs`.
241+
let mut cache = nonce_cache
242+
.lock()
243+
.map_err(|_| miette::miette!("nonce cache lock poisoned"))?;
244+
if cache.contains_key(nonce) {
245+
warn!(nonce = nonce, "NSSH1 nonce replay detected");
246+
return Ok(false);
247+
}
248+
cache.insert(nonce.to_string(), Instant::now());
249+
250+
Ok(true)
206251
}
207252

208253
fn hmac_sha256(key: &[u8], data: &[u8]) -> String {
@@ -1085,4 +1130,103 @@ mod tests {
10851130
"expected all 100 KiB delivered before EOF"
10861131
);
10871132
}
1133+
1134+
// -----------------------------------------------------------------------
1135+
// verify_preface tests
1136+
// -----------------------------------------------------------------------
1137+
1138+
/// Build a valid NSSH1 preface line with the given parameters.
1139+
fn build_preface(token: &str, secret: &str, nonce: &str, timestamp: i64) -> String {
1140+
let payload = format!("{token}|{timestamp}|{nonce}");
1141+
let signature = hmac_sha256(secret.as_bytes(), payload.as_bytes());
1142+
format!("{PREFACE_MAGIC} {token} {timestamp} {nonce} {signature}")
1143+
}
1144+
1145+
fn fresh_nonce_cache() -> NonceCache {
1146+
Arc::new(Mutex::new(HashMap::new()))
1147+
}
1148+
1149+
fn current_timestamp() -> i64 {
1150+
i64::try_from(
1151+
SystemTime::now()
1152+
.duration_since(UNIX_EPOCH)
1153+
.unwrap()
1154+
.as_secs(),
1155+
)
1156+
.unwrap()
1157+
}
1158+
1159+
#[test]
1160+
fn verify_preface_accepts_valid_preface() {
1161+
let secret = "test-secret-key";
1162+
let nonce = "unique-nonce-1";
1163+
let ts = current_timestamp();
1164+
let line = build_preface("tok1", secret, nonce, ts);
1165+
let cache = fresh_nonce_cache();
1166+
1167+
assert!(verify_preface(&line, secret, 300, &cache).unwrap());
1168+
}
1169+
1170+
#[test]
1171+
fn verify_preface_rejects_replayed_nonce() {
1172+
let secret = "test-secret-key";
1173+
let nonce = "replay-nonce";
1174+
let ts = current_timestamp();
1175+
let line = build_preface("tok1", secret, nonce, ts);
1176+
let cache = fresh_nonce_cache();
1177+
1178+
// First attempt should succeed.
1179+
assert!(verify_preface(&line, secret, 300, &cache).unwrap());
1180+
// Second attempt with the same nonce should be rejected.
1181+
assert!(!verify_preface(&line, secret, 300, &cache).unwrap());
1182+
}
1183+
1184+
#[test]
1185+
fn verify_preface_rejects_expired_timestamp() {
1186+
let secret = "test-secret-key";
1187+
let nonce = "expired-nonce";
1188+
// Timestamp 600 seconds in the past, with a 300-second skew window.
1189+
let ts = current_timestamp() - 600;
1190+
let line = build_preface("tok1", secret, nonce, ts);
1191+
let cache = fresh_nonce_cache();
1192+
1193+
assert!(!verify_preface(&line, secret, 300, &cache).unwrap());
1194+
}
1195+
1196+
#[test]
1197+
fn verify_preface_rejects_invalid_hmac() {
1198+
let secret = "test-secret-key";
1199+
let nonce = "hmac-nonce";
1200+
let ts = current_timestamp();
1201+
// Build with the correct secret, then verify with the wrong one.
1202+
let line = build_preface("tok1", secret, nonce, ts);
1203+
let cache = fresh_nonce_cache();
1204+
1205+
assert!(!verify_preface(&line, "wrong-secret", 300, &cache).unwrap());
1206+
}
1207+
1208+
#[test]
1209+
fn verify_preface_rejects_malformed_input() {
1210+
let cache = fresh_nonce_cache();
1211+
1212+
// Too few parts.
1213+
assert!(!verify_preface("NSSH1 tok1 123", "s", 300, &cache).unwrap());
1214+
// Wrong magic.
1215+
assert!(!verify_preface("NSSH2 tok1 123 nonce sig", "s", 300, &cache).unwrap());
1216+
// Empty string.
1217+
assert!(!verify_preface("", "s", 300, &cache).unwrap());
1218+
}
1219+
1220+
#[test]
1221+
fn verify_preface_distinct_nonces_both_accepted() {
1222+
let secret = "test-secret-key";
1223+
let ts = current_timestamp();
1224+
let cache = fresh_nonce_cache();
1225+
1226+
let line1 = build_preface("tok1", secret, "nonce-a", ts);
1227+
let line2 = build_preface("tok1", secret, "nonce-b", ts);
1228+
1229+
assert!(verify_preface(&line1, secret, 300, &cache).unwrap());
1230+
assert!(verify_preface(&line2, secret, 300, &cache).unwrap());
1231+
}
10881232
}

crates/navigator-server/src/lib.rs

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,11 @@ pub async fn run_server(config: Config, tracing_log_bus: TracingLogBus) -> Resul
9292
if database_url.is_empty() {
9393
return Err(Error::config("database_url is required"));
9494
}
95+
if config.ssh_handshake_secret.is_empty() {
96+
return Err(Error::config(
97+
"ssh_handshake_secret is required. Set --ssh-handshake-secret or NEMOCLAW_SSH_HANDSHAKE_SECRET",
98+
));
99+
}
95100

96101
let store = Store::connect(database_url).await?;
97102
let sandbox_client = SandboxClient::new(

crates/navigator-server/src/sandbox/mod.rs

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -915,9 +915,7 @@ fn apply_required_env(
915915
if !ssh_listen_addr.is_empty() {
916916
upsert_env(env, "NEMOCLAW_SSH_LISTEN_ADDR", ssh_listen_addr);
917917
}
918-
if !ssh_handshake_secret.is_empty() {
919-
upsert_env(env, "NEMOCLAW_SSH_HANDSHAKE_SECRET", ssh_handshake_secret);
920-
}
918+
upsert_env(env, "NEMOCLAW_SSH_HANDSHAKE_SECRET", ssh_handshake_secret);
921919
upsert_env(
922920
env,
923921
"NEMOCLAW_SSH_HANDSHAKE_SKEW_SECS",
@@ -1219,4 +1217,29 @@ mod tests {
12191217

12201218
assert_eq!(derive_phase(&status, false), SandboxPhase::Ready);
12211219
}
1220+
1221+
#[test]
1222+
fn apply_required_env_always_injects_ssh_handshake_secret() {
1223+
let mut env = Vec::new();
1224+
apply_required_env(
1225+
&mut env,
1226+
"sandbox-1",
1227+
"my-sandbox",
1228+
"https://endpoint:8080",
1229+
"0.0.0.0:2222",
1230+
"my-secret-value",
1231+
300,
1232+
);
1233+
1234+
let secret_entry = env
1235+
.iter()
1236+
.find(|e| {
1237+
e.get("name").and_then(|v| v.as_str()) == Some("NEMOCLAW_SSH_HANDSHAKE_SECRET")
1238+
})
1239+
.expect("NEMOCLAW_SSH_HANDSHAKE_SECRET must be present in env");
1240+
assert_eq!(
1241+
secret_entry.get("value").and_then(|v| v.as_str()),
1242+
Some("my-secret-value")
1243+
);
1244+
}
12221245
}

deploy/docker/cluster-entrypoint.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -249,6 +249,11 @@ if [ -n "${IMAGE_PULL_POLICY:-}" ] && [ -f "$HELMCHART" ]; then
249249
sed -i "s|pullPolicy: Always|pullPolicy: ${IMAGE_PULL_POLICY}|" "$HELMCHART"
250250
fi
251251

252+
# Generate a random SSH handshake secret for the NSSH1 HMAC handshake between
253+
# the gateway and sandbox SSH servers. This is required — the server will refuse
254+
# to start without it.
255+
SSH_HANDSHAKE_SECRET="${SSH_HANDSHAKE_SECRET:-$(openssl rand -hex 32)}"
256+
252257
# Inject SSH gateway host/port into the HelmChart manifest so the navigator
253258
# server returns the correct address to CLI clients for SSH proxy CONNECT.
254259
if [ -f "$HELMCHART" ]; then
@@ -266,6 +271,8 @@ if [ -f "$HELMCHART" ]; then
266271
# Clear the placeholder so the default (8080) is used
267272
sed -i "s|sshGatewayPort: __SSH_GATEWAY_PORT__|sshGatewayPort: 0|g" "$HELMCHART"
268273
fi
274+
echo "Setting SSH handshake secret"
275+
sed -i "s|__SSH_HANDSHAKE_SECRET__|${SSH_HANDSHAKE_SECRET}|g" "$HELMCHART"
269276
fi
270277

271278
# Inject chart checksum into the HelmChart manifest so that a changed chart
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
{{- if .Values.networkPolicy.enabled }}
2+
# NetworkPolicy restricting SSH ingress on sandbox pods to the gateway pod.
3+
# Sandbox pods are dynamically created by the server and labelled with
4+
# navigator.ai/managed-by=navigator. This policy ensures only the gateway
5+
# (navigator server) pod can reach the sandbox SSH port (2222), blocking
6+
# lateral movement from other in-cluster workloads.
7+
apiVersion: networking.k8s.io/v1
8+
kind: NetworkPolicy
9+
metadata:
10+
name: {{ include "navigator.fullname" . }}-sandbox-ssh
11+
namespace: {{ .Values.server.sandboxNamespace }}
12+
labels:
13+
{{- include "navigator.labels" . | nindent 4 }}
14+
spec:
15+
podSelector:
16+
matchLabels:
17+
navigator.ai/managed-by: navigator
18+
policyTypes:
19+
- Ingress
20+
ingress:
21+
- from:
22+
- namespaceSelector:
23+
matchLabels:
24+
kubernetes.io/metadata.name: {{ .Release.Namespace }}
25+
podSelector:
26+
matchLabels:
27+
app.kubernetes.io/name: {{ include "navigator.name" . }}
28+
app.kubernetes.io/instance: {{ .Release.Name }}
29+
ports:
30+
- protocol: TCP
31+
port: 2222
32+
{{- end }}

deploy/helm/navigator/templates/statefulset.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,8 @@ spec:
5858
- name: NEMOCLAW_SSH_GATEWAY_PORT
5959
value: {{ .Values.server.sshGatewayPort | quote }}
6060
{{- end }}
61+
- name: NEMOCLAW_SSH_HANDSHAKE_SECRET
62+
value: {{ required "server.sshHandshakeSecret is required" .Values.server.sshHandshakeSecret | quote }}
6163
- name: NEMOCLAW_TLS_CERT
6264
value: /etc/navigator-tls/server/tls.crt
6365
- name: NEMOCLAW_TLS_KEY

0 commit comments

Comments
 (0)