Skip to content

Commit 2d85338

Browse files
committed
feat(platform): cleanup api surface area and mtls flows (!39)
Closes NVIDIA#48, NVIDIA#52 ## Summary - Replace the envoy-gateway-based TLS setup with inline PKI generation during cluster bootstrap, generating CA, server, and client certificates directly in the `navigator-bootstrap` crate - Remove all envoy gateway Helm templates (`gateway.yaml`, `gatewayclass.yaml`, `grpcroute.yaml`, PKI job, traffic policies) and the `Dockerfile.pki-job` - Add native mTLS support to the navigator server with `tokio-rustls`, mounting client TLS certs as volumes into sandbox pods - Update cluster entrypoint, healthcheck, and deploy scripts to work with the new direct-TLS architecture - Add TLS security e2e test and fix formatting/clippy warnings ## Test Plan - All unit tests pass (`cargo test --workspace`) - Clippy clean (`cargo clippy --workspace --all-targets`) - Format clean (`cargo fmt --all -- --check`) - Python tests pass (`uv run pytest python/`) - Full `mise run pre-commit` passes
1 parent f88aecf commit 2d85338

59 files changed

Lines changed: 2150 additions & 1145 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agent/skills/debug-navigator-cluster/SKILL.md

Lines changed: 26 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -18,17 +18,20 @@ Diagnose why a navigator cluster failed to start after `nav cluster admin deploy
1818
5. Wait for k3s to generate kubeconfig (up to 60s)
1919
6. **Clean stale nodes**: Remove any `NotReady` k3s nodes left over from previous container instances that reused the same persistent volume
2020
7. **Prepare local images** (if `NAVIGATOR_PUSH_IMAGES` is set): In `internal` registry mode, bootstrap waits for the in-cluster registry and pushes tagged images there. In `external` mode, bootstrap uses legacy `ctr -n k8s.io images import` push-mode behavior.
21-
8. Wait for cluster health checks to pass (up to 6 min):
21+
7. **Reconcile TLS PKI**: Load existing TLS secrets from the cluster; if missing, incomplete, or malformed, generate fresh PKI (CA + server + client certs). Apply secrets to cluster. If rotation happened and the navigator workload is already running, rollout restart and wait for completion (failed rollout aborts deploy).
22+
8. **Store CLI mTLS credentials**: Persist client cert/key/CA locally for CLI authentication.
23+
9. Wait for cluster health checks to pass (up to 6 min):
2224
- k3s API server readiness (`/readyz`)
2325
- `navigator` statefulset ready in `navigator` namespace
2426
- `navigator-gateway` Gateway programmed in `navigator` namespace
25-
- If TLS enabled: `navigator-cli-client` secret exists with cert data
26-
9. Extract mTLS credentials if TLS is enabled (up to 3 min)
27+
- TLS secrets `navigator-server-tls` and `navigator-client-tls` exist
2728

2829
For local deploys, metadata endpoint selection now depends on Docker connectivity:
2930

30-
- default local Docker socket (`unix:///var/run/docker.sock`): `https://127.0.0.1`
31-
- TCP Docker daemon (`DOCKER_HOST=tcp://<host>:<port>`): `https://<host>` for non-loopback hosts
31+
- default local Docker socket (`unix:///var/run/docker.sock`): `https://127.0.0.1:{port}` (default port 8080)
32+
- TCP Docker daemon (`DOCKER_HOST=tcp://<host>:<port>`): `https://<host>:{port}` for non-loopback hosts
33+
34+
The host port is configurable via `--port` on `nav cluster admin deploy` (default 8080) and is stored in `ClusterMetadata.gateway_port`.
3235

3336
The TCP host is also added as an extra gateway TLS SAN so mTLS hostname validation succeeds.
3437

@@ -161,23 +164,17 @@ The Envoy Gateway provides HTTP/gRPC ingress:
161164
# Gateway status
162165
docker exec navigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get gateway/navigator-gateway'
163166

164-
# Envoy Gateway system pods
165-
docker exec navigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n envoy-gateway-system get pods -o wide'
166-
167-
# Envoy Gateway Helm install job
168-
docker exec navigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n kube-system logs -l job-name=helm-install-envoy-gateway --tail=200'
169-
170167
# Check port bindings on the host
171168
docker port navigator-cluster-<name>
172169
```
173170

174-
Expected ports: `6443/tcp`, `80/tcp`, `443/tcp`, `8080/tcp` (mapped to host 30051).
171+
Expected ports: `6443/tcp`, `30051/tcp` (mapped to configurable host port, default 8080; set via `--port` on deploy).
175172

176173
If ports are missing or conflicting, another process may be using them. Check with:
177174

178175
```bash
179176
# On the host (or remote host)
180-
ss -tlnp | grep -E ':(6443|80|443|30051)\s'
177+
ss -tlnp | grep -E ':(6443|8080)\s'
181178
```
182179

183180
If using Docker-in-Docker (`DOCKER_HOST=tcp://docker:2375`), verify metadata points at `https://docker` (not `https://127.0.0.1`).
@@ -223,20 +220,25 @@ If `registries.yaml` is missing or has wrong values, verify env wiring (`NAVIGAT
223220

224221
### Step 7: Check mTLS / PKI
225222

226-
If TLS is enabled, the health check requires the `navigator-cli-client` secret:
223+
TLS certificates are generated by the `navigator-bootstrap` crate (using `rcgen`) and stored as K8s secrets before the Helm release installs. There is no PKI job or cert-manager — certificates are applied directly via `kubectl apply`.
227224

228225
```bash
229-
# Check if the secret exists
230-
docker exec navigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get secret navigator-cli-client'
226+
# Check if the three TLS secrets exist
227+
docker exec navigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get secret navigator-server-tls navigator-server-client-ca navigator-client-tls'
231228

232-
# PKI job logs (this job creates the certificates)
233-
docker exec navigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator logs -l job-name=navigator-gateway-pki --tail=200'
229+
# Inspect server cert expiry (if openssl is available in the container)
230+
docker exec navigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get secret navigator-server-tls -o jsonpath="{.data.tls\.crt}" | base64 -d | openssl x509 -noout -dates 2>/dev/null || echo "openssl not available"'
234231

235-
# Check cert-manager pods (PKI depends on cert-manager)
236-
docker exec navigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n cert-manager get pods'
232+
# Check if CLI-side mTLS files exist locally
233+
ls -la ~/.config/navigator/clusters/<name>/mtls/
237234
```
238235

239-
The PKI job often fails due to DNS issues or registry auth problems (it needs to pull its image from the distribution registry). If the job failed, check registry config (Step 6) and DNS (Step 9).
236+
On redeploy, bootstrap reuses existing secrets if they are valid PEM. If secrets are missing or malformed, fresh PKI is generated and the navigator workload is automatically restarted. If the rollout restart fails after rotation, the deploy aborts and CLI-side certs are not updated. Certificates use rcgen defaults (effectively never expire).
237+
238+
Common mTLS issues:
239+
- **Secrets missing**: The `navigator` namespace may not have been created yet (Helm controller race). Bootstrap waits up to 2 minutes for the namespace.
240+
- **mTLS mismatch after manual secret deletion**: Delete all three secrets and redeploy — bootstrap will regenerate and restart the workload.
241+
- **CLI can't connect after redeploy**: Check that `~/.config/navigator/clusters/<name>/mtls/` contains `ca.crt`, `tls.crt`, `tls.key` and that they were updated at deploy time.
240242

241243
### Step 8: Check Kubernetes Events
242244

@@ -293,11 +295,12 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
293295
| Image import fails (`k3s ctr` exit code != 0) | Corrupt tar stream or containerd not ready | Retry after k3s is fully started; check container logs |
294296
| Push mode images not found by kubelet | Imported into wrong containerd namespace | Must use `k3s ctr -n k8s.io images import`, not `k3s ctr images import` |
295297
| Gateway not `Programmed` | Envoy Gateway not ready | Check `envoy-gateway-system` pods and Helm install logs |
296-
| mTLS secret missing | PKI job failed (often DNS) | Check PKI job logs and DNS resolution (Step 8) |
298+
| mTLS secrets missing | Bootstrap couldn't apply secrets (namespace not ready, kubectl exec failure) | Check deploy logs and verify `navigator` namespace exists (Step 7) |
299+
| mTLS mismatch after redeploy | PKI rotated but workload not restarted, or rollout failed | Check that all three TLS secrets exist and that the navigator pod restarted after cert rotation (Step 7) |
297300
| Helm install job failed | Chart values error or dependency issue | Check `helm-install-navigator` job logs in `kube-system` |
298301
| Architecture mismatch (remote) | Built on arm64, deploying to amd64 | Cross-build the image for the target architecture |
299302
| SSH connection failed (remote) | SSH key/host/Docker issues | Test `ssh <host> docker ps` manually |
300-
| Port conflict | Another service on 6443/80/443/30051 | Stop conflicting service or change port mapping |
303+
| Port conflict | Another service on 6443 or the configured gateway host port (default 8080) | Stop conflicting service or use `--port` to pick a different host port |
301304
| gRPC connect refused to `127.0.0.1:443` in CI | Docker daemon is remote (`DOCKER_HOST=tcp://...`) but metadata still points to loopback | Verify metadata endpoint host matches `DOCKER_HOST` and includes non-loopback host |
302305
| DNS failures inside container | Entrypoint DNS detection failed | Check `/etc/rancher/k3s/resolv.conf` and container startup logs |
303306
| `metrics-server` errors in logs | Normal k3s noise, not the root cause | These errors are benign — look for the actual failing health check component |

.env.example

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Navigator local development environment
2+
# Copy to .env and customise. Mise loads .env automatically.
3+
#
4+
# To run multiple Navigator clusters concurrently, give each worktree (or
5+
# checkout) a unique CLUSTER_NAME and GATEWAY_PORT. For example:
6+
#
7+
# Worktree A (.env): CLUSTER_NAME=nav-a GATEWAY_PORT=8080
8+
# Worktree B (.env): CLUSTER_NAME=nav-b GATEWAY_PORT=8090
9+
10+
# ---------- Cluster identity ----------
11+
12+
# Name used for the Docker container, k3s volume, TLS secrets, and the
13+
# navigator CLI's active-cluster bookmark. Defaults to the repo directory
14+
# basename (e.g. "navigator-c").
15+
#CLUSTER_NAME=navigator-c
16+
17+
# ---------- Ports ----------
18+
19+
# Host port mapped to the k3s NodePort (30051) where the Navigator gateway
20+
# listens. The CLI connects here. Must be unique per cluster.
21+
#GATEWAY_PORT=8080

CONTRIBUTING.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -209,7 +209,6 @@ mise run cluster # Build and deploy local k3s cluster with Navigator
209209
mise run cluster:deploy # Fast deploy: rebuild changed components and skip unnecessary helm work
210210
mise run cluster:push:server # Push local server image to configured pull registry
211211
mise run cluster:push:sandbox # Push local sandbox image to configured pull registry
212-
mise run cluster:push:pki-job # Push local pki-job image to configured pull registry
213212
mise run cluster:deploy:pull # Force full pull-mode deploy flow
214213
mise run cluster:push # Legacy image-import fallback workflow
215214
```

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

architecture/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,7 @@ This opens an interactive SSH session into the sandbox, with all provider creden
286286
|---|---|
287287
| [Cluster Bootstrap](cluster-single-node.md) | How the platform bootstraps a Kubernetes cluster from a single Docker container, for local and remote targets. |
288288
| [Gateway Architecture](gateway.md) | The control plane gateway: API multiplexing, gRPC services, persistence, TLS, and sandbox orchestration. |
289+
| [Gateway Security](gateway-security.md) | mTLS enforcement, PKI bootstrap, certificate hierarchy, and the gateway trust model. |
289290
| [Sandbox Architecture](sandbox.md) | The sandbox execution environment: policy enforcement, Landlock, seccomp, network namespaces, and the network proxy. |
290291
| [Container Management](build-containers.md) | Container images, Dockerfiles, Helm charts, build tasks, and CI/CD. |
291292
| [Sandbox Connect](sandbox-connect.md) | SSH tunneling into sandboxes through the gateway. |

0 commit comments

Comments
 (0)