You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(platform): cleanup api surface area and mtls flows (!39)
ClosesNVIDIA#48, NVIDIA#52
## Summary
- Replace the envoy-gateway-based TLS setup with inline PKI generation during cluster bootstrap, generating CA, server, and client certificates directly in the `navigator-bootstrap` crate
- Remove all envoy gateway Helm templates (`gateway.yaml`, `gatewayclass.yaml`, `grpcroute.yaml`, PKI job, traffic policies) and the `Dockerfile.pki-job`
- Add native mTLS support to the navigator server with `tokio-rustls`, mounting client TLS certs as volumes into sandbox pods
- Update cluster entrypoint, healthcheck, and deploy scripts to work with the new direct-TLS architecture
- Add TLS security e2e test and fix formatting/clippy warnings
## Test Plan
- All unit tests pass (`cargo test --workspace`)
- Clippy clean (`cargo clippy --workspace --all-targets`)
- Format clean (`cargo fmt --all -- --check`)
- Python tests pass (`uv run pytest python/`)
- Full `mise run pre-commit` passes
Copy file name to clipboardExpand all lines: .agent/skills/debug-navigator-cluster/SKILL.md
+26-23Lines changed: 26 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,17 +18,20 @@ Diagnose why a navigator cluster failed to start after `nav cluster admin deploy
18
18
5. Wait for k3s to generate kubeconfig (up to 60s)
19
19
6.**Clean stale nodes**: Remove any `NotReady` k3s nodes left over from previous container instances that reused the same persistent volume
20
20
7.**Prepare local images** (if `NAVIGATOR_PUSH_IMAGES` is set): In `internal` registry mode, bootstrap waits for the in-cluster registry and pushes tagged images there. In `external` mode, bootstrap uses legacy `ctr -n k8s.io images import` push-mode behavior.
21
-
8. Wait for cluster health checks to pass (up to 6 min):
21
+
7.**Reconcile TLS PKI**: Load existing TLS secrets from the cluster; if missing, incomplete, or malformed, generate fresh PKI (CA + server + client certs). Apply secrets to cluster. If rotation happened and the navigator workload is already running, rollout restart and wait for completion (failed rollout aborts deploy).
Expected ports: `6443/tcp`, `80/tcp`, `443/tcp`, `8080/tcp`(mapped to host 30051).
171
+
Expected ports: `6443/tcp`, `30051/tcp`(mapped to configurable host port, default 8080; set via `--port` on deploy).
175
172
176
173
If ports are missing or conflicting, another process may be using them. Check with:
177
174
178
175
```bash
179
176
# On the host (or remote host)
180
-
ss -tlnp | grep -E ':(6443|80|443|30051)\s'
177
+
ss -tlnp | grep -E ':(6443|8080)\s'
181
178
```
182
179
183
180
If using Docker-in-Docker (`DOCKER_HOST=tcp://docker:2375`), verify metadata points at `https://docker` (not `https://127.0.0.1`).
@@ -223,20 +220,25 @@ If `registries.yaml` is missing or has wrong values, verify env wiring (`NAVIGAT
223
220
224
221
### Step 7: Check mTLS / PKI
225
222
226
-
If TLS is enabled, the health check requires the `navigator-cli-client` secret:
223
+
TLS certificates are generated by the `navigator-bootstrap` crate (using `rcgen`) and stored as K8s secrets before the Helm release installs. There is no PKI job or cert-manager — certificates are applied directly via `kubectl apply`.
227
224
228
225
```bash
229
-
# Check if the secret exists
230
-
docker exec navigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get secret navigator-cli-client'
226
+
# Check if the three TLS secrets exist
227
+
docker exec navigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get secret navigator-server-tls navigator-server-client-ca navigator-client-tls'
#Inspect server cert expiry (if openssl is available in the container)
230
+
docker exec navigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get secret navigator-server-tls -o jsonpath="{.data.tls\.crt}" | base64 -d | openssl x509 -noout -dates 2>/dev/null || echo "openssl not available"'
234
231
235
-
# Check cert-manager pods (PKI depends on cert-manager)
236
-
docker execnavigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n cert-manager get pods'
232
+
# Check if CLI-side mTLS files exist locally
233
+
ls -la ~/.config/navigator/clusters/<name>/mtls/
237
234
```
238
235
239
-
The PKI job often fails due to DNS issues or registry auth problems (it needs to pull its image from the distribution registry). If the job failed, check registry config (Step 6) and DNS (Step 9).
236
+
On redeploy, bootstrap reuses existing secrets if they are valid PEM. If secrets are missing or malformed, fresh PKI is generated and the navigator workload is automatically restarted. If the rollout restart fails after rotation, the deploy aborts and CLI-side certs are not updated. Certificates use rcgen defaults (effectively never expire).
237
+
238
+
Common mTLS issues:
239
+
-**Secrets missing**: The `navigator` namespace may not have been created yet (Helm controller race). Bootstrap waits up to 2 minutes for the namespace.
240
+
-**mTLS mismatch after manual secret deletion**: Delete all three secrets and redeploy — bootstrap will regenerate and restart the workload.
241
+
-**CLI can't connect after redeploy**: Check that `~/.config/navigator/clusters/<name>/mtls/` contains `ca.crt`, `tls.crt`, `tls.key` and that they were updated at deploy time.
240
242
241
243
### Step 8: Check Kubernetes Events
242
244
@@ -293,11 +295,12 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
293
295
| Image import fails (`k3s ctr` exit code != 0) | Corrupt tar stream or containerd not ready | Retry after k3s is fully started; check container logs |
294
296
| Push mode images not found by kubelet | Imported into wrong containerd namespace | Must use `k3s ctr -n k8s.io images import`, not `k3s ctr images import`|
295
297
| Gateway not `Programmed`| Envoy Gateway not ready | Check `envoy-gateway-system` pods and Helm install logs |
| mTLS mismatch after redeploy | PKI rotated but workload not restarted, or rollout failed | Check that all three TLS secrets exist and that the navigator pod restarted after cert rotation (Step 7) |
297
300
| Helm install job failed | Chart values error or dependency issue | Check `helm-install-navigator` job logs in `kube-system`|
298
301
| Architecture mismatch (remote) | Built on arm64, deploying to amd64 | Cross-build the image for the target architecture |
| Port conflict | Another service on 6443/80/443/30051 | Stop conflicting service or change port mapping|
303
+
| Port conflict | Another service on 6443 or the configured gateway host port (default 8080) | Stop conflicting service or use `--port` to pick a different host port|
301
304
| gRPC connect refused to `127.0.0.1:443` in CI | Docker daemon is remote (`DOCKER_HOST=tcp://...`) but metadata still points to loopback | Verify metadata endpoint host matches `DOCKER_HOST` and includes non-loopback host |
302
305
| DNS failures inside container | Entrypoint DNS detection failed | Check `/etc/rancher/k3s/resolv.conf` and container startup logs |
303
306
|`metrics-server` errors in logs | Normal k3s noise, not the root cause | These errors are benign — look for the actual failing health check component |
Copy file name to clipboardExpand all lines: architecture/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -286,6 +286,7 @@ This opens an interactive SSH session into the sandbox, with all provider creden
286
286
|---|---|
287
287
|[Cluster Bootstrap](cluster-single-node.md)| How the platform bootstraps a Kubernetes cluster from a single Docker container, for local and remote targets. |
288
288
|[Gateway Architecture](gateway.md)| The control plane gateway: API multiplexing, gRPC services, persistence, TLS, and sandbox orchestration. |
289
+
|[Gateway Security](gateway-security.md)| mTLS enforcement, PKI bootstrap, certificate hierarchy, and the gateway trust model. |
289
290
|[Sandbox Architecture](sandbox.md)| The sandbox execution environment: policy enforcement, Landlock, seccomp, network namespaces, and the network proxy. |
0 commit comments