Skip to content

feat: add K8s operator and harden sandbox#327

Open
clementblaise wants to merge 1 commit intomainfrom
clems/add-k8s-operator-screener
Open

feat: add K8s operator and harden sandbox#327
clementblaise wants to merge 1 commit intomainfrom
clems/add-k8s-operator-screener

Conversation

@clementblaise
Copy link
Collaborator

Summary

  • Add a Kubernetes operator that manages EvaluatorPool StatefulSets, autoscaling screener pods based on queue depth from the Ridges API.
  • Add a screener queue-depth API endpoint (GET /screener/queue-depth) and backing DB queries.
  • Harden evaluation sandboxes: non-root execution, read-only root FS, dropped capabilities, PID limits, and no-new-privileges.
  • Rework the sandbox nginx proxy to run as non-root, restrict proxied paths to an explicit allow-list with rate limiting, and return 403 for everything else.
  • Add graceful SIGTERM shutdown to the validator with a health server (/healthz, /readyz) for Kubernetes probes.
  • Add Dockerfiles for the screener image and operator image.
  • Add a GitHub Actions CI workflow to build and push both images to GHCR.

Details

Operator and CRD

EvaluatorPool CRD (group ridges.ai/v1alpha1) with min/max replicas and scale-down stabilization. The controller runs preflight checks (secret, API reachability, labeled nodes), manages a StatefulSet + PDB + NetworkPolicy, and autoscales by polling queue depth with stepwise scale-down and stabilization windows. Exposes Prometheus metrics for queue depth, desired replicas, preflight status, drift corrections, and scaling errors.

Screener queue-depth endpoint

GET /screener/queue-depth?stage={screener_1,screener_2,validator} returns queue depth and active evaluation count. Backed by new get_queue_depth() and get_active_evaluation_count() queries.

Sandbox hardening

Sandbox containers now run as UID 65534 (nobody) with a read-only root filesystem, a 64 MB tmpfs at /tmp, a PID limit of 256, all capabilities dropped, and no-new-privileges. Temp directory permissions are set before mounting. HOME=/tmp and git safe.directory are configured via environment. The sandbox image Dockerfile fixes NVM directory permissions for non-root access.

Sandbox proxy hardening

The nginx proxy runs as the nginx user on port 8080. Only /api/inference, /api/embedding, and /api/usage are proxied to the gateway (rate-limited at 30 req/s, burst 10); all other paths return 403. Temp/pid files are written to a writable /sandbox-proxy directory.

Validator graceful shutdown

A SIGTERM handler marks the health server as shutting down, wakes idle sleep loops via asyncio.Event, finishes the current evaluation if one is running, then disconnects from the platform. A lightweight async health server serves /healthz (liveness) and /readyz (readiness) on port 8080.

CI and Dockerfiles

Multi-stage Dockerfile for the screener image (Python 3.12, Docker CLI, shallow .git, non-root user UID 1000). Multi-stage Dockerfile for the operator (static Go binary on distroless/nonroot). GitHub Actions workflow builds both images in a matrix and pushes to GHCR on main and version tags.

@clementblaise clementblaise requested review from adamridges and camfairchild and removed request for adamridges March 11, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant