feat: add K8s operator and harden sandbox by clementblaise · Pull Request #327 · ridgesai/ridges

clementblaise · 2026-03-11T15:03:10Z

Summary

Add a Kubernetes operator that manages EvaluatorPool StatefulSets, autoscaling screener pods based on queue depth from the Ridges API.
Add a screener queue-depth API endpoint (GET /screener/queue-depth) and backing DB queries.
Harden evaluation sandboxes: non-root execution, read-only root FS, dropped capabilities, PID limits, and no-new-privileges.
Rework the sandbox nginx proxy to run as non-root, restrict proxied paths to an explicit allow-list with rate limiting, and return 403 for everything else.
Add graceful SIGTERM shutdown to the validator with a health server (/healthz, /readyz) for Kubernetes probes.
Add Dockerfiles for the screener image and operator image.
Add a GitHub Actions CI workflow to build and push both images to GHCR.

Details

Operator and CRD

EvaluatorPool CRD (group ridges.ai/v1alpha1) with min/max replicas and scale-down stabilization. The controller runs preflight checks (secret, API reachability, labeled nodes), manages a StatefulSet + PDB + NetworkPolicy, and autoscales by polling queue depth with stepwise scale-down and stabilization windows. Exposes Prometheus metrics for queue depth, desired replicas, preflight status, drift corrections, and scaling errors.

Screener queue-depth endpoint

GET /screener/queue-depth?stage={screener_1,screener_2,validator} returns queue depth and active evaluation count. Backed by new get_queue_depth() and get_active_evaluation_count() queries.

Sandbox hardening

Sandbox containers now run as UID 65534 (nobody) with a read-only root filesystem, a 64 MB tmpfs at /tmp, a PID limit of 256, all capabilities dropped, and no-new-privileges. Temp directory permissions are set before mounting. HOME=/tmp and git safe.directory are configured via environment. The sandbox image Dockerfile fixes NVM directory permissions for non-root access.

Sandbox proxy hardening

The nginx proxy runs as the nginx user on port 8080. Only /api/inference, /api/embedding, and /api/usage are proxied to the gateway (rate-limited at 30 req/s, burst 10); all other paths return 403. Temp/pid files are written to a writable /sandbox-proxy directory.

Validator graceful shutdown

A SIGTERM handler marks the health server as shutting down, wakes idle sleep loops via asyncio.Event, finishes the current evaluation if one is running, then disconnects from the platform. A lightweight async health server serves /healthz (liveness) and /readyz (readiness) on port 8080.

CI and Dockerfiles

Multi-stage Dockerfile for the screener image (Python 3.12, Docker CLI, shallow .git, non-root user UID 1000). Multi-stage Dockerfile for the operator (static Go binary on distroless/nonroot). GitHub Actions workflow builds both images in a matrix and pushes to GHCR on main and version tags.

…eful shutdown and CI

Add K8s operator for screener autoscaling, harden sandboxes, add grac…

26fb718

…eful shutdown and CI

clementblaise requested review from adamridges and camfairchild and removed request for adamridges March 11, 2026 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add K8s operator and harden sandbox#327

feat: add K8s operator and harden sandbox#327
clementblaise wants to merge 1 commit intomainfrom
clems/add-k8s-operator-screener

clementblaise commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

clementblaise commented Mar 11, 2026

Summary

Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant