This document defines the runtime contract for the four production container roles used by OpenHive:
- Gateway containers
- Dashboard containers
- Agent containers
- Sandbox containers
These contracts are the baseline for issue #84 and the Kubernetes
productization work that followed. They are intentionally explicit about
startup behavior, health checks, writable paths, and isolation expectations so
later work such as ContainerAgentPool and reconciliation can build on a
stable operational model.
OpenHive currently has two containerization layers:
- the current deployable baseline in
deploy/k8s/ - the evolving container orchestration path built around
ContainerAgentPool
This document describes the contract that operators should rely on now, while also identifying which responsibilities remain future work.
For clarity:
- the supported
preview_localonboarding path is still backend + dashboard from source - the Kubernetes productized control-plane path now adds a standalone dashboard container next to the full Gateway runtime
- the matching local operator entrypoint for sandbox workflows is
make run-sandbox/uv run uvicorn hive.container.sandbox_entrypoint:app --port 8091withHIVE_SANDBOX_URL=http://127.0.0.1:8091 - local image-parity checks can instead use
make run-sandbox-container, which starts the profiled Docker Composesandboxservice built fromDockerfile.sandboxon the same local port - sandbox-backed dev-task workflows now have a real operator-facing surface
through the authenticated Gateway and dashboard dev-task review APIs, but
this remains an optional local operator workflow rather than a required
preview_localonboarding step - the gateway relay and short-lived task-token contract exist for model-backed isolated runtimes; the isolated agent runtime now uses a shared task relay client that future sandbox relay backends should reuse
- the canonical current sandbox coding path is the default
HIVE_SANDBOX_CODING_BACKEND=codex_climode, which runs the localcodex execCLI as a governed subprocess with Codex’sworkspace-writesandbox mode so task-local edits can be proposed without full unrestricted access - the sandbox container also exposes an explicit opt-in proof mode,
HIVE_SANDBOX_CODING_BACKEND=relay_helper, which swaps that default local CLI execution for a local relay helper process that consumes gateway-issued task tokens - the sandbox container also exposes
HIVE_SANDBOX_CODING_BACKEND=deterministic_prooffor Docker-local control loop verification only; it does not prove provider-backed coding quality, but it does prove archive seeding, runtime attempts, checkpoints, artifacts, and in-place resume without depending on a live LLM provider - the Docker Compose sandbox profile bind-mounts source
hive,packages, andscriptsdirectories so local verification can exercise the current checkout without rebuilding the sandbox image for every Python-only control-loop change - the default
codex_cliadapter now scrubs inherited secret env before launch, but it can now also opt in to a narrow provider-env allowlist from the sandbox runtime env when operators need realcodex_cliexecution - subprocess env construction is centralized for governed runtime paths: inherited provider keys, database credentials, gateway/internal secrets, and app credentials are scrubbed unless a documented direct-env allowlist names the key
- sandbox command status includes bounded stdout/stderr previews with sizes, omitted counts, state, exit code, command id, and retrieval hints; full logs remain available only through explicit log retrieval
- sandbox-backed Work Runs publish durable progress checkpoints through the sandbox API and persist runtime attempts separately from the logical task id; this lets the operator resume a stale or failed Work Run in place instead of treating timeout extension as the long-term scaling mechanism
- checkpoint runtime events expose a typed
checkpoint_sequence_no,durablemarker, andcheckpoint_payloadso dashboards can show friendly progress and recovery context without relying on untyped lifecycle-event extras - in-place Work Run resume is a new sandbox attempt under the same logical task:
it keeps the task id, current workspace, artifacts/log visibility, and latest
checkpoint context, while
requeueremains the fallback that creates a new logical task when the original workspace is unavailable or should be forked - the governed process-session lifecycle is list, poll, log, and cancel over existing sandbox command ids, not an arbitrary terminal or shell creation API
- when an operator also supplies sandbox runtime provider metadata such as a
provider id and base URL, the sandbox entrypoint can materialize a minimal
runtime
~/.codex/config.tomlunder the writable sandbox home so the defaultcodex_clipath can target an OpenAI-compatible provider without baking that config into the image - the Kubernetes sandbox pod now sets
HIVE_SANDBOX_CODEX_SANDBOX_MODE=danger-full-accessfor the inner Codex CLI because the pod itself is already the outer governed workspace boundary and container runtimes such as Docker Desktop block Codex's inner bubblewrap namespace sandbox - the same pod also mounts a dedicated writable home at
/home/codexinstead of reusing/tmp, because real non-ephemeral Codex CLI runs may stall when their managed home directory lives directly under a temporary root - when a remote sandbox is seeded from
workspace_archive_b64, approval first tries to apply the patch back to the original requested workspace path; if that path is not visible inside the sandbox pod, operators may configure an explicit host-side apply relay withHIVE_SANDBOX_WORKSPACE_APPLY_RELAY_URLandHIVE_SANDBOX_WORKSPACE_APPLY_RELAY_TOKEN - dev-task workspace state now has a source-tested
WorkspaceManifestshape that records task id, repo path class, artifact manifest path, target paths, source/requested workspace path presence, apply-back mode, and relay configuration without storing raw archives, patches, or secrets in run state - that env-auth path still is not the same as a relay-backed sandbox runtime; it is a direct provider-env bridge for the local CLI, not a token flow
- that means vendor-secret residency claims only apply to explicitly
relay-backed sandbox or agent runtimes, not to every current
preview_localdev-task execution path or the defaultcodex_clisandbox path
All OpenHive runtime containers share these rules:
| Area | Contract |
|---|---|
| Health endpoint | Each role must expose a lightweight HTTP probe endpoint. Gateway, Agent, and Sandbox use GET /healthz; Dashboard uses GET /dashboard-healthz. |
| Probe behavior | Health endpoints must return quickly and without external side effects. |
| Logging | Containers write structured logs to stdout/stderr. |
| Shutdown | Containers must tolerate SIGTERM and allow the orchestrator to stop them without manual cleanup steps. |
| Secrets | Secrets come from environment or mounted secret sources. They must never be baked into images or persisted in logs. |
| Filesystem policy | Any path that must be writable at runtime must be explicitly mounted or documented. Read-only roots are preferred where practical. |
Primary responsibility
- Run the platform control plane
- Serve the platform API and session/auth endpoints consumed by the dashboard
- Own long-lived coordination such as auth, scheduling, routing, and credential proxying
Current entrypoints
- Baseline K8s deployment:
hive.container.gateway_entrypoint:app - Full application runtime:
hive.main:app
Startup contract
- The baseline deployment may use the minimal gateway entrypoint so the image is probeable before the full runtime stack is configured.
- A full production gateway must only be considered healthy after FastAPI startup wiring completes, including DB setup, scheduler startup, and route mounting.
- The Kubernetes full-runtime overlay at
deploy/k8s/overlays/full-gateway-runtimeenablesHIVE_STRICT_STARTUP=true, which makes startup fail fast when required secrets are still on placeholder defaults or when DB initialization and migrations cannot complete. - The same overlay also sets
HIVE_STARTUP_MIGRATION_MODE=checkandHIVE_METADATA_CREATE_ON_STARTUP=false, so the Kubernetes path expects the schema to already be at the current Alembic head instead of mutating it implicitly during control-plane startup. - The same overlay adds a
startupProbeso Kubernetes waits for the real control-plane startup path before liveness checks restart the pod.
Health contract
- Port:
8080 - Endpoint:
GET /healthz - Baseline response identifies role
gateway - Full runtime response also reflects DB and active-agent health via
hive.main
Filesystem and volume contract
- No persistent workspace volume is required for the baseline gateway container.
- The full runtime may read project workspace and config paths, but gateway-owned state must remain reconstructable from DB plus configured workspace mounts.
- The current full-runtime overlay mounts
/data/hiveas a writableemptyDir. That is enough for control-plane bring-up and route reachability, but operators should treat it as non-durable until a persistent storage story is wired in.
Isolation contract
- Namespace:
openhive - Gateway is the intended long-term role that holds real vendor secrets in process memory.
- In the current source-based
preview_localdeployment, Keeper and Scout often still run in the same process as the gateway viaLocalAgentPool, so this boundary is only fully enforced today for relay-backed sandbox model access and encrypted-at-rest channel credentials. - Gateway may talk to DB, Agent runtime, Sandbox runtime, and approved IM / LLM upstreams.
Required environment
- Full runtime requires the normal server settings used by
hive.main, especially database connectivity, dashboard session secret, and platform credentials. - The current full-runtime overlay treats these as the minimum required env
surface:
DATABASE_URLDASHBOARD_SESSION_SECRETHIVE_INTERNAL_SECRETHIVE_WORKSPACEHIVE_STRICT_STARTUP=trueHIVE_STARTUP_MIGRATION_MODE=checkHIVE_METADATA_CREATE_ON_STARTUP=false
- Pool selection stays in the composition root and currently uses these modes:
HIVE_POOL_BACKEND=localkeeps the in-processLocalAgentPoolpath used by source-based preview flows.HIVE_POOL_BACKEND=containerselectsContainerAgentPooland requiresHIVE_CONTAINER_RUNTIME_BACKENDto choose the isolated-runtime backend.- The first supported Kubernetes deployment path uses
HIVE_CONTAINER_RUNTIME_BACKEND=kubernetes. Local source-based experiments can still uselocal_subprocessto start one isolated HTTP runtime per agent on the same host.
- Baseline health-only deployment does not require the full control-plane env surface.
Primary responsibility
- Serve the operator-facing Next.js dashboard
- Keep browser traffic same-origin while proxying dashboard-originated API calls to Gateway
- Preserve cookie-based auth behavior for self-hosted deployments
Current entrypoint
- Standalone Next.js runtime generated from
Dockerfile.web
Startup contract
- The production image is built from Next.js standalone output and starts the
generated
server.jsruntime. - The runtime must listen on port
3000and tolerate the standardPORTandHOSTNAMEenvironment variables used by the standalone server. - Dashboard API traffic must not depend on local-dev-only assumptions such as a
hardcoded
http://localhost:8080rewrite target. - Same-origin browser requests to
/api/*and/healthzare proxied by the Next.jsproxy.tslayer to the configured in-cluster Gateway origin.
Health contract
- Port:
3000 - Endpoint:
GET /dashboard-healthz - Response identifies role
dashboard
Filesystem and volume contract
- No writable volume is required for the current standalone dashboard runtime.
- Static assets and the generated server bundle ship inside the image.
Isolation contract
- Namespace:
openhive - Dashboard may reach the in-cluster Gateway service.
- Dashboard does not require direct DB access.
- Browser-visible API traffic should remain same-origin to the dashboard host; the in-cluster hop to Gateway happens server-side through the dashboard proxy.
Required environment
HIVE_GATEWAY_INTERNAL_URLfor runtime proxying to Gateway- Optional
PORT/HOSTNAMEoverrides supported by the standalone Next.js server NEXT_PUBLIC_API_URLremains optional for special non-proxied deployments, but the supported Kubernetes path uses the same-origin proxy instead
Primary responsibility
- Host exactly one agent runtime instance for the Kubernetes-backed
ContainerAgentPoolpath - Expose a probeable runtime boundary that can be started, stopped, and replaced independently
Current entrypoint
hive.container.agent_entrypoint:app
Startup contract
- The container must start without mutating bundled defaults inside the image.
- Config bootstrap must copy default files into the writable config volume without overwriting operator edits.
- The agent process must read its writable workspace from
HIVE_WORKSPACE. - When
HIVE_AGENT_CONFIG_JSONis supplied, the entrypoint must bootstrap one realHiveAgentruntime inside the container and expose it through the existing/runand/flush-memoryHTTP contract. - The first Kubernetes runtime backend now creates one Pod plus one Service per
managed agent identity, with deterministic runtime naming and Service-DNS
base URL resolution. Exact
agent_idand controller ownership values are carried in Pod annotations rather than lossy label-safe rewrites. - The isolated runtime must route model calls back through
HIVE_GATEWAY_URLusing the gateway relay flow. It must not require a long-lived vendor API key in the agent container environment. - That vendor-secret boundary is narrower than a blanket "no secrets at all"
claim: the current local separated-process runtime may still receive
DATABASE_URLwhen DB-backed runtime features are enabled, so operators should interpret the current proof as provider/app credential isolation rather than total operational-secret elimination.
Health contract
- Port:
8090 - Endpoint:
GET /healthz - Response includes:
statusrole=agentworkspaceruntime_readyagent_idproject_idcontroller_iddeployment_backendreadiness_reason
Filesystem and volume contract
- Writable config/workspace mount is required.
- Current baseline mount path:
/data/config
- Current baseline env:
HIVE_WORKSPACE=/data/config
- The init container writes defaults from
/app/defaults/agentinto the writable volume. - The first Kubernetes-backed lifecycle slice uses an
emptyDirworkspace per managed agent Pod. That keeps the runtime contract explicit without claiming durable per-agent storage yet. - Agent runtime data must live in the mounted workspace, not in the image filesystem.
Isolation contract
- Namespace:
hive-agents - Agent containers should not require arbitrary external egress.
- Agent containers may reach:
- Gateway API
- Sandbox API
- Agent containers must not rely on direct access to another agent container.
- State-changing agent-runtime HTTP endpoints require the gateway/controller
shared secret (
X-Internal-Secretor Bearer auth).GET /healthzremains the probe endpoint and does not execute runtime work.
Required environment
HIVE_WORKSPACEHIVE_AGENT_CONFIG_JSONfor the real isolated-runtime modeHIVE_GATEWAY_URLso the agent runtime can reach the gateway relayHIVE_INTERNAL_SECRETfor local relay-token issuance until a narrower per-agent token bootstrap path lands, and for authenticating gateway/controller calls into/run,/resume-run, and/flush-memory- Any per-agent runtime metadata supplied by the future pool implementation
- Long-lived vendor API secrets must not be injected into agent containers; the supported isolated mode uses the gateway relay instead
Primary responsibility
- Execute sandbox-local development/runtime tasks
- Persist task-local logs and artifacts in a bounded writable area
- Remain isolated from agent runtime and arbitrary network destinations
Current entrypoint
hive.container.sandbox_entrypoint:app
Startup contract
- Sandbox API startup requires DB connectivity because task metadata and events are persisted.
- The sandbox entrypoint is not a placeholder-only probe target anymore: it is the current runtime for the governed dev-task lane used by the project-scoped Gateway facade.
- The container root filesystem may remain read-only if writable task storage is mounted separately.
Health contract
- Port:
8091 - Endpoint:
GET /healthz - Response identifies role
sandbox - Dev-task status responses include an optional
runtimeblock withbackend_run_id,execution_class,artifact_root,log_root, and heartbeat metadata so operators can map OpenHive tasks back to sandbox execution state
Filesystem and volume contract
- Sandbox task-local writable root:
/sandbox/commands/sandbox/tasks
- The current runtime also benefits from a writable
/tmpmount for subprocess and tool behavior. - Task storage under
/sandbox/tasks/<task_id>/is split into:repo/artifacts/logs/scratch/
- Task-local writable storage must be ephemeral or policy-controlled; artifacts that need to survive task completion must be persisted through the sandbox API contract.
Isolation contract
- Namespace:
hive-sandbox - Sandbox containers must not require agent-to-agent communication.
- Sandbox containers may reach Gateway for control-plane interaction.
- Sandbox containers must not talk directly to agent runtime.
- External egress is deny-by-default and should only be opened through explicit allowlists.
- The reusable
/commandsAPI is intentionally narrower than unrestricted shell access: it only accepts governed argv-based commands, rejects shell entrypoints and env overrides, and requires explicit allowlisted registries for networked package-install flows.
Required environment
DATABASE_URLfor the current sandbox API runtime- optional
HIVE_SANDBOX_CODING_BACKEND; omit it or setcodex_clifor the default governed local CLI path, or setrelay_helperfor the explicit gateway-relay-backed helper proof path - optional
HIVE_SANDBOX_CODEX_AUTH_MODE; leave unset orscrubbedfor the default secret-scrubbedcodex_clichild env, or setenvto allow the governedcodexsubprocess to receive explicitly allowlisted provider env vars - optional
HIVE_SANDBOX_CODEX_ENV_ALLOWLIST; comma-separated provider env keys such asOPENAI_API_KEYorQWEN_API_KEYthat may be forwarded only whenHIVE_SANDBOX_CODEX_AUTH_MODE=env - optional
HIVE_SANDBOX_CODEX_MODEL; defaults toqwen3-maxbut may be overridden when the configured Codex provider only supports a different model - optional
HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_URLandHIVE_SANDBOX_WORKSPACE_APPLY_RELAY_TOKENfor archive-seeded tasks whose original workspace lives outside the sandbox filesystem. This is an operator-local apply-back channel, not a model-token relay; the sandbox sends the approved patch to that URL only after PM approval and only when the original requested workspace path is not locally reachable. - Any future sandbox execution settings required by backend-specific task runners
Until dedicated readiness endpoints are introduced, OpenHive uses the following rule:
- Gateway, Agent, and Sandbox use
GET /healthzfor both liveness and readiness probes - Dashboard uses
GET /dashboard-healthzfor both liveness and readiness probes
That means:
- handlers must stay lightweight
- baseline entrypoints must not depend on slow external calls inside health probes
- probe responses may still include lightweight ownership and readiness metadata when it helps distinguish startup delays from hard bootstrap failures
The manifests under deploy/k8s/base/ and the full-runtime overlays map these
contracts as follows:
| Role | Namespace | Port | Probe | Writable paths |
|---|---|---|---|---|
| Gateway | openhive |
8080 |
/healthz |
none required in baseline; /data/hive in full runtime |
| Dashboard | openhive |
3000 |
/dashboard-healthz |
none required |
| Agent | hive-agents |
8090 |
/healthz |
/data/config |
| Sandbox | hive-sandbox |
8091 |
/healthz |
/sandbox/commands, /sandbox/tasks, /tmp |
This document does not claim that the following are already implemented:
- per-agent production relay APIs
- Kubernetes-native autoscaling behavior
- final HA topology for the dashboard or Gateway control plane
Those concerns should build on this contract rather than redefining the runtime surface.