Skip to content

Latest commit

 

History

History
457 lines (372 loc) · 21.1 KB

File metadata and controls

457 lines (372 loc) · 21.1 KB

OpenHive Container Runtime Contracts

This document defines the runtime contract for the four production container roles used by OpenHive:

  • Gateway containers
  • Dashboard containers
  • Agent containers
  • Sandbox containers

These contracts are the baseline for issue #84 and the Kubernetes productization work that followed. They are intentionally explicit about startup behavior, health checks, writable paths, and isolation expectations so later work such as ContainerAgentPool and reconciliation can build on a stable operational model.

Contract Status

OpenHive currently has two containerization layers:

  • the current deployable baseline in deploy/k8s/
  • the evolving container orchestration path built around ContainerAgentPool

This document describes the contract that operators should rely on now, while also identifying which responsibilities remain future work.

For clarity:

  • the supported preview_local onboarding path is still backend + dashboard from source
  • the Kubernetes productized control-plane path now adds a standalone dashboard container next to the full Gateway runtime
  • the matching local operator entrypoint for sandbox workflows is make run-sandbox / uv run uvicorn hive.container.sandbox_entrypoint:app --port 8091 with HIVE_SANDBOX_URL=http://127.0.0.1:8091
  • local image-parity checks can instead use make run-sandbox-container, which starts the profiled Docker Compose sandbox service built from Dockerfile.sandbox on the same local port
  • sandbox-backed dev-task workflows now have a real operator-facing surface through the authenticated Gateway and dashboard dev-task review APIs, but this remains an optional local operator workflow rather than a required preview_local onboarding step
  • the gateway relay and short-lived task-token contract exist for model-backed isolated runtimes; the isolated agent runtime now uses a shared task relay client that future sandbox relay backends should reuse
  • the canonical current sandbox coding path is the default HIVE_SANDBOX_CODING_BACKEND=codex_cli mode, which runs the local codex exec CLI as a governed subprocess with Codex’s workspace-write sandbox mode so task-local edits can be proposed without full unrestricted access
  • the sandbox container also exposes an explicit opt-in proof mode, HIVE_SANDBOX_CODING_BACKEND=relay_helper, which swaps that default local CLI execution for a local relay helper process that consumes gateway-issued task tokens
  • the sandbox container also exposes HIVE_SANDBOX_CODING_BACKEND=deterministic_proof for Docker-local control loop verification only; it does not prove provider-backed coding quality, but it does prove archive seeding, runtime attempts, checkpoints, artifacts, and in-place resume without depending on a live LLM provider
  • the Docker Compose sandbox profile bind-mounts source hive, packages, and scripts directories so local verification can exercise the current checkout without rebuilding the sandbox image for every Python-only control-loop change
  • the default codex_cli adapter now scrubs inherited secret env before launch, but it can now also opt in to a narrow provider-env allowlist from the sandbox runtime env when operators need real codex_cli execution
  • subprocess env construction is centralized for governed runtime paths: inherited provider keys, database credentials, gateway/internal secrets, and app credentials are scrubbed unless a documented direct-env allowlist names the key
  • sandbox command status includes bounded stdout/stderr previews with sizes, omitted counts, state, exit code, command id, and retrieval hints; full logs remain available only through explicit log retrieval
  • sandbox-backed Work Runs publish durable progress checkpoints through the sandbox API and persist runtime attempts separately from the logical task id; this lets the operator resume a stale or failed Work Run in place instead of treating timeout extension as the long-term scaling mechanism
  • checkpoint runtime events expose a typed checkpoint_sequence_no, durable marker, and checkpoint_payload so dashboards can show friendly progress and recovery context without relying on untyped lifecycle-event extras
  • in-place Work Run resume is a new sandbox attempt under the same logical task: it keeps the task id, current workspace, artifacts/log visibility, and latest checkpoint context, while requeue remains the fallback that creates a new logical task when the original workspace is unavailable or should be forked
  • the governed process-session lifecycle is list, poll, log, and cancel over existing sandbox command ids, not an arbitrary terminal or shell creation API
  • when an operator also supplies sandbox runtime provider metadata such as a provider id and base URL, the sandbox entrypoint can materialize a minimal runtime ~/.codex/config.toml under the writable sandbox home so the default codex_cli path can target an OpenAI-compatible provider without baking that config into the image
  • the Kubernetes sandbox pod now sets HIVE_SANDBOX_CODEX_SANDBOX_MODE=danger-full-access for the inner Codex CLI because the pod itself is already the outer governed workspace boundary and container runtimes such as Docker Desktop block Codex's inner bubblewrap namespace sandbox
  • the same pod also mounts a dedicated writable home at /home/codex instead of reusing /tmp, because real non-ephemeral Codex CLI runs may stall when their managed home directory lives directly under a temporary root
  • when a remote sandbox is seeded from workspace_archive_b64, approval first tries to apply the patch back to the original requested workspace path; if that path is not visible inside the sandbox pod, operators may configure an explicit host-side apply relay with HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_URL and HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_TOKEN
  • dev-task workspace state now has a source-tested WorkspaceManifest shape that records task id, repo path class, artifact manifest path, target paths, source/requested workspace path presence, apply-back mode, and relay configuration without storing raw archives, patches, or secrets in run state
  • that env-auth path still is not the same as a relay-backed sandbox runtime; it is a direct provider-env bridge for the local CLI, not a token flow
  • that means vendor-secret residency claims only apply to explicitly relay-backed sandbox or agent runtimes, not to every current preview_local dev-task execution path or the default codex_cli sandbox path

Common Contract Rules

All OpenHive runtime containers share these rules:

Area Contract
Health endpoint Each role must expose a lightweight HTTP probe endpoint. Gateway, Agent, and Sandbox use GET /healthz; Dashboard uses GET /dashboard-healthz.
Probe behavior Health endpoints must return quickly and without external side effects.
Logging Containers write structured logs to stdout/stderr.
Shutdown Containers must tolerate SIGTERM and allow the orchestrator to stop them without manual cleanup steps.
Secrets Secrets come from environment or mounted secret sources. They must never be baked into images or persisted in logs.
Filesystem policy Any path that must be writable at runtime must be explicitly mounted or documented. Read-only roots are preferred where practical.

Role Contracts

1. Gateway Container

Primary responsibility

  • Run the platform control plane
  • Serve the platform API and session/auth endpoints consumed by the dashboard
  • Own long-lived coordination such as auth, scheduling, routing, and credential proxying

Current entrypoints

  • Baseline K8s deployment: hive.container.gateway_entrypoint:app
  • Full application runtime: hive.main:app

Startup contract

  • The baseline deployment may use the minimal gateway entrypoint so the image is probeable before the full runtime stack is configured.
  • A full production gateway must only be considered healthy after FastAPI startup wiring completes, including DB setup, scheduler startup, and route mounting.
  • The Kubernetes full-runtime overlay at deploy/k8s/overlays/full-gateway-runtime enables HIVE_STRICT_STARTUP=true, which makes startup fail fast when required secrets are still on placeholder defaults or when DB initialization and migrations cannot complete.
  • The same overlay also sets HIVE_STARTUP_MIGRATION_MODE=check and HIVE_METADATA_CREATE_ON_STARTUP=false, so the Kubernetes path expects the schema to already be at the current Alembic head instead of mutating it implicitly during control-plane startup.
  • The same overlay adds a startupProbe so Kubernetes waits for the real control-plane startup path before liveness checks restart the pod.

Health contract

  • Port: 8080
  • Endpoint: GET /healthz
  • Baseline response identifies role gateway
  • Full runtime response also reflects DB and active-agent health via hive.main

Filesystem and volume contract

  • No persistent workspace volume is required for the baseline gateway container.
  • The full runtime may read project workspace and config paths, but gateway-owned state must remain reconstructable from DB plus configured workspace mounts.
  • The current full-runtime overlay mounts /data/hive as a writable emptyDir. That is enough for control-plane bring-up and route reachability, but operators should treat it as non-durable until a persistent storage story is wired in.

Isolation contract

  • Namespace: openhive
  • Gateway is the intended long-term role that holds real vendor secrets in process memory.
  • In the current source-based preview_local deployment, Keeper and Scout often still run in the same process as the gateway via LocalAgentPool, so this boundary is only fully enforced today for relay-backed sandbox model access and encrypted-at-rest channel credentials.
  • Gateway may talk to DB, Agent runtime, Sandbox runtime, and approved IM / LLM upstreams.

Required environment

  • Full runtime requires the normal server settings used by hive.main, especially database connectivity, dashboard session secret, and platform credentials.
  • The current full-runtime overlay treats these as the minimum required env surface:
    • DATABASE_URL
    • DASHBOARD_SESSION_SECRET
    • HIVE_INTERNAL_SECRET
    • HIVE_WORKSPACE
    • HIVE_STRICT_STARTUP=true
    • HIVE_STARTUP_MIGRATION_MODE=check
    • HIVE_METADATA_CREATE_ON_STARTUP=false
  • Pool selection stays in the composition root and currently uses these modes:
    • HIVE_POOL_BACKEND=local keeps the in-process LocalAgentPool path used by source-based preview flows.
    • HIVE_POOL_BACKEND=container selects ContainerAgentPool and requires HIVE_CONTAINER_RUNTIME_BACKEND to choose the isolated-runtime backend.
    • The first supported Kubernetes deployment path uses HIVE_CONTAINER_RUNTIME_BACKEND=kubernetes. Local source-based experiments can still use local_subprocess to start one isolated HTTP runtime per agent on the same host.
  • Baseline health-only deployment does not require the full control-plane env surface.

2. Dashboard Container

Primary responsibility

  • Serve the operator-facing Next.js dashboard
  • Keep browser traffic same-origin while proxying dashboard-originated API calls to Gateway
  • Preserve cookie-based auth behavior for self-hosted deployments

Current entrypoint

  • Standalone Next.js runtime generated from Dockerfile.web

Startup contract

  • The production image is built from Next.js standalone output and starts the generated server.js runtime.
  • The runtime must listen on port 3000 and tolerate the standard PORT and HOSTNAME environment variables used by the standalone server.
  • Dashboard API traffic must not depend on local-dev-only assumptions such as a hardcoded http://localhost:8080 rewrite target.
  • Same-origin browser requests to /api/* and /healthz are proxied by the Next.js proxy.ts layer to the configured in-cluster Gateway origin.

Health contract

  • Port: 3000
  • Endpoint: GET /dashboard-healthz
  • Response identifies role dashboard

Filesystem and volume contract

  • No writable volume is required for the current standalone dashboard runtime.
  • Static assets and the generated server bundle ship inside the image.

Isolation contract

  • Namespace: openhive
  • Dashboard may reach the in-cluster Gateway service.
  • Dashboard does not require direct DB access.
  • Browser-visible API traffic should remain same-origin to the dashboard host; the in-cluster hop to Gateway happens server-side through the dashboard proxy.

Required environment

  • HIVE_GATEWAY_INTERNAL_URL for runtime proxying to Gateway
  • Optional PORT / HOSTNAME overrides supported by the standalone Next.js server
  • NEXT_PUBLIC_API_URL remains optional for special non-proxied deployments, but the supported Kubernetes path uses the same-origin proxy instead

3. Agent Container

Primary responsibility

  • Host exactly one agent runtime instance for the Kubernetes-backed ContainerAgentPool path
  • Expose a probeable runtime boundary that can be started, stopped, and replaced independently

Current entrypoint

  • hive.container.agent_entrypoint:app

Startup contract

  • The container must start without mutating bundled defaults inside the image.
  • Config bootstrap must copy default files into the writable config volume without overwriting operator edits.
  • The agent process must read its writable workspace from HIVE_WORKSPACE.
  • When HIVE_AGENT_CONFIG_JSON is supplied, the entrypoint must bootstrap one real HiveAgent runtime inside the container and expose it through the existing /run and /flush-memory HTTP contract.
  • The first Kubernetes runtime backend now creates one Pod plus one Service per managed agent identity, with deterministic runtime naming and Service-DNS base URL resolution. Exact agent_id and controller ownership values are carried in Pod annotations rather than lossy label-safe rewrites.
  • The isolated runtime must route model calls back through HIVE_GATEWAY_URL using the gateway relay flow. It must not require a long-lived vendor API key in the agent container environment.
  • That vendor-secret boundary is narrower than a blanket "no secrets at all" claim: the current local separated-process runtime may still receive DATABASE_URL when DB-backed runtime features are enabled, so operators should interpret the current proof as provider/app credential isolation rather than total operational-secret elimination.

Health contract

  • Port: 8090
  • Endpoint: GET /healthz
  • Response includes:
    • status
    • role=agent
    • workspace
    • runtime_ready
    • agent_id
    • project_id
    • controller_id
    • deployment_backend
    • readiness_reason

Filesystem and volume contract

  • Writable config/workspace mount is required.
  • Current baseline mount path:
    • /data/config
  • Current baseline env:
    • HIVE_WORKSPACE=/data/config
  • The init container writes defaults from /app/defaults/agent into the writable volume.
  • The first Kubernetes-backed lifecycle slice uses an emptyDir workspace per managed agent Pod. That keeps the runtime contract explicit without claiming durable per-agent storage yet.
  • Agent runtime data must live in the mounted workspace, not in the image filesystem.

Isolation contract

  • Namespace: hive-agents
  • Agent containers should not require arbitrary external egress.
  • Agent containers may reach:
    • Gateway API
    • Sandbox API
  • Agent containers must not rely on direct access to another agent container.
  • State-changing agent-runtime HTTP endpoints require the gateway/controller shared secret (X-Internal-Secret or Bearer auth). GET /healthz remains the probe endpoint and does not execute runtime work.

Required environment

  • HIVE_WORKSPACE
  • HIVE_AGENT_CONFIG_JSON for the real isolated-runtime mode
  • HIVE_GATEWAY_URL so the agent runtime can reach the gateway relay
  • HIVE_INTERNAL_SECRET for local relay-token issuance until a narrower per-agent token bootstrap path lands, and for authenticating gateway/controller calls into /run, /resume-run, and /flush-memory
  • Any per-agent runtime metadata supplied by the future pool implementation
  • Long-lived vendor API secrets must not be injected into agent containers; the supported isolated mode uses the gateway relay instead

4. Sandbox Container

Primary responsibility

  • Execute sandbox-local development/runtime tasks
  • Persist task-local logs and artifacts in a bounded writable area
  • Remain isolated from agent runtime and arbitrary network destinations

Current entrypoint

  • hive.container.sandbox_entrypoint:app

Startup contract

  • Sandbox API startup requires DB connectivity because task metadata and events are persisted.
  • The sandbox entrypoint is not a placeholder-only probe target anymore: it is the current runtime for the governed dev-task lane used by the project-scoped Gateway facade.
  • The container root filesystem may remain read-only if writable task storage is mounted separately.

Health contract

  • Port: 8091
  • Endpoint: GET /healthz
  • Response identifies role sandbox
  • Dev-task status responses include an optional runtime block with backend_run_id, execution_class, artifact_root, log_root, and heartbeat metadata so operators can map OpenHive tasks back to sandbox execution state

Filesystem and volume contract

  • Sandbox task-local writable root:
    • /sandbox/commands
    • /sandbox/tasks
  • The current runtime also benefits from a writable /tmp mount for subprocess and tool behavior.
  • Task storage under /sandbox/tasks/<task_id>/ is split into:
    • repo/
    • artifacts/
    • logs/
    • scratch/
  • Task-local writable storage must be ephemeral or policy-controlled; artifacts that need to survive task completion must be persisted through the sandbox API contract.

Isolation contract

  • Namespace: hive-sandbox
  • Sandbox containers must not require agent-to-agent communication.
  • Sandbox containers may reach Gateway for control-plane interaction.
  • Sandbox containers must not talk directly to agent runtime.
  • External egress is deny-by-default and should only be opened through explicit allowlists.
  • The reusable /commands API is intentionally narrower than unrestricted shell access: it only accepts governed argv-based commands, rejects shell entrypoints and env overrides, and requires explicit allowlisted registries for networked package-install flows.

Required environment

  • DATABASE_URL for the current sandbox API runtime
  • optional HIVE_SANDBOX_CODING_BACKEND; omit it or set codex_cli for the default governed local CLI path, or set relay_helper for the explicit gateway-relay-backed helper proof path
  • optional HIVE_SANDBOX_CODEX_AUTH_MODE; leave unset or scrubbed for the default secret-scrubbed codex_cli child env, or set env to allow the governed codex subprocess to receive explicitly allowlisted provider env vars
  • optional HIVE_SANDBOX_CODEX_ENV_ALLOWLIST; comma-separated provider env keys such as OPENAI_API_KEY or QWEN_API_KEY that may be forwarded only when HIVE_SANDBOX_CODEX_AUTH_MODE=env
  • optional HIVE_SANDBOX_CODEX_MODEL; defaults to qwen3-max but may be overridden when the configured Codex provider only supports a different model
  • optional HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_URL and HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_TOKEN for archive-seeded tasks whose original workspace lives outside the sandbox filesystem. This is an operator-local apply-back channel, not a model-token relay; the sandbox sends the approved patch to that URL only after PM approval and only when the original requested workspace path is not locally reachable.
  • Any future sandbox execution settings required by backend-specific task runners

Probe and Readiness Semantics

Until dedicated readiness endpoints are introduced, OpenHive uses the following rule:

  • Gateway, Agent, and Sandbox use GET /healthz for both liveness and readiness probes
  • Dashboard uses GET /dashboard-healthz for both liveness and readiness probes

That means:

  • handlers must stay lightweight
  • baseline entrypoints must not depend on slow external calls inside health probes
  • probe responses may still include lightweight ownership and readiness metadata when it helps distinguish startup delays from hard bootstrap failures

Kubernetes Baseline Mapping

The manifests under deploy/k8s/base/ and the full-runtime overlays map these contracts as follows:

Role Namespace Port Probe Writable paths
Gateway openhive 8080 /healthz none required in baseline; /data/hive in full runtime
Dashboard openhive 3000 /dashboard-healthz none required
Agent hive-agents 8090 /healthz /data/config
Sandbox hive-sandbox 8091 /healthz /sandbox/commands, /sandbox/tasks, /tmp

Non-Goals for This Contract

This document does not claim that the following are already implemented:

  • per-agent production relay APIs
  • Kubernetes-native autoscaling behavior
  • final HA topology for the dashboard or Gateway control plane

Those concerns should build on this contract rather than redefining the runtime surface.