Skip to content

Latest commit

 

History

History
257 lines (186 loc) · 6.57 KB

File metadata and controls

257 lines (186 loc) · 6.57 KB

OpenHive — Operations Guide

Process Management

Gateway auto-restart

The Gateway service is configured with restart: always in docker-compose.yml. Docker will automatically restart it after crashes, OOM kills, or host reboots.

# Start all services
docker compose up -d

# Check service status
docker compose ps

# View Gateway logs (last 100 lines, follow)
docker compose logs -f --tail=100 gateway

# Manual restart
docker compose restart gateway

Kubernetes baseline

The preview-era containerization baseline lives under deploy/k8s/. Use docs/deploy-k8s.md for namespace layout, NetworkPolicy checks, and init-container bootstrap verification. Use docs/container-runtime-contracts.md for the explicit startup, health, volume, and isolation contract of the Gateway, Agent, and Sandbox runtime roles.

For the current Kubernetes productization slice:

  • the supported DB mode is operator-managed external PostgreSQL
  • OpenHive owns the in-cluster migration Job contract, not PostgreSQL lifecycle
  • operators own PostgreSQL backups, restore, retention, and major-version upgrades
  • the recommended operator-facing deployment is the preview installer driven by deploy/k8s/preview-installer/values.env.example and make k8s-preview-install env_file=/path/to/env
  • the full platform overlay adds the standalone dashboard, same-origin API proxying, and a combined ingress example for dashboard plus API traffic

Graceful shutdown

# Stop all services (data volumes preserved)
docker compose stop

# Stop and remove containers (data volumes preserved)
docker compose down

Backup & Recovery

Automated daily backup

The backup script (scripts/backup.sh) handles:

What Where
PostgreSQL full dump /var/backups/hive/db/hive-TIMESTAMP.sql.gz
projects/ directory /var/backups/hive/projects/projects-TIMESTAMP.tar.gz

Retention: last 7 days. Older files are deleted automatically.

Setting up the cron job

# 1. Make the script executable (done once)
chmod +x /opt/openhive/scripts/backup.sh

# 2. Open the crontab editor
crontab -e

# 3. Add this line (runs at 02:15 local time daily)
15 2 * * * /opt/openhive/scripts/backup.sh >> /var/log/hive-backup.log 2>&1

The script reads credentials from environment variables. Set them in /etc/environment or a cron-specific env file:

DB_HOST=localhost
DB_PORT=5432
DB_NAME=hive
DB_USER=hive
DB_PASSWORD=<your-password>
HIVE_PROJECTS_DIR=/opt/openhive/.runtime/projects
HIVE_BACKUP_DIR=/var/backups/hive

Manual backup

# Run immediately (uses defaults from env)
./scripts/backup.sh

# Override backup directory
./scripts/backup.sh /mnt/external/backups

Restoring from backup

Restore database:

# Stop the Gateway first to prevent writes during restore
docker compose stop gateway

# Decompress and restore
gunzip -c /var/backups/hive/db/hive-TIMESTAMP.sql.gz | \
  PGPASSWORD=$DB_PASSWORD psql \
    --host=localhost --port=5432 \
    --username=hive hive

# Restart Gateway
docker compose start gateway

Restore projects directory:

# Decompress the archive into the correct location
tar --extract --gzip \
    --file=/var/backups/hive/projects/projects-TIMESTAMP.tar.gz \
    --directory=/opt/openhive/.runtime/

Health Checks

/healthz and /dashboard-healthz endpoints

The Gateway exposes a health endpoint at GET /healthz:

{
  "status": "ok",
  "db": "healthy",
  "agents": { "active": 2 }
}
Field Values
status "ok" or "degraded"
db "healthy" or "unreachable"
agents.active Number of currently active agent instances
# Quick Gateway check
curl http://localhost:8080/healthz | jq .

# Dashboard container probe when running the standalone web server
curl http://localhost:3000/dashboard-healthz | jq .

Agent runtime pods and the sandbox API also expose GET /healthz for probe use:

# Agent runtime probe
curl http://localhost:8090/healthz | jq .

# Sandbox probe
curl http://localhost:8091/healthz | jq .

Agent runtime readiness now distinguishes startup from bootstrap failures:

{
  "status": "error",
  "role": "agent",
  "runtime_ready": false,
  "agent_id": "keeper:proj_a",
  "project_id": "proj_a",
  "controller_id": "gateway",
  "deployment_backend": "kubernetes",
  "readiness_reason": "RuntimeError: relay unavailable"
}

Readiness guidance:

Field Meaning
status=ready Runtime is serving work and the readiness probe should pass
status=starting Runtime has not finished bootstrap yet
status=error Bootstrap failed; inspect readiness_reason before restarting blindly
agent_id, project_id, controller_id Pod-to-agent ownership mapping for operator triage

Kubernetes diagnostics flow

For Kubernetes-backed preview deployments, the quickest operator loop is:

  1. inspect the failing pod kubectl get pods -A -o wide
  2. query the pod probe payload with kubectl exec ... -- wget -qO- http://localhost:8090/healthz
  3. map the pod back to OpenHive ownership through annotations such as openhive.io/agent-id, openhive.io/project-id, openhive.io/agent-role, and openhive.io/controller-id
  4. for Keeper dev-task investigations, fetch the task through /dev-tasks/{task_id} and inspect the nested runtime block for backend_run_id, execution_class, artifact_root, and log_root

Request tracing

Every HTTP request and Feishu WebSocket event is assigned a trace_id. All log lines within a request chain carry the same trace_id field.

# Correlate all log lines for a single request
docker compose logs gateway | grep '"trace_id": "abc123def456"'

HTTP responses include the trace ID in the X-Trace-Id header for easy correlation from client logs.


Log Management

Structlog outputs JSON-formatted lines to stdout. Docker captures them.

# Stream all logs
docker compose logs -f gateway

# Filter by log level
docker compose logs gateway 2>&1 | grep '"log_level": "error"'

# Last 1000 lines
docker compose logs --tail=1000 gateway

For production, consider shipping logs to a centralised store (Loki, Datadog, etc.) by configuring the Docker logging driver in docker-compose.yml.


Disk Space

Monitor the backup directory and Docker volumes:

# Backup sizes
du -sh /var/backups/hive/db/* /var/backups/hive/projects/*

# Docker volume (PostgreSQL data)
docker system df -v | grep pgdata

# Projects directory
du -sh .runtime/projects/