Distributed lease coordination service for Kubernetes multi-cluster workloads.
Berth provides TTL-based distributed leases that coordinate exclusive or shared
access to resources across Kubernetes clusters. Leases are expressed as
Kubernetes custom resources (BerthLease), managed via an API server, and
reconciled by an operator that can suspend or resume workloads in response to
lease state transitions.
Berth ships three binaries:
| Binary | Purpose |
|---|---|
apiserver |
HTTPS API server for lease operations |
operator |
Kubernetes controller that reconciles BerthLease resources |
berth |
CLI client for interacting with the API server |
A lease moves through these states:
acquire ──► held ──► released
└──► expired (TTL without heartbeat)
- A holder acquires a lease by creating a
BerthLeaseresource with its identity and desired TTL. - While held, the holder sends periodic heartbeats to reset the TTL clock.
- The holder releases the lease explicitly, or it expires when the TTL elapses without a heartbeat.
Each lease declares an acquisition mode:
at-most-once— guarantees that at most one holder can hold the lease at any time. Use this for leader election and exclusive resource access.at-least-once— permits concurrent holders. Use this for availability-oriented coordination where brief overlap is acceptable.
A lease can optionally reference a Kubernetes workload via target. When
configured, the operator applies acquireAction and releaseAction to the
target in response to lease transitions. Two action shapes are supported, and
at most one may be set per action:
suspend— togglesspec.suspendon the target. Use for CronJob.scale— patches the target's scale subresource. Use for Deployment, StatefulSet, or ReplicaSet. A typical singleton wiresacquireAction.scale.replicasto the desired running count andreleaseAction.scale.replicas: 0.
CronJob singleton (suspend action):
apiVersion: berth.skaphos.io/v1alpha1
kind: BerthLease
metadata:
name: ingest-coordinator
namespace: pipeline
spec:
leaseName: "ingest-coordinator"
holderIdentity: "worker-east-1"
ttlSeconds: 30
heartbeatIntervalSeconds: 10
semantics: "at-most-once"
target:
apiVersion: batch/v1
kind: CronJob
name: ingest-pipeline
acquireAction:
suspend: false
releaseAction:
suspend: trueCross-cluster Deployment singleton (scale action):
apiVersion: berth.skaphos.io/v1alpha1
kind: BerthLease
metadata:
name: ingest-worker
namespace: pipeline
spec:
leaseName: "ingest-worker"
holderIdentity: "ignored-when-operator-runs-with-cluster-id"
ttlSeconds: 30
heartbeatIntervalSeconds: 10
semantics: "at-most-once"
target:
apiVersion: apps/v1
kind: Deployment
name: ingest-worker
acquireAction:
scale:
replicas: 3
releaseAction:
scale:
replicas: 0Apply the same manifest unchanged to every cluster. Each cluster's operator
must run with a distinct --cluster-id:
# cluster-east
operator --berth-api-server https://berth.example.com:8443 --cluster-id cluster-east
# cluster-west
operator --berth-api-server https://berth.example.com:8443 --cluster-id cluster-west--cluster-id, when set, overrides spec.holderIdentity and is used as the
holder identity for every Acquire call. Only one cluster's operator will hold
the lease at a time and scale its Deployment to 3 replicas; the others scale
to 0. When --cluster-id is not set, the operator falls back to
spec.holderIdentity — useful when an external client manages identity
itself.
Failover RTO is bounded by ttlSeconds + reacquire interval. With the
example above (ttlSeconds: 30, heartbeatIntervalSeconds: 10):
- The holder cluster heartbeats every 10 seconds.
- If the holder dies or is partitioned, the lease becomes reclaimable 30 seconds after its last successful heartbeat.
- Standby operators retry Acquire every
min(heartbeatIntervalSeconds, ttlSeconds/3)— 10 seconds in this case — so within ~40 seconds total a standby cluster acquires and scales its Deployment up.
Tune ttlSeconds to trade off failover speed against tolerance for
transient API-server unreachability. A 30-second TTL is a reasonable
default; shorter TTLs (≤10s) make the system jittery under network
hiccups, longer ones (≥60s) extend failover time.
Split-brain window. When the holder loses connectivity to the API
server, its Deployment continues running until that operator next
reconciles and observes its Acquire return Acquired=false. During the
window between (a) server-side TTL expiry, (b) the standby successfully
reacquiring and scaling up, and (c) the original holder noticing it lost
the lease and scaling down, both clusters can be running their
Deployment. Two mitigations:
- Short TTL + short heartbeat narrow the window. With the defaults above the worst case is ~10 seconds (one reconcile cycle).
- Fencing tokens are returned by Acquire/Renew. The Berth API server itself rejects writes from a stale holder (the operator can't accidentally Release/Renew a lease it has lost). True end-to-end fencing — where the Deployment's downstream calls are also rejected when stale — requires the workload to validate the token, which is out of scope for the operator-as-holder pattern.
For workloads where momentary overlap is unacceptable, run with a very short TTL or use the application-level pattern where the workload itself acquires the lease (and exits when it loses it).
The pkg/client package provides a Go client for the API server:
import "github.com/skaphos/berth/pkg/client"
c := client.New("https://berth.example.com:8443",
client.WithAPIKey("my-api-key"),
client.WithTLSConfig(tlsCfg),
)
if err := c.Ping(ctx); err != nil {
log.Fatal(err)
}# List all leases
berth --api-server https://berth.example.com:8443 --api-key $BERTH_KEY lease list
# Get a specific lease
berth lease get ingest-coordinator
# Release a lease
berth lease release ingest-coordinator- Kubernetes 1.28+
- Helm 3
kubectl apply -f config/crd/berthlease.yaml# API server
helm install berth-apiserver deploy/helm/berth-apiserver \
--set image.repository=your-registry/berth-apiserver \
--set image.tag=latest
# Operator
helm install berth-operator deploy/helm/berth-operator \
--set image.repository=your-registry/berth-operator \
--set image.tag=latest--listen-addr Listen address (default ":8443")
--tls-cert-file Path to TLS certificate (required)
--tls-key-file Path to TLS private key (required)
--coordination-kubeconfig Path to a kubeconfig pointing at the coordination
cluster (empty = in-cluster config)
--coordination-namespace Namespace in the coordination cluster where Berth
Lease objects are stored. When empty, the API
server falls back to an in-memory store (dev only
— state is lost on restart and HA is not possible).
--auth-mode 'none', 'static-keys', or 'oidc'. Defaults to
'static-keys' when --coordination-namespace is set;
defaults to 'none' otherwise. Use 'none' only for
dev — the server logs a loud warning at startup.
--api-keys-file Path to a file of '<key-id>:<sha256-hex>' entries.
Required when --auth-mode=static-keys. SIGHUP
reloads the file in place (no restart needed).
--oidc-issuer-url OIDC issuer URL (e.g. https://your-org.okta.com/oauth2/default,
https://pingfed.example.com). Required when --auth-mode=oidc.
--oidc-audience Expected JWT 'aud' claim. Required when --auth-mode=oidc.
--oidc-required-claim Repeatable key=value claim that must be present
(string or string-array). Example: groups=berth-clients.
--oidc-username-claim JWT claim copied into the identity holder field
(default 'sub').
--oidc-tenant-claim JWT claim copied into the identity tenant field
(default 'sub'); array-valued claims use the first element.
--oidc-jwks-url Override the JWKS URL discovered from the issuer
(rarely needed).
The API server accepts bearer-token auth on the /v1alpha1/* endpoints when
--auth-mode=static-keys is set (the default in production). /healthz
remains unauthenticated.
The --api-keys-file is a plain-text file with one entry per line:
# Berth API keys — comments and blank lines ignored.
team-a:0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
team-b:fedcba9876543210fedcba9876543210fedcba9876543210fedcba9876543210
The hash is the SHA-256 of the raw token. The API server only stores hashes; the raw token lives only on the client side (operator-mounted Secret). Generate a key like this:
RAW=$(openssl rand -hex 32)
HASH=$(printf '%s' "$RAW" | sha256sum | awk '{print $1}')
echo "team-a:$HASH" # add to the keys file
echo "$RAW" # distribute via the operator's --berth-api-key SecretRotate keys by editing the file and sending SIGHUP to the API server pod.
The current key set is replaced atomically; if the new file is malformed,
the previous key set is preserved.
For production deployments where you want short-lived, IdP-issued tokens instead of long-lived static keys, run the API server with OIDC:
berth-apiserver \
--auth-mode=oidc \
--oidc-issuer-url=https://your-org.okta.com/oauth2/default \
--oidc-audience=berth-api \
--oidc-required-claim=groups=berth-clients
For PingFederate, swap the issuer URL: --oidc-issuer-url=https://pingfed.example.com.
For Entra (Azure AD): https://login.microsoftonline.com/<tenant-id>/v2.0.
Berth fetches <issuer>/.well-known/openid-configuration at startup,
validates JWT signature against the JWKS, and rejects tokens with the
wrong iss/aud/exp or missing required claims.
The operator side then uses a sidecar token broker. Berth ships a
reference broker as berth-oidc-broker:
# operator pod sketch
spec:
containers:
- name: token-broker
image: ghcr.io/skaphos/berth-oidc-broker:latest
args:
- --oidc-issuer-url=https://your-org.okta.com/oauth2/default
- --oidc-client-id=$(OIDC_CLIENT_ID)
- --oidc-client-secret-file=/etc/berth-oidc/secret
- --oidc-audience=berth-api
- --output=/var/run/berth/token
env:
- { name: OIDC_CLIENT_ID, valueFrom: { secretKeyRef: { name: berth-oidc, key: client-id } } }
volumeMounts:
- { name: token, mountPath: /var/run/berth }
- { name: oidc-secret, mountPath: /etc/berth-oidc, readOnly: true }
- name: operator
image: ghcr.io/skaphos/berth-operator:latest
args:
- --berth-api-server=https://berth.example.com:8443
- --berth-api-key-file=/var/run/berth/token
- --cluster-id=cluster-east
volumeMounts:
- { name: token, mountPath: /var/run/berth, readOnly: true }
volumes:
- { name: token, emptyDir: { medium: Memory } }
- { name: oidc-secret, secret: { secretName: berth-oidc } }The broker performs OAuth2 client credentials against the IdP, writes
the access token atomically to the shared Memory-backed volume, and
refreshes well before expiry. The operator picks up rotations via its
--berth-api-key-file watcher (1-second cache TTL).
For Entra/Azure AD, AWS Cognito, Google Cloud, and other IdPs that need
extra parameters on the token request, the broker accepts
--oidc-audience (passed as the audience form parameter, which is what
Auth0 and some Okta authorization servers require) and --oidc-scopes.
For more exotic flows (token exchange, certificate-bound tokens) you can
substitute your own broker — the operator only cares that the file at
--berth-api-key-file contains a valid bearer token.
--metrics-bind-address Metrics endpoint (default ":8080")
--health-probe-bind-address Health probe endpoint (default ":8081")
--berth-api-server Berth API server base URL (required)
--berth-api-key Static bearer token. Mutually exclusive with
--berth-api-key-file.
--berth-api-key-file Path to a file containing the bearer token.
Re-read on each request (cached briefly), so an
external sidecar (typically the OIDC broker)
can rotate it without restarting the operator.
Mutually exclusive with --berth-api-key.
--cluster-id Cluster-distinct holder identity. When set,
overrides spec.holderIdentity on every Acquire
call. Required for the cross-cluster singleton
pattern; leave empty to fall back to
spec.holderIdentity.
The API server's lease state is authoritative for at-most-once semantics across clusters. Two backends are available:
| Backend | When | Durability | HA |
|---|---|---|---|
| K8s coordination cluster (default in production) | --coordination-namespace is set |
State persists in coordination.k8s.io/v1.Lease objects in the named namespace |
API server can be scaled to multiple replicas; they share state via the kube-apiserver |
| In-memory (dev/demo only) | --coordination-namespace is empty |
None — state is lost on restart | Single replica only |
For the production backend, point --coordination-kubeconfig at a small
dedicated cluster — not at one of the tenant clusters that Berth
coordinates Deployments on, since losing that cluster would also lose the
lease store. A managed control plane (EKS/GKE/AKS) is fine. Berth pools all
leases for all tenants under --coordination-namespace; the coordination
cluster does not need per-tenant namespaces.
Requires Go 1.26+.
make build # Build all binaries to bin/
make test # Run tests
make lint # Run golangci-lint and go vet
make generate # Regenerate deepcopy code
make manifests # Regenerate CRD manifests
make docker-build # Build Docker imagesapi/v1alpha1/ Kubernetes CRD types (BerthLease, BerthLeaseList)
cmd/apiserver/ API server entrypoint
cmd/operator/ Operator entrypoint
cmd/berth/ CLI entrypoint
internal/api/ HTTP server, routes, middleware
internal/auth/ Authentication (Authenticator interface, static keys)
internal/lease/ Lease state, Store interface, Manager, TTL enforcement
internal/operator/ Kubernetes reconciler (BerthLeaseReconciler)
internal/tenant/ Tenant resolution (Resolver interface)
internal/console/ Web console server (placeholder)
internal/k8s/ Kubernetes client initialization
pkg/client/ Public Go client library
config/crd/ Generated CRD manifests
config/rbac/ RBAC manifests
deploy/helm/ Helm charts for API server and operator
See LICENSE.