Skip to content

feat(lakekeeper): pin pod resources + Prometheus scrape annotations on the CR#684

Merged
benben merged 1 commit into
mainfrom
ben/lakekeeper-cr-resources-scrape
Jun 5, 2026
Merged

feat(lakekeeper): pin pod resources + Prometheus scrape annotations on the CR#684
benben merged 1 commit into
mainfrom
ben/lakekeeper-cr-resources-scrape

Conversation

@benben
Copy link
Copy Markdown
Member

@benben benben commented Jun 5, 2026

What

EnsureCR (per-org Lakekeeper provisioner) now sets two previously-missing parts of the CR spec:

  • spec.resources — requests == limits (Guaranteed QoS), 500m CPU / 512Mi mem, via the lakekeeperPodCPU/Memory consts. Lakekeeper is a light Rust REST catalog, so a modest fixed shape is plenty; bump the consts if a tenant needs more.
  • spec.podMetadata.annotationsprometheus.io/scrape|port|path (port 9000) so the managed-warehouse clusters' annotation-based vmagent (kubernetes_sd, no prometheus-operator) scrapes the catalog pods.

Why

Per-org Lakekeeper pods previously ran BestEffort (no requests/limits → first evicted under node pressure) and were never scraped (no scrape annotation; the CRD exposed no pod-metadata hook).

Dependency / ordering

spec.podMetadata needs the passthrough added to PostHog's operator fork (PostHog/lakekeeper-operator branch posthog/serviceaccountname, merged via #1) and shipped in operator image 9e1e0fb. On an operator without it the CRD prunes the field and the annotations are silently dropped — a safe no-op, so this can merge in any order relative to the charts image bump.

Tests

Extends TestEnsureCR_CreateAndShape to assert both blocks. All provisioner tests pass (go test -tags kubernetes ./controlplane/provisioner/).

🤖 Generated with Claude Code

…n the CR

The per-org Lakekeeper CRs the control plane provisions had no resource
requests/limits (pods ran BestEffort, first to evict under node pressure) and
no Prometheus scrape annotations (the managed-warehouse clusters discover
targets by pod annotation via vmagent kubernetes_sd — no prometheus-operator —
so the catalog pods were never scraped).

EnsureCR now sets:
- spec.resources with requests == limits (Guaranteed QoS) — 500m CPU / 512Mi,
  tunable via the lakekeeperPodCPU/Memory consts. Lakekeeper is a light Rust
  REST catalog, so a modest fixed shape is plenty.
- spec.podMetadata.annotations {prometheus.io/scrape,port,path} so vmagent
  scrapes the operator's metrics container port (default 9000).

spec.podMetadata requires the passthrough added to PostHog's operator fork
(PostHog/lakekeeper-operator branch posthog/serviceaccountname). On an operator
without it the CRD prunes the field and the annotations are dropped — a safe
no-op until the new operator image ships, so this can merge in either order.

Extends TestEnsureCR_CreateAndShape to cover both blocks.
@benben benben merged commit b4412de into main Jun 5, 2026
24 checks passed
@benben benben deleted the ben/lakekeeper-cr-resources-scrape branch June 5, 2026 10:58
benben added a commit that referenced this pull request Jun 5, 2026
…isioned orgs (#686)

reconcileLakekeeper previously early-returned for any org with
LakekeeperEndpoint already set, on the assumption that the operator's reconcile
loop would carry future drift. That holds for fields already in the CR (e.g.
image), but the operator can't add fields that aren't there — so a new field in
the desired CR spec (resources, podMetadata, ...) never reaches existing
per-org Lakekeepers.

The Ready loop now re-applies just the CR spec for already-provisioned orgs:
- buildCRSpec(w, in): single source of truth for the desired CR spec, shared by
  EnsureForOrg and the new path so the two never diverge.
- EnsureCRSpec(ctx, w, in): lightweight drift correction — calls only
  k8s.EnsureCR (create-or-update), skipping the DB/Secret/REST pipeline.
  Idempotent; preserves the operator-owned status.
- controller: the LakekeeperEndpoint-set branch resolves inputs and calls
  EnsureCRSpec instead of returning. The operator rolls the Deployment only when
  the spec actually changes.

This is what makes the resources + podMetadata additions (PR #684) land on
existing catalog pods rather than only new ones. Off main; complementary to
#684 — until that merges, EnsureCR writes the same spec it does today, so this
is a no-op on existing CRs.

Rewrites TestReconcileLakekeeper_SkipsWhenAlreadyProvisioned ->
_DriftCorrectsWhenAlreadyProvisioned to assert the CR is re-applied (image only,
no pipeline side effects). All provisioner tests pass.
benben added a commit that referenced this pull request Jun 5, 2026
Three fixes to how the provisioner converges per-org Lakekeeper pods, found
when prod pods stayed BestEffort/unscraped after the resources+podMetadata
rollout (#684/#686):

1. Drift-correct by label, not by recomputed name. #686 re-derived the CR name
   via LakekeeperResourceName(orgID) for already-provisioned orgs. Post-#632
   that preserves hyphens, but a legacy org's CR — and its Secret, SA, and EKS
   pod-identity, all derived from the no-hyphen Duckling XR name — keeps the
   de-hyphenated name. So #686 created duplicate hyphenated CRs that had no
   Secret/SA/pod-identity (no deployment), while the real pods stayed unpatched.
   PatchPodShape now lists CRs by the duckgres/active-org label and patches
   whatever name exists (de-hyphenated legacy, hyphenated new) — no duplicates.

2. Conflict-free merge patch. EnsureCR's Get+Update lost a resourceVersion race
   to the operator's frequent status writes under multiple control-plane
   replicas ("object has been modified", every tick). PatchPodShape uses a JSON
   merge patch (no resourceVersion). It also needs no inputs, so the per-tick
   "failed to resolve lakekeeper inputs" warnings for non-provisioner-managed
   CRs go away.

3. Pod shape per request: replicas 2, and resources requests-only (no limits →
   Burstable; a CPU limit would CFS-throttle the catalog). limits is nulled in
   the patch to strip any stale limit block. EnsureCR's default replicas is now
   2 as well. Resource/podMetadata maps are shared between EnsureCR (create) and
   PatchPodShape (drift) so they can't diverge.

Replaces EnsureCRSpec with PatchPodShape; controller's Ready branch calls it
with no inputs. Tests: rewrote the drift test to assert label-matched patching +
no duplicate + no inputs; added PatchPodShape coverage (two names, limits
stripped); updated CreateAndShape for requests-only + 2 replicas. All
provisioner tests pass.
benben added a commit that referenced this pull request Jun 5, 2026
#687)

* fix(lakekeeper): drift-correct CRs by label; requests-only; 2 replicas

Three fixes to how the provisioner converges per-org Lakekeeper pods, found
when prod pods stayed BestEffort/unscraped after the resources+podMetadata
rollout (#684/#686):

1. Drift-correct by label, not by recomputed name. #686 re-derived the CR name
   via LakekeeperResourceName(orgID) for already-provisioned orgs. Post-#632
   that preserves hyphens, but a legacy org's CR — and its Secret, SA, and EKS
   pod-identity, all derived from the no-hyphen Duckling XR name — keeps the
   de-hyphenated name. So #686 created duplicate hyphenated CRs that had no
   Secret/SA/pod-identity (no deployment), while the real pods stayed unpatched.
   PatchPodShape now lists CRs by the duckgres/active-org label and patches
   whatever name exists (de-hyphenated legacy, hyphenated new) — no duplicates.

2. Conflict-free merge patch. EnsureCR's Get+Update lost a resourceVersion race
   to the operator's frequent status writes under multiple control-plane
   replicas ("object has been modified", every tick). PatchPodShape uses a JSON
   merge patch (no resourceVersion). It also needs no inputs, so the per-tick
   "failed to resolve lakekeeper inputs" warnings for non-provisioner-managed
   CRs go away.

3. Pod shape per request: replicas 2, and resources requests-only (no limits →
   Burstable; a CPU limit would CFS-throttle the catalog). limits is nulled in
   the patch to strip any stale limit block. EnsureCR's default replicas is now
   2 as well. Resource/podMetadata maps are shared between EnsureCR (create) and
   PatchPodShape (drift) so they can't diverge.

Replaces EnsureCRSpec with PatchPodShape; controller's Ready branch calls it
with no inputs. Tests: rewrote the drift test to assert label-matched patching +
no duplicate + no inputs; added PatchPodShape coverage (two names, limits
stripped); updated CreateAndShape for requests-only + 2 replicas. All
provisioner tests pass.

* docs(agents): forbid exposing customer/internal data in this public repo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant