feat(lakekeeper): pin pod resources + Prometheus scrape annotations on the CR#684
Merged
Merged
Conversation
…n the CR
The per-org Lakekeeper CRs the control plane provisions had no resource
requests/limits (pods ran BestEffort, first to evict under node pressure) and
no Prometheus scrape annotations (the managed-warehouse clusters discover
targets by pod annotation via vmagent kubernetes_sd — no prometheus-operator —
so the catalog pods were never scraped).
EnsureCR now sets:
- spec.resources with requests == limits (Guaranteed QoS) — 500m CPU / 512Mi,
tunable via the lakekeeperPodCPU/Memory consts. Lakekeeper is a light Rust
REST catalog, so a modest fixed shape is plenty.
- spec.podMetadata.annotations {prometheus.io/scrape,port,path} so vmagent
scrapes the operator's metrics container port (default 9000).
spec.podMetadata requires the passthrough added to PostHog's operator fork
(PostHog/lakekeeper-operator branch posthog/serviceaccountname). On an operator
without it the CRD prunes the field and the annotations are dropped — a safe
no-op until the new operator image ships, so this can merge in either order.
Extends TestEnsureCR_CreateAndShape to cover both blocks.
benben
added a commit
that referenced
this pull request
Jun 5, 2026
…isioned orgs (#686) reconcileLakekeeper previously early-returned for any org with LakekeeperEndpoint already set, on the assumption that the operator's reconcile loop would carry future drift. That holds for fields already in the CR (e.g. image), but the operator can't add fields that aren't there — so a new field in the desired CR spec (resources, podMetadata, ...) never reaches existing per-org Lakekeepers. The Ready loop now re-applies just the CR spec for already-provisioned orgs: - buildCRSpec(w, in): single source of truth for the desired CR spec, shared by EnsureForOrg and the new path so the two never diverge. - EnsureCRSpec(ctx, w, in): lightweight drift correction — calls only k8s.EnsureCR (create-or-update), skipping the DB/Secret/REST pipeline. Idempotent; preserves the operator-owned status. - controller: the LakekeeperEndpoint-set branch resolves inputs and calls EnsureCRSpec instead of returning. The operator rolls the Deployment only when the spec actually changes. This is what makes the resources + podMetadata additions (PR #684) land on existing catalog pods rather than only new ones. Off main; complementary to #684 — until that merges, EnsureCR writes the same spec it does today, so this is a no-op on existing CRs. Rewrites TestReconcileLakekeeper_SkipsWhenAlreadyProvisioned -> _DriftCorrectsWhenAlreadyProvisioned to assert the CR is re-applied (image only, no pipeline side effects). All provisioner tests pass.
benben
added a commit
that referenced
this pull request
Jun 5, 2026
Three fixes to how the provisioner converges per-org Lakekeeper pods, found when prod pods stayed BestEffort/unscraped after the resources+podMetadata rollout (#684/#686): 1. Drift-correct by label, not by recomputed name. #686 re-derived the CR name via LakekeeperResourceName(orgID) for already-provisioned orgs. Post-#632 that preserves hyphens, but a legacy org's CR — and its Secret, SA, and EKS pod-identity, all derived from the no-hyphen Duckling XR name — keeps the de-hyphenated name. So #686 created duplicate hyphenated CRs that had no Secret/SA/pod-identity (no deployment), while the real pods stayed unpatched. PatchPodShape now lists CRs by the duckgres/active-org label and patches whatever name exists (de-hyphenated legacy, hyphenated new) — no duplicates. 2. Conflict-free merge patch. EnsureCR's Get+Update lost a resourceVersion race to the operator's frequent status writes under multiple control-plane replicas ("object has been modified", every tick). PatchPodShape uses a JSON merge patch (no resourceVersion). It also needs no inputs, so the per-tick "failed to resolve lakekeeper inputs" warnings for non-provisioner-managed CRs go away. 3. Pod shape per request: replicas 2, and resources requests-only (no limits → Burstable; a CPU limit would CFS-throttle the catalog). limits is nulled in the patch to strip any stale limit block. EnsureCR's default replicas is now 2 as well. Resource/podMetadata maps are shared between EnsureCR (create) and PatchPodShape (drift) so they can't diverge. Replaces EnsureCRSpec with PatchPodShape; controller's Ready branch calls it with no inputs. Tests: rewrote the drift test to assert label-matched patching + no duplicate + no inputs; added PatchPodShape coverage (two names, limits stripped); updated CreateAndShape for requests-only + 2 replicas. All provisioner tests pass.
benben
added a commit
that referenced
this pull request
Jun 5, 2026
#687) * fix(lakekeeper): drift-correct CRs by label; requests-only; 2 replicas Three fixes to how the provisioner converges per-org Lakekeeper pods, found when prod pods stayed BestEffort/unscraped after the resources+podMetadata rollout (#684/#686): 1. Drift-correct by label, not by recomputed name. #686 re-derived the CR name via LakekeeperResourceName(orgID) for already-provisioned orgs. Post-#632 that preserves hyphens, but a legacy org's CR — and its Secret, SA, and EKS pod-identity, all derived from the no-hyphen Duckling XR name — keeps the de-hyphenated name. So #686 created duplicate hyphenated CRs that had no Secret/SA/pod-identity (no deployment), while the real pods stayed unpatched. PatchPodShape now lists CRs by the duckgres/active-org label and patches whatever name exists (de-hyphenated legacy, hyphenated new) — no duplicates. 2. Conflict-free merge patch. EnsureCR's Get+Update lost a resourceVersion race to the operator's frequent status writes under multiple control-plane replicas ("object has been modified", every tick). PatchPodShape uses a JSON merge patch (no resourceVersion). It also needs no inputs, so the per-tick "failed to resolve lakekeeper inputs" warnings for non-provisioner-managed CRs go away. 3. Pod shape per request: replicas 2, and resources requests-only (no limits → Burstable; a CPU limit would CFS-throttle the catalog). limits is nulled in the patch to strip any stale limit block. EnsureCR's default replicas is now 2 as well. Resource/podMetadata maps are shared between EnsureCR (create) and PatchPodShape (drift) so they can't diverge. Replaces EnsureCRSpec with PatchPodShape; controller's Ready branch calls it with no inputs. Tests: rewrote the drift test to assert label-matched patching + no duplicate + no inputs; added PatchPodShape coverage (two names, limits stripped); updated CreateAndShape for requests-only + 2 replicas. All provisioner tests pass. * docs(agents): forbid exposing customer/internal data in this public repo
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
EnsureCR(per-org Lakekeeper provisioner) now sets two previously-missing parts of the CR spec:spec.resources— requests == limits (Guaranteed QoS),500mCPU /512Mimem, via thelakekeeperPodCPU/Memoryconsts. Lakekeeper is a light Rust REST catalog, so a modest fixed shape is plenty; bump the consts if a tenant needs more.spec.podMetadata.annotations—prometheus.io/scrape|port|path(port 9000) so the managed-warehouse clusters' annotation-based vmagent (kubernetes_sd, no prometheus-operator) scrapes the catalog pods.Why
Per-org Lakekeeper pods previously ran BestEffort (no requests/limits → first evicted under node pressure) and were never scraped (no scrape annotation; the CRD exposed no pod-metadata hook).
Dependency / ordering
spec.podMetadataneeds the passthrough added to PostHog's operator fork (PostHog/lakekeeper-operatorbranchposthog/serviceaccountname, merged via #1) and shipped in operator image9e1e0fb. On an operator without it the CRD prunes the field and the annotations are silently dropped — a safe no-op, so this can merge in any order relative to the charts image bump.Tests
Extends
TestEnsureCR_CreateAndShapeto assert both blocks. All provisioner tests pass (go test -tags kubernetes ./controlplane/provisioner/).🤖 Generated with Claude Code