feat(lakekeeper): drift-correct existing CRs instead of skipping provisioned orgs#686
Merged
Merged
Conversation
…isioned orgs reconcileLakekeeper previously early-returned for any org with LakekeeperEndpoint already set, on the assumption that the operator's reconcile loop would carry future drift. That holds for fields already in the CR (e.g. image), but the operator can't add fields that aren't there — so a new field in the desired CR spec (resources, podMetadata, ...) never reaches existing per-org Lakekeepers. The Ready loop now re-applies just the CR spec for already-provisioned orgs: - buildCRSpec(w, in): single source of truth for the desired CR spec, shared by EnsureForOrg and the new path so the two never diverge. - EnsureCRSpec(ctx, w, in): lightweight drift correction — calls only k8s.EnsureCR (create-or-update), skipping the DB/Secret/REST pipeline. Idempotent; preserves the operator-owned status. - controller: the LakekeeperEndpoint-set branch resolves inputs and calls EnsureCRSpec instead of returning. The operator rolls the Deployment only when the spec actually changes. This is what makes the resources + podMetadata additions (PR #684) land on existing catalog pods rather than only new ones. Off main; complementary to #684 — until that merges, EnsureCR writes the same spec it does today, so this is a no-op on existing CRs. Rewrites TestReconcileLakekeeper_SkipsWhenAlreadyProvisioned -> _DriftCorrectsWhenAlreadyProvisioned to assert the CR is re-applied (image only, no pipeline side effects). All provisioner tests pass.
benben
added a commit
that referenced
this pull request
Jun 5, 2026
Three fixes to how the provisioner converges per-org Lakekeeper pods, found when prod pods stayed BestEffort/unscraped after the resources+podMetadata rollout (#684/#686): 1. Drift-correct by label, not by recomputed name. #686 re-derived the CR name via LakekeeperResourceName(orgID) for already-provisioned orgs. Post-#632 that preserves hyphens, but a legacy org's CR — and its Secret, SA, and EKS pod-identity, all derived from the no-hyphen Duckling XR name — keeps the de-hyphenated name. So #686 created duplicate hyphenated CRs that had no Secret/SA/pod-identity (no deployment), while the real pods stayed unpatched. PatchPodShape now lists CRs by the duckgres/active-org label and patches whatever name exists (de-hyphenated legacy, hyphenated new) — no duplicates. 2. Conflict-free merge patch. EnsureCR's Get+Update lost a resourceVersion race to the operator's frequent status writes under multiple control-plane replicas ("object has been modified", every tick). PatchPodShape uses a JSON merge patch (no resourceVersion). It also needs no inputs, so the per-tick "failed to resolve lakekeeper inputs" warnings for non-provisioner-managed CRs go away. 3. Pod shape per request: replicas 2, and resources requests-only (no limits → Burstable; a CPU limit would CFS-throttle the catalog). limits is nulled in the patch to strip any stale limit block. EnsureCR's default replicas is now 2 as well. Resource/podMetadata maps are shared between EnsureCR (create) and PatchPodShape (drift) so they can't diverge. Replaces EnsureCRSpec with PatchPodShape; controller's Ready branch calls it with no inputs. Tests: rewrote the drift test to assert label-matched patching + no duplicate + no inputs; added PatchPodShape coverage (two names, limits stripped); updated CreateAndShape for requests-only + 2 replicas. All provisioner tests pass.
benben
added a commit
that referenced
this pull request
Jun 5, 2026
#687) * fix(lakekeeper): drift-correct CRs by label; requests-only; 2 replicas Three fixes to how the provisioner converges per-org Lakekeeper pods, found when prod pods stayed BestEffort/unscraped after the resources+podMetadata rollout (#684/#686): 1. Drift-correct by label, not by recomputed name. #686 re-derived the CR name via LakekeeperResourceName(orgID) for already-provisioned orgs. Post-#632 that preserves hyphens, but a legacy org's CR — and its Secret, SA, and EKS pod-identity, all derived from the no-hyphen Duckling XR name — keeps the de-hyphenated name. So #686 created duplicate hyphenated CRs that had no Secret/SA/pod-identity (no deployment), while the real pods stayed unpatched. PatchPodShape now lists CRs by the duckgres/active-org label and patches whatever name exists (de-hyphenated legacy, hyphenated new) — no duplicates. 2. Conflict-free merge patch. EnsureCR's Get+Update lost a resourceVersion race to the operator's frequent status writes under multiple control-plane replicas ("object has been modified", every tick). PatchPodShape uses a JSON merge patch (no resourceVersion). It also needs no inputs, so the per-tick "failed to resolve lakekeeper inputs" warnings for non-provisioner-managed CRs go away. 3. Pod shape per request: replicas 2, and resources requests-only (no limits → Burstable; a CPU limit would CFS-throttle the catalog). limits is nulled in the patch to strip any stale limit block. EnsureCR's default replicas is now 2 as well. Resource/podMetadata maps are shared between EnsureCR (create) and PatchPodShape (drift) so they can't diverge. Replaces EnsureCRSpec with PatchPodShape; controller's Ready branch calls it with no inputs. Tests: rewrote the drift test to assert label-matched patching + no duplicate + no inputs; added PatchPodShape coverage (two names, limits stripped); updated CreateAndShape for requests-only + 2 replicas. All provisioner tests pass. * docs(agents): forbid exposing customer/internal data in this public repo
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The provisioning controller's Ready loop now re-applies the Lakekeeper CR spec for already-provisioned orgs, instead of early-returning.
Why
reconcileLakekeeperskipped any org withLakekeeperEndpointset, assuming the operator's reconcile loop carries future drift. That's true for fields already in the CR (e.g. image), but the operator can't add fields that aren't there — so a new field in the desired CR spec (resources,podMetadata, …) never reaches existing per-org Lakekeepers. They'd stay BestEffort + unscraped forever.This is the piece that makes #684's
resources+podMetadataland on existing catalog pods, not just newly-provisioned ones.How
buildCRSpec(w, in)— single source of truth for the desired CR spec, shared byEnsureForOrgand the new path so they never diverge.EnsureCRSpec(ctx, w, in)— lightweight drift correction: calls onlyk8s.EnsureCR(create-or-update), skipping the DB/Secret/REST pipeline. Idempotent; preserves operator-owned status.LakekeeperEndpoint-set branch resolves inputs and callsEnsureCRSpecinstead ofreturn. The operator rolls the Deployment only when the spec actually changes.Ordering
Branched off
main, complementary to #684. Until #684 merges,EnsureCRwrites the same spec it does today → this is a safe no-op on existing CRs. Once both land, the next reconcile tick converges every existing CR onto resources + scrape annotations and the operator rolls the pods.Tests
TestReconcileLakekeeper_SkipsWhenAlreadyProvisioned→_DriftCorrectsWhenAlreadyProvisioned: asserts the CR is re-applied (image) with no pipeline side effects (warehouse row untouched). All provisioner tests pass.🤖 Generated with Claude Code