fix(lakekeeper): drift-correct CRs by label; requests-only; 2 replicas#687
Merged
Conversation
Three fixes to how the provisioner converges per-org Lakekeeper pods, found when prod pods stayed BestEffort/unscraped after the resources+podMetadata rollout (#684/#686): 1. Drift-correct by label, not by recomputed name. #686 re-derived the CR name via LakekeeperResourceName(orgID) for already-provisioned orgs. Post-#632 that preserves hyphens, but a legacy org's CR — and its Secret, SA, and EKS pod-identity, all derived from the no-hyphen Duckling XR name — keeps the de-hyphenated name. So #686 created duplicate hyphenated CRs that had no Secret/SA/pod-identity (no deployment), while the real pods stayed unpatched. PatchPodShape now lists CRs by the duckgres/active-org label and patches whatever name exists (de-hyphenated legacy, hyphenated new) — no duplicates. 2. Conflict-free merge patch. EnsureCR's Get+Update lost a resourceVersion race to the operator's frequent status writes under multiple control-plane replicas ("object has been modified", every tick). PatchPodShape uses a JSON merge patch (no resourceVersion). It also needs no inputs, so the per-tick "failed to resolve lakekeeper inputs" warnings for non-provisioner-managed CRs go away. 3. Pod shape per request: replicas 2, and resources requests-only (no limits → Burstable; a CPU limit would CFS-throttle the catalog). limits is nulled in the patch to strip any stale limit block. EnsureCR's default replicas is now 2 as well. Resource/podMetadata maps are shared between EnsureCR (create) and PatchPodShape (drift) so they can't diverge. Replaces EnsureCRSpec with PatchPodShape; controller's Ready branch calls it with no inputs. Tests: rewrote the drift test to assert label-matched patching + no duplicate + no inputs; added PatchPodShape coverage (two names, limits stripped); updated CreateAndShape for requests-only + 2 replicas. All provisioner tests pass.
c26e599 to
e2fd5fe
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
After #684/#686 rolled out, per-org Lakekeeper pods in prod stayed BestEffort + unscraped (dev worked). Investigation found three issues — the main one is a naming bug #686 exposed.
1. Drift-correct by label, not by recomputed name (the real bug)
#686re-derived the CR name viaLakekeeperResourceName(orgID)for already-provisioned orgs. Post-#632that preserves hyphens, but a legacy org's CR — and its Secret, ServiceAccount, and EKS pod-identity, all derived from the no-hyphen Duckling XR name — keep the de-hyphenated name. So#686created duplicate hyphenated CRs with no Secret/SA/pod-identity (hence no Deployment), while the real pods stayed unpatched.PatchPodShapenow lists CRs by theduckgres/active-orglabel and patches whatever name exists (de-hyphenated legacy or hyphenated new) — no duplicates, no orphans.2. Conflict-free merge patch
EnsureCR's Get+Update lost aresourceVersionrace to the operator's frequent status writes under multiple control-plane replicas (the object has been modified, every tick). The drift path now uses a JSON merge patch (no resourceVersion). It also needs no inputs, so the per-tickfailed to resolve lakekeeper inputswarnings (for non-provisioner-managed CRs) disappear.3. Pod shape per request
replicas: 2(EnsureCR default is 2 too).limitsis nulled in the patch to strip any stale block.EnsureCR(create) andPatchPodShape(drift) so they can't diverge.Changes
EnsureCRSpecwithPatchPodShape(label-matched, merge-patch, no inputs); controller Ready branch calls it.PatchPodShapecoverage (two names, limits stripped);CreateAndShapeupdated for requests-only + 2 replicas. All provisioner tests pass.Follow-up (not in this PR)
<org-A>,<org-B>) — they have no deployments. Safekubectl deleteonce this ships.#632hyphenated names vs no-hyphen Duckling XR) is worth reconciling separately.