Skip to content

fix(lakekeeper): drift-correct CRs by label; requests-only; 2 replicas#687

Merged
benben merged 2 commits into
mainfrom
ben/lakekeeper-podshape-driftfix
Jun 5, 2026
Merged

fix(lakekeeper): drift-correct CRs by label; requests-only; 2 replicas#687
benben merged 2 commits into
mainfrom
ben/lakekeeper-podshape-driftfix

Conversation

@benben
Copy link
Copy Markdown
Member

@benben benben commented Jun 5, 2026

Why

After #684/#686 rolled out, per-org Lakekeeper pods in prod stayed BestEffort + unscraped (dev worked). Investigation found three issues — the main one is a naming bug #686 exposed.

1. Drift-correct by label, not by recomputed name (the real bug)

#686 re-derived the CR name via LakekeeperResourceName(orgID) for already-provisioned orgs. Post-#632 that preserves hyphens, but a legacy org's CR — and its Secret, ServiceAccount, and EKS pod-identity, all derived from the no-hyphen Duckling XR name — keep the de-hyphenated name. So #686 created duplicate hyphenated CRs with no Secret/SA/pod-identity (hence no Deployment), while the real pods stayed unpatched.

PatchPodShape now lists CRs by the duckgres/active-org label and patches whatever name exists (de-hyphenated legacy or hyphenated new) — no duplicates, no orphans.

2. Conflict-free merge patch

EnsureCR's Get+Update lost a resourceVersion race to the operator's frequent status writes under multiple control-plane replicas (the object has been modified, every tick). The drift path now uses a JSON merge patch (no resourceVersion). It also needs no inputs, so the per-tick failed to resolve lakekeeper inputs warnings (for non-provisioner-managed CRs) disappear.

3. Pod shape per request

  • replicas: 2 (EnsureCR default is 2 too).
  • resources requests-only, no limits → Burstable (a CPU limit would CFS-throttle the catalog). limits is nulled in the patch to strip any stale block.
  • Resource/podMetadata maps shared between EnsureCR (create) and PatchPodShape (drift) so they can't diverge.

Changes

  • Replaces EnsureCRSpec with PatchPodShape (label-matched, merge-patch, no inputs); controller Ready branch calls it.
  • Tests: drift test now asserts label-matched patch + no duplicate + no inputs; added PatchPodShape coverage (two names, limits stripped); CreateAndShape updated for requests-only + 2 replicas. All provisioner tests pass.

Follow-up (not in this PR)

  • Delete the duplicate orphan hyphenated CRs already created in prod (<org-A>, <org-B>) — they have no deployments. Safe kubectl delete once this ships.
  • The duckgres↔crossplane name divergence (#632 hyphenated names vs no-hyphen Duckling XR) is worth reconciling separately.

Three fixes to how the provisioner converges per-org Lakekeeper pods, found
when prod pods stayed BestEffort/unscraped after the resources+podMetadata
rollout (#684/#686):

1. Drift-correct by label, not by recomputed name. #686 re-derived the CR name
   via LakekeeperResourceName(orgID) for already-provisioned orgs. Post-#632
   that preserves hyphens, but a legacy org's CR — and its Secret, SA, and EKS
   pod-identity, all derived from the no-hyphen Duckling XR name — keeps the
   de-hyphenated name. So #686 created duplicate hyphenated CRs that had no
   Secret/SA/pod-identity (no deployment), while the real pods stayed unpatched.
   PatchPodShape now lists CRs by the duckgres/active-org label and patches
   whatever name exists (de-hyphenated legacy, hyphenated new) — no duplicates.

2. Conflict-free merge patch. EnsureCR's Get+Update lost a resourceVersion race
   to the operator's frequent status writes under multiple control-plane
   replicas ("object has been modified", every tick). PatchPodShape uses a JSON
   merge patch (no resourceVersion). It also needs no inputs, so the per-tick
   "failed to resolve lakekeeper inputs" warnings for non-provisioner-managed
   CRs go away.

3. Pod shape per request: replicas 2, and resources requests-only (no limits →
   Burstable; a CPU limit would CFS-throttle the catalog). limits is nulled in
   the patch to strip any stale limit block. EnsureCR's default replicas is now
   2 as well. Resource/podMetadata maps are shared between EnsureCR (create) and
   PatchPodShape (drift) so they can't diverge.

Replaces EnsureCRSpec with PatchPodShape; controller's Ready branch calls it
with no inputs. Tests: rewrote the drift test to assert label-matched patching +
no duplicate + no inputs; added PatchPodShape coverage (two names, limits
stripped); updated CreateAndShape for requests-only + 2 replicas. All
provisioner tests pass.
@benben benben force-pushed the ben/lakekeeper-podshape-driftfix branch from c26e599 to e2fd5fe Compare June 5, 2026 15:23
@benben benben merged commit 17e9861 into main Jun 5, 2026
24 checks passed
@benben benben deleted the ben/lakekeeper-podshape-driftfix branch June 5, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant