Skip to content

feat(lakekeeper): drift-correct existing CRs instead of skipping provisioned orgs#686

Merged
benben merged 1 commit into
mainfrom
ben/lakekeeper-cr-drift-correction
Jun 5, 2026
Merged

feat(lakekeeper): drift-correct existing CRs instead of skipping provisioned orgs#686
benben merged 1 commit into
mainfrom
ben/lakekeeper-cr-drift-correction

Conversation

@benben
Copy link
Copy Markdown
Member

@benben benben commented Jun 5, 2026

What

The provisioning controller's Ready loop now re-applies the Lakekeeper CR spec for already-provisioned orgs, instead of early-returning.

Why

reconcileLakekeeper skipped any org with LakekeeperEndpoint set, assuming the operator's reconcile loop carries future drift. That's true for fields already in the CR (e.g. image), but the operator can't add fields that aren't there — so a new field in the desired CR spec (resources, podMetadata, …) never reaches existing per-org Lakekeepers. They'd stay BestEffort + unscraped forever.

This is the piece that makes #684's resources + podMetadata land on existing catalog pods, not just newly-provisioned ones.

How

  • buildCRSpec(w, in) — single source of truth for the desired CR spec, shared by EnsureForOrg and the new path so they never diverge.
  • EnsureCRSpec(ctx, w, in) — lightweight drift correction: calls only k8s.EnsureCR (create-or-update), skipping the DB/Secret/REST pipeline. Idempotent; preserves operator-owned status.
  • controller — the LakekeeperEndpoint-set branch resolves inputs and calls EnsureCRSpec instead of return. The operator rolls the Deployment only when the spec actually changes.

Ordering

Branched off main, complementary to #684. Until #684 merges, EnsureCR writes the same spec it does today → this is a safe no-op on existing CRs. Once both land, the next reconcile tick converges every existing CR onto resources + scrape annotations and the operator rolls the pods.

Tests

TestReconcileLakekeeper_SkipsWhenAlreadyProvisioned_DriftCorrectsWhenAlreadyProvisioned: asserts the CR is re-applied (image) with no pipeline side effects (warehouse row untouched). All provisioner tests pass.

🤖 Generated with Claude Code

…isioned orgs

reconcileLakekeeper previously early-returned for any org with
LakekeeperEndpoint already set, on the assumption that the operator's reconcile
loop would carry future drift. That holds for fields already in the CR (e.g.
image), but the operator can't add fields that aren't there — so a new field in
the desired CR spec (resources, podMetadata, ...) never reaches existing
per-org Lakekeepers.

The Ready loop now re-applies just the CR spec for already-provisioned orgs:
- buildCRSpec(w, in): single source of truth for the desired CR spec, shared by
  EnsureForOrg and the new path so the two never diverge.
- EnsureCRSpec(ctx, w, in): lightweight drift correction — calls only
  k8s.EnsureCR (create-or-update), skipping the DB/Secret/REST pipeline.
  Idempotent; preserves the operator-owned status.
- controller: the LakekeeperEndpoint-set branch resolves inputs and calls
  EnsureCRSpec instead of returning. The operator rolls the Deployment only when
  the spec actually changes.

This is what makes the resources + podMetadata additions (PR #684) land on
existing catalog pods rather than only new ones. Off main; complementary to
#684 — until that merges, EnsureCR writes the same spec it does today, so this
is a no-op on existing CRs.

Rewrites TestReconcileLakekeeper_SkipsWhenAlreadyProvisioned ->
_DriftCorrectsWhenAlreadyProvisioned to assert the CR is re-applied (image only,
no pipeline side effects). All provisioner tests pass.
@benben benben merged commit e2c67ec into main Jun 5, 2026
23 checks passed
@benben benben deleted the ben/lakekeeper-cr-drift-correction branch June 5, 2026 12:07
benben added a commit that referenced this pull request Jun 5, 2026
Three fixes to how the provisioner converges per-org Lakekeeper pods, found
when prod pods stayed BestEffort/unscraped after the resources+podMetadata
rollout (#684/#686):

1. Drift-correct by label, not by recomputed name. #686 re-derived the CR name
   via LakekeeperResourceName(orgID) for already-provisioned orgs. Post-#632
   that preserves hyphens, but a legacy org's CR — and its Secret, SA, and EKS
   pod-identity, all derived from the no-hyphen Duckling XR name — keeps the
   de-hyphenated name. So #686 created duplicate hyphenated CRs that had no
   Secret/SA/pod-identity (no deployment), while the real pods stayed unpatched.
   PatchPodShape now lists CRs by the duckgres/active-org label and patches
   whatever name exists (de-hyphenated legacy, hyphenated new) — no duplicates.

2. Conflict-free merge patch. EnsureCR's Get+Update lost a resourceVersion race
   to the operator's frequent status writes under multiple control-plane
   replicas ("object has been modified", every tick). PatchPodShape uses a JSON
   merge patch (no resourceVersion). It also needs no inputs, so the per-tick
   "failed to resolve lakekeeper inputs" warnings for non-provisioner-managed
   CRs go away.

3. Pod shape per request: replicas 2, and resources requests-only (no limits →
   Burstable; a CPU limit would CFS-throttle the catalog). limits is nulled in
   the patch to strip any stale limit block. EnsureCR's default replicas is now
   2 as well. Resource/podMetadata maps are shared between EnsureCR (create) and
   PatchPodShape (drift) so they can't diverge.

Replaces EnsureCRSpec with PatchPodShape; controller's Ready branch calls it
with no inputs. Tests: rewrote the drift test to assert label-matched patching +
no duplicate + no inputs; added PatchPodShape coverage (two names, limits
stripped); updated CreateAndShape for requests-only + 2 replicas. All
provisioner tests pass.
benben added a commit that referenced this pull request Jun 5, 2026
#687)

* fix(lakekeeper): drift-correct CRs by label; requests-only; 2 replicas

Three fixes to how the provisioner converges per-org Lakekeeper pods, found
when prod pods stayed BestEffort/unscraped after the resources+podMetadata
rollout (#684/#686):

1. Drift-correct by label, not by recomputed name. #686 re-derived the CR name
   via LakekeeperResourceName(orgID) for already-provisioned orgs. Post-#632
   that preserves hyphens, but a legacy org's CR — and its Secret, SA, and EKS
   pod-identity, all derived from the no-hyphen Duckling XR name — keeps the
   de-hyphenated name. So #686 created duplicate hyphenated CRs that had no
   Secret/SA/pod-identity (no deployment), while the real pods stayed unpatched.
   PatchPodShape now lists CRs by the duckgres/active-org label and patches
   whatever name exists (de-hyphenated legacy, hyphenated new) — no duplicates.

2. Conflict-free merge patch. EnsureCR's Get+Update lost a resourceVersion race
   to the operator's frequent status writes under multiple control-plane
   replicas ("object has been modified", every tick). PatchPodShape uses a JSON
   merge patch (no resourceVersion). It also needs no inputs, so the per-tick
   "failed to resolve lakekeeper inputs" warnings for non-provisioner-managed
   CRs go away.

3. Pod shape per request: replicas 2, and resources requests-only (no limits →
   Burstable; a CPU limit would CFS-throttle the catalog). limits is nulled in
   the patch to strip any stale limit block. EnsureCR's default replicas is now
   2 as well. Resource/podMetadata maps are shared between EnsureCR (create) and
   PatchPodShape (drift) so they can't diverge.

Replaces EnsureCRSpec with PatchPodShape; controller's Ready branch calls it
with no inputs. Tests: rewrote the drift test to assert label-matched patching +
no duplicate + no inputs; added PatchPodShape coverage (two names, limits
stripped); updated CreateAndShape for requests-only + 2 replicas. All
provisioner tests pass.

* docs(agents): forbid exposing customer/internal data in this public repo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant