fix: add clusterCapacity support for drive-sharing clusters (OP-329) by kristina-solovyova · Pull Request #2586 · weka/weka-operator

kristina-solovyova · 2026-06-10T17:00:49Z

No description provided.

github-actions · 2026-06-10T17:01:01Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

kristina-solovyova · 2026-06-10T17:01:03Z

fix: add clusterCapacity support for drive-sharing clusters (OP-329) #2586 👈 (View in Graphite)
main

How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has required the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-06-10T17:01:11Z

Claude finished @kristina-solovyova's task in 6m 50s —— View job

Claude PR Review — clusterCapacity for drive-sharing (OP-329)

Reviewed the full diff (37 files, ~4070 LOC). This is a well-structured, heavily-tested addition — the planner is split cleanly into driver/ephemeral/explicit/assignment modules, error propagation is generally correct, and edge cases (zero nodes, zero QLC FDs, pure-QLC clusters) are mostly guarded. Findings below, ordered by severity. The submodule pkg/weka-k8s-api wasn't checked out here so I couldn't go build/go test — please ensure CI is green.

🔴 HIGH

1. Error swallowing in buildNodeInventory — comment promises a log that never happens
internal/controllers/wekacluster/funcs_fd_planning.go:76-80

info, err := nodeInfoGetter(ctx, nodeName)
if err != nil {
    // Log and skip — a missing annotation is not fatal for planning.
    continue
}

The comment says "Log and skip," but nothing is logged, and nodeInfoGetter does not error on a missing annotation (it returns empty SharedDrives). It only errors on a real client.Get failure (transient API error, node mid-deletion) or malformed annotation JSON. So this silently drops a node from the capacity inventory on genuine errors → the plan can be under-sized or reported infeasible ("not enough usable TLC failure domains") with a misleading message, and nothing surfaces. Violates the repo rule "never swallow errors unless explicitly asked."
Suggested: actually log (the comment already promises it), and distinguish apierrors.IsNotFound (legitimately skip a vanishing node) from other errors (return/surface). Fix this →

2. Pool-assignment regression for drive-sharing clusters with no cluster-level DriveTypesRatio
internal/controllers/wekacontainer/funcs_drives.go:92-99

allSameType := false
if cluster.Spec.Dynamic != nil && cluster.Spec.Dynamic.DriveTypesRatio != nil &&
    (cluster.Spec.Dynamic.DriveTypesRatio.Qlc == 0 || cluster.Spec.Dynamic.DriveTypesRatio.Tlc == 0) {
    allSameType = true
}

The allSameType default flipped from effectively true (old per-container logic) to false. Any pre-existing drive-sharing cluster where Spec.Dynamic == nil or DriveTypesRatio == nil now computes allSameType = false, so drives route to iubig/iu4k by vd.Type instead of the legacy pool they previously used. Spec.Dynamic.DriveTypesRatio is user-set and is not auto-defaulted from the DriveSharing env config, so clusters that left it unset are affected — this looks like a silent pool migration for already-running clusters.
Please confirm this is intended for existing clusters. If legacy should stay the default when the cluster-level ratio is absent, default allSameType = true and only set false when both Tlc > 0 and Qlc > 0.

🟡 MEDIUM

3. Explicit-mode anchor recognition breaks when a pinned node falls out of inventory
internal/controllers/allocator/cluster_capacity_explicit.go:100-112
Existing anchors are recognized only via nodeFD[e.PinnedNode]. buildNodeInventory drops a node when it's fully consumed by other clusters (funcs_fd_planning.go:91-93) or its FD label is missing. If an existing anchor's pinned node is dropped, nodeFD[e.PinnedNode] == "", the anchor isn't recognized, its FD is treated as unanchored, and the planner can emit a duplicate anchor on a different node in the same FD group → double-provisioning + broken idempotency.
Suggested: also recognize existing anchors by their known FD identity (AllocatedFD / BoundFailureDomain from status), not just the live nodeFD lookup.

4. Wrong QLC denominator in covered fast-path PerFDGiB
internal/controllers/allocator/cluster_capacity.go:241

PerFDGiB: util.CeilDiv(tlcRaw, numTLC) + util.CeilDiv(qlcRaw, numTLC),

QLC is divided by numTLC (TLC container count), but QLC spreads across QLC-bearing FDs. The full driver computes the QLC term over numFDTarget / len(fdQlcNodes), never numTLC. When numQLC != numTLC the fast-path reports a different PerFDGiB than the full planner for the same cluster. This field is observability-only (doesn't drive allocation), so no capacity is mis-provisioned — but the reported value is inconsistent/misleading.
Suggested: divide the QLC term by the QLC count (guard 0), or recompute consistently with the driver's per-FD definition.

5. Three new unguarded Spec.Dynamic.UsesClusterCapacity() calls
funcs_upgrade.go:408, steps_cluster_creation.go:281,559
Spec.Dynamic is nillable (handled as nil elsewhere, not defaulted in admission). The established pattern guards it — hugepages.go:39: cluster.Spec.Dynamic != nil && cluster.Spec.Dynamic.UsesClusterCapacity(). These three new sites omit the != nil guard. Whether it panics depends on the receiver type (couldn't verify without the submodule), but HandleSpecUpdates runs for every cluster including non-clusterCapacity ones, so this is a latent risk worth closing cheaply by matching the existing guard pattern.

🟢 LOW / confirm

Doc/code mismatch: funcs_fd_planning.go:160 sums NumCores + ExtraCores, but the two comments (lines 116-117, 132-136) say only NumCores; ExtraCores is untested. Update comments + add a test case.
Covered fast-path doesn't refresh Status.NumFailureDomains despite the "always refresh" comment at steps_cluster_creation.go:457 — the fast path returns before the status block. Either drop "always" or refresh NumFD (it's populated as len(uniqueFDs)).
Zero per-core capacity silently degrades to 1 core: TlcCapacityPerCoreGiB/QlcCapacityPerCoreGiB (env.go:531-533) feed CeilDiv divisors wrapped in max(1, …). A …PER_CORE_GIB=0 misconfig sizes every drive container to 1 core instead of erroring. Validate these are > 0 at config time.
Inline role derivation has no nil-Ratio fallback at steps_cluster_creation.go:476-486 — role stays "" → malformed container name. Not currently reachable (planner always sets Ratio), but consider reusing allocator.RatioType(pc.Ratio).
Migration frozen-chunk seed (funcs_fd_planning.go:214-224) runs for explicit mode too, before the if explicit dispatch. Confirm the seed is meant for explicit-mode clusters with pre-existing anchors; if ephemeral-only, gate with if !explicit.
Dead guard: cluster_capacity_explicit.go:105 if e.PinnedNode != "" is already guaranteed non-empty by the enclosing nodeFD lookup. Drop it.

✅ Verified correct (no action)

funcs_cluster_joining.go:101-109 swallow of GetClusterCreationTime is fine — its only error is in-memory "not found," no transient path.
flow_active_state.go step reorder (setJoinIpsIfStuckInStemMode before applyCurrentImage) is inert for founding/non-clusterCapacity containers.
hugepages.go / templates.go clusterCapacity branches are correctly nil-guarded and propagate errors.
deriveComputeCountAndCores keeps cores >= 1 across all branches; RawCapacityGiB divides by sw only after validateCapacityInputs enforces sw >= 3.
Validators (cluster_capacity_chunk_feasibility.go, cluster_capacity_protection.go) registered correctly; registry.go/defaults.go pairing enforced at startup.
No concurrency concerns — planner state is per-call local.

Bottom line: Findings #1 (error swallow) and #2 (pool default) are the two I'd want resolved/confirmed before merge; #3 is a real idempotency edge case worth fixing; the rest are confirm/cleanup. Nice test coverage overall.
· 06-10-fix_add_clustercapacity_support_for_drive-sharing_clusters_op-329_

claude · 2026-06-10T17:50:21Z

Claude finished @kristina-solovyova's task in 1m 50s —— View job

Claude PR Review (re-review) — clusterCapacity for drive-sharing (OP-329)

The branch was updated after my first review (commit ca5ac92b, pushed ~17:49 UTC, ~48 min after the original review). I diffed the prior findings against the current code. 4 of the 5 HIGH/MEDIUM findings are resolved. One MEDIUM remains.

Diff prior-reviewed findings against current code
Review the updated changes
Post updated review

✅ Resolved since last review

#1 (HIGH) Error swallowing in buildNodeInventory — fixed.
funcs_fd_planning.go:79-89 now logs and skips on apierrors.IsNotFound (vanishing node) and returns the error otherwise (transient API error / malformed annotation), with a descriptive wrap. Matches the repo "never swallow errors" rule. 👍

#2 (HIGH) Pool-assignment default for drive-sharing — fixed.
funcs_drives.go:94-106 now defaults allSameType = true when Spec.Dynamic == nil or DriveTypesRatio == nil, falling back to the per-container drive-type scan (preserving the old legacy-pool behavior for pre-existing clusters). Only routes to type pools when both Tlc > 0 and Qlc > 0. This removes the silent pool-migration regression.

#4 (MEDIUM) PerFDGiB QLC denominator — fixed.
cluster_capacity.go:244 now divides both the TLC and QLC terms by max(1, numFD) (with numFD = len(uniqueFDs)), instead of the TLC container count. Zero-guarded and consistent with "QLC neighbors share their anchor's AllocatedFD."

#5 (MEDIUM) Unguarded Spec.Dynamic.UsesClusterCapacity() — fixed.
All three sites (funcs_upgrade.go:408, steps_cluster_creation.go:281,559) now have the cluster.Spec.Dynamic != nil && guard, matching the established hugepages.go:39 pattern.

🟡 Still open

#3 (MEDIUM) Explicit-mode anchor recognition still keyed only on live nodeFD[e.PinnedNode].
cluster_capacity_explicit.go:100-112

for _, e := range existing {
    fd := nodeFD[e.PinnedNode]
    if fd == "" {
        continue
    }
    ...
}

The ephemeral planner was updated to key idempotency on the persisted e.AllocatedFD (cluster_capacity_ephemeral.go:69-77), but the explicit planner still resolves the FD via the live nodeFD map. buildNodeInventory drops a node when it's fully consumed by other clusters (funcs_fd_planning.go:100-102) or its FD label is missing — and our own anchor's node can be floored to 0 once other clusters claim it (our own container's capacity isn't added back here). When that node drops out, nodeFD[e.PinnedNode] == "" → the existing anchor isn't recognized → its FD looks unanchored → the planner can emit a duplicate anchor on a different node in the same FD group (double-provisioning + broken idempotency).

The fields needed for the robust fix already exist on the existing element (AllocatedFD, plus PinnedNode). Suggested: fall back to e.AllocatedFD when the live nodeFD lookup is empty, mirroring what ephemeral mode now does.
Fix this →

Minor (still present): the dead guard at cluster_capacity_explicit.go:105 — if e.PinnedNode != "" is already guaranteed non-empty by the enclosing nodeFD[e.PinnedNode] lookup succeeding (an empty key can't produce a non-empty fd). Once #3 is addressed with an AllocatedFD fallback, this guard becomes meaningful again, so it's worth resolving together.

Notes

The other 🟢 LOW items from the first review (doc/code comment mismatch on ExtraCores, covered fast-path not refreshing Status.NumFailureDomains, zero-per-core-capacity → 1-core degradation, inline role-derivation nil-Ratio fallback, migration frozen-chunk seed gating) were not part of these updates — they remain as confirm/cleanup, not blockers.
I couldn't run go build/go test here (build needs approval in this environment, and the pkg/weka-k8s-api submodule isn't checked out). Please ensure CI is green.

Bottom line: Strong iteration — the two HIGH findings and two of the MEDIUMs are cleanly resolved. #3 (explicit-mode anchor recognition) is the one remaining correctness edge case I'd want closed before merge; the dead-guard cleanup pairs with it.
· 06-10-fix_add_clustercapacity_support_for_drive-sharing_clusters_op-329_

fix: add clusterCapacity support for drive-sharing clusters (OP-329)

ca5ac92

kristina-solovyova force-pushed the 06-10-fix_add_clustercapacity_support_for_drive-sharing_clusters_op-329_ branch from 21b73b5 to ca5ac92 Compare June 10, 2026 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add clusterCapacity support for drive-sharing clusters (OP-329)#2586

fix: add clusterCapacity support for drive-sharing clusters (OP-329)#2586
kristina-solovyova wants to merge 1 commit into
mainfrom
06-10-fix_add_clustercapacity_support_for_drive-sharing_clusters_op-329_

kristina-solovyova commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

kristina-solovyova commented Jun 10, 2026

Uh oh!

claude Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kristina-solovyova commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

kristina-solovyova commented Jun 10, 2026

How to use the Graphite Merge Queue

Uh oh!

claude Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude PR Review — clusterCapacity for drive-sharing (OP-329)

🔴 HIGH

🟡 MEDIUM

🟢 LOW / confirm

✅ Verified correct (no action)

Uh oh!

claude Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude PR Review (re-review) — clusterCapacity for drive-sharing (OP-329)

✅ Resolved since last review

🟡 Still open

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 10, 2026 •

edited

Loading

claude Bot commented Jun 10, 2026 •

edited

Loading

claude Bot commented Jun 10, 2026 •

edited

Loading