Skip to content

fix: add clusterCapacity support for drive-sharing clusters (OP-329)#2586

Draft
kristina-solovyova wants to merge 1 commit into
mainfrom
06-10-fix_add_clustercapacity_support_for_drive-sharing_clusters_op-329_
Draft

fix: add clusterCapacity support for drive-sharing clusters (OP-329)#2586
kristina-solovyova wants to merge 1 commit into
mainfrom
06-10-fix_add_clustercapacity_support_for_drive-sharing_clusters_op-329_

Conversation

@kristina-solovyova

Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copy link
Copy Markdown
Collaborator Author

How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has required the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @kristina-solovyova's task in 6m 50s —— View job


Claude PR Review — clusterCapacity for drive-sharing (OP-329)

Reviewed the full diff (37 files, ~4070 LOC). This is a well-structured, heavily-tested addition — the planner is split cleanly into driver/ephemeral/explicit/assignment modules, error propagation is generally correct, and edge cases (zero nodes, zero QLC FDs, pure-QLC clusters) are mostly guarded. Findings below, ordered by severity. The submodule pkg/weka-k8s-api wasn't checked out here so I couldn't go build/go test — please ensure CI is green.


🔴 HIGH

1. Error swallowing in buildNodeInventory — comment promises a log that never happens
internal/controllers/wekacluster/funcs_fd_planning.go:76-80

info, err := nodeInfoGetter(ctx, nodeName)
if err != nil {
    // Log and skip — a missing annotation is not fatal for planning.
    continue
}

The comment says "Log and skip," but nothing is logged, and nodeInfoGetter does not error on a missing annotation (it returns empty SharedDrives). It only errors on a real client.Get failure (transient API error, node mid-deletion) or malformed annotation JSON. So this silently drops a node from the capacity inventory on genuine errors → the plan can be under-sized or reported infeasible ("not enough usable TLC failure domains") with a misleading message, and nothing surfaces. Violates the repo rule "never swallow errors unless explicitly asked."
Suggested: actually log (the comment already promises it), and distinguish apierrors.IsNotFound (legitimately skip a vanishing node) from other errors (return/surface). Fix this →

2. Pool-assignment regression for drive-sharing clusters with no cluster-level DriveTypesRatio
internal/controllers/wekacontainer/funcs_drives.go:92-99

allSameType := false
if cluster.Spec.Dynamic != nil && cluster.Spec.Dynamic.DriveTypesRatio != nil &&
    (cluster.Spec.Dynamic.DriveTypesRatio.Qlc == 0 || cluster.Spec.Dynamic.DriveTypesRatio.Tlc == 0) {
    allSameType = true
}

The allSameType default flipped from effectively true (old per-container logic) to false. Any pre-existing drive-sharing cluster where Spec.Dynamic == nil or DriveTypesRatio == nil now computes allSameType = false, so drives route to iubig/iu4k by vd.Type instead of the legacy pool they previously used. Spec.Dynamic.DriveTypesRatio is user-set and is not auto-defaulted from the DriveSharing env config, so clusters that left it unset are affected — this looks like a silent pool migration for already-running clusters.
Please confirm this is intended for existing clusters. If legacy should stay the default when the cluster-level ratio is absent, default allSameType = true and only set false when both Tlc > 0 and Qlc > 0.


🟡 MEDIUM

3. Explicit-mode anchor recognition breaks when a pinned node falls out of inventory
internal/controllers/allocator/cluster_capacity_explicit.go:100-112
Existing anchors are recognized only via nodeFD[e.PinnedNode]. buildNodeInventory drops a node when it's fully consumed by other clusters (funcs_fd_planning.go:91-93) or its FD label is missing. If an existing anchor's pinned node is dropped, nodeFD[e.PinnedNode] == "", the anchor isn't recognized, its FD is treated as unanchored, and the planner can emit a duplicate anchor on a different node in the same FD group → double-provisioning + broken idempotency.
Suggested: also recognize existing anchors by their known FD identity (AllocatedFD / BoundFailureDomain from status), not just the live nodeFD lookup.

4. Wrong QLC denominator in covered fast-path PerFDGiB
internal/controllers/allocator/cluster_capacity.go:241

PerFDGiB: util.CeilDiv(tlcRaw, numTLC) + util.CeilDiv(qlcRaw, numTLC),

QLC is divided by numTLC (TLC container count), but QLC spreads across QLC-bearing FDs. The full driver computes the QLC term over numFDTarget / len(fdQlcNodes), never numTLC. When numQLC != numTLC the fast-path reports a different PerFDGiB than the full planner for the same cluster. This field is observability-only (doesn't drive allocation), so no capacity is mis-provisioned — but the reported value is inconsistent/misleading.
Suggested: divide the QLC term by the QLC count (guard 0), or recompute consistently with the driver's per-FD definition.

5. Three new unguarded Spec.Dynamic.UsesClusterCapacity() calls
funcs_upgrade.go:408, steps_cluster_creation.go:281,559
Spec.Dynamic is nillable (handled as nil elsewhere, not defaulted in admission). The established pattern guards it — hugepages.go:39: cluster.Spec.Dynamic != nil && cluster.Spec.Dynamic.UsesClusterCapacity(). These three new sites omit the != nil guard. Whether it panics depends on the receiver type (couldn't verify without the submodule), but HandleSpecUpdates runs for every cluster including non-clusterCapacity ones, so this is a latent risk worth closing cheaply by matching the existing guard pattern.


🟢 LOW / confirm

  • Doc/code mismatch: funcs_fd_planning.go:160 sums NumCores + ExtraCores, but the two comments (lines 116-117, 132-136) say only NumCores; ExtraCores is untested. Update comments + add a test case.
  • Covered fast-path doesn't refresh Status.NumFailureDomains despite the "always refresh" comment at steps_cluster_creation.go:457 — the fast path returns before the status block. Either drop "always" or refresh NumFD (it's populated as len(uniqueFDs)).
  • Zero per-core capacity silently degrades to 1 core: TlcCapacityPerCoreGiB/QlcCapacityPerCoreGiB (env.go:531-533) feed CeilDiv divisors wrapped in max(1, …). A …PER_CORE_GIB=0 misconfig sizes every drive container to 1 core instead of erroring. Validate these are > 0 at config time.
  • Inline role derivation has no nil-Ratio fallback at steps_cluster_creation.go:476-486role stays "" → malformed container name. Not currently reachable (planner always sets Ratio), but consider reusing allocator.RatioType(pc.Ratio).
  • Migration frozen-chunk seed (funcs_fd_planning.go:214-224) runs for explicit mode too, before the if explicit dispatch. Confirm the seed is meant for explicit-mode clusters with pre-existing anchors; if ephemeral-only, gate with if !explicit.
  • Dead guard: cluster_capacity_explicit.go:105 if e.PinnedNode != "" is already guaranteed non-empty by the enclosing nodeFD lookup. Drop it.

✅ Verified correct (no action)

  • funcs_cluster_joining.go:101-109 swallow of GetClusterCreationTime is fine — its only error is in-memory "not found," no transient path.
  • flow_active_state.go step reorder (setJoinIpsIfStuckInStemMode before applyCurrentImage) is inert for founding/non-clusterCapacity containers.
  • hugepages.go / templates.go clusterCapacity branches are correctly nil-guarded and propagate errors.
  • deriveComputeCountAndCores keeps cores >= 1 across all branches; RawCapacityGiB divides by sw only after validateCapacityInputs enforces sw >= 3.
  • Validators (cluster_capacity_chunk_feasibility.go, cluster_capacity_protection.go) registered correctly; registry.go/defaults.go pairing enforced at startup.
  • No concurrency concerns — planner state is per-call local.

Bottom line: Findings #1 (error swallow) and #2 (pool default) are the two I'd want resolved/confirmed before merge; #3 is a real idempotency edge case worth fixing; the rest are confirm/cleanup. Nice test coverage overall.
· 06-10-fix_add_clustercapacity_support_for_drive-sharing_clusters_op-329_

@kristina-solovyova kristina-solovyova force-pushed the 06-10-fix_add_clustercapacity_support_for_drive-sharing_clusters_op-329_ branch from 21b73b5 to ca5ac92 Compare June 10, 2026 17:49
@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @kristina-solovyova's task in 1m 50s —— View job


Claude PR Review (re-review) — clusterCapacity for drive-sharing (OP-329)

The branch was updated after my first review (commit ca5ac92b, pushed ~17:49 UTC, ~48 min after the original review). I diffed the prior findings against the current code. 4 of the 5 HIGH/MEDIUM findings are resolved. One MEDIUM remains.

  • Diff prior-reviewed findings against current code
  • Review the updated changes
  • Post updated review

✅ Resolved since last review

#1 (HIGH) Error swallowing in buildNodeInventory — fixed.
funcs_fd_planning.go:79-89 now logs and skips on apierrors.IsNotFound (vanishing node) and returns the error otherwise (transient API error / malformed annotation), with a descriptive wrap. Matches the repo "never swallow errors" rule. 👍

#2 (HIGH) Pool-assignment default for drive-sharing — fixed.
funcs_drives.go:94-106 now defaults allSameType = true when Spec.Dynamic == nil or DriveTypesRatio == nil, falling back to the per-container drive-type scan (preserving the old legacy-pool behavior for pre-existing clusters). Only routes to type pools when both Tlc > 0 and Qlc > 0. This removes the silent pool-migration regression.

#4 (MEDIUM) PerFDGiB QLC denominator — fixed.
cluster_capacity.go:244 now divides both the TLC and QLC terms by max(1, numFD) (with numFD = len(uniqueFDs)), instead of the TLC container count. Zero-guarded and consistent with "QLC neighbors share their anchor's AllocatedFD."

#5 (MEDIUM) Unguarded Spec.Dynamic.UsesClusterCapacity() — fixed.
All three sites (funcs_upgrade.go:408, steps_cluster_creation.go:281,559) now have the cluster.Spec.Dynamic != nil && guard, matching the established hugepages.go:39 pattern.


🟡 Still open

#3 (MEDIUM) Explicit-mode anchor recognition still keyed only on live nodeFD[e.PinnedNode].
cluster_capacity_explicit.go:100-112

for _, e := range existing {
    fd := nodeFD[e.PinnedNode]
    if fd == "" {
        continue
    }
    ...
}

The ephemeral planner was updated to key idempotency on the persisted e.AllocatedFD (cluster_capacity_ephemeral.go:69-77), but the explicit planner still resolves the FD via the live nodeFD map. buildNodeInventory drops a node when it's fully consumed by other clusters (funcs_fd_planning.go:100-102) or its FD label is missing — and our own anchor's node can be floored to 0 once other clusters claim it (our own container's capacity isn't added back here). When that node drops out, nodeFD[e.PinnedNode] == "" → the existing anchor isn't recognized → its FD looks unanchored → the planner can emit a duplicate anchor on a different node in the same FD group (double-provisioning + broken idempotency).

The fields needed for the robust fix already exist on the existing element (AllocatedFD, plus PinnedNode). Suggested: fall back to e.AllocatedFD when the live nodeFD lookup is empty, mirroring what ephemeral mode now does.
Fix this →

Minor (still present): the dead guard at cluster_capacity_explicit.go:105if e.PinnedNode != "" is already guaranteed non-empty by the enclosing nodeFD[e.PinnedNode] lookup succeeding (an empty key can't produce a non-empty fd). Once #3 is addressed with an AllocatedFD fallback, this guard becomes meaningful again, so it's worth resolving together.


Notes

  • The other 🟢 LOW items from the first review (doc/code comment mismatch on ExtraCores, covered fast-path not refreshing Status.NumFailureDomains, zero-per-core-capacity → 1-core degradation, inline role-derivation nil-Ratio fallback, migration frozen-chunk seed gating) were not part of these updates — they remain as confirm/cleanup, not blockers.
  • I couldn't run go build/go test here (build needs approval in this environment, and the pkg/weka-k8s-api submodule isn't checked out). Please ensure CI is green.

Bottom line: Strong iteration — the two HIGH findings and two of the MEDIUMs are cleanly resolved. #3 (explicit-mode anchor recognition) is the one remaining correctness edge case I'd want closed before merge; the dead-guard cleanup pairs with it.
· 06-10-fix_add_clustercapacity_support_for_drive-sharing_clusters_op-329_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant