Skip to content

VMDistributed CR: orchestrate VMCluster upgrades#1556

Merged
AndrewChubatiuk merged 297 commits intoVictoriaMetrics:masterfrom
vrutkovs:vmdistributed-cluster
Jan 26, 2026
Merged

VMDistributed CR: orchestrate VMCluster upgrades#1556
AndrewChubatiuk merged 297 commits intoVictoriaMetrics:masterfrom
vrutkovs:vmdistributed-cluster

Conversation

@vrutkovs
Copy link
Copy Markdown
Collaborator

@vrutkovs vrutkovs commented Oct 21, 2025

Add a new CR - VMDistributed - so that multiple VMClusters can be upgraded in an orchestrated fashion, ensuring the read VMAuth is disabled before upgrade and the VMAgent (if available) doesn't have pending bytes to send.

Fixes #1515

This CR can refer to VMClusters using one of two possible ways:

  • Existing VMClusters can be referred to using ref property and changes applied using spec
  • Entirely new VMClusters can be created with name and spec properties

Either way, settings in VMDistributed would be applied to target VMClusters, overriding their existing settings if necessary.

Current implementation scope:

  • VMDistributed will create a VMAgent instance to proxy writes and vmauth LB to proxy reads
  • VMDistributed can create new VMCluster instances when name and spec are specified
  • VMDistributed can update existing VMCluster objects when ref and overrideSpec are set
  • Before a cluster is updated, vmauth LB is updated to disable reads from this cluster
  • VMClusters are updated one by one, waiting for them to change status to "operational" again
  • Time to wait for the cluster to become ready can be configured
  • After VMCluster update is complete, we're waiting for VMAgent to flush collected data again by checking its metrics
  • VMAuth LB is updated to enable reads from this cluster
  • Optionally, the controller can wait a configurable amount of time before proceeding to the next cluster
  • Process is repeated for all remaining VMClusters

See #1515 (comment) for agreed limitations for v1alpha1 version:

  • All objects must belong to the same namespace as VMDistributed
  • Referenced VMClusters are not being actively watched for changes, they only get reconciled periodically
  • All objects must be referred to by name, label selectors are not supported
  • Only VMClusters are supported, VMSingles are deferred for other versions
  • Two delays are tweakable:
    • vmclusterWaitReadyDeadline
    • delay between zone updates
  • No additional metric to indicate that the cluster is being upgraded to silence possible alerts

TODO:

  • Add changelog entry
  • Fix flaking tests
  • Set ownerRefs to managed VMClusters
  • Add high-level description of VMDistributed and problem space
  • Description-less CRD should be applied for development only. Rephrase descriptions in existing parts to make them fit for production
  • Squash commits
    Keeping original commits for review as its useful to show how the feature was developed
  • Update existing documentation to mention VMDistributed and describe its target architecture and existing shortcomings

@f41gh7 f41gh7 self-assigned this Oct 21, 2025
Comment thread config/rbac/role.yaml
Comment thread config/rbac/operator_vmdistributedcluster_editor_role.yaml Outdated
@AndrewChubatiuk
Copy link
Copy Markdown
Contributor

initially thought distributed CR is needed for full distributed setup management, but looks like it only performs version upgrade. In this case just curious why we need different CRs for VM, VT and VL?

@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch from eaeacd4 to c02b24c Compare October 21, 2025 09:01
@vrutkovs
Copy link
Copy Markdown
Collaborator Author

Yes, so far we're focusing on upgrades - existing CRs provide sufficient flexibility IMO - and we didn't get a request for other actions so far.

In this case just curious why we need different CRs for VM, VT and VL?

VL and VT don't have agents (yet) so their specs would be different. However we can reuse the same approach and probably even some helper functions

Copy link
Copy Markdown
Member

@Haleygo Haleygo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so far we're focusing on upgrades - existing CRs provide sufficient flexibility IMO - and we didn't get a request for other actions so far.

I believe users would expect to modify the vmcluster spec value or apply extra flags to the vmclusters.
And since vmclusterSpec.ClusterVersion is optional, users could specify component versions inside vmclusterSpec which overrides the vmclusterSpec.ClusterVersion.

And currently, it seems VMDistributedCluster only covers a limited scenario where resources like vmcluster, vmuser, vmauth are defined and configured as needed.
Could you please provide an example of how to config them to achieve similar topology described in victoria-metrics-distributed chart? I expect VMDistributedCluster to be supported there when released.

Comment thread internal/controller/operator/factory/vmdistributedcluster/vmdistributedcluster.go Outdated
Comment thread internal/controller/operator/factory/vmdistributedcluster/vmdistributedcluster.go Outdated
@vrutkovs
Copy link
Copy Markdown
Collaborator Author

I believe users would expect to modify the vmcluster spec value or apply extra flags to the vmclusters.

Yup, setting generic overrideParams would be more flexible and, along with upgrades, would cover other maintenance tasks, i.e., adding replicas or setting flags

@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch 2 times, most recently from 280b2e6 to 04b44f9 Compare October 30, 2025 08:56
@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch from 04b44f9 to 1336f73 Compare November 3, 2025 12:43
Comment thread ginkgo1.txt Outdated
@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch from 1336f73 to c3b3e24 Compare November 3, 2025 12:49
Comment thread internal/controller/operator/factory/vmdistributedcluster/vmdistributedcluster.go Outdated
Comment thread api/operator/v1alpha1/vmdistributedcluster_types.go Outdated
Comment thread internal/controller/operator/factory/vmdistributedcluster/vmdistributedcluster.go Outdated
Comment thread api/operator/v1alpha1/vmdistributedcluster_types.go Outdated
Comment thread api/operator/v1alpha1/vmdistributedcluster_types.go Outdated
Comment thread api/operator/v1alpha1/vmdistributedcluster_types.go Outdated
Comment thread api/operator/v1alpha1/vmdistributedcluster_types.go Outdated
Comment thread api/operator/v1alpha1/vmdistributedcluster_types.go Outdated
Comment thread internal/controller/operator/factory/vmdistributedcluster/vmdistributedcluster.go Outdated
Comment thread internal/controller/operator/factory/vmdistributedcluster/vmdistributedcluster.go Outdated
Comment thread api/operator/v1alpha1/vmdistributed_types.go Outdated
@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch 4 times, most recently from a5c6693 to 43e7344 Compare November 10, 2025 09:23
@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch 5 times, most recently from 4efee35 to 5a92268 Compare November 13, 2025 13:03
vrutkovs and others added 3 commits January 23, 2026 15:28
Add config/examples/vmdistributed-with-label-selector.yaml to demonstrate
how to use labelSelector in VMDistributed spec to target existing VMAgent.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="config/examples/vmdistributed-overrides-version.yaml">

<violation number="1" location="config/examples/vmdistributed-overrides-version.yaml:53">
P2: This example references `vmcluster-$ZONE`, which doesn’t exist in the manifest and won’t resolve to a real VMCluster name. Since the example is meant to be copy‑paste runnable, use explicit names (or keep per‑zone refs) instead of an unresolved placeholder.

(Based on your team's feedback about ensuring documentation examples are complete and runnable.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread config/examples/vmdistributed-overrides-version.yaml Outdated
@AndrewChubatiuk
Copy link
Copy Markdown
Contributor

@cubic-dev-ai review this PR

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Jan 23, 2026

@cubic-dev-ai review this PR

@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 73 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="internal/controller/operator/factory/finalize/vmdistributed.go">

<violation number="1" location="internal/controller/operator/factory/finalize/vmdistributed.go:18">
P2: The function comment is misleading. This function **preserves** referenced VMClusters by removing owner references (disowning them), not "removes all objects". Someone reading this comment would expect the opposite behavior.

(Based on your team's feedback about documenting exported structs and public methods with accurate descriptions.) [FEEDBACK_USED]</violation>
</file>

<file name="api/operator/v1alpha1/vmdistributed_types.go">

<violation number="1" location="api/operator/v1alpha1/vmdistributed_types.go:55">
P2: Inconsistent json tag: `CommonZone` is marked `+optional` but lacks `omitempty` in the json tag, unlike all other optional fields in this struct. This causes the empty struct to always be serialized.</violation>

<violation number="2" location="api/operator/v1alpha1/vmdistributed_types.go:443">
P2: The `Validate` method doesn't check for duplicate zone names. Since zone names can be used for `%ZONE%` substitution in commonZone spec, duplicates could cause ambiguous behavior. Consider adding a uniqueness check.</violation>
</file>

<file name="config/examples/vmdistributed-with-label-selector.yaml">

<violation number="1" location="config/examples/vmdistributed-with-label-selector.yaml:6">
P2: This example isn’t self-contained: it references a VMAgent (via labelSelector) and a VMAuth (by name) that aren’t defined in the manifest. The team’s docs feedback requires examples to be complete and runnable as copy‑paste demos. Please add the referenced resources (or inline them) so the example can be applied without extra missing objects.

(Based on your team's feedback about ensuring documentation examples are complete and runnable.) [FEEDBACK_USED]</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/vmcluster.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/vmcluster.go:88">
P2: Comment is misleading - this function waits for a configurable status, not specifically `UpdateStatusOperational`. Update the comment to accurately reflect the generic behavior.

(Based on your team's feedback about adding accurate comments when logic is non-obvious.) [FEEDBACK_USED]</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/vmdistributed.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/vmdistributed.go:241">
P2: Using environment variable to conditionally ignore errors is risky. If `E2E_TEST=true` is accidentally set in production, critical errors would be silently ignored. Consider using build tags, a dedicated test mode configuration, or dependency injection for testability instead.

(Based on your team's feedback about adding comments when logic is non-obvious.) [FEEDBACK_USED]</violation>
</file>

<file name="config/examples/vmdistributed-vmagent.yaml">

<violation number="1" location="config/examples/vmdistributed-vmagent.yaml:4">
P2: The example defines VMClusters separately but references them in VMDistributed using `name`, which the API reserves for creating new clusters. Use `ref` to point to the existing VMCluster resources (or drop the standalone VMCluster manifests) so the example is consistent and runnable.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread internal/controller/operator/factory/finalize/vmdistributed.go Outdated
Comment thread api/operator/v1alpha1/vmdistributed_types.go Outdated
Comment thread api/operator/v1alpha1/vmdistributed_types.go
metadata:
name: with-label-selector
spec:
vmagent:
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: This example isn’t self-contained: it references a VMAgent (via labelSelector) and a VMAuth (by name) that aren’t defined in the manifest. The team’s docs feedback requires examples to be complete and runnable as copy‑paste demos. Please add the referenced resources (or inline them) so the example can be applied without extra missing objects.

(Based on your team's feedback about ensuring documentation examples are complete and runnable.)

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At config/examples/vmdistributed-with-label-selector.yaml, line 6:

<comment>This example isn’t self-contained: it references a VMAgent (via labelSelector) and a VMAuth (by name) that aren’t defined in the manifest. The team’s docs feedback requires examples to be complete and runnable as copy‑paste demos. Please add the referenced resources (or inline them) so the example can be applied without extra missing objects.

(Based on your team's feedback about ensuring documentation examples are complete and runnable.) </comment>

<file context>
@@ -0,0 +1,25 @@
+metadata:
+  name: with-label-selector
+spec:
+  vmagent:
+    labelSelector:
+      matchLabels:
</file context>
Fix with Cubic

Comment thread internal/controller/operator/factory/vmdistributed/vmcluster.go Outdated
Comment thread internal/controller/operator/factory/vmdistributed/vmdistributed.go Outdated
Comment thread config/examples/vmdistributed-vmagent.yaml Outdated
@AndrewChubatiuk
Copy link
Copy Markdown
Contributor

@cubic-dev-ai review this PR

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Jan 23, 2026

@cubic-dev-ai review this PR

@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 74 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="test/e2e/vmdistributed_test.go">

<violation number="1" location="test/e2e/vmdistributed_test.go:331">
P2: Duplicate Eventually block: This check for VMDistributed becoming operational is identical to the one at lines 281-284 that already succeeded. This redundant code adds unnecessary test execution time and appears to be a copy-paste error or incomplete refactoring.</violation>
</file>

<file name="internal/controller/operator/factory/finalize/vmdistributed.go">

<violation number="1" location="internal/controller/operator/factory/finalize/vmdistributed.go:23">
P1: Potential nil pointer dereference: `zone.VMCluster` can be nil (it's a pointer type with `omitempty`), but the code accesses `zone.VMCluster.Ref` without a nil check. This will panic when a zone has no VMCluster defined. Use the existing `isRefSet()` helper method which safely handles nil receivers.</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/vmagent.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/vmagent.go:340">
P1: New URLs appended during VMAgent update don't preserve auth config from `writeSpecMap`. When a new cluster is added to an existing VMAgent, any authentication configuration (BasicAuth, OAuth2, TLS, etc.) specified in the CR spec for that URL will be lost. This is inconsistent with the creation case (lines 210-218) and existing URL handling (lines 323-333), both of which check `writeSpecMap`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread internal/controller/operator/factory/finalize/vmdistributed.go Outdated
Comment thread internal/controller/operator/factory/vmdistributed/vmagent.go Outdated
Comment thread test/e2e/vmdistributed_test.go
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 6 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="internal/controller/operator/factory/vmdistributed/vmagent.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/vmagent.go:242">
P1: Bug: Spec comparison always returns equal because it compares after assignment. The line `vmagentObj.Spec = vmagentSpec` assigns the new spec before this comparison, so `vmagentObj.Spec` IS `vmagentSpec` - they will always be DeepEqual. Spec changes to existing VMAgents will never be applied.

Store the original spec before assignment and compare against that, or move the comparison before the assignment.</violation>
</file>

<file name="api/operator/v1alpha1/vmdistributed_types.go">

<violation number="1" location="api/operator/v1alpha1/vmdistributed_types.go:76">
P2: Missing `omitempty` in JSON tag for optional field. The `RemoteWrite` field is marked `+optional` but lacks `omitempty`, causing nil values to serialize as `null` instead of being omitted. This is inconsistent with other optional pointer fields in this file.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread internal/controller/operator/factory/vmdistributed/vmagent.go
Comment thread api/operator/v1alpha1/vmdistributed_types.go Outdated
@AndrewChubatiuk
Copy link
Copy Markdown
Contributor

@cubic-dev-ai review this PR

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Jan 23, 2026

@cubic-dev-ai review this PR

@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 74 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="test/e2e/vmdistributed_test.go">

<violation number="1" location="test/e2e/vmdistributed_test.go:1263">
P2: Inconsistent pattern for checking NotFound errors. Use the same `MatchError(k8serrors.IsNotFound, "IsNotFound")` pattern used elsewhere in this file (lines 56, 1250, 1255, 1270, etc.) for consistency and cleaner code.</violation>
</file>

<file name="config/examples/vmdistributed-vmagent.yaml">

<violation number="1" location="config/examples/vmdistributed-vmagent.yaml:35">
P2: The example manifest references VMClusters that are not defined anywhere in the file, so it is not runnable as a copy‑paste demo. Include sample VMCluster resources (or inline specs under each zone) so the example is self‑contained.</violation>
</file>

<file name="test/e2e/suite/suite.go">

<violation number="1" location="test/e2e/suite/suite.go:160">
P3: Remove the duplicate "VMAUTHDEFAULT" entry in resourceEnvsPrefixes to avoid redundant env var setup and clarify the intended list.</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/vmagent.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/vmagent.go:203">
P1: Index mismatch: `vmClusters` is sorted by `observedGeneration` in `fetchVMClusters`, but this loop assumes indices correspond to `cr.Spec.Zones` order. This causes URLs to be assigned to wrong remoteWrite configurations, potentially routing writes to incorrect clusters.

Consider either: (1) not sorting vmClusters in `fetchVMClusters`, (2) maintaining a mapping between zone index and cluster, or (3) matching clusters to remoteWrites by a stable identifier rather than index.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

return nil, fmt.Errorf("failed to unmarshal.spec.zones[*].remoteWrite of VMDistributed=%s/%s: %w", cr.Name, cr.Namespace, err)
}

for i := range vmClusters {
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Index mismatch: vmClusters is sorted by observedGeneration in fetchVMClusters, but this loop assumes indices correspond to cr.Spec.Zones order. This causes URLs to be assigned to wrong remoteWrite configurations, potentially routing writes to incorrect clusters.

Consider either: (1) not sorting vmClusters in fetchVMClusters, (2) maintaining a mapping between zone index and cluster, or (3) matching clusters to remoteWrites by a stable identifier rather than index.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/controller/operator/factory/vmdistributed/vmagent.go, line 203:

<comment>Index mismatch: `vmClusters` is sorted by `observedGeneration` in `fetchVMClusters`, but this loop assumes indices correspond to `cr.Spec.Zones` order. This causes URLs to be assigned to wrong remoteWrite configurations, potentially routing writes to incorrect clusters.

Consider either: (1) not sorting vmClusters in `fetchVMClusters`, (2) maintaining a mapping between zone index and cluster, or (3) matching clusters to remoteWrites by a stable identifier rather than index.</comment>

<file context>
@@ -0,0 +1,274 @@
+		return nil, fmt.Errorf("failed to unmarshal.spec.zones[*].remoteWrite of VMDistributed=%s/%s: %w", cr.Name, cr.Namespace, err)
+	}
+
+	for i := range vmClusters {
+		vmagentSpec.RemoteWrite[i].URL = remoteWriteURL(vmClusters[i])
+	}
</file context>
Fix with Cubic

Comment thread test/e2e/vmdistributed_test.go
Comment thread config/examples/vmdistributed-vmagent.yaml
Comment thread test/e2e/suite/suite.go
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="test/e2e/vmdistributed_test.go">

<violation number="1" location="test/e2e/vmdistributed_test.go:1375">
P2: The global cleanup flags are set to true but never reset, so later tests will incorrectly expect NotFound on VMAgent/VMAuth deletion and fail. Reset them after this test to avoid cross-test pollution.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread test/e2e/vmdistributed_test.go
The `expectedVMAgentToBeRemoved` and `expectedVMAuthToBeRemoved` flags were not being reset after
the `VMDistributed` deletion test. This commit adds a `DeferCleanup` function to ensure these flags
are reset to `false` after the test, preventing potential interference with subsequent tests.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="test/e2e/vmdistributed_test.go">

<violation number="1" location="test/e2e/vmdistributed_test.go:124">
P3: The new comment in createVMAgent refers to VMAuth, but the code resets expectedVMAgentToBeRemoved. Update the comment to match the VMAgent flag being reset.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

},
},
}
// Reset expectedVMAuthToBeRemoved so that it would not leak into other tests
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The new comment in createVMAgent refers to VMAuth, but the code resets expectedVMAgentToBeRemoved. Update the comment to match the VMAgent flag being reset.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At test/e2e/vmdistributed_test.go, line 124:

<comment>The new comment in createVMAgent refers to VMAuth, but the code resets expectedVMAgentToBeRemoved. Update the comment to match the VMAgent flag being reset.</comment>

<file context>
@@ -121,6 +121,10 @@ func createVMAuth(ctx context.Context, k8sClient client.Client, name, namespace
 			},
 		},
 	}
+	// Reset expectedVMAuthToBeRemoved so that it would not leak into other tests
+	DeferCleanup(func() {
+		expectedVMAuthToBeRemoved = false
</file context>
Fix with Cubic

@AndrewChubatiuk
Copy link
Copy Markdown
Contributor

@cubic-dev-ai review this PR

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Jan 24, 2026

@cubic-dev-ai review this PR

@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 74 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="internal/controller/operator/factory/vmdistributed/vmagent.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/vmagent.go:229">
P2: Sorting logic for new remote write URLs defaults to index 0, causing unstable ordering. When a new URL is added that doesn't exist in the current VMAgent spec, the map lookup returns 0, making new entries sort to the beginning with undefined relative order among themselves. Consider using `ok` idiom to detect missing keys and assign them indices after existing entries.</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/util_test.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/util_test.go:31">
P3: Swap assert.Equal argument order so failure messages correctly show expected vs. actual values.</violation>
</file>

<file name="internal/controller/operator/vmdistributed_controller_test.go">

<violation number="1" location="internal/controller/operator/vmdistributed_controller_test.go:46">
P2: Handle non-NotFound errors from the Get call; otherwise test failures (e.g., RBAC/connection errors) are silently ignored and the test can pass with an invalid state.</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/vmdistributed.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/vmdistributed.go:248">
P1: Logic bug: when there's only one VMCluster, reads are disabled before update (line 164-175) but never re-enabled because this condition requires `len(vmClusters) > 1`. Single-cluster deployments would have reads permanently disabled after updates.</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/vmcluster.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/vmcluster.go:58">
P2: Silently continuing to poll when VMCluster is deleted may hide real issues. If the cluster is not found during the wait, consider returning an immediate error rather than waiting until timeout, which would provide clearer feedback that the cluster was unexpectedly deleted.</violation>
</file>

<file name="docs/resources/vmdistributed.md">

<violation number="1" location="docs/resources/vmdistributed.md:105">
P2: This example references pre-existing VMCluster objects but doesn’t define them or note that they must already exist, so it isn’t runnable as a copy‑paste demo. Consider adding the VMCluster manifests (or an explicit note) alongside the example.

(Based on your team's feedback about ensuring documentation examples are complete and runnable.) [FEEDBACK_USED]</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/vmcluster_test.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/vmcluster_test.go:178">
P3: These assertions assume a deterministic order even though the sort is unstable when ObservedGeneration values are equal (both are 0 here), which can make the test flaky. Prefer order-insensitive assertions or explicitly pick the inline cluster before asserting its spec.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread internal/controller/operator/factory/vmdistributed/vmdistributed.go
Comment thread internal/controller/operator/factory/vmdistributed/vmagent.go
Comment thread internal/controller/operator/vmdistributed_controller_test.go
Comment thread internal/controller/operator/factory/vmdistributed/vmcluster.go
Comment thread docs/resources/vmdistributed.md
Comment thread internal/controller/operator/factory/vmdistributed/util_test.go
got, err := fetchVMClusters(context.Background(), rclient, cr)
assert.NoError(t, err)
assert.Len(t, got, 2)
assert.Equal(t, "ref", got[0].Name)
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: These assertions assume a deterministic order even though the sort is unstable when ObservedGeneration values are equal (both are 0 here), which can make the test flaky. Prefer order-insensitive assertions or explicitly pick the inline cluster before asserting its spec.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/controller/operator/factory/vmdistributed/vmcluster_test.go, line 178:

<comment>These assertions assume a deterministic order even though the sort is unstable when ObservedGeneration values are equal (both are 0 here), which can make the test flaky. Prefer order-insensitive assertions or explicitly pick the inline cluster before asserting its spec.</comment>

<file context>
@@ -0,0 +1,234 @@
+	got, err := fetchVMClusters(context.Background(), rclient, cr)
+	assert.NoError(t, err)
+	assert.Len(t, got, 2)
+	assert.Equal(t, "ref", got[0].Name)
+	assert.Equal(t, "inline", got[1].Name)
+	assert.Equal(t, inlineSpec.ClusterVersion, got[1].Spec.ClusterVersion)
</file context>
Fix with Cubic

@AndrewChubatiuk
Copy link
Copy Markdown
Contributor

@cubic-dev-ai review this PR

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Jan 24, 2026

@cubic-dev-ai review this PR

@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 74 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="test/e2e/vmdistributed_test.go">

<violation number="1" location="test/e2e/vmdistributed_test.go:124">
P2: Misleading comment: says `expectedVMAuthToBeRemoved` but code resets `expectedVMAgentToBeRemoved`. This copy-paste error could confuse maintainers.</violation>
</file>

<file name="config/examples/vmdistributed-with-label-selector.yaml">

<violation number="1" location="config/examples/vmdistributed-with-label-selector.yaml:7">
P2: This example isn’t runnable as-is: it selects an existing VMAgent via labelSelector and references a VMAuth by name, but the file only defines the VMDistributed resource. Please include the referenced VMAgent/VMAuth manifests (or inline them in the example) so users can apply the example without having to guess missing resources.

(Based on your team's feedback about ensuring documentation examples are complete and runnable.) [FEEDBACK_USED]</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/vmdistributed.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/vmdistributed.go:241">
P2: Test-specific logic (`E2E_TEST` env check) embedded in production code. This silently ignores errors during E2E tests, which could mask real issues. Consider instead making the VMAgent metrics check configurable via the CR spec or using a skip annotation, rather than an environment variable.</violation>
</file>

<file name="internal/controller/operator/factory/vmdistributed/vmcluster.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/vmcluster.go:77">
P2: Non-stable sort may cause non-deterministic VMCluster ordering. When multiple VMClusters have the same `ObservedGeneration`, their relative order will be undefined across reconciliations. Consider using `sort.SliceStable` or adding a secondary sort key (e.g., by name) to ensure consistent ordering for rolling updates.</violation>
</file>

<file name="docs/resources/vmdistributed.md">

<violation number="1" location="docs/resources/vmdistributed.md:52">
P2: The “inline VMCluster specifications” example isn’t runnable because vmagent/vmauth are only named (no inline spec or accompanying manifests). Add inline specs or define the referenced VMAgent/VMAuth resources so users can apply the example as-is.

(Based on your team's feedback about ensuring documentation examples are complete and runnable.) [FEEDBACK_USED]</violation>

<violation number="2" location="docs/resources/vmdistributed.md:60">
P2: The VMCluster example uses a non-existent `spec` subfield under `vmstorage/vmselect/vminsert`, so it won’t validate. Remove the extra `spec` nesting to match the actual VMCluster schema.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

},
},
}
// Reset expectedVMAuthToBeRemoved so that it would not leak into other tests
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Misleading comment: says expectedVMAuthToBeRemoved but code resets expectedVMAgentToBeRemoved. This copy-paste error could confuse maintainers.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At test/e2e/vmdistributed_test.go, line 124:

<comment>Misleading comment: says `expectedVMAuthToBeRemoved` but code resets `expectedVMAgentToBeRemoved`. This copy-paste error could confuse maintainers.</comment>

<file context>
@@ -0,0 +1,1391 @@
+			},
+		},
+	}
+	// Reset expectedVMAuthToBeRemoved so that it would not leak into other tests
+	DeferCleanup(func() {
+		expectedVMAuthToBeRemoved = false
</file context>
Fix with Cubic

Comment thread config/examples/vmdistributed-with-label-selector.yaml
for _, vmAgentObj := range vmAgentObjs {
if err := waitForVMClusterVMAgentMetrics(ctx, httpClient, vmAgentObj, vmAgentFlushDeadlineDeadline, defaultVMAgentCheckInterval, rclient); err != nil {
// Ignore this error when running e2e tests - vmagent pods will be unreachable from outside of the cluster
if os.Getenv("E2E_TEST") == "true" {
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Test-specific logic (E2E_TEST env check) embedded in production code. This silently ignores errors during E2E tests, which could mask real issues. Consider instead making the VMAgent metrics check configurable via the CR spec or using a skip annotation, rather than an environment variable.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/controller/operator/factory/vmdistributed/vmdistributed.go, line 241:

<comment>Test-specific logic (`E2E_TEST` env check) embedded in production code. This silently ignores errors during E2E tests, which could mask real issues. Consider instead making the VMAgent metrics check configurable via the CR spec or using a skip annotation, rather than an environment variable.</comment>

<file context>
@@ -0,0 +1,280 @@
+		for _, vmAgentObj := range vmAgentObjs {
+			if err := waitForVMClusterVMAgentMetrics(ctx, httpClient, vmAgentObj, vmAgentFlushDeadlineDeadline, defaultVMAgentCheckInterval, rclient); err != nil {
+				// Ignore this error when running e2e tests - vmagent pods will be unreachable from outside of the cluster
+				if os.Getenv("E2E_TEST") == "true" {
+					continue
+				}
</file context>
Fix with Cubic

Comment thread internal/controller/operator/factory/vmdistributed/vmcluster.go
Comment thread docs/resources/vmdistributed.md
metadata:
name: my-distributed-cluster
spec:
vmagent:
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The “inline VMCluster specifications” example isn’t runnable because vmagent/vmauth are only named (no inline spec or accompanying manifests). Add inline specs or define the referenced VMAgent/VMAuth resources so users can apply the example as-is.

(Based on your team's feedback about ensuring documentation examples are complete and runnable.)

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/resources/vmdistributed.md, line 52:

<comment>The “inline VMCluster specifications” example isn’t runnable because vmagent/vmauth are only named (no inline spec or accompanying manifests). Add inline specs or define the referenced VMAgent/VMAuth resources so users can apply the example as-is.

(Based on your team's feedback about ensuring documentation examples are complete and runnable.) </comment>

<file context>
@@ -0,0 +1,195 @@
+metadata:
+  name: my-distributed-cluster
+spec:
+  vmagent:
+    name: my-distributed-vmagent
+  vmauth:
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for a distributed deployment

6 participants