Skip to content

CCXDEV-15790: extractor controller#1280

Open
opokornyy wants to merge 5 commits into
openshift:masterfrom
opokornyy:CCXDEV-15790-extractor-controller
Open

CCXDEV-15790: extractor controller#1280
opokornyy wants to merge 5 commits into
openshift:masterfrom
opokornyy:CCXDEV-15790-extractor-controller

Conversation

@opokornyy
Copy link
Copy Markdown
Contributor

@opokornyy opokornyy commented Apr 29, 2026

Implement controller that manages runtime-extractor resources.

Categories

  • Bugfix
  • Data Enhancement
  • Feature
  • Backporting
  • Others (CI, Infrastructure, Documentation)

Sample Archive

  • None

Documentation

  • pkg/controller/runtimeextractor/README.md

Unit Tests

  • pkg/controller/runtimeextractor/informer_test.go
  • pkg/controller/runtimeextractor/resources/daemonset_test.go
  • pkg/controller/runtimeextractor/resources/retry_test.go

Privacy

Yes. There are no sensitive data in the newly collected information.

Changelog

  • None

Breaking Changes

No

References

https://redhat.atlassian.net/browse/CCXDEV-15790

Summary by CodeRabbit

  • New Features

    • Added a runtime-extractor controller with informer-driven reconciliation and resource management for the runtime-extractor deployment.
    • Enhanced operator status behavior to detect version changes and trigger reconciliations.
  • Configuration

    • Expanded RBAC permissions for daemonset and event management.
    • Deployment now includes new environment variables to pin related component images.
  • Documentation

    • Added detailed runtime-extractor controller documentation and usage guidance.
  • Tests

    • Added unit tests for informer, resource management, retry-on-conflict, and daemonset behavior.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 29, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 29, 2026

@opokornyy: This pull request references CCXDEV-15790 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Implement controller that manages runtime-extractor resources.

Categories

  • Bugfix
  • Data Enhancement
  • Feature
  • Backporting
  • Others (CI, Infrastructure, Documentation)

Sample Archive

  • None

Documentation

  • pkg/controller/runtimeextractor/README.md

Unit Tests

  • pkg/controller/runtimeextractor/informer_test.go
  • pkg/controller/runtimeextractor/resources/daemonset_test.go
  • pkg/controller/runtimeextractor/resources/retry_test.go

Privacy

Yes. There are no sensitive data in the newly collected information.

Changelog

  • None

Breaking Changes

No

References

https://redhat.atlassian.net/browse/CCXDEV-15790

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

Adds a runtime-extractor controller, informer, and ResourceManager; embeds and manages a DaemonSet with env-pinned images and retry-on-conflict; wires operator startup and status to signal updates; updates RBAC and Deployment envs/indentation; and adds tests and documentation.

Changes

Runtime Extractor feature

Layer / File(s) Summary
RBAC & Deployment edits
manifests/03-clusterrole.yaml, manifests/06-deployment.yaml
Adds ClusterRole rule for apps/daemonsets, expands openshift-insights Role events verbs, fixes Deployment YAML indentation, and adds RELATED_IMAGE_INSIGHTS_RUNTIME_EXTRACTOR, RELATED_IMAGE_INSIGHTS_RUNTIME_EXPORTER, RELATED_IMAGE_KUBE_RBAC_PROXY env vars.
Operator initialization & wiring
pkg/controller/operator.go
Moves config-observer startup earlier, creates buffered updateCh, starts namespace-scoped informer factory and runtime-extractor controller, and passes updateCh into status controller.
Status controller changes & tests
pkg/controller/status/controller.go, pkg/controller/status/controller_test.go
Adds updateCh to status controller, replaces prior semver helper with checkVersionChanges, signals updateCh on any version change, and refactors tests to assert both version-change flags.
RuntimeExtractor controller core & package doc
pkg/controller/runtimeextractor/controller.go, pkg/controller/runtimeextractor/doc.go
Adds ConfigNotifier/ResourceManager/ResourceInformer interfaces and runtimeExtractorController with a Run loop reconciling on config changes, cluster-version updates, and informer-detected resource drift.
Resource informer & tests
pkg/controller/runtimeextractor/informer.go, pkg/controller/runtimeextractor/informer_test.go
Adds informer targeting insights-runtime-extractor DaemonSet in openshift-insights, emitting non-blocking notifications on generation changes and deletions; adds tests for update, deletion, and namespace isolation.
DaemonSet ResourceManager, manifest & tests
pkg/controller/runtimeextractor/resources/*.go, pkg/controller/runtimeextractor/resources/manifests/runtime-extractor-daemonset.yaml, pkg/controller/runtimeextractor/resources/*_test.go
Embeds DaemonSet YAML, substitutes container images from env vars (with defaults), applies via server-side apply with retry-on-conflict, provides delete/get/exists helpers, exposes ResourceManager facade, and adds comprehensive unit tests including retry-on-conflict verification.
Documentation & README
pkg/controller/runtimeextractor/README.md, pkg/controller/runtimeextractor/doc.go
Adds README and package doc describing controller architecture, reconciliation triggers, drift handling, server-side apply behavior, usage examples, and tests.

Sequence Diagram(s)

sequenceDiagram
    participant Op as Operator.Run()
    participant RC as RuntimeExtractorController
    participant CN as ConfigNotifier
    participant RM as ResourceManager
    participant RI as ResourceInformer
    participant K8s as Kubernetes API

    Op->>RC: NewRuntimeExtractorController(configNotifier, updateCh, ...)
    Op->>RC: Run(ctx)

    activate RC
    rect rgba(100, 150, 200, 0.5)
    Note over RC: Initial Config Check
    RC->>CN: Config()
    CN-->>RC: InsightsConfiguration
    alt DisableRuntimeExtractor = false
        RC->>RM: ApplyRuntimeExtractorResources(ctx)
        RM->>K8s: Apply DaemonSet
        K8s-->>RM: Success
    end
    end

    rect rgba(150, 100, 200, 0.5)
    Note over RC: Reconciliation Triggers
    par Config Change Event
        CN->>CN: ConfigChanged() <- signal
        CN-->>RC: Config change notification
        RC->>CN: Config()
        CN-->>RC: Updated InsightsConfiguration
        alt DisableRuntimeExtractor = true
            RC->>RM: DeleteRuntimeExtractorResources(ctx)
        else
            RC->>RM: ApplyRuntimeExtractorResources(ctx)
        end
    and Cluster Version Update
        Op->>RC: Send on updateCh
        RC->>CN: Config()
        CN-->>RC: InsightsConfiguration
        alt DisableRuntimeExtractor = false
            RC->>RM: ApplyRuntimeExtractorResources(ctx)
        end
    and External Resource Drift
        K8s->>RI: DaemonSet updated/deleted
        RI->>RC: ResourceModified()
        RC->>CN: Config()
        CN-->>RC: InsightsConfiguration
        alt DisableRuntimeExtractor = false
            RC->>RM: ApplyRuntimeExtractorResources(ctx)
        else
            RC->>RM: DeleteRuntimeExtractorResources(ctx)
        end
    end
    end

    deactivate RC
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning WaitForCacheSync return values unchecked at lines 68, 115, 159. Assertion messages missing throughout daemonset_test.go, inconsistent with codebase. Check WaitForCacheSync return and fail if false; add meaningful messages to assertions matching controller_test.go pattern.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title 'CCXDEV-15790: extractor controller' directly references the Jira ticket and clearly identifies the main change: implementing an extractor controller.
Description check ✅ Passed The description follows the repository template structure with all major sections completed, including feature category selection, documentation, unit tests, privacy considerations, and references.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR uses standard Go testing (testing.T), not Ginkgo. The custom check is specific to Ginkgo test names and is not applicable to this codebase. Standard Go test names are stable and deterministic.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests found in PR. All tests are standard Go unit tests using testing.T, which are outside the scope of this MicroShift compatibility check.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added; only standard Go unit tests (testing package) are included. SNO compatibility check applies only to Ginkgo e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed Runtime-extractor DaemonSet uses only linux nodeSelector with no topology-breaking constraints. DaemonSet scales appropriately for all OpenShift topologies (HA, SNO, TNF, TNA, HyperShift).
Ote Binary Stdout Contract ✅ Passed No OTE stdout violations found. All logging uses klog configured to stderr via logs.InitLogs() and alsologtostderr=true. No fmt.Print/Println/os.Stdout writes in non-test code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. All test files are standard Go unit tests (using testing.T) with fake Kubernetes clients - no network calls or IPv4 assumptions are made.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from katushiik11 and ncaak April 29, 2026 13:55
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: opokornyy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
pkg/controller/runtimeextractor/informer_test.go (1)

68-68: Assert informer cache sync success to reduce flakiness.

WaitForCacheSync return values are ignored; if sync fails, tests may fail nondeterministically later. Assert all returned values are true before proceeding.

✅ Minimal hardening pattern
-	informerFactory.WaitForCacheSync(ctx.Done())
+	for informerType, synced := range informerFactory.WaitForCacheSync(ctx.Done()) {
+		if !synced {
+			t.Fatalf("Informer cache failed to sync: %v", informerType)
+		}
+	}

Also applies to: 115-115, 159-159

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/runtimeextractor/informer_test.go` at line 68, The call to
informerFactory.WaitForCacheSync currently ignores its boolean return values,
causing flaky tests; change the test to capture the returned map (e.g., synced
:= informerFactory.WaitForCacheSync(ctx.Done())) and assert that all values in
that map are true (fail the test if any are false) before proceeding—apply the
same change for other occurrences of WaitForCacheSync in this test (lines
referencing informerFactory.WaitForCacheSync at the other noted locations).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@manifests/03-clusterrole.yaml`:
- Around line 129-140: The ClusterRole currently grants create/delete/update on
apiGroups "apps" resources "daemonsets" which is overbroad; remove the mutating
verbs (create, delete, update) for apps/daemonsets from the ClusterRole and
leave only read verbs (get, list, watch) there, then add those mutating verbs to
the existing namespaced Role that targets the openshift-insights namespace (the
namespaced Role that the runtime-extractor controller uses), and update/confirm
the RoleBinding so the controller's ServiceAccount is bound to that namespaced
Role (ensuring the controller retains get/list/watch at cluster or namespace
scope as intended and only gains create/delete/update within
openshift-insights).

In `@pkg/controller/runtimeextractor/controller.go`:
- Around line 147-179: The createDeployment, deleteDeployment and
updateDeployment functions currently log errors from
resourceManager.ApplyRuntimeExtractorResources / DeleteRuntimeExtractorResources
and swallow them; change each to return an error (e.g., func (re
*runtimeExtractorController) createDeployment(ctx context.Context) error) and
propagate the error to the caller so the reconciler can requeue/retry, or if you
prefer keep void signatures then explicitly trigger a reschedule (e.g.,
requeue/reconcile helper) on failure; also return early on isCreated checks as
before but ensure deleteDeployment returns nil when nothing to delete and
non-nil when DeleteRuntimeExtractorResources fails, and updateDeployment returns
the Apply error so transient API failures bubble up rather than being silently
logged.

In `@pkg/controller/runtimeextractor/informer.go`:
- Around line 103-121: In handleResourceDeletion (type resourceInformer) detect
and unwrap cache.DeletedFinalStateUnknown tombstones before calling
meta.Accessor: if obj is a tombstone, extract the tombstone.Obj (or underlying
runtime.Object) and reassign it to obj so meta.Accessor succeeds; then continue
the existing namespace/name checks and call ri.notifyModification() when
runtimeExtractorName is deleted. Ensure the tombstone handling uses the
cache.DeletedFinalStateUnknown type assertion and preserves existing logging
behavior.

In `@pkg/controller/runtimeextractor/README.md`:
- Around line 54-141: Update the README fenced code blocks that lack language
tags by adding appropriate languages: change the ASCII flow diagrams under the
"Reconciliation Flow", "Initial Deployment", "Configuration Change", and
"Resource Drift" headings to use a language tag such as text or mermaid (e.g.,
replace ``` with ```text or ```mermaid), keep the "Watch patterns" Go block as
```go and ensure the resource labels block remains ```yaml; this will satisfy
markdownlint MD040 without altering the content.

In `@pkg/controller/runtimeextractor/resources/daemonset_test.go`:
- Around line 104-107: The test currently only asserts images changed; replace
the NotEqual checks with exact equality checks against the expected env-derived
image strings so the test fails on incorrect mappings: in
pkg/controller/runtimeextractor/resources/daemonset_test.go update the
assertions for ds.Spec.Template.Spec.Containers[0].Image, [1].Image and
[2].Image to assert Equal against the precise expected image values (either
compute them from the same environment variables used by the code under test or
hard-code the exact expected strings used in the test setup) instead of using
assert.NotEqual.

In `@pkg/controller/runtimeextractor/resources/daemonset.go`:
- Around line 151-162: daemonSetExists currently swallows all get errors and
returns false, which treats API/auth/transport failures as "not found"; change
the signature to return (bool, error) (e.g., daemonSetExists(ctx
context.Context) (bool, error)), call rm.getDaemonSet(ctx) and if
apierrors.IsNotFound(err) return (false, nil), if err != nil return (false,
err), otherwise return (true, nil); update all callers of daemonSetExists to
handle the error return and only treat NotFound as absence while surfacing other
errors so reconciliation isn't skipped on temporary API failures.

In `@pkg/controller/status/controller.go`:
- Around line 205-208: The direct blocking send to c.updateCh in the status
reconciliation path can deadlock merge() if updateCh is nil or full; change it
to a nil-safe, non-blocking signal: check c.updateCh != nil and then use a
non-blocking send (select with a case c.updateCh <- struct{}{} and a default) so
you only attempt to signal when the channel exists and won't block if it's full.

---

Nitpick comments:
In `@pkg/controller/runtimeextractor/informer_test.go`:
- Line 68: The call to informerFactory.WaitForCacheSync currently ignores its
boolean return values, causing flaky tests; change the test to capture the
returned map (e.g., synced := informerFactory.WaitForCacheSync(ctx.Done())) and
assert that all values in that map are true (fail the test if any are false)
before proceeding—apply the same change for other occurrences of
WaitForCacheSync in this test (lines referencing
informerFactory.WaitForCacheSync at the other noted locations).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5cc38019-b1ed-410e-8cff-a4fd86c65c10

📥 Commits

Reviewing files that changed from the base of the PR and between 50525ba and b2636bb.

📒 Files selected for processing (15)
  • manifests/03-clusterrole.yaml
  • manifests/06-deployment.yaml
  • pkg/controller/operator.go
  • pkg/controller/runtimeextractor/README.md
  • pkg/controller/runtimeextractor/controller.go
  • pkg/controller/runtimeextractor/doc.go
  • pkg/controller/runtimeextractor/informer.go
  • pkg/controller/runtimeextractor/informer_test.go
  • pkg/controller/runtimeextractor/resources/daemonset.go
  • pkg/controller/runtimeextractor/resources/daemonset_test.go
  • pkg/controller/runtimeextractor/resources/manager.go
  • pkg/controller/runtimeextractor/resources/manifests/runtime-extractor-daemonset.yaml
  • pkg/controller/runtimeextractor/resources/retry_test.go
  • pkg/controller/status/controller.go
  • pkg/controller/status/controller_test.go

Comment on lines +129 to 140
- apiGroups:
- "apps"
resources:
- daemonsets
verbs:
- get
- list
- watch
- create
- delete
- update
---
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope DaemonSet write permissions to namespace-level Role (least privilege).

Granting create/delete/update on apps/daemonsets in a ClusterRole allows mutating DaemonSets cluster-wide. The runtime-extractor controller manages a namespaced DaemonSet (openshift-insights), so this should be moved to the existing namespaced Role in openshift-insights.

🔧 Suggested RBAC tightening
-  - apiGroups:
-      - "apps"
-    resources:
-      - daemonsets
-    verbs:
-      - get
-      - list
-      - watch
-      - create
-      - delete
-      - update
 # In Role `insights-operator` (namespace: openshift-insights), add:
+  - apiGroups:
+      - apps
+    resources:
+      - daemonsets
+    verbs:
+      - get
+      - list
+      - watch
+      - create
+      - delete
+      - update
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/03-clusterrole.yaml` around lines 129 - 140, The ClusterRole
currently grants create/delete/update on apiGroups "apps" resources "daemonsets"
which is overbroad; remove the mutating verbs (create, delete, update) for
apps/daemonsets from the ClusterRole and leave only read verbs (get, list,
watch) there, then add those mutating verbs to the existing namespaced Role that
targets the openshift-insights namespace (the namespaced Role that the
runtime-extractor controller uses), and update/confirm the RoleBinding so the
controller's ServiceAccount is bound to that namespaced Role (ensuring the
controller retains get/list/watch at cluster or namespace scope as intended and
only gains create/delete/update within openshift-insights).

Comment on lines +147 to +179
func (re *runtimeExtractorController) createDeployment(ctx context.Context) {
klog.Info("Creating runtime extractor resources")

if err := re.resourceManager.ApplyRuntimeExtractorResources(ctx); err != nil {
klog.Errorf("Failed to apply runtime extractor resources: %v", err)
}
}

func (re *runtimeExtractorController) deleteDeployment(ctx context.Context) {
klog.Info("Deleting runtime extractor resources")

if !re.isCreated(ctx) {
klog.Info("Runtime extractor resources do not exist, nothing to delete")
return
}

if err := re.resourceManager.DeleteRuntimeExtractorResources(ctx); err != nil {
klog.Errorf("Failed to delete runtime extractor resources: %v", err)
}
}

func (re *runtimeExtractorController) updateDeployment(ctx context.Context) {
klog.Info("Updating runtime extractor resources")

// Avoid creating it when the cluster version is updated
if !re.isCreated(ctx) {
klog.Info("Runtime extractor resources not found, skipping update")
return
}

if err := re.resourceManager.ApplyRuntimeExtractorResources(ctx); err != nil {
klog.Errorf("Failed to apply runtime extractor resources: %v", err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Surface reconcile failures instead of logging them away.

These helpers swallow apply/delete errors after logging. Because the controller is event-driven, a transient API failure can leave the runtime-extractor resources out of sync indefinitely if no new event arrives.

Please bubble the error into a retry/requeue path, or explicitly reschedule reconciliation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/runtimeextractor/controller.go` around lines 147 - 179, The
createDeployment, deleteDeployment and updateDeployment functions currently log
errors from resourceManager.ApplyRuntimeExtractorResources /
DeleteRuntimeExtractorResources and swallow them; change each to return an error
(e.g., func (re *runtimeExtractorController) createDeployment(ctx
context.Context) error) and propagate the error to the caller so the reconciler
can requeue/retry, or if you prefer keep void signatures then explicitly trigger
a reschedule (e.g., requeue/reconcile helper) on failure; also return early on
isCreated checks as before but ensure deleteDeployment returns nil when nothing
to delete and non-nil when DeleteRuntimeExtractorResources fails, and
updateDeployment returns the Apply error so transient API failures bubble up
rather than being silently logged.

Comment on lines +103 to +121
// handleResourceDeletion processes resource deletion events
func (ri *resourceInformer) handleResourceDeletion(obj interface{}, resourceType string) {
metadata, err := meta.Accessor(obj)
if err != nil {
klog.Errorf("Failed to get metadata from deleted runtime extractor %s: %v", resourceType, err)
return
}

// Only care about our specific resources
if metadata.GetNamespace() != runtimeExtractorNamespace {
return
}

name := metadata.GetName()
if name == runtimeExtractorName {
klog.Infof("Runtime extractor %s %s/%s was deleted externally, triggering reconciliation",
resourceType, metadata.GetNamespace(), name)
ri.notifyModification()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle tombstones in delete events.

DeleteFunc can receive cache.DeletedFinalStateUnknown tombstones from client-go. meta.Accessor(obj) will fail on that wrapper, so deletions of the target DaemonSet can be missed instead of triggering reconciliation.

Please unwrap the tombstone before filtering.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/runtimeextractor/informer.go` around lines 103 - 121, In
handleResourceDeletion (type resourceInformer) detect and unwrap
cache.DeletedFinalStateUnknown tombstones before calling meta.Accessor: if obj
is a tombstone, extract the tombstone.Obj (or underlying runtime.Object) and
reassign it to obj so meta.Accessor succeeds; then continue the existing
namespace/name checks and call ri.notifyModification() when runtimeExtractorName
is deleted. Ensure the tombstone handling uses the
cache.DeletedFinalStateUnknown type assertion and preserves existing logging
behavior.

Comment on lines +54 to +141
```go
// Watch patterns
DaemonSet: openshift-insights/insights-runtime-extractor
```

### Detection Criteria

The informer uses **generation-based filtering** to minimize unnecessary reconciliations:

**DaemonSet**:
- Checks if `Generation` changed (only increments when spec changes, not status)
- Filters out ~90% of update events (status updates, controller metadata changes)
- Deletion events always trigger reconciliation

**Why generation-based?**
- **Simple**: Single field check instead of deep comparison
- **Efficient**: Filters out 90% of noise (status updates, reconciliation loops)
- **Reliable**: Kubernetes guarantees generation increments on spec changes
- **Performant**: `resourceapply` handles detailed comparison and actual updates

### Reconciliation Flow

```
External Change → Informer Detects → Notification Sent → Controller Reconciles → State Restored
```

## Resource Protection

### Server-Side Apply
The controller uses `resourceapply` functions from `library-go`, which provide:
- **Three-way merge**: Preserves fields managed by other controllers
- **Generation tracking**: Detects meaningful changes
- **Conflict resolution**: Handles concurrent modifications gracefully

### Retry Logic
All resource apply operations use `retry.RetryOnConflict` with exponential backoff:
- **Automatic retry**: Conflicts are automatically retried (up to 5 attempts by default)
- **Exponential backoff**: Delays increase between retries (10ms, 20ms, 40ms, etc.)
- **Jitter**: Random delay variation to prevent thundering herd
- **Success guarantee**: Eventually consistent even under concurrent modifications

### Ownership
Resources are labeled with:
```yaml
labels:
app.kubernetes.io/managed-by: insights-operator
app.kubernetes.io/name: insights-runtime-extractor
```

## Event Flow

### Initial Deployment
```
Operator Starts
Read Configuration
DisableRuntimeExtractor == false?
↓ (yes)
Create/Update Resources
Monitor for Changes
```

### Configuration Change
```
Config Updated
ConfigNotifier Sends Event
Controller Handles Config Change
Enable: Apply Resources
Disable: Delete Resources
```

### Resource Drift
```
User/Process Modifies Resource
Informer Detects Change
Notification Sent to modifiedCh
Controller Reapplies Desired State
Resource Restored
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language tags to the fenced blocks.

The plain fences in this section are triggering markdownlint MD040. Please mark them as text, mermaid, or an equivalent language so the docs pass lint cleanly.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 76-76: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 106-106: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 119-119: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 131-131: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/runtimeextractor/README.md` around lines 54 - 141, Update the
README fenced code blocks that lack language tags by adding appropriate
languages: change the ASCII flow diagrams under the "Reconciliation Flow",
"Initial Deployment", "Configuration Change", and "Resource Drift" headings to
use a language tag such as text or mermaid (e.g., replace ``` with ```text or
```mermaid), keep the "Watch patterns" Go block as ```go and ensure the resource
labels block remains ```yaml; this will satisfy markdownlint MD040 without
altering the content.

Comment on lines +104 to +107
// Verify images were updated from environment variables
assert.NotEqual(t, "old-image", ds.Spec.Template.Spec.Containers[0].Image)
assert.NotEqual(t, "old-image", ds.Spec.Template.Spec.Containers[1].Image)
assert.NotEqual(t, "old-image", ds.Spec.Template.Spec.Containers[2].Image)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert the expected images, not just a change.

These checks would still pass if the wrong image were loaded. Compare against the exact env-derived values so the test catches mapping regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/runtimeextractor/resources/daemonset_test.go` around lines 104
- 107, The test currently only asserts images changed; replace the NotEqual
checks with exact equality checks against the expected env-derived image strings
so the test fails on incorrect mappings: in
pkg/controller/runtimeextractor/resources/daemonset_test.go update the
assertions for ds.Spec.Template.Spec.Containers[0].Image, [1].Image and
[2].Image to assert Equal against the precise expected image values (either
compute them from the same environment variables used by the code under test or
hard-code the exact expected strings used in the test setup) instead of using
assert.NotEqual.

Comment on lines +151 to +162
// daemonSetExists checks if the runtime extractor DaemonSet exists
func (rm *ResourceManager) daemonSetExists(ctx context.Context) bool {
_, err := rm.getDaemonSet(ctx)
if err != nil {
if apierrors.IsNotFound(err) {
return false
}
klog.Errorf("Failed to get runtime extractor DaemonSet %s/%s: %v", daemonSetNamespace, daemonSetName, err)
return false
}
return true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't treat API failures as resource absence.

daemonSetExists() returns false for every get error, not just NotFound. That makes the controller skip reconciliation during temporary API/auth/transport failures, which can leave the runtime-extractor state stale.

Please return the error separately and only map NotFound to false.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/runtimeextractor/resources/daemonset.go` around lines 151 -
162, daemonSetExists currently swallows all get errors and returns false, which
treats API/auth/transport failures as "not found"; change the signature to
return (bool, error) (e.g., daemonSetExists(ctx context.Context) (bool, error)),
call rm.getDaemonSet(ctx) and if apierrors.IsNotFound(err) return (false, nil),
if err != nil return (false, err), otherwise return (true, nil); update all
callers of daemonSetExists to handle the error return and only treat NotFound as
absence while surfacing other errors so reconciliation isn't skipped on
temporary API failures.

Comment on lines +205 to +208
// Update runtime-extractor images on any version change
if versionChanged {
c.updateCh <- struct{}{}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Avoid blocking send on updateCh in status reconciliation path.

The direct send can deadlock merge() if updateCh is nil or already full. Use a non-blocking send (and nil-safe guard) to keep status updates resilient.

🔒 Non-blocking, nil-safe signal send
-		if versionChanged {
-			c.updateCh <- struct{}{}
-		}
+		if versionChanged && c.updateCh != nil {
+			select {
+			case c.updateCh <- struct{}{}:
+			default:
+				// signal already pending; avoid blocking status reconciliation
+			}
+		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Update runtime-extractor images on any version change
if versionChanged {
c.updateCh <- struct{}{}
}
// Update runtime-extractor images on any version change
if versionChanged && c.updateCh != nil {
select {
case c.updateCh <- struct{}{}:
default:
// signal already pending; avoid blocking status reconciliation
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/status/controller.go` around lines 205 - 208, The direct
blocking send to c.updateCh in the status reconciliation path can deadlock
merge() if updateCh is nil or full; change it to a nil-safe, non-blocking
signal: check c.updateCh != nil and then use a non-blocking send (select with a
case c.updateCh <- struct{}{} and a default) so you only attempt to signal when
the channel exists and won't block if it's full.

@opokornyy
Copy link
Copy Markdown
Contributor Author

/retest

opokornyy added 3 commits May 18, 2026 10:17
Move runtime-extractor manifests to dedicated
directory and update clusterrole permissions
for controller resource management.

Signed-off-by: Ondrej Pokorny <opokorny@redhat.com>
Implement controller that manages runtime-extractor resources:
  - ConfigMap for extractor configuration
  - DaemonSet for node-level data collection
  - Service for pod communication
  - Resource lifecycle management

Signed-off-by: Ondrej Pokorny <opokorny@redhat.com>
Add comprehensive tests for runtime-extractor resource management:
  - ConfigMap creation and updates
  - DaemonSet lifecycle and configuration
  - Service resource handling
  - Common test utilities

Signed-off-by: Ondrej Pokorny <opokorny@redhat.com>
@opokornyy opokornyy force-pushed the CCXDEV-15790-extractor-controller branch from b2636bb to 25ff9a1 Compare May 18, 2026 12:24
@opokornyy
Copy link
Copy Markdown
Contributor Author

/test tls-scanner

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/controller/runtimeextractor/informer_test.go (1)

17-17: ⚡ Quick win

Consider upgrading to fake.NewClientset() now that apply configurations are available.

fake.NewSimpleClientset is deprecated. The codebase has apply configurations available in the vendored k8s.io/client-go library, making an upgrade to fake.NewClientset feasible with minimal effort.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/runtimeextractor/informer_test.go` at line 17, The test uses
the deprecated fake.NewSimpleClientset; replace its usage by calling
fake.NewClientset() when constructing fakeClient (update the instantiation where
fakeClient := fake.NewSimpleClientset() appears) to leverage apply-configuration
support; ensure any test setup that relied on SimpleClientset behavior still
works with the new fake.NewClientset and adjust imports or apply-config usage if
necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controller/runtimeextractor/resources/retry_test.go`:
- Around line 19-84: Convert Test_applyDaemonSet_RetryOnConflict into a
table-driven test: create a test cases slice (even if it contains one case) with
fields like name, initialDaemonSet, reactorBehavior (to simulate conflict then
success), and wantErr; iterate over cases using t.Run(case.name, func(t
*testing.T) { ... }); inside each subtest set up the fakeClient, recorder,
updateAttempts counter and PrependReactor as before, construct the
ResourceManager via NewResourceManager and call applyDaemonSet, then assert
results against case.wantErr and verify updateAttempts; keep the existing logic
but move per-case setup into the loop and reference
Test_applyDaemonSet_RetryOnConflict, NewResourceManager and applyDaemonSet to
locate where to change.

---

Nitpick comments:
In `@pkg/controller/runtimeextractor/informer_test.go`:
- Line 17: The test uses the deprecated fake.NewSimpleClientset; replace its
usage by calling fake.NewClientset() when constructing fakeClient (update the
instantiation where fakeClient := fake.NewSimpleClientset() appears) to leverage
apply-configuration support; ensure any test setup that relied on
SimpleClientset behavior still works with the new fake.NewClientset and adjust
imports or apply-config usage if necessary.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 636e3849-9a58-4315-a611-71b7b299921b

📥 Commits

Reviewing files that changed from the base of the PR and between b2636bb and 25ff9a1.

📒 Files selected for processing (15)
  • manifests/03-clusterrole.yaml
  • manifests/06-deployment.yaml
  • pkg/controller/operator.go
  • pkg/controller/runtimeextractor/README.md
  • pkg/controller/runtimeextractor/controller.go
  • pkg/controller/runtimeextractor/doc.go
  • pkg/controller/runtimeextractor/informer.go
  • pkg/controller/runtimeextractor/informer_test.go
  • pkg/controller/runtimeextractor/resources/daemonset.go
  • pkg/controller/runtimeextractor/resources/daemonset_test.go
  • pkg/controller/runtimeextractor/resources/manager.go
  • pkg/controller/runtimeextractor/resources/manifests/runtime-extractor-daemonset.yaml
  • pkg/controller/runtimeextractor/resources/retry_test.go
  • pkg/controller/status/controller.go
  • pkg/controller/status/controller_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/controller/runtimeextractor/doc.go
🚧 Files skipped from review as they are similar to previous changes (9)
  • pkg/controller/runtimeextractor/resources/manager.go
  • pkg/controller/status/controller_test.go
  • pkg/controller/status/controller.go
  • manifests/06-deployment.yaml
  • pkg/controller/runtimeextractor/resources/daemonset_test.go
  • manifests/03-clusterrole.yaml
  • pkg/controller/runtimeextractor/controller.go
  • pkg/controller/operator.go
  • pkg/controller/runtimeextractor/resources/daemonset.go

Comment on lines +19 to +84
func Test_applyDaemonSet_RetryOnConflict(t *testing.T) {
ctx := context.Background()

// Create initial DaemonSet
initialDS := &appsv1.DaemonSet{
ObjectMeta: metav1.ObjectMeta{
Name: "insights-runtime-extractor",
Namespace: "openshift-insights",
ResourceVersion: "1",
},
Spec: appsv1.DaemonSetSpec{
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{"app": "test"},
},
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: map[string]string{"app": "test"},
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{
{Name: "extractor", Image: "old-image:v1"},
{Name: "exporter", Image: "old-image:v1"},
{Name: "kube-rbac-proxy", Image: "old-image:v1"},
},
},
},
},
}

fakeClient := fake.NewSimpleClientset(initialDS)
recorder := events.NewInMemoryRecorder("test", clock.RealClock{})

// Track update attempts
updateAttempts := 0

// Add reactor to simulate conflict on first attempt, then succeed
fakeClient.PrependReactor("update", "daemonsets", func(action k8stesting.Action) (handled bool, ret runtime.Object, err error) {
updateAttempts++
if updateAttempts == 1 {
// First attempt: return conflict error
return true, nil, apierrors.NewConflict(
appsv1.Resource("daemonsets"),
"insights-runtime-extractor",
nil,
)
}
// Second attempt: succeed with default behavior
return false, nil, nil
})

rm := NewResourceManager(
fakeClient.AppsV1(),
recorder,
)

// This should succeed after retry
_, err := rm.applyDaemonSet(ctx)
if err != nil {
t.Fatalf("Expected applyDaemonSet to succeed after retry, got error: %v", err)
}

// Verify it retried (should have 2 attempts: 1 conflict + 1 success)
if updateAttempts < 2 {
t.Errorf("Expected at least 2 update attempts (conflict + retry), got %d", updateAttempts)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Convert this to a table-driven test to match repo test standards.

This test is currently single-case; please wrap it in a table-driven structure (even with one case) for consistency and easier expansion.

Proposed refactor sketch
-func Test_applyDaemonSet_RetryOnConflict(t *testing.T) {
-    ctx := context.Background()
-    ...
-}
+func Test_applyDaemonSet_RetryOnConflict(t *testing.T) {
+    tests := []struct {
+        name string
+    }{
+        {name: "retries once on conflict and then succeeds"},
+    }
+
+    for _, tt := range tests {
+        t.Run(tt.name, func(t *testing.T) {
+            ctx := context.Background()
+            // existing setup/assertions
+        })
+    }
+}

As per coding guidelines, Use table-driven tests in test implementations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/runtimeextractor/resources/retry_test.go` around lines 19 -
84, Convert Test_applyDaemonSet_RetryOnConflict into a table-driven test: create
a test cases slice (even if it contains one case) with fields like name,
initialDaemonSet, reactorBehavior (to simulate conflict then success), and
wantErr; iterate over cases using t.Run(case.name, func(t *testing.T) { ... });
inside each subtest set up the fakeClient, recorder, updateAttempts counter and
PrependReactor as before, construct the ResourceManager via NewResourceManager
and call applyDaemonSet, then assert results against case.wantErr and verify
updateAttempts; keep the existing logic but move per-case setup into the loop
and reference Test_applyDaemonSet_RetryOnConflict, NewResourceManager and
applyDaemonSet to locate where to change.

@opokornyy opokornyy force-pushed the CCXDEV-15790-extractor-controller branch from 25ff9a1 to 8518be5 Compare May 18, 2026 12:34
@opokornyy
Copy link
Copy Markdown
Contributor Author

/test tls-scanner

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/controller/runtimeextractor/README.md (1)

76-76: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language tags to fenced code blocks.

The four plain-text flow diagrams (lines 76, 106, 119, 131) are still missing language identifiers, which causes markdownlint MD040 warnings. Add text tags to each fence.

📝 Proposed fix
-```
+```text
 External Change → Informer Detects → Notification Sent → Controller Reconciles → State Restored

Apply the same pattern to the fences at lines 106, 119, and 131.

</details>


Also applies to: 106-106, 119-119, 131-131

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @pkg/controller/runtimeextractor/README.md at line 76, Update the fenced code
blocks that contain the plain-text flow diagrams so they include the language
tag "text" instead of a bare fence; for example change totext for the
fence wrapping the flow "External Change → Informer Detects → Notification Sent
→ Controller Reconciles → State Restored" and apply the same change to the other
three flow-diagram fences in the README (the ones around the other two/three
plain-text diagrams). This fixes the MD040 markdownlint warnings by explicitly
marking those blocks as text.


</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In @pkg/controller/runtimeextractor/README.md:

  • Line 76: Update the fenced code blocks that contain the plain-text flow
    diagrams so they include the language tag "text" instead of a bare fence; for
    example change totext for the fence wrapping the flow "External Change →
    Informer Detects → Notification Sent → Controller Reconciles → State Restored"
    and apply the same change to the other three flow-diagram fences in the README
    (the ones around the other two/three plain-text diagrams). This fixes the MD040
    markdownlint warnings by explicitly marking those blocks as text.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Repository YAML (base), Central YAML (inherited)

**Review profile**: CHILL

**Plan**: Enterprise

**Run ID**: `b3544c28-21f2-41f9-80f2-7f5edc2bf16b`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 25ff9a17854af50c1dc15cb80a125fa7d5059607 and 8518be5d65cd51f39c83f8734d2e81d1af205579.

</details>

<details>
<summary>📒 Files selected for processing (10)</summary>

* `pkg/controller/operator.go`
* `pkg/controller/runtimeextractor/README.md`
* `pkg/controller/runtimeextractor/controller.go`
* `pkg/controller/runtimeextractor/informer.go`
* `pkg/controller/runtimeextractor/informer_test.go`
* `pkg/controller/runtimeextractor/resources/daemonset.go`
* `pkg/controller/runtimeextractor/resources/daemonset_test.go`
* `pkg/controller/runtimeextractor/resources/manager.go`
* `pkg/controller/runtimeextractor/resources/retry_test.go`
* `pkg/controller/status/controller.go`

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (9)</summary>

* pkg/controller/runtimeextractor/resources/retry_test.go
* pkg/controller/runtimeextractor/resources/manager.go
* pkg/controller/runtimeextractor/informer_test.go
* pkg/controller/runtimeextractor/resources/daemonset_test.go
* pkg/controller/runtimeextractor/informer.go
* pkg/controller/runtimeextractor/controller.go
* pkg/controller/operator.go
* pkg/controller/runtimeextractor/resources/daemonset.go
* pkg/controller/status/controller.go

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@opokornyy
Copy link
Copy Markdown
Contributor Author

/retest

@opokornyy
Copy link
Copy Markdown
Contributor Author

/test tls-scanner

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 20, 2026

@opokornyy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/tls-scanner 8518be5 link false /test tls-scanner

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

opokornyy added 2 commits May 20, 2026 13:45
This commit is implementing the Kubernetes
informer-based drift detection and reconciliation
for runtime-extractor resources (DaemonSet,
Service, ConfigMap)

Signed-off-by: Ondrej Pokorny <opokorny@redhat.com>
Signed-off-by: Ondrej Pokorny <opokorny@redhat.com>
@opokornyy opokornyy force-pushed the CCXDEV-15790-extractor-controller branch from 8518be5 to 847d7e6 Compare May 20, 2026 11:45
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
pkg/controller/status/controller.go (1)

205-208: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Use a non-blocking, nil-safe send for updateCh in reconciliation.

Line 207 performs a blocking send on the status merge path; if the channel is full (or unset in a caller), reconciliation can stall.

🔒 Suggested fix
-		if versionChanged {
-			c.updateCh <- struct{}{}
-		}
+		if versionChanged && c.updateCh != nil {
+			select {
+			case c.updateCh <- struct{}{}:
+			default:
+				// signal already pending; avoid blocking reconciliation
+			}
+		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/status/controller.go` around lines 205 - 208, The current
blocking send to c.updateCh when versionChanged can stall reconciliation; change
it to a nil-safe, non-blocking send by checking c.updateCh != nil and using a
select with a default case (e.g., select { case c.updateCh <- struct{}{}:
default: }) so the reconcile path never blocks if the channel is full or unset;
update the code around the versionChanged branch in the status merge/reconcile
function to perform this check and non-blocking send.
pkg/controller/runtimeextractor/informer.go (1)

96-101: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle tombstones in delete events.

Delete handlers may receive cache.DeletedFinalStateUnknown; without unwrapping it first, metadata extraction can fail and external deletions may not trigger reconciliation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/runtimeextractor/informer.go` around lines 96 - 101, The
delete handler in resourceInformer.handleResourceDeletion must handle
cache.DeletedFinalStateUnknown tombstones before calling meta.Accessor: detect
if obj is a cache.DeletedFinalStateUnknown, extract the actual object via
tombstone.Obj (or log+return if unwrap fails), then proceed to call
meta.Accessor; update the function to unwrap tombstones and handle unwrap
failures so external deletions correctly trigger reconciliation.
pkg/controller/runtimeextractor/resources/daemonset_test.go (1)

104-107: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert exact image values, not just “changed from old-image.”

These checks can still pass if images are mapped incorrectly. Compare each container image against the expected env-derived value.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/runtimeextractor/resources/daemonset_test.go` around lines 104
- 107, The test currently uses assert.NotEqual on
ds.Spec.Template.Spec.Containers[0..2].Image which only verifies they changed
from "old-image"; replace these with exact comparisons to the expected
environment-derived image strings (the values your test derives from env vars or
helper like expectedImageX) by asserting equality against
ds.Spec.Template.Spec.Containers[0].Image, [1].Image and [2].Image respectively;
locate the expected values where the test sets/reads the envs and use those
variables (or a helper function used in the test) to assert assert.Equal(t,
expectedImage0, ds.Spec.Template.Spec.Containers[0].Image) etc.
pkg/controller/runtimeextractor/controller.go (1)

151-183: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate reconcile errors instead of only logging them.

createDeployment, deleteDeployment, and updateDeployment log failures but always continue. That can leave runtime-extractor resources out of sync indefinitely if no new event arrives.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/runtimeextractor/controller.go` around lines 151 - 183, The
createDeployment, deleteDeployment, and updateDeployment functions currently
only log failures and swallow errors; change each function (createDeployment,
deleteDeployment, updateDeployment) to return error instead of void, propagate
and return the error from resourceManager.ApplyRuntimeExtractorResources /
DeleteRuntimeExtractorResources (and any isCreated checks as needed) rather than
only klog.Errorf, and update the caller(s) (the reconcile loop that invokes
these methods) to handle returned errors (retry/requeue or surface them to
controller-runtime) so reconcile failures are not silently ignored.
pkg/controller/runtimeextractor/resources/daemonset.go (1)

151-160: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not treat all Get failures as “does not exist.”

daemonSetExists returns false for non-NotFound errors too, which can hide API/auth/transport failures and skip needed reconciliation paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/runtimeextractor/resources/daemonset.go` around lines 151 -
160, daemonSetExists currently swallows all getDaemonSet errors and returns
false for non-NotFound failures; change its behavior to surface
API/auth/transport errors by converting its signature from daemonSetExists(ctx
context.Context) bool to daemonSetExists(ctx context.Context) (bool, error) and
implement: call getDaemonSet(ctx), if apierrors.IsNotFound(err) return false,
nil; if err != nil return false, err; otherwise return true, nil. Update callers
of daemonSetExists to handle the (bool, error) return and avoid treating
non-NotFound errors as “does not exist.” Ensure references to
ResourceManager.daemonSetExists and getDaemonSet are updated accordingly.
🧹 Nitpick comments (1)
pkg/controller/runtimeextractor/README.md (1)

228-228: 💤 Low value

Clarify the distinction between existing and planned exponential backoff.

Lines 89-93 describe exponential backoff as already implemented for retry-on-conflict. This future enhancement suggests implementing exponential backoff for "reconciliation failures." If this refers to a different backoff mechanism (e.g., between full reconciliation attempts rather than within retry-on-conflict), please clarify the distinction to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/runtimeextractor/README.md` at line 228, The checklist item is
ambiguous about whether it duplicates the existing retry-on-conflict backoff;
update the README entry to clearly distinguish the two: state that
retry.OnConflict (used inside the Reconciler/reconcile loop for update
conflicts) already has exponential backoff, and change the TODO to explicitly
target implementing exponential backoff between full reconciliation attempts
(e.g., backoff on Reconciler errors/failures that abort a reconcile cycle),
mentioning the terms "retry.OnConflict" and "Reconciler/reconcile loop" so
readers can locate the existing implementation and understand this is a separate
mechanism.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controller/runtimeextractor/informer_test.go`:
- Around line 68-69: The three informerFactory.WaitForCacheSync calls in
informer_test.go currently ignore their boolean return; update each call (the
ones that invoke informerFactory.WaitForCacheSync(ctx.Done())) to capture the
returned map/boolean and assert it succeeded—e.g., if sync returns false fail
the test (t.Fatalf or require.True) with a clear message so tests fail fast when
caches didn't initialize before further actions like updates/deletes; locate the
calls by the WaitForCacheSync invocations in the test and add the boolean check
and test failure handling.

In `@pkg/controller/runtimeextractor/resources/daemonset.go`:
- Around line 25-33: The current defaults extractorDefaultImage,
exporterDefaultImage and proxyDefaultImage use :latest which is unsafe; remove
the :latest fallbacks and change the image resolution logic (the code that reads
exporterImageEnv/proxyImageEnv and the extractor env lookup around where
envImageErrMsg is used and also in the block referenced at lines ~109-125) to
fail fast: when os.LookupEnv for RELATED_IMAGE_* returns missing or empty,
return/propagate an error or log a clear fatal and do not substitute a hardcoded
:latest; update envImageErrMsg to indicate a missing RELATED_IMAGE and include
the env var name so callers receive a deterministically failing error instead of
silently using :latest.

---

Duplicate comments:
In `@pkg/controller/runtimeextractor/controller.go`:
- Around line 151-183: The createDeployment, deleteDeployment, and
updateDeployment functions currently only log failures and swallow errors;
change each function (createDeployment, deleteDeployment, updateDeployment) to
return error instead of void, propagate and return the error from
resourceManager.ApplyRuntimeExtractorResources / DeleteRuntimeExtractorResources
(and any isCreated checks as needed) rather than only klog.Errorf, and update
the caller(s) (the reconcile loop that invokes these methods) to handle returned
errors (retry/requeue or surface them to controller-runtime) so reconcile
failures are not silently ignored.

In `@pkg/controller/runtimeextractor/informer.go`:
- Around line 96-101: The delete handler in
resourceInformer.handleResourceDeletion must handle
cache.DeletedFinalStateUnknown tombstones before calling meta.Accessor: detect
if obj is a cache.DeletedFinalStateUnknown, extract the actual object via
tombstone.Obj (or log+return if unwrap fails), then proceed to call
meta.Accessor; update the function to unwrap tombstones and handle unwrap
failures so external deletions correctly trigger reconciliation.

In `@pkg/controller/runtimeextractor/resources/daemonset_test.go`:
- Around line 104-107: The test currently uses assert.NotEqual on
ds.Spec.Template.Spec.Containers[0..2].Image which only verifies they changed
from "old-image"; replace these with exact comparisons to the expected
environment-derived image strings (the values your test derives from env vars or
helper like expectedImageX) by asserting equality against
ds.Spec.Template.Spec.Containers[0].Image, [1].Image and [2].Image respectively;
locate the expected values where the test sets/reads the envs and use those
variables (or a helper function used in the test) to assert assert.Equal(t,
expectedImage0, ds.Spec.Template.Spec.Containers[0].Image) etc.

In `@pkg/controller/runtimeextractor/resources/daemonset.go`:
- Around line 151-160: daemonSetExists currently swallows all getDaemonSet
errors and returns false for non-NotFound failures; change its behavior to
surface API/auth/transport errors by converting its signature from
daemonSetExists(ctx context.Context) bool to daemonSetExists(ctx
context.Context) (bool, error) and implement: call getDaemonSet(ctx), if
apierrors.IsNotFound(err) return false, nil; if err != nil return false, err;
otherwise return true, nil. Update callers of daemonSetExists to handle the
(bool, error) return and avoid treating non-NotFound errors as “does not exist.”
Ensure references to ResourceManager.daemonSetExists and getDaemonSet are
updated accordingly.

In `@pkg/controller/status/controller.go`:
- Around line 205-208: The current blocking send to c.updateCh when
versionChanged can stall reconciliation; change it to a nil-safe, non-blocking
send by checking c.updateCh != nil and using a select with a default case (e.g.,
select { case c.updateCh <- struct{}{}: default: }) so the reconcile path never
blocks if the channel is full or unset; update the code around the
versionChanged branch in the status merge/reconcile function to perform this
check and non-blocking send.

---

Nitpick comments:
In `@pkg/controller/runtimeextractor/README.md`:
- Line 228: The checklist item is ambiguous about whether it duplicates the
existing retry-on-conflict backoff; update the README entry to clearly
distinguish the two: state that retry.OnConflict (used inside the
Reconciler/reconcile loop for update conflicts) already has exponential backoff,
and change the TODO to explicitly target implementing exponential backoff
between full reconciliation attempts (e.g., backoff on Reconciler
errors/failures that abort a reconcile cycle), mentioning the terms
"retry.OnConflict" and "Reconciler/reconcile loop" so readers can locate the
existing implementation and understand this is a separate mechanism.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4c938da1-bd84-4417-8026-99d91354aa1a

📥 Commits

Reviewing files that changed from the base of the PR and between 8518be5 and 847d7e6.

📒 Files selected for processing (10)
  • pkg/controller/operator.go
  • pkg/controller/runtimeextractor/README.md
  • pkg/controller/runtimeextractor/controller.go
  • pkg/controller/runtimeextractor/informer.go
  • pkg/controller/runtimeextractor/informer_test.go
  • pkg/controller/runtimeextractor/resources/daemonset.go
  • pkg/controller/runtimeextractor/resources/daemonset_test.go
  • pkg/controller/runtimeextractor/resources/manager.go
  • pkg/controller/runtimeextractor/resources/retry_test.go
  • pkg/controller/status/controller.go

Comment on lines +68 to +69
informerFactory.WaitForCacheSync(ctx.Done())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file
fd -t f "informer_test.go" | grep "runtimeextractor"

Repository: openshift/insights-operator

Length of output: 120


🏁 Script executed:

# Read the file to check the specific lines
cat -n pkg/controller/runtimeextractor/informer_test.go

Repository: openshift/insights-operator

Length of output: 6462


Check WaitForCacheSync return value to validate cache initialization.

These three WaitForCacheSync calls ignore the boolean return value. If cache sync fails (returns false), the informer won't have synced state before the test proceeds with updates/deletes, causing confusing test timeouts instead of explicit failures. Check the return value and fail the test if sync incomplete.

Affects: lines 68, 115, 159

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/runtimeextractor/informer_test.go` around lines 68 - 69, The
three informerFactory.WaitForCacheSync calls in informer_test.go currently
ignore their boolean return; update each call (the ones that invoke
informerFactory.WaitForCacheSync(ctx.Done())) to capture the returned
map/boolean and assert it succeeded—e.g., if sync returns false fail the test
(t.Fatalf or require.True) with a clear message so tests fail fast when caches
didn't initialize before further actions like updates/deletes; locate the calls
by the WaitForCacheSync invocations in the test and add the boolean check and
test failure handling.

Comment on lines +25 to +33
extractorDefaultImage = "quay.io/openshift/origin-insights-runtime-extractor:latest"

exporterImageEnv = "RELATED_IMAGE_INSIGHTS_RUNTIME_EXPORTER"
exporterDefaultImage = "quay.io/openshift/origin-insights-runtime-exporter:latest"

proxyImageEnv = "RELATED_IMAGE_KUBE_RBAC_PROXY"
proxyDefaultImage = "quay.io/openshift/origin-kube-rbac-proxy:latest"

envImageErrMsg = "Failed to get image from environment variable %s, using default image %s"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid :latest fallback for missing RELATED_IMAGE env vars.

Falling back to hardcoded :latest images makes deployments non-deterministic and risky in disconnected/pinned-image environments. Prefer failing fast and surfacing a clear error.

Also applies to: 109-125

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/runtimeextractor/resources/daemonset.go` around lines 25 - 33,
The current defaults extractorDefaultImage, exporterDefaultImage and
proxyDefaultImage use :latest which is unsafe; remove the :latest fallbacks and
change the image resolution logic (the code that reads
exporterImageEnv/proxyImageEnv and the extractor env lookup around where
envImageErrMsg is used and also in the block referenced at lines ~109-125) to
fail fast: when os.LookupEnv for RELATED_IMAGE_* returns missing or empty,
return/propagate an error or log a clear fatal and do not substitute a hardcoded
:latest; update envImageErrMsg to indicate a missing RELATED_IMAGE and include
the env var name so callers receive a deterministically failing error instead of
silently using :latest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants