Skip to content

LOG-9356: Implement Collector ServiceMonitor Changes#3270

Open
Clee2691 wants to merge 1 commit into
openshift:masterfrom
Clee2691:LOG-9356
Open

LOG-9356: Implement Collector ServiceMonitor Changes#3270
Clee2691 wants to merge 1 commit into
openshift:masterfrom
Clee2691:LOG-9356

Conversation

@Clee2691
Copy link
Copy Markdown
Contributor

@Clee2691 Clee2691 commented May 5, 2026

Description

This PR makes use of metrics collection profiles.

The CMO supports metrics collection profiles that select ServiceMonitors by the monitoring.openshift.io/collection-profile label. This change adds that label to all ServiceMonitors created by CLO, enabling cluster admins to reduce metrics cardinality by switching to the minimal profile.

Because the minimal ServiceMonitor has a different resource name (e.g., minimal-collector) but targets the same backing service, a serviceName parameter was added to ensure TLS ServerName verification uses the correct service DNS name.

Minimal Metrics - Collector:

  • logcollector_component_event_unmatched_count
  • vector_buffer_byte_size
  • vector_buffer_discarded_events_total
  • vector_buffer_events
  • vector_buffer_sent_events_total
  • vector_component_discarded_events_total
  • vector_component_errors_total
  • vector_component_received_bytes_total
  • vector_component_received_event_bytes_total
  • vector_component_received_events_total
  • vector_component_sent_bytes_total
  • vector_events_in_total
  • vector_http_client_errors_total
  • vector_http_client_requests_sent_total
  • vector_http_client_responses_total
  • vector_open_files

Minimal Metrics - LFME:

  • log_logged_bytes_total

/cc @vparfonov
/assign @jcantrill

Links

Summary by CodeRabbit

  • New Features

    • Metrics scraping now creates separate "full" and "minimal" ServiceMonitors, tags them with a collection-profile label, and applies configurable relabeling and per-service TLS server names.
    • Added a Prometheus/Thanos test helper for runtime metric queries.
  • Tests

    • Unit and e2e tests expanded to validate profile labeling, relabel rule ordering/content, TLS server name usage, and that full/minimal profiles expose expected metrics.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 5, 2026

@Clee2691: This pull request references LOG-9356 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.8.0" version, but no target version was set.

Details

In response to this:

Description

  • Add MetricRelabelConfigs to the collector ServiceMonitor to keep only the metrics needed for alerts, dashboards, and telemetry
  • Drop transform component metrics except errors and discarded events to further reduce cardinality

Allowed Metrics:

  • logcollector_component_event_unmatched_count
  • vector_buffer_byte_size
  • vector_buffer_discarded_events_total
  • vector_buffer_events
  • vector_buffer_sent_events_total
  • vector_component_discarded_events_total
  • vector_component_errors_total
  • vector_component_received_bytes_total
  • vector_component_received_event_bytes_total
  • vector_component_received_events_total
  • vector_component_sent_bytes_total
  • vector_events_in_total
  • vector_http_client_errors_total
  • vector_http_client_requests_sent_total
  • vector_http_client_responses_total
  • vector_open_files

/cc @vparfonov
/assign @jcantrill

Links

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026
@openshift-ci openshift-ci Bot requested a review from vparfonov May 5, 2026 21:16
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add metric filtering to collector ServiceMonitor

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add metric filtering to ServiceMonitor with keep/drop relabel configs
• Keep only essential metrics for alerts, dashboards, telemetry
• Drop transform component metrics except errors and discarded events
• Refactor relabel config structure for improved readability
Diagram
flowchart LR
  SM["ServiceMonitor<br/>MetricRelabelConfigs"]
  R1["Relabel 1:<br/>Replace - with _"]
  R2["Relabel 2:<br/>Keep allowed metrics"]
  R3["Relabel 3:<br/>Drop transform metrics"]
  SM --> R1
  SM --> R2
  SM --> R3
  R2 --> Filter["Filtered metrics<br/>for monitoring"]
  R3 --> Filter
Loading

Grey Divider

File Changes

1. internal/metrics/service_monitor.go ✨ Enhancement +14/-9

Add metric filtering relabel configurations

• Removed comment explaining label replacement logic
• Reformatted first relabel config for consistency
• Added second relabel config with keep action to filter metrics by name
• Added third relabel config with drop action to exclude transform component metrics

internal/metrics/service_monitor.go


2. internal/metrics/service_monitor_test.go 🧪 Tests +14/-0

Add MetricRelabelConfigs verification tests

• Added test section verifying MetricRelabelConfigs structure
• Validates three relabel configs are present
• Verifies first config replaces hyphens with underscores
• Verifies second config keeps only allowed metrics
• Verifies third config drops transform component metrics

internal/metrics/service_monitor_test.go


Grey Divider

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 5, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

Context used

Grey Divider


Action required

1. LFME metrics filtered out 🐞 Bug ≡ Correctness
Description
newServiceMonitor() now applies a metric "keep" allowlist to all ServiceMonitors, including
LogFileMetricsExporter’s ServiceMonitor, which will drop log_logged_bytes_total from Prometheus.
This breaks the shipped recording rule and dashboard panels that query log_logged_bytes_total.
Code

internal/metrics/service_monitor.go[R40-44]

+				{
+					Action:       "keep",
+					SourceLabels: []monitoringv1.LabelName{"__name__"},
+					Regex:        "logcollector_component_event_unmatched_count|vector_buffer_byte_size|vector_buffer_discarded_events_total|vector_buffer_events|vector_buffer_sent_events_total|vector_component_discarded_events_total|vector_component_errors_total|vector_component_received_bytes_total|vector_component_received_event_bytes_total|vector_component_received_events_total|vector_component_sent_bytes_total|vector_events_in_total|vector_http_client_errors_total|vector_http_client_requests_sent_total|vector_http_client_responses_total|vector_open_files",
+				},
Evidence
The new metric relabel "keep" allowlist in newServiceMonitor() only allows vector_* and one
logcollector_* metric name, so LogFileMetricsExporter metrics (e.g., log_logged_bytes_total) will be
filtered out. The LogFileMetricsExporter controller reuses metrics.ReconcileServiceMonitor(), so it
will receive the same filtering; meanwhile, repository-shipped PrometheusRule recording and
dashboard queries depend on log_logged_bytes_total being present in Prometheus.

internal/metrics/service_monitor.go[33-50]
internal/metrics/logfilemetricexporter/metric_exporter.go[78-82]
config/prometheus/collector_alerts.yaml[97-107]
internal/metrics/dashboard/openshift-logging-dashboard.json[585-593]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`metrics.newServiceMonitor()` now installs a metric relabel `keep` allowlist for **every** ServiceMonitor created via `metrics.ReconcileServiceMonitor()`. This unintentionally filters out LogFileMetricsExporter metrics (notably `log_logged_bytes_total`), breaking shipped recording rules and dashboard panels.

## Issue Context
- LogFileMetricsExporter reconciles its ServiceMonitor through the same `metrics.ReconcileServiceMonitor()` path used by the collector.
- The new keep-regex only allows a vector/logcollector metric subset, excluding `log_logged_bytes_total`.

## Fix Focus Areas
- internal/metrics/service_monitor.go[33-50]
- internal/metrics/logfilemetricexporter/metric_exporter.go[78-82]
- config/prometheus/collector_alerts.yaml[97-107]

### Expected fix direction
- Make ServiceMonitor relabeling **component-specific**:
 - Either split into `newCollectorServiceMonitor()` (with allowlist/drop rules) and a generic `newServiceMonitor()` (without restrictive keep), or
 - Add a parameter to `newServiceMonitor(..., metricRelabelConfigs []*monitoringv1.RelabelConfig)` and pass the collector-only configs from the collector reconciler.
- Ensure LFME-required metrics (at least `log_logged_bytes_total`) are not filtered out of Prometheus.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Qodo Logo

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Reconciler now creates two ServiceMonitors (full and minimal) per component, parameterizing TLS serverName, metric relabel configs, and a metrics-collection-profile label. New relabel builder exposes precomputed config sets; unit and e2e tests updated to validate relabel rules and Prometheus scraping.

Changes

Metrics relabel & ServiceMonitor flow

Layer / File(s) Summary
Constants / Labels
internal/constants/..., internal/constants/labels.go
Added LabelMetricsCollectionProfile = "monitoring.openshift.io/collection-profile" and reorganized metrics-related constants.
Relabel Data & Builders
internal/metrics/relabel.go
Introduce metricAllowlistConfig/metricDropConfig, buildRelabelConfigs, and exported prebuilt sets: CollectorMinimalRelabelConfigs, LFMEMinimalRelabelConfigs, FullRelabelConfigs.
Relabel Tests
internal/metrics/relabel_test.go
New tests covering ordering and exact contents of generated relabel rules and verifying prebuilt sets.
ServiceMonitor API
internal/metrics/service_monitor.go
newServiceMonitor and ReconcileServiceMonitor signatures extended to accept serviceName, metricRelabelConfigs, and profile; endpoint TLSConfig.ServerName set from serviceName; MetricRelabelConfigs and profile label applied.
ServiceMonitor Tests
internal/metrics/service_monitor_test.go
Added full/minimal-profile tests asserting profile label, TLS serverName (service-name-based), and exact MetricRelabelConfigs ordering/fields; re-reconcile test updated.
Controller Wiring
internal/controller/observability/collector.go
Reconcile two ServiceMonitors for collector: full (uses FullRelabelConfigs, full profile) and minimal (distinct name, CollectorMinimalRelabelConfigs, minimal profile).
Exporter Wiring
internal/metrics/logfilemetricexporter/metric_exporter.go
Reconcile two ServiceMonitors for LFME: full (FullRelabelConfigs, full profile) and minimal (named with minimal profile prefix, LFMEMinimalRelabelConfigs, minimal profile).
Controller & Exporter Tests
internal/controller/observability/collector_test.go, internal/metrics/logfilemetricexporter/metric_exporter_test.go
Tests updated to expect both full and minimal ServiceMonitors and validate endpoints and relabel config counts/actions.
E2E Prometheus Helpers
test/helpers/prometheus/prometheus.go
New helper to obtain Prometheus token/Thanos host via oc, perform HTTPS /api/v1/query with bearer token, return parsed JSON and HasResults.
E2E Tests Updated/Added
test/e2e/...
Added metrics suite and profile e2e validating monitoring.openshift.io/collection-profile labels and that allowlisted and non-allowlisted metrics are scraped; updated LFME and operator e2e tests to query Prometheus.
Test Utils
test/e2e/flowcontrol/utils.go
Metric polling simplified to use test/helpers/prometheus.Query and HasResults; removed prior direct token/HTTP helpers.

Sequence Diagram

sequenceDiagram
    participant Controller as Controller
    participant Relabel as Relabel Builder
    participant K8s as Kubernetes API
    participant PromOp as Prometheus Operator

    Controller->>Relabel: Request relabel configs (full/minimal)
    Relabel-->>Controller: Return relabel config sets
    Controller->>K8s: Create/Update ServiceMonitor (endpoints, TLS ServerName, metricRelabelConfigs, profile label)
    K8s->>PromOp: Prometheus Operator observes ServiceMonitor
    PromOp->>K8s: Begin scraping targets per ServiceMonitor spec
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 7 | ❌ 5

❌ Failed checks (5 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test assertions lack meaningful failure messages. Most Expect() calls throughout test files are missing diagnostic messages per requirement 4. Add failure message strings to Expect() assertions: Expect(x).To(Equal(y), "descriptive message about what failed"). This helps diagnose test failures in CI/CD environments.
Microshift Test Compatibility ⚠️ Warning New e2e tests query Prometheus/Thanos (unavailable on MicroShift). No protection: missing [Skipped:MicroShift], [apigroup:...] tags, or IsMicroShiftCluster() checks. Add [Skipped:MicroShift] labels to test names or guard with IsMicroShiftCluster() checks in profile_test.go, lfme_test.go, e2e_test.go Prometheus query tests.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning New e2e test in profile_test.go lacks SNO protection: no [Skipped:SingleReplicaTopology] label or runtime topology checks. Add [Skipped:SingleReplicaTopology] label to test name on line 27, or implement runtime topology check using exutil.IsSingleNode() to skip on SNO.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning New e2e test uses hardcoded external registry image that fails in disconnected IPv6 environments. Replace registry.access.redhat.com/ubi9/ubi with cluster-internal registry or add [Skipped:Disconnected] tag to the test.
✅ Passed checks (7 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: implementing Collector ServiceMonitor changes related to metrics collection profiles, which is the core focus of this PR.
Description check ✅ Passed The description provides good context about metrics collection profiles, lists allowed minimal metrics, explains the serviceName parameter addition, and includes the JIRA link and reviewer assignments.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test titles are static and deterministic. No dynamic values like pod names, timestamps, UUIDs, node names, or namespace suffixes found in test names.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no scheduling constraints. Changes are limited to ServiceMonitor (Prometheus CRD) configuration, metric relabel configs, and tests—not Pod/Deployment scheduling.
Ote Binary Stdout Contract ✅ Passed No stdout violations detected. TestSuite properly calls RunSpecs, Ginkgo specs registered via var _ pattern, logging uses stderr, test output via GinkgoWriter only.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread internal/metrics/service_monitor.go Outdated
Comment on lines +40 to +44
{
Action: "keep",
SourceLabels: []monitoringv1.LabelName{"__name__"},
Regex: "logcollector_component_event_unmatched_count|vector_buffer_byte_size|vector_buffer_discarded_events_total|vector_buffer_events|vector_buffer_sent_events_total|vector_component_discarded_events_total|vector_component_errors_total|vector_component_received_bytes_total|vector_component_received_event_bytes_total|vector_component_received_events_total|vector_component_sent_bytes_total|vector_events_in_total|vector_http_client_errors_total|vector_http_client_requests_sent_total|vector_http_client_responses_total|vector_open_files",
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Lfme metrics filtered out 🐞 Bug ≡ Correctness

newServiceMonitor() now applies a metric "keep" allowlist to all ServiceMonitors, including
LogFileMetricsExporter’s ServiceMonitor, which will drop log_logged_bytes_total from Prometheus.
This breaks the shipped recording rule and dashboard panels that query log_logged_bytes_total.
Agent Prompt
## Issue description
`metrics.newServiceMonitor()` now installs a metric relabel `keep` allowlist for **every** ServiceMonitor created via `metrics.ReconcileServiceMonitor()`. This unintentionally filters out LogFileMetricsExporter metrics (notably `log_logged_bytes_total`), breaking shipped recording rules and dashboard panels.

## Issue Context
- LogFileMetricsExporter reconciles its ServiceMonitor through the same `metrics.ReconcileServiceMonitor()` path used by the collector.
- The new keep-regex only allows a vector/logcollector metric subset, excluding `log_logged_bytes_total`.

## Fix Focus Areas
- internal/metrics/service_monitor.go[33-50]
- internal/metrics/logfilemetricexporter/metric_exporter.go[78-82]
- config/prometheus/collector_alerts.yaml[97-107]

### Expected fix direction
- Make ServiceMonitor relabeling **component-specific**:
  - Either split into `newCollectorServiceMonitor()` (with allowlist/drop rules) and a generic `newServiceMonitor()` (without restrictive keep), or
  - Add a parameter to `newServiceMonitor(..., metricRelabelConfigs []*monitoringv1.RelabelConfig)` and pass the collector-only configs from the collector reconciler.
- Ensure LFME-required metrics (at least `log_logged_bytes_total`) are not filtered out of Prometheus.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/metrics/service_monitor.go (1)

43-43: 💤 Low value

Consider extracting the keep-list regex to a named constant.

The allowlist is the core deliverable of this PR and a maintenance hotspot (new metrics are added, removed, or renamed). Inlining it as a 500+ character string makes diffs noisy and typos hard to spot.

♻️ Suggested refactor
+const collectorMetricsKeepRegex = "logcollector_component_event_unmatched_count|" +
+    "vector_buffer_byte_size|vector_buffer_discarded_events_total|" +
+    "vector_buffer_events|vector_buffer_sent_events_total|" +
+    "vector_component_discarded_events_total|vector_component_errors_total|" +
+    "vector_component_received_bytes_total|vector_component_received_event_bytes_total|" +
+    "vector_component_received_events_total|vector_component_sent_bytes_total|" +
+    "vector_events_in_total|vector_http_client_errors_total|" +
+    "vector_http_client_requests_sent_total|vector_http_client_responses_total|" +
+    "vector_open_files"

 // in newServiceMonitor:
-                Regex: "logcollector_component_event_unmatched_count|vector_buffer_byte_size|...",
+                Regex: collectorMetricsKeepRegex,

The constant can then be referenced in the test as well, removing duplication.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metrics/service_monitor.go` at line 43, Extract the long inline
allowlist string used in the Regex field into a clearly named package-level
constant (e.g., allowedMetricsRegex or keepListRegex) in
internal/metrics/service_monitor.go, replace the inlined string in the struct
literal's Regex field with that constant, and update any tests to
import/reference the same constant instead of duplicating the string so the
allowlist is a single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metrics/service_monitor_test.go`:
- Around line 83-84: The test currently omits asserting the allowlist regex so
changes silently pass; update the test in service_monitor_test.go to assert
relabelConfigs[1].Regex equals the expected allowlist regex (or, better, compare
it to the constant defined in service_monitor.go if you extract the regex
there), i.e., add an
Expect(relabelConfigs[1].Regex).To(Equal(<expectedRegexOrConstant>)) alongside
the existing Action and SourceLabels assertions.

---

Nitpick comments:
In `@internal/metrics/service_monitor.go`:
- Line 43: Extract the long inline allowlist string used in the Regex field into
a clearly named package-level constant (e.g., allowedMetricsRegex or
keepListRegex) in internal/metrics/service_monitor.go, replace the inlined
string in the struct literal's Regex field with that constant, and update any
tests to import/reference the same constant instead of duplicating the string so
the allowlist is a single source of truth.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c0367dba-620a-4f1b-abb3-d1fbd24ba9de

📥 Commits

Reviewing files that changed from the base of the PR and between ff06c7a and 962597a.

📒 Files selected for processing (2)
  • internal/metrics/service_monitor.go
  • internal/metrics/service_monitor_test.go

Comment thread internal/metrics/service_monitor_test.go
@Clee2691
Copy link
Copy Markdown
Contributor Author

Clee2691 commented May 6, 2026

/retest

@jcantrill
Copy link
Copy Markdown
Contributor

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2026
Comment thread internal/metrics/service_monitor.go Outdated
Comment thread internal/metrics/service_monitor.go Outdated
@simonpasquier
Copy link
Copy Markdown

simonpasquier commented May 6, 2026

Have you considered implementing collection profiles?

The general idea is that the admin defines the level of details for metrics at the cluster level. We have 3 profiles:

  • full (default) => collects all metrics available
  • minimal => collects only metrics relevant for alerting rules, dashboards and telemetry
  • telemetry => collects only metrics required for telemetry

We can help with the implementation (in summary, you need to define 1 additional service monitor per profile). Happy to discuss directly on your forum.

cc @jan--f and @r2d2rnd (who informed us about the work in progress, thanks!)

@Clee2691 Clee2691 changed the title LOG-9356: Implement Collector ServiceMontitor Changes LOG-9356: Implement Collector ServiceMonitor Changes May 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/metrics/relabel.go (2)

13-17: 💤 Low value

excludeMetrics is semantically ambiguous — consider droppedMetrics.

The field name excludeMetrics can be read as "metrics excluded from the drop action" (i.e., kept), but the implementation puts them inside the drop rule regex, meaning they are the metrics that get dropped for the given label combination. droppedMetrics or metricsToDropForLabel would remove this ambiguity for future maintainers.

♻️ Proposed rename
 type metricDropConfig struct {
 	labelName      string
 	labelValue     string
-	excludeMetrics []string
+	droppedMetrics []string
 }
 var collectorMinimalDropConfigs = []metricDropConfig{
 	{
 		labelName:  "component_kind",
 		labelValue: "transform",
-		excludeMetrics: []string{
+		droppedMetrics: []string{

And in buildRelabelConfigs:

-		Regex: drop.labelValue + ";(" + strings.Join(drop.excludeMetrics, "|") + ")",
+		Regex: drop.labelValue + ";(" + strings.Join(drop.droppedMetrics, "|") + ")",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metrics/relabel.go` around lines 13 - 17, Rename the ambiguous
struct field excludeMetrics in metricDropConfig to a clearer name like
droppedMetrics (or metricsToDropForLabel) and update all references accordingly;
specifically change the field declaration in metricDropConfig and adjust any
code that reads or writes metricDropConfig (e.g., the buildRelabelConfigs
function and any callers or JSON/YAML tags) so they use the new symbol name to
reflect that these are the metrics that will be dropped for the given label
combination.

89-95: 💤 Low value

Add a length guard for excludeMetrics to match the allowlist pattern.

The allowlist block guards len(allowlist.allowedMetrics) > 0 before appending a keep rule, but the drop loop has no equivalent guard. An empty excludeMetrics slice produces the regex labelValue;() with action: drop. While Prometheus's anchored full-string matching makes this effectively a no-op for real metrics (would only match a sample with an empty __name__), it emits a spurious relabel rule and is inconsistent with the allowlist guard style.

♻️ Proposed fix
 	for _, drop := range dropConfigs {
+		if len(drop.excludeMetrics) == 0 {
+			continue
+		}
 		configs = append(configs, &monitoringv1.RelabelConfig{
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metrics/relabel.go` around lines 89 - 95, The drop loop appends a
"drop" relabel rule even when drop.excludeMetrics is empty, producing a spurious
regex like "labelValue;()"; add the same length guard used for the allowlist:
check len(drop.excludeMetrics) > 0 before creating/appending the
monitoringv1.RelabelConfig in the loop (the loop iterating over dropConfigs that
currently builds SourceLabels/Regex), so only append the drop rule when
excludeMetrics is non-empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/metrics/relabel.go`:
- Around line 13-17: Rename the ambiguous struct field excludeMetrics in
metricDropConfig to a clearer name like droppedMetrics (or
metricsToDropForLabel) and update all references accordingly; specifically
change the field declaration in metricDropConfig and adjust any code that reads
or writes metricDropConfig (e.g., the buildRelabelConfigs function and any
callers or JSON/YAML tags) so they use the new symbol name to reflect that these
are the metrics that will be dropped for the given label combination.
- Around line 89-95: The drop loop appends a "drop" relabel rule even when
drop.excludeMetrics is empty, producing a spurious regex like "labelValue;()";
add the same length guard used for the allowlist: check len(drop.excludeMetrics)
> 0 before creating/appending the monitoringv1.RelabelConfig in the loop (the
loop iterating over dropConfigs that currently builds SourceLabels/Regex), so
only append the drop rule when excludeMetrics is non-empty.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: da0fa976-0000-4188-84d0-d85035bde888

📥 Commits

Reviewing files that changed from the base of the PR and between 962597a and ef5f21d.

📒 Files selected for processing (10)
  • internal/constants/constants.go
  • internal/constants/labels.go
  • internal/controller/observability/collector.go
  • internal/controller/observability/collector_test.go
  • internal/metrics/logfilemetricexporter/metric_exporter.go
  • internal/metrics/logfilemetricexporter/metric_exporter_test.go
  • internal/metrics/relabel.go
  • internal/metrics/relabel_test.go
  • internal/metrics/service_monitor.go
  • internal/metrics/service_monitor_test.go
✅ Files skipped from review due to trivial changes (2)
  • internal/constants/labels.go
  • internal/constants/constants.go

@Clee2691
Copy link
Copy Markdown
Contributor Author

Clee2691 commented May 7, 2026

Have you considered implementing collection profiles?

The general idea is that the admin defines the level of details for metrics at the cluster level. We have 3 profiles:

* full (default) => collects all metrics available

* minimal => collects only metrics relevant for alerting rules, dashboards and telemetry

* telemetry => collects only metrics required for telemetry

We can help with the implementation (in summary, you need to define 1 additional service monitor per profile). Happy to discuss directly on your forum.

cc @jan--f and @r2d2rnd (who informed us about the work in progress, thanks!)

@simonpasquier I've made changes to this to implement profiles. Can someone on the monitoring team also review?

Copy link
Copy Markdown
Contributor

@jcantrill jcantrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Comment thread internal/metrics/relabel.go
Comment thread internal/metrics/relabel.go
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Clee2691, jcantrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/helpers/prometheus/prometheus.go`:
- Around line 42-74: In QueryPrometheus replace the no-timeout http.Client with
one that has a 30s Timeout, stop using Fail(...) on request/response errors and
instead log the error and return an empty map[string]interface{} (so callers can
retry), validate response.StatusCode (expect 200) before attempting to
read/unmarshal the body and handle non-200 by logging and returning empty map,
and ensure response.Body is always closed; update references in this function
(QueryPrometheus, the http.Client/Request handling, and the json.Unmarshal
logic) accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9624b7e9-9df4-450c-918c-851df0a88fa1

📥 Commits

Reviewing files that changed from the base of the PR and between ef5f21d and 99d21e3.

📒 Files selected for processing (16)
  • internal/constants/constants.go
  • internal/constants/labels.go
  • internal/controller/observability/collector.go
  • internal/controller/observability/collector_test.go
  • internal/metrics/logfilemetricexporter/metric_exporter.go
  • internal/metrics/logfilemetricexporter/metric_exporter_test.go
  • internal/metrics/relabel.go
  • internal/metrics/relabel_test.go
  • internal/metrics/service_monitor.go
  • internal/metrics/service_monitor_test.go
  • test/e2e/collection/metrics/profile_test.go
  • test/e2e/collection/metrics/suite_test.go
  • test/e2e/flowcontrol/utils.go
  • test/e2e/logfilesmetricexporter/lfme_test.go
  • test/e2e/operator/metrics/e2e_test.go
  • test/helpers/prometheus/prometheus.go
✅ Files skipped from review due to trivial changes (3)
  • internal/constants/labels.go
  • internal/metrics/logfilemetricexporter/metric_exporter_test.go
  • internal/constants/constants.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • internal/controller/observability/collector.go
  • internal/metrics/logfilemetricexporter/metric_exporter.go
  • internal/controller/observability/collector_test.go
  • internal/metrics/service_monitor.go

Comment thread test/helpers/prometheus/prometheus.go
@jcantrill
Copy link
Copy Markdown
Contributor

/refresh

Comment thread test/e2e/collection/metrics/profile_test.go Outdated
Comment thread test/e2e/collection/metrics/profile_test.go
Comment thread test/helpers/prometheus/prometheus.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/helpers/prometheus/prometheus.go (1)

46-50: 💤 Low value

Optional: Consider adding TLS MinVersion for defense in depth.

While InsecureSkipVerify is acceptable for test code (and properly annotated with nolint:gosec), adding MinVersion: tls.VersionTLS12 would provide defense in depth against downgrade attacks in the test environment.

🔒 Optional TLS hardening
 tr := &http.Transport{
 	TLSClientConfig: &tls.Config{
 		InsecureSkipVerify: true, //nolint:gosec
+		MinVersion:         tls.VersionTLS12,
 	},
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/helpers/prometheus/prometheus.go` around lines 46 - 50, The TLS config
for the http.Transport instance (variable tr) sets InsecureSkipVerify in
TLSClientConfig; add a MinVersion field (tls.VersionTLS12) to the tls.Config to
harden tests against downgrade attacks—update the tls.Config literal used by tr
(where TLSClientConfig is constructed) to include MinVersion: tls.VersionTLS12
while keeping the existing InsecureSkipVerify and nolint annotation.
internal/metrics/service_monitor_test.go (1)

106-114: 💤 Low value

Consider asserting the keep rule's metric allowlist regex.

The test validates the keep rule's Action and SourceLabels but doesn't check the Regex field (line 112). While relabel_test.go validates the regex content for CollectorMinimalRelabelConfigs, asserting it here would make the test more robust against changes.

📋 Optional: Add regex validation
 Expect(string(relabelConfigs[1].Action)).To(Equal("keep"))
 Expect(relabelConfigs[1].SourceLabels).To(Equal([]monitoringv1.LabelName{"__name__"}))
+Expect(relabelConfigs[1].Regex).To(ContainSubstring("vector_open_files"))
+Expect(relabelConfigs[1].Regex).To(ContainSubstring("vector_component_errors_total"))

Alternatively, compare against CollectorMinimalRelabelConfigs[1].Regex directly for exact validation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metrics/service_monitor_test.go` around lines 106 - 114, Add an
assertion to validate the keep rule's Regex on the ServiceMonitor relabel
configs: in the test that inspects sm.Spec.Endpoints[0].MetricRelabelConfigs
(variable relabelConfigs), assert relabelConfigs[1].Regex equals the expected
allowlist pattern (or compare it directly to
CollectorMinimalRelabelConfigs[1].Regex) so the keep rule's regex is explicitly
validated alongside its Action and SourceLabels.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/metrics/service_monitor_test.go`:
- Around line 106-114: Add an assertion to validate the keep rule's Regex on the
ServiceMonitor relabel configs: in the test that inspects
sm.Spec.Endpoints[0].MetricRelabelConfigs (variable relabelConfigs), assert
relabelConfigs[1].Regex equals the expected allowlist pattern (or compare it
directly to CollectorMinimalRelabelConfigs[1].Regex) so the keep rule's regex is
explicitly validated alongside its Action and SourceLabels.

In `@test/helpers/prometheus/prometheus.go`:
- Around line 46-50: The TLS config for the http.Transport instance (variable
tr) sets InsecureSkipVerify in TLSClientConfig; add a MinVersion field
(tls.VersionTLS12) to the tls.Config to harden tests against downgrade
attacks—update the tls.Config literal used by tr (where TLSClientConfig is
constructed) to include MinVersion: tls.VersionTLS12 while keeping the existing
InsecureSkipVerify and nolint annotation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0d6fcde3-8d4b-4e8f-8fd1-90f648e1910c

📥 Commits

Reviewing files that changed from the base of the PR and between 99d21e3 and 6283020.

📒 Files selected for processing (16)
  • internal/constants/constants.go
  • internal/constants/labels.go
  • internal/controller/observability/collector.go
  • internal/controller/observability/collector_test.go
  • internal/metrics/logfilemetricexporter/metric_exporter.go
  • internal/metrics/logfilemetricexporter/metric_exporter_test.go
  • internal/metrics/relabel.go
  • internal/metrics/relabel_test.go
  • internal/metrics/service_monitor.go
  • internal/metrics/service_monitor_test.go
  • test/e2e/collection/metrics/profile_test.go
  • test/e2e/collection/metrics/suite_test.go
  • test/e2e/flowcontrol/utils.go
  • test/e2e/logfilesmetricexporter/lfme_test.go
  • test/e2e/operator/metrics/e2e_test.go
  • test/helpers/prometheus/prometheus.go
✅ Files skipped from review due to trivial changes (3)
  • internal/constants/labels.go
  • internal/metrics/logfilemetricexporter/metric_exporter_test.go
  • internal/constants/constants.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • test/e2e/collection/metrics/suite_test.go
  • internal/controller/observability/collector_test.go
  • internal/controller/observability/collector.go
  • test/e2e/flowcontrol/utils.go
  • test/e2e/logfilesmetricexporter/lfme_test.go
  • internal/metrics/logfilemetricexporter/metric_exporter.go

@Clee2691 Clee2691 force-pushed the LOG-9356 branch 2 times, most recently from 69c2342 to 5527936 Compare May 11, 2026 18:36
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/e2e/collection/metrics/profile_test.go (1)

106-121: 🏗️ Heavy lift

Add an e2e path that validates minimal-profile filtering behavior.

Current scrape assertions prove full-profile visibility, but they do not verify the minimal-profile selection effect (e.g., non-allowlisted metrics becoming absent). Adding a minimal-profile scenario would protect the core cardinality-reduction contract end-to-end.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/collection/metrics/profile_test.go` around lines 106 - 121, Add a
new e2e test case (parallel to "Full profile scraping") that switches to or
targets the minimal profile and asserts minimal-profile behavior: use
prometheus.Query against fmt.Sprintf(`%s{namespace="%s"}`, allowlistedMetric,
constants.OpenshiftNS) and assert with Eventually that the allowlistedMetric is
present, then query the same way for nonAllowlistedMetric and assert with
Eventually that prometheus.HasResults(response) is false (or that the query
returns no results) within appropriate timeouts; reuse the existing helper
patterns (Eventually, Gomega assertions, prometheus.HasResults) so the test
verifies non-allowlisted metrics are absent under the minimal profile.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/helpers/prometheus/prometheus.go`:
- Around line 39-42: The transport currently sets
TLSClientConfig.InsecureSkipVerify = true (see tr and TLSClientConfig) which
exposes bearer tokens to MITM; change the TLS config to not skip verification:
set MinVersion to tls.VersionTLS12 and set ServerName to the target host (use
the host passed into the helper) or, preferably, load and append the cluster CA
bundle into TLSClientConfig.RootCAs when available so certificate validation is
enforced instead of disabling it.
- Around line 21-23: The helper executeCmd currently runs exec.Command("bash",
"-c", cmd).Output() with no deadline and a shell wrapper; change executeCmd to
use a context with timeout (e.g., context.WithTimeout(ctx, 30s)) and
exec.CommandContext to enforce a deadline, and remove the shell wrapper by
invoking the command name and args directly (e.g., exec.CommandContext(ctx,
name, args...)). Update the function signature or its callers so you pass the
command name and argument slice/variadic params to executeCmd, then call
cmd.Output() via the context and return the output and error.

---

Nitpick comments:
In `@test/e2e/collection/metrics/profile_test.go`:
- Around line 106-121: Add a new e2e test case (parallel to "Full profile
scraping") that switches to or targets the minimal profile and asserts
minimal-profile behavior: use prometheus.Query against
fmt.Sprintf(`%s{namespace="%s"}`, allowlistedMetric, constants.OpenshiftNS) and
assert with Eventually that the allowlistedMetric is present, then query the
same way for nonAllowlistedMetric and assert with Eventually that
prometheus.HasResults(response) is false (or that the query returns no results)
within appropriate timeouts; reuse the existing helper patterns (Eventually,
Gomega assertions, prometheus.HasResults) so the test verifies non-allowlisted
metrics are absent under the minimal profile.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ba13a1c4-1491-46a2-96c7-78b90100e4c2

📥 Commits

Reviewing files that changed from the base of the PR and between 6283020 and 5527936.

📒 Files selected for processing (16)
  • internal/constants/constants.go
  • internal/constants/labels.go
  • internal/controller/observability/collector.go
  • internal/controller/observability/collector_test.go
  • internal/metrics/logfilemetricexporter/metric_exporter.go
  • internal/metrics/logfilemetricexporter/metric_exporter_test.go
  • internal/metrics/relabel.go
  • internal/metrics/relabel_test.go
  • internal/metrics/service_monitor.go
  • internal/metrics/service_monitor_test.go
  • test/e2e/collection/metrics/profile_test.go
  • test/e2e/collection/metrics/suite_test.go
  • test/e2e/flowcontrol/utils.go
  • test/e2e/logfilesmetricexporter/lfme_test.go
  • test/e2e/operator/metrics/e2e_test.go
  • test/helpers/prometheus/prometheus.go
✅ Files skipped from review due to trivial changes (3)
  • internal/constants/labels.go
  • internal/constants/constants.go
  • internal/metrics/logfilemetricexporter/metric_exporter_test.go
🚧 Files skipped from review as they are similar to previous changes (7)
  • test/e2e/collection/metrics/suite_test.go
  • test/e2e/logfilesmetricexporter/lfme_test.go
  • test/e2e/flowcontrol/utils.go
  • internal/metrics/logfilemetricexporter/metric_exporter.go
  • internal/controller/observability/collector_test.go
  • internal/controller/observability/collector.go
  • internal/metrics/service_monitor.go

Comment on lines +21 to +23
func executeCmd(cmd string) (string, error) {
result, err := exec.Command("bash", "-c", cmd).Output()
return string(result), err
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n test/helpers/prometheus/prometheus.go

Repository: openshift/cluster-logging-operator

Length of output: 3454


Add timeout to command execution and eliminate unnecessary shell wrapper.

Line 22 uses exec.Command("bash", "-c", cmd).Output() without a deadline. If the oc command stalls (e.g., API server becomes unavailable), polling-based tests will hang. Additionally, the shell wrapper is unnecessary for these fixed command strings and unnecessarily widens the risk surface.

Proposed change
+import "context"
@@
-func executeCmd(cmd string) (string, error) {
-	result, err := exec.Command("bash", "-c", cmd).Output()
+func executeCmd(name string, args ...string) (string, error) {
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+	result, err := exec.CommandContext(ctx, name, args...).Output()
 	return string(result), err
 }
@@
 func GetToken() (string, error) {
-	token, err := executeCmd(tokenCmd)
+	token, err := executeCmd("oc", "create", "token", "prometheus-k8s", "-n", "openshift-monitoring")
 	return strings.TrimSpace(token), err
 }
@@
 func GetThanosHost() (string, error) {
-	host, err := executeCmd(thanosHostCmd)
+	host, err := executeCmd("oc", "get", "route", "thanos-querier", "-n", "openshift-monitoring", "-o", "jsonpath={.spec.host}")
 	return strings.TrimSpace(host), err
 }
🧰 Tools
🪛 OpenGrep (1.20.0)

[ERROR] 22-22: Dynamic command passed to exec.Command with a shell invocation. Pass arguments directly to exec.Command without a shell wrapper.

(coderabbit.command-injection.go-exec-command)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/helpers/prometheus/prometheus.go` around lines 21 - 23, The helper
executeCmd currently runs exec.Command("bash", "-c", cmd).Output() with no
deadline and a shell wrapper; change executeCmd to use a context with timeout
(e.g., context.WithTimeout(ctx, 30s)) and exec.CommandContext to enforce a
deadline, and remove the shell wrapper by invoking the command name and args
directly (e.g., exec.CommandContext(ctx, name, args...)). Update the function
signature or its callers so you pass the command name and argument
slice/variadic params to executeCmd, then call cmd.Output() via the context and
return the output and error.

Comment on lines +39 to +42
tr := &http.Transport{
TLSClientConfig: &tls.Config{
InsecureSkipVerify: true, //nolint:gosec
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the prometheus.go file to see the context
cat -n test/helpers/prometheus/prometheus.go | head -60

Repository: openshift/cluster-logging-operator

Length of output: 2011


🏁 Script executed:

# Let's check if there are other similar patterns in the codebase
rg "InsecureSkipVerify" --type go

Repository: openshift/cluster-logging-operator

Length of output: 4471


Avoid disabling TLS verification when sending bearer credentials.

Line 41 disables certificate verification in a function that sends real OpenShift service account bearer tokens (obtained at line 26-29 via oc create token). This allows MITM interception of the token in test environments.

Even though this is test helper code, the tokens are real cluster credentials sent to a real endpoint. Set MinVersion: tls.VersionTLS12 and ServerName: host at minimum, or load the cluster CA bundle if available in the test environment.

🧰 Tools
🪛 ast-grep (0.42.2)

[warning] 39-41: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🪛 OpenGrep (1.20.0)

[ERROR] 40-42: TLS certificate verification is disabled via InsecureSkipVerify. This allows man-in-the-middle attacks. Remove InsecureSkipVerify or set it to false.

(coderabbit.tls.go-insecure-skip-verify)


[ERROR] 40-42: TLS certificate verification is disabled via InsecureSkipVerify. This allows man-in-the-middle attacks. Remove InsecureSkipVerify or set it to false.

(coderabbit.tls.go-insecure-skip-verify)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/helpers/prometheus/prometheus.go` around lines 39 - 42, The transport
currently sets TLSClientConfig.InsecureSkipVerify = true (see tr and
TLSClientConfig) which exposes bearer tokens to MITM; change the TLS config to
not skip verification: set MinVersion to tls.VersionTLS12 and set ServerName to
the target host (use the host passed into the helper) or, preferably, load and
append the cluster CA bundle into TLSClientConfig.RootCAs when available so
certificate validation is enforced instead of disabling it.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

@Clee2691: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few telemetry metrics associated to the logging team: https://github.com/openshift/cluster-monitoring-operator/blob/c9f25136c1cc6f9f7facbf1d4f128849fbb8e0f1/manifests/0000_50_cluster-monitoring-operator_04-config.yaml#L1024-L1080

We'd need to make sure that these metrics are available in all profiles.

Comment thread internal/constants/constants.go Outdated
MetricsPortName = "metrics"
MetricsPort = int32(24231)
MetricsCollectionProfileFull = "full"
MetricsCollectionProfileMinimal = "minimal"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're adding a new telemetry profile called telemetry which collects only metrics required for Red Hat Telemetry. It would be good to account for it already.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimal profile already accounts for those identified metrics. Some of them are operator only metrics and we are not going to be implementing profiles for the operator. These service monitors are for the collector and the LFME itself not the operator.

I don't think we need a telemetry profile as the minimal is sufficient. I think we'd be down to 1 metric for the telemetry only profile. The LFME already has 1 metric in the minimal set anyways

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a telemetry profile as the minimal is sufficient. I think we'd be down to 1 metric for the telemetry only profile. The LFME already has 1 metric in the minimal set anyways

What if the profile chosen is 'telemetry'? Would that mean we would provide nothing since we are dependent upon the client to use the desired servicemonitor?

Copy link
Copy Markdown
Contributor Author

@Clee2691 Clee2691 May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier
The LFME SM would basically be duplicated if we have both minimal and telemetry. Is there a way around this?

@jcantrill I think it would be the "full" metrics if there isn't a matching label for telemetry
For the collector, I think we could provide a telemetry only SM which would include only 1 metric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. release/6.6

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants