LOG-9356: Implement Collector ServiceMonitor Changes#3270
Conversation
|
@Clee2691: This pull request references LOG-9356 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.8.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Review Summary by QodoAdd metric filtering to collector ServiceMonitor
WalkthroughsDescription• Add metric filtering to ServiceMonitor with keep/drop relabel configs • Keep only essential metrics for alerts, dashboards, telemetry • Drop transform component metrics except errors and discarded events • Refactor relabel config structure for improved readability Diagramflowchart LR
SM["ServiceMonitor<br/>MetricRelabelConfigs"]
R1["Relabel 1:<br/>Replace - with _"]
R2["Relabel 2:<br/>Keep allowed metrics"]
R3["Relabel 3:<br/>Drop transform metrics"]
SM --> R1
SM --> R2
SM --> R3
R2 --> Filter["Filtered metrics<br/>for monitoring"]
R3 --> Filter
File Changes1. internal/metrics/service_monitor.go
|
Code Review by Qodo
Context used 1. LFME metrics filtered out
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReconciler now creates two ServiceMonitors (full and minimal) per component, parameterizing TLS serverName, metric relabel configs, and a metrics-collection-profile label. New relabel builder exposes precomputed config sets; unit and e2e tests updated to validate relabel rules and Prometheus scraping. ChangesMetrics relabel & ServiceMonitor flow
Sequence DiagramsequenceDiagram
participant Controller as Controller
participant Relabel as Relabel Builder
participant K8s as Kubernetes API
participant PromOp as Prometheus Operator
Controller->>Relabel: Request relabel configs (full/minimal)
Relabel-->>Controller: Return relabel config sets
Controller->>K8s: Create/Update ServiceMonitor (endpoints, TLS ServerName, metricRelabelConfigs, profile label)
K8s->>PromOp: Prometheus Operator observes ServiceMonitor
PromOp->>K8s: Begin scraping targets per ServiceMonitor spec
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 7 | ❌ 5❌ Failed checks (5 warnings)
✅ Passed checks (7 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
| { | ||
| Action: "keep", | ||
| SourceLabels: []monitoringv1.LabelName{"__name__"}, | ||
| Regex: "logcollector_component_event_unmatched_count|vector_buffer_byte_size|vector_buffer_discarded_events_total|vector_buffer_events|vector_buffer_sent_events_total|vector_component_discarded_events_total|vector_component_errors_total|vector_component_received_bytes_total|vector_component_received_event_bytes_total|vector_component_received_events_total|vector_component_sent_bytes_total|vector_events_in_total|vector_http_client_errors_total|vector_http_client_requests_sent_total|vector_http_client_responses_total|vector_open_files", | ||
| }, |
There was a problem hiding this comment.
1. Lfme metrics filtered out 🐞 Bug ≡ Correctness
newServiceMonitor() now applies a metric "keep" allowlist to all ServiceMonitors, including LogFileMetricsExporter’s ServiceMonitor, which will drop log_logged_bytes_total from Prometheus. This breaks the shipped recording rule and dashboard panels that query log_logged_bytes_total.
Agent Prompt
## Issue description
`metrics.newServiceMonitor()` now installs a metric relabel `keep` allowlist for **every** ServiceMonitor created via `metrics.ReconcileServiceMonitor()`. This unintentionally filters out LogFileMetricsExporter metrics (notably `log_logged_bytes_total`), breaking shipped recording rules and dashboard panels.
## Issue Context
- LogFileMetricsExporter reconciles its ServiceMonitor through the same `metrics.ReconcileServiceMonitor()` path used by the collector.
- The new keep-regex only allows a vector/logcollector metric subset, excluding `log_logged_bytes_total`.
## Fix Focus Areas
- internal/metrics/service_monitor.go[33-50]
- internal/metrics/logfilemetricexporter/metric_exporter.go[78-82]
- config/prometheus/collector_alerts.yaml[97-107]
### Expected fix direction
- Make ServiceMonitor relabeling **component-specific**:
- Either split into `newCollectorServiceMonitor()` (with allowlist/drop rules) and a generic `newServiceMonitor()` (without restrictive keep), or
- Add a parameter to `newServiceMonitor(..., metricRelabelConfigs []*monitoringv1.RelabelConfig)` and pass the collector-only configs from the collector reconciler.
- Ensure LFME-required metrics (at least `log_logged_bytes_total`) are not filtered out of Prometheus.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/metrics/service_monitor.go (1)
43-43: 💤 Low valueConsider extracting the keep-list regex to a named constant.
The allowlist is the core deliverable of this PR and a maintenance hotspot (new metrics are added, removed, or renamed). Inlining it as a 500+ character string makes diffs noisy and typos hard to spot.
♻️ Suggested refactor
+const collectorMetricsKeepRegex = "logcollector_component_event_unmatched_count|" + + "vector_buffer_byte_size|vector_buffer_discarded_events_total|" + + "vector_buffer_events|vector_buffer_sent_events_total|" + + "vector_component_discarded_events_total|vector_component_errors_total|" + + "vector_component_received_bytes_total|vector_component_received_event_bytes_total|" + + "vector_component_received_events_total|vector_component_sent_bytes_total|" + + "vector_events_in_total|vector_http_client_errors_total|" + + "vector_http_client_requests_sent_total|vector_http_client_responses_total|" + + "vector_open_files" // in newServiceMonitor: - Regex: "logcollector_component_event_unmatched_count|vector_buffer_byte_size|...", + Regex: collectorMetricsKeepRegex,The constant can then be referenced in the test as well, removing duplication.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/metrics/service_monitor.go` at line 43, Extract the long inline allowlist string used in the Regex field into a clearly named package-level constant (e.g., allowedMetricsRegex or keepListRegex) in internal/metrics/service_monitor.go, replace the inlined string in the struct literal's Regex field with that constant, and update any tests to import/reference the same constant instead of duplicating the string so the allowlist is a single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/metrics/service_monitor_test.go`:
- Around line 83-84: The test currently omits asserting the allowlist regex so
changes silently pass; update the test in service_monitor_test.go to assert
relabelConfigs[1].Regex equals the expected allowlist regex (or, better, compare
it to the constant defined in service_monitor.go if you extract the regex
there), i.e., add an
Expect(relabelConfigs[1].Regex).To(Equal(<expectedRegexOrConstant>)) alongside
the existing Action and SourceLabels assertions.
---
Nitpick comments:
In `@internal/metrics/service_monitor.go`:
- Line 43: Extract the long inline allowlist string used in the Regex field into
a clearly named package-level constant (e.g., allowedMetricsRegex or
keepListRegex) in internal/metrics/service_monitor.go, replace the inlined
string in the struct literal's Regex field with that constant, and update any
tests to import/reference the same constant instead of duplicating the string so
the allowlist is a single source of truth.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c0367dba-620a-4f1b-abb3-d1fbd24ba9de
📒 Files selected for processing (2)
internal/metrics/service_monitor.gointernal/metrics/service_monitor_test.go
|
/retest |
|
/hold |
|
Have you considered implementing collection profiles? The general idea is that the admin defines the level of details for metrics at the cluster level. We have 3 profiles:
We can help with the implementation (in summary, you need to define 1 additional service monitor per profile). Happy to discuss directly on your forum. cc @jan--f and @r2d2rnd (who informed us about the work in progress, thanks!) |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/metrics/relabel.go (2)
13-17: 💤 Low value
excludeMetricsis semantically ambiguous — considerdroppedMetrics.The field name
excludeMetricscan be read as "metrics excluded from the drop action" (i.e., kept), but the implementation puts them inside thedroprule regex, meaning they are the metrics that get dropped for the given label combination.droppedMetricsormetricsToDropForLabelwould remove this ambiguity for future maintainers.♻️ Proposed rename
type metricDropConfig struct { labelName string labelValue string - excludeMetrics []string + droppedMetrics []string }var collectorMinimalDropConfigs = []metricDropConfig{ { labelName: "component_kind", labelValue: "transform", - excludeMetrics: []string{ + droppedMetrics: []string{And in
buildRelabelConfigs:- Regex: drop.labelValue + ";(" + strings.Join(drop.excludeMetrics, "|") + ")", + Regex: drop.labelValue + ";(" + strings.Join(drop.droppedMetrics, "|") + ")",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/metrics/relabel.go` around lines 13 - 17, Rename the ambiguous struct field excludeMetrics in metricDropConfig to a clearer name like droppedMetrics (or metricsToDropForLabel) and update all references accordingly; specifically change the field declaration in metricDropConfig and adjust any code that reads or writes metricDropConfig (e.g., the buildRelabelConfigs function and any callers or JSON/YAML tags) so they use the new symbol name to reflect that these are the metrics that will be dropped for the given label combination.
89-95: 💤 Low valueAdd a length guard for
excludeMetricsto match the allowlist pattern.The allowlist block guards
len(allowlist.allowedMetrics) > 0before appending a keep rule, but the drop loop has no equivalent guard. An emptyexcludeMetricsslice produces the regexlabelValue;()withaction: drop. While Prometheus's anchored full-string matching makes this effectively a no-op for real metrics (would only match a sample with an empty__name__), it emits a spurious relabel rule and is inconsistent with the allowlist guard style.♻️ Proposed fix
for _, drop := range dropConfigs { + if len(drop.excludeMetrics) == 0 { + continue + } configs = append(configs, &monitoringv1.RelabelConfig{🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/metrics/relabel.go` around lines 89 - 95, The drop loop appends a "drop" relabel rule even when drop.excludeMetrics is empty, producing a spurious regex like "labelValue;()"; add the same length guard used for the allowlist: check len(drop.excludeMetrics) > 0 before creating/appending the monitoringv1.RelabelConfig in the loop (the loop iterating over dropConfigs that currently builds SourceLabels/Regex), so only append the drop rule when excludeMetrics is non-empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/metrics/relabel.go`:
- Around line 13-17: Rename the ambiguous struct field excludeMetrics in
metricDropConfig to a clearer name like droppedMetrics (or
metricsToDropForLabel) and update all references accordingly; specifically
change the field declaration in metricDropConfig and adjust any code that reads
or writes metricDropConfig (e.g., the buildRelabelConfigs function and any
callers or JSON/YAML tags) so they use the new symbol name to reflect that these
are the metrics that will be dropped for the given label combination.
- Around line 89-95: The drop loop appends a "drop" relabel rule even when
drop.excludeMetrics is empty, producing a spurious regex like "labelValue;()";
add the same length guard used for the allowlist: check len(drop.excludeMetrics)
> 0 before creating/appending the monitoringv1.RelabelConfig in the loop (the
loop iterating over dropConfigs that currently builds SourceLabels/Regex), so
only append the drop rule when excludeMetrics is non-empty.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: da0fa976-0000-4188-84d0-d85035bde888
📒 Files selected for processing (10)
internal/constants/constants.gointernal/constants/labels.gointernal/controller/observability/collector.gointernal/controller/observability/collector_test.gointernal/metrics/logfilemetricexporter/metric_exporter.gointernal/metrics/logfilemetricexporter/metric_exporter_test.gointernal/metrics/relabel.gointernal/metrics/relabel_test.gointernal/metrics/service_monitor.gointernal/metrics/service_monitor_test.go
✅ Files skipped from review due to trivial changes (2)
- internal/constants/labels.go
- internal/constants/constants.go
@simonpasquier I've made changes to this to implement profiles. Can someone on the monitoring team also review? |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Clee2691, jcantrill The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/helpers/prometheus/prometheus.go`:
- Around line 42-74: In QueryPrometheus replace the no-timeout http.Client with
one that has a 30s Timeout, stop using Fail(...) on request/response errors and
instead log the error and return an empty map[string]interface{} (so callers can
retry), validate response.StatusCode (expect 200) before attempting to
read/unmarshal the body and handle non-200 by logging and returning empty map,
and ensure response.Body is always closed; update references in this function
(QueryPrometheus, the http.Client/Request handling, and the json.Unmarshal
logic) accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9624b7e9-9df4-450c-918c-851df0a88fa1
📒 Files selected for processing (16)
internal/constants/constants.gointernal/constants/labels.gointernal/controller/observability/collector.gointernal/controller/observability/collector_test.gointernal/metrics/logfilemetricexporter/metric_exporter.gointernal/metrics/logfilemetricexporter/metric_exporter_test.gointernal/metrics/relabel.gointernal/metrics/relabel_test.gointernal/metrics/service_monitor.gointernal/metrics/service_monitor_test.gotest/e2e/collection/metrics/profile_test.gotest/e2e/collection/metrics/suite_test.gotest/e2e/flowcontrol/utils.gotest/e2e/logfilesmetricexporter/lfme_test.gotest/e2e/operator/metrics/e2e_test.gotest/helpers/prometheus/prometheus.go
✅ Files skipped from review due to trivial changes (3)
- internal/constants/labels.go
- internal/metrics/logfilemetricexporter/metric_exporter_test.go
- internal/constants/constants.go
🚧 Files skipped from review as they are similar to previous changes (4)
- internal/controller/observability/collector.go
- internal/metrics/logfilemetricexporter/metric_exporter.go
- internal/controller/observability/collector_test.go
- internal/metrics/service_monitor.go
|
/refresh |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/helpers/prometheus/prometheus.go (1)
46-50: 💤 Low valueOptional: Consider adding TLS MinVersion for defense in depth.
While
InsecureSkipVerifyis acceptable for test code (and properly annotated withnolint:gosec), addingMinVersion: tls.VersionTLS12would provide defense in depth against downgrade attacks in the test environment.🔒 Optional TLS hardening
tr := &http.Transport{ TLSClientConfig: &tls.Config{ InsecureSkipVerify: true, //nolint:gosec + MinVersion: tls.VersionTLS12, }, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/helpers/prometheus/prometheus.go` around lines 46 - 50, The TLS config for the http.Transport instance (variable tr) sets InsecureSkipVerify in TLSClientConfig; add a MinVersion field (tls.VersionTLS12) to the tls.Config to harden tests against downgrade attacks—update the tls.Config literal used by tr (where TLSClientConfig is constructed) to include MinVersion: tls.VersionTLS12 while keeping the existing InsecureSkipVerify and nolint annotation.internal/metrics/service_monitor_test.go (1)
106-114: 💤 Low valueConsider asserting the keep rule's metric allowlist regex.
The test validates the keep rule's Action and SourceLabels but doesn't check the
Regexfield (line 112). Whilerelabel_test.govalidates the regex content forCollectorMinimalRelabelConfigs, asserting it here would make the test more robust against changes.📋 Optional: Add regex validation
Expect(string(relabelConfigs[1].Action)).To(Equal("keep")) Expect(relabelConfigs[1].SourceLabels).To(Equal([]monitoringv1.LabelName{"__name__"})) +Expect(relabelConfigs[1].Regex).To(ContainSubstring("vector_open_files")) +Expect(relabelConfigs[1].Regex).To(ContainSubstring("vector_component_errors_total"))Alternatively, compare against
CollectorMinimalRelabelConfigs[1].Regexdirectly for exact validation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/metrics/service_monitor_test.go` around lines 106 - 114, Add an assertion to validate the keep rule's Regex on the ServiceMonitor relabel configs: in the test that inspects sm.Spec.Endpoints[0].MetricRelabelConfigs (variable relabelConfigs), assert relabelConfigs[1].Regex equals the expected allowlist pattern (or compare it directly to CollectorMinimalRelabelConfigs[1].Regex) so the keep rule's regex is explicitly validated alongside its Action and SourceLabels.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/metrics/service_monitor_test.go`:
- Around line 106-114: Add an assertion to validate the keep rule's Regex on the
ServiceMonitor relabel configs: in the test that inspects
sm.Spec.Endpoints[0].MetricRelabelConfigs (variable relabelConfigs), assert
relabelConfigs[1].Regex equals the expected allowlist pattern (or compare it
directly to CollectorMinimalRelabelConfigs[1].Regex) so the keep rule's regex is
explicitly validated alongside its Action and SourceLabels.
In `@test/helpers/prometheus/prometheus.go`:
- Around line 46-50: The TLS config for the http.Transport instance (variable
tr) sets InsecureSkipVerify in TLSClientConfig; add a MinVersion field
(tls.VersionTLS12) to the tls.Config to harden tests against downgrade
attacks—update the tls.Config literal used by tr (where TLSClientConfig is
constructed) to include MinVersion: tls.VersionTLS12 while keeping the existing
InsecureSkipVerify and nolint annotation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0d6fcde3-8d4b-4e8f-8fd1-90f648e1910c
📒 Files selected for processing (16)
internal/constants/constants.gointernal/constants/labels.gointernal/controller/observability/collector.gointernal/controller/observability/collector_test.gointernal/metrics/logfilemetricexporter/metric_exporter.gointernal/metrics/logfilemetricexporter/metric_exporter_test.gointernal/metrics/relabel.gointernal/metrics/relabel_test.gointernal/metrics/service_monitor.gointernal/metrics/service_monitor_test.gotest/e2e/collection/metrics/profile_test.gotest/e2e/collection/metrics/suite_test.gotest/e2e/flowcontrol/utils.gotest/e2e/logfilesmetricexporter/lfme_test.gotest/e2e/operator/metrics/e2e_test.gotest/helpers/prometheus/prometheus.go
✅ Files skipped from review due to trivial changes (3)
- internal/constants/labels.go
- internal/metrics/logfilemetricexporter/metric_exporter_test.go
- internal/constants/constants.go
🚧 Files skipped from review as they are similar to previous changes (6)
- test/e2e/collection/metrics/suite_test.go
- internal/controller/observability/collector_test.go
- internal/controller/observability/collector.go
- test/e2e/flowcontrol/utils.go
- test/e2e/logfilesmetricexporter/lfme_test.go
- internal/metrics/logfilemetricexporter/metric_exporter.go
69c2342 to
5527936
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/e2e/collection/metrics/profile_test.go (1)
106-121: 🏗️ Heavy liftAdd an e2e path that validates minimal-profile filtering behavior.
Current scrape assertions prove full-profile visibility, but they do not verify the minimal-profile selection effect (e.g., non-allowlisted metrics becoming absent). Adding a minimal-profile scenario would protect the core cardinality-reduction contract end-to-end.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/collection/metrics/profile_test.go` around lines 106 - 121, Add a new e2e test case (parallel to "Full profile scraping") that switches to or targets the minimal profile and asserts minimal-profile behavior: use prometheus.Query against fmt.Sprintf(`%s{namespace="%s"}`, allowlistedMetric, constants.OpenshiftNS) and assert with Eventually that the allowlistedMetric is present, then query the same way for nonAllowlistedMetric and assert with Eventually that prometheus.HasResults(response) is false (or that the query returns no results) within appropriate timeouts; reuse the existing helper patterns (Eventually, Gomega assertions, prometheus.HasResults) so the test verifies non-allowlisted metrics are absent under the minimal profile.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/helpers/prometheus/prometheus.go`:
- Around line 39-42: The transport currently sets
TLSClientConfig.InsecureSkipVerify = true (see tr and TLSClientConfig) which
exposes bearer tokens to MITM; change the TLS config to not skip verification:
set MinVersion to tls.VersionTLS12 and set ServerName to the target host (use
the host passed into the helper) or, preferably, load and append the cluster CA
bundle into TLSClientConfig.RootCAs when available so certificate validation is
enforced instead of disabling it.
- Around line 21-23: The helper executeCmd currently runs exec.Command("bash",
"-c", cmd).Output() with no deadline and a shell wrapper; change executeCmd to
use a context with timeout (e.g., context.WithTimeout(ctx, 30s)) and
exec.CommandContext to enforce a deadline, and remove the shell wrapper by
invoking the command name and args directly (e.g., exec.CommandContext(ctx,
name, args...)). Update the function signature or its callers so you pass the
command name and argument slice/variadic params to executeCmd, then call
cmd.Output() via the context and return the output and error.
---
Nitpick comments:
In `@test/e2e/collection/metrics/profile_test.go`:
- Around line 106-121: Add a new e2e test case (parallel to "Full profile
scraping") that switches to or targets the minimal profile and asserts
minimal-profile behavior: use prometheus.Query against
fmt.Sprintf(`%s{namespace="%s"}`, allowlistedMetric, constants.OpenshiftNS) and
assert with Eventually that the allowlistedMetric is present, then query the
same way for nonAllowlistedMetric and assert with Eventually that
prometheus.HasResults(response) is false (or that the query returns no results)
within appropriate timeouts; reuse the existing helper patterns (Eventually,
Gomega assertions, prometheus.HasResults) so the test verifies non-allowlisted
metrics are absent under the minimal profile.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ba13a1c4-1491-46a2-96c7-78b90100e4c2
📒 Files selected for processing (16)
internal/constants/constants.gointernal/constants/labels.gointernal/controller/observability/collector.gointernal/controller/observability/collector_test.gointernal/metrics/logfilemetricexporter/metric_exporter.gointernal/metrics/logfilemetricexporter/metric_exporter_test.gointernal/metrics/relabel.gointernal/metrics/relabel_test.gointernal/metrics/service_monitor.gointernal/metrics/service_monitor_test.gotest/e2e/collection/metrics/profile_test.gotest/e2e/collection/metrics/suite_test.gotest/e2e/flowcontrol/utils.gotest/e2e/logfilesmetricexporter/lfme_test.gotest/e2e/operator/metrics/e2e_test.gotest/helpers/prometheus/prometheus.go
✅ Files skipped from review due to trivial changes (3)
- internal/constants/labels.go
- internal/constants/constants.go
- internal/metrics/logfilemetricexporter/metric_exporter_test.go
🚧 Files skipped from review as they are similar to previous changes (7)
- test/e2e/collection/metrics/suite_test.go
- test/e2e/logfilesmetricexporter/lfme_test.go
- test/e2e/flowcontrol/utils.go
- internal/metrics/logfilemetricexporter/metric_exporter.go
- internal/controller/observability/collector_test.go
- internal/controller/observability/collector.go
- internal/metrics/service_monitor.go
| func executeCmd(cmd string) (string, error) { | ||
| result, err := exec.Command("bash", "-c", cmd).Output() | ||
| return string(result), err |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n test/helpers/prometheus/prometheus.goRepository: openshift/cluster-logging-operator
Length of output: 3454
Add timeout to command execution and eliminate unnecessary shell wrapper.
Line 22 uses exec.Command("bash", "-c", cmd).Output() without a deadline. If the oc command stalls (e.g., API server becomes unavailable), polling-based tests will hang. Additionally, the shell wrapper is unnecessary for these fixed command strings and unnecessarily widens the risk surface.
Proposed change
+import "context"
@@
-func executeCmd(cmd string) (string, error) {
- result, err := exec.Command("bash", "-c", cmd).Output()
+func executeCmd(name string, args ...string) (string, error) {
+ ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+ defer cancel()
+ result, err := exec.CommandContext(ctx, name, args...).Output()
return string(result), err
}
@@
func GetToken() (string, error) {
- token, err := executeCmd(tokenCmd)
+ token, err := executeCmd("oc", "create", "token", "prometheus-k8s", "-n", "openshift-monitoring")
return strings.TrimSpace(token), err
}
@@
func GetThanosHost() (string, error) {
- host, err := executeCmd(thanosHostCmd)
+ host, err := executeCmd("oc", "get", "route", "thanos-querier", "-n", "openshift-monitoring", "-o", "jsonpath={.spec.host}")
return strings.TrimSpace(host), err
}🧰 Tools
🪛 OpenGrep (1.20.0)
[ERROR] 22-22: Dynamic command passed to exec.Command with a shell invocation. Pass arguments directly to exec.Command without a shell wrapper.
(coderabbit.command-injection.go-exec-command)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/helpers/prometheus/prometheus.go` around lines 21 - 23, The helper
executeCmd currently runs exec.Command("bash", "-c", cmd).Output() with no
deadline and a shell wrapper; change executeCmd to use a context with timeout
(e.g., context.WithTimeout(ctx, 30s)) and exec.CommandContext to enforce a
deadline, and remove the shell wrapper by invoking the command name and args
directly (e.g., exec.CommandContext(ctx, name, args...)). Update the function
signature or its callers so you pass the command name and argument
slice/variadic params to executeCmd, then call cmd.Output() via the context and
return the output and error.
| tr := &http.Transport{ | ||
| TLSClientConfig: &tls.Config{ | ||
| InsecureSkipVerify: true, //nolint:gosec | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the prometheus.go file to see the context
cat -n test/helpers/prometheus/prometheus.go | head -60Repository: openshift/cluster-logging-operator
Length of output: 2011
🏁 Script executed:
# Let's check if there are other similar patterns in the codebase
rg "InsecureSkipVerify" --type goRepository: openshift/cluster-logging-operator
Length of output: 4471
Avoid disabling TLS verification when sending bearer credentials.
Line 41 disables certificate verification in a function that sends real OpenShift service account bearer tokens (obtained at line 26-29 via oc create token). This allows MITM interception of the token in test environments.
Even though this is test helper code, the tokens are real cluster credentials sent to a real endpoint. Set MinVersion: tls.VersionTLS12 and ServerName: host at minimum, or load the cluster CA bundle if available in the test environment.
🧰 Tools
🪛 ast-grep (0.42.2)
[warning] 39-41: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(missing-ssl-minversion-go)
🪛 OpenGrep (1.20.0)
[ERROR] 40-42: TLS certificate verification is disabled via InsecureSkipVerify. This allows man-in-the-middle attacks. Remove InsecureSkipVerify or set it to false.
(coderabbit.tls.go-insecure-skip-verify)
[ERROR] 40-42: TLS certificate verification is disabled via InsecureSkipVerify. This allows man-in-the-middle attacks. Remove InsecureSkipVerify or set it to false.
(coderabbit.tls.go-insecure-skip-verify)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/helpers/prometheus/prometheus.go` around lines 39 - 42, The transport
currently sets TLSClientConfig.InsecureSkipVerify = true (see tr and
TLSClientConfig) which exposes bearer tokens to MITM; change the TLS config to
not skip verification: set MinVersion to tls.VersionTLS12 and set ServerName to
the target host (use the host passed into the helper) or, preferably, load and
append the cluster CA bundle into TLSClientConfig.RootCAs when available so
certificate validation is enforced instead of disabling it.
|
@Clee2691: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
simonpasquier
left a comment
There was a problem hiding this comment.
There are quite a few telemetry metrics associated to the logging team: https://github.com/openshift/cluster-monitoring-operator/blob/c9f25136c1cc6f9f7facbf1d4f128849fbb8e0f1/manifests/0000_50_cluster-monitoring-operator_04-config.yaml#L1024-L1080
We'd need to make sure that these metrics are available in all profiles.
| MetricsPortName = "metrics" | ||
| MetricsPort = int32(24231) | ||
| MetricsCollectionProfileFull = "full" | ||
| MetricsCollectionProfileMinimal = "minimal" |
There was a problem hiding this comment.
we're adding a new telemetry profile called telemetry which collects only metrics required for Red Hat Telemetry. It would be good to account for it already.
There was a problem hiding this comment.
The minimal profile already accounts for those identified metrics. Some of them are operator only metrics and we are not going to be implementing profiles for the operator. These service monitors are for the collector and the LFME itself not the operator.
I don't think we need a telemetry profile as the minimal is sufficient. I think we'd be down to 1 metric for the telemetry only profile. The LFME already has 1 metric in the minimal set anyways
There was a problem hiding this comment.
I don't think we need a telemetry profile as the minimal is sufficient. I think we'd be down to 1 metric for the telemetry only profile. The LFME already has 1 metric in the minimal set anyways
What if the profile chosen is 'telemetry'? Would that mean we would provide nothing since we are dependent upon the client to use the desired servicemonitor?
There was a problem hiding this comment.
@simonpasquier
The LFME SM would basically be duplicated if we have both minimal and telemetry. Is there a way around this?
@jcantrill I think it would be the "full" metrics if there isn't a matching label for telemetry
For the collector, I think we could provide a telemetry only SM which would include only 1 metric
Description
This PR makes use of metrics collection profiles.
The CMO supports metrics collection profiles that select
ServiceMonitorsby themonitoring.openshift.io/collection-profilelabel. This change adds that label to allServiceMonitorscreated by CLO, enabling cluster admins to reduce metrics cardinality by switching to the minimal profile.Because the minimal
ServiceMonitorhas a different resource name (e.g.,minimal-collector) but targets the same backing service, a serviceNameparameter was added to ensure TLS ServerName verification uses the correct service DNS name.Minimal Metrics - Collector:
logcollector_component_event_unmatched_countvector_buffer_byte_sizevector_buffer_discarded_events_totalvector_buffer_eventsvector_buffer_sent_events_totalvector_component_discarded_events_totalvector_component_errors_totalvector_component_received_bytes_totalvector_component_received_event_bytes_totalvector_component_received_events_totalvector_component_sent_bytes_totalvector_events_in_totalvector_http_client_errors_totalvector_http_client_requests_sent_totalvector_http_client_responses_totalvector_open_filesMinimal Metrics - LFME:
log_logged_bytes_total/cc @vparfonov
/assign @jcantrill
Links
Summary by CodeRabbit
New Features
Tests