NETOBSERV-2365: Recording rules support #1163

leandroberetta · 2025-12-09T15:37:19Z

Description

This PR adds support for recording rules in the Network Health view. Recording rules are Prometheus recording rules that pre-compute and store health metrics, complementing the existing alerting functionality.

Recording Rules Feature

Recording rules appear alongside alerts in the Network Health view with the following capabilities:

Display recording rule violations organized by global, namespace, and node scopes
Show severity levels (critical, warning, info) based on configured thresholds
Include direction indicators (Src/Dst) when metrics are directional
Integrate with the health summary to reflect overall network status
Provide direct navigation to query browser for metric exploration

Implementation

UI Components

Recording rule cards display in the same gallery as alerts with unified selection behavior
Details table shows template name, severity, current value, threshold, and direction
Kebab menu provides quick access to view metrics in the query browser

Data Flow

Fetches recording rules from Prometheus API filtered by netobserv label
Queries current metric values for each recording rule
Processes metrics using health rule metadata from FlowCollector configuration
Groups rules by resource (global, namespace, node) and severity

Health Summary

Aggregates recording rule counts across all scopes
Contributes to overall health status determination
Displays alongside alert counts in the network health summary

Configuration

Recording rules are configured in the FlowCollector CR under processor.metrics.healthRules with mode: recording. The operator generates the corresponding PrometheusRule resources with the appropriate metric names and evaluation rules.

Testing

To test this feature with both alerts and recording rules, use the provided test configurations.

# 1. Configure FlowCollector with alert + recording rule
kubectl patch flowcollector cluster --type=merge --patch '
spec:
  agent:
    ebpf:
      privileged: true
      features:
      - "PacketDrop"
      - "DNSTracking"
  processor:
    advanced:
      env:
        EXPERIMENTAL_ALERTS_HEALTH: "true"
    metrics:
      healthRules:
      - template: DNSNxDomain
        mode: alert
        variants:
        - groupBy: Namespace
          thresholds:
            info: "10"
            warning: "50"
            critical: "80"
      - template: PacketDropsByKernel
        mode: recording
        variants:
        - thresholds:
            info: "0.5"
            warning: "2"
            critical: "5"
'

# 2. Generate DNS errors (for alert)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: dns-test
---
apiVersion: v1
kind: Pod
metadata:
  name: dns-error-generator
  namespace: dns-test
spec:
  containers:
  - name: dns-client
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      echo "Starting DNS error generator..."
      while true; do
        for i in {1..20}; do
          nslookup "nonexistent-domain-\${RANDOM}.invalid" || true
          nslookup "fake-\${RANDOM}.test" || true
          nslookup "does-not-exist-\${RANDOM}.local" || true
        done
        echo "Generated 60 DNS NXDOMAIN errors"
        sleep 5
      done
  restartPolicy: Always
EOF

# 3. Generate packet drops (for recording rule)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: packet-drop-test
---
apiVersion: v1
kind: Service
metadata:
  name: udp-sink
  namespace: packet-drop-test
spec:
  selector:
    app: udp-sink
  ports:
  - port: 9999
    protocol: UDP
---
apiVersion: v1
kind: Pod
metadata:
  name: udp-sink
  namespace: packet-drop-test
  labels:
    app: udp-sink
spec:
  containers:
  - name: sink
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      while true; do
        nc -ul -p 9999 > /dev/null 2>&1
      done
    resources:
      limits:
        memory: "64Mi"
        cpu: "100m"
---
apiVersion: v1
kind: Pod
metadata:
  name: packet-drop-generator
  namespace: packet-drop-test
spec:
  containers:
  - name: flood-gen
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      sleep 10
      while true; do
        for i in {1..50}; do
          (
            for j in {1..5000}; do
              echo "DATA" | nc -u -w 0 udp-sink.packet-drop-test.svc.cluster.local 9999 2>/dev/null
            done
          ) &
        done
        wait
        echo "Sent 250k packets"
        sleep 10
      done
    resources:
      limits:
        memory: "256Mi"
        cpu: "1000m"
  restartPolicy: Always
EOF

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

openshift-ci-robot · 2025-12-09T15:37:24Z

openshift-ci · 2025-12-09T15:37:24Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-12-09T15:37:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mffiedler for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

DOWNSTREAM_OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2025-12-11T15:37:35Z

Codecov Report

❌ Patch coverage is 4.83871% with 59 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.68%. Comparing base (d5e51a4) to head (cbf034b).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
web/src/components/health/health-helper.ts	4.83%	59 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1163      +/-   ##
==========================================
- Coverage   52.95%   52.68%   -0.28%     
==========================================
  Files         209      209              
  Lines       10950    11010      +60     
  Branches     1391     1409      +18     
==========================================
+ Hits         5799     5801       +2     
- Misses       4602     4660      +58     
  Partials      549      549

Flag	Coverage Δ
uitests	`54.57% <4.83%> (-0.38%)`	⬇️
unittests	`47.27% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
pkg/config/config.go	`47.32% <ø> (ø)`
web/src/model/config.ts	`100.00% <ø> (ø)`
web/src/components/health/health-helper.ts	`21.14% <4.83%> (-6.40%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

openshift-ci-robot · 2026-01-05T15:34:24Z

@leandroberetta: This pull request references NETOBSERV-2365 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

This PR adds support for recording rules in the Network Health view. Recording rules are Prometheus recording rules that pre-compute and store health metrics, complementing the existing alerting functionality.

Recording Rules Feature

Recording rules appear alongside alerts in the Network Health view with the following capabilities:

Display recording rule violations organized by global, namespace, and node scopes

Show severity levels (critical, warning, info) based on configured thresholds

Include direction indicators (Src/Dst) when metrics are directional

Integrate with the health summary to reflect overall network status

Provide direct navigation to query browser for metric exploration

Implementation

UI Components

Recording rule cards display in the same gallery as alerts with unified selection behavior

Details table shows template name, severity, current value, threshold, and direction

Kebab menu provides quick access to view metrics in the query browser

Data Flow

Fetches recording rules from Prometheus API filtered by netobserv label

Queries current metric values for each recording rule

Processes metrics using health rule metadata from FlowCollector configuration

Groups rules by resource (global, namespace, node) and severity

Health Summary

Aggregates recording rule counts across all scopes

Contributes to overall health status determination

Displays alongside alert counts in the network health summary

Configuration

Recording rules are configured in the FlowCollector CR under processor.metrics.healthRules with mode: recording. The operator generates the corresponding PrometheusRule resources with the appropriate metric names and evaluation rules.

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).

Does this PR require product documentation?

If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.

Does this PR require a product release notes entry?

If so, fill in "Release Note Text" in the JIRA.

Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.

If so, make sure it is described in the JIRA ticket.

QE requirements (check 1 from the list):

Standard QE validation, with pre-merge tests unless stated otherwise.

Regression tests only (e.g. refactoring with no user-facing change).

No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-01-05T17:17:01Z

@leandroberetta: This pull request references NETOBSERV-2365 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

This PR adds support for recording rules in the Network Health view. Recording rules are Prometheus recording rules that pre-compute and store health metrics, complementing the existing alerting functionality.

Recording Rules Feature

Recording rules appear alongside alerts in the Network Health view with the following capabilities:

Display recording rule violations organized by global, namespace, and node scopes

Show severity levels (critical, warning, info) based on configured thresholds

Include direction indicators (Src/Dst) when metrics are directional

Integrate with the health summary to reflect overall network status

Provide direct navigation to query browser for metric exploration

Implementation

UI Components

Recording rule cards display in the same gallery as alerts with unified selection behavior

Details table shows template name, severity, current value, threshold, and direction

Kebab menu provides quick access to view metrics in the query browser

Data Flow

Fetches recording rules from Prometheus API filtered by netobserv label

Queries current metric values for each recording rule

Processes metrics using health rule metadata from FlowCollector configuration

Groups rules by resource (global, namespace, node) and severity

Health Summary

Aggregates recording rule counts across all scopes

Contributes to overall health status determination

Displays alongside alert counts in the network health summary

Configuration

Recording rules are configured in the FlowCollector CR under processor.metrics.healthRules with mode: recording. The operator generates the corresponding PrometheusRule resources with the appropriate metric names and evaluation rules.

Testing

To test this feature with both alerts and recording rules, use the provided test configurations.
# 1. Configure FlowCollector with alert + recording rule
kubectl patch flowcollector cluster --type=merge --patch '
spec:
  agent:
    ebpf:
      privileged: true
      features:
      - "PacketDrop"
      - "DNSTracking"
  processor:
    advanced:
      env:
        EXPERIMENTAL_ALERTS_HEALTH: "true"
    metrics:
      healthRules:
      - template: DNSNxDomain
        mode: alert
        variants:
        - groupBy: Namespace
          thresholds:
            info: "10"
            warning: "50"
            critical: "80"
      - template: PacketDropsByKernel
        mode: recording
        variants:
        - thresholds:
            info: "0.5"
            warning: "2"
            critical: "5"
'

# 2. Generate DNS errors (for alert)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: dns-test
---
apiVersion: v1
kind: Pod
metadata:
  name: dns-error-generator
  namespace: dns-test
spec:
  containers:
  - name: dns-client
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      echo "Starting DNS error generator..."
      while true; do
        for i in {1..20}; do
          nslookup "nonexistent-domain-\${RANDOM}.invalid" || true
          nslookup "fake-\${RANDOM}.test" || true
          nslookup "does-not-exist-\${RANDOM}.local" || true
        done
        echo "Generated 60 DNS NXDOMAIN errors"
        sleep 5
      done
  restartPolicy: Always
EOF

# 3. Generate packet drops (for recording rule)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: packet-drop-test
---
apiVersion: v1
kind: Service
metadata:
  name: udp-sink
  namespace: packet-drop-test
spec:
  selector:
    app: udp-sink
  ports:
  - port: 9999
    protocol: UDP
---
apiVersion: v1
kind: Pod
metadata:
  name: udp-sink
  namespace: packet-drop-test
  labels:
    app: udp-sink
spec:
  containers:
  - name: sink
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      while true; do
        nc -ul -p 9999 > /dev/null 2>&1
      done
    resources:
      limits:
        memory: "64Mi"
        cpu: "100m"
---
apiVersion: v1
kind: Pod
metadata:
  name: packet-drop-generator
  namespace: packet-drop-test
spec:
  containers:
  - name: flood-gen
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      sleep 10
      while true; do
        for i in {1..50}; do
          (
            for j in {1..5000}; do
              echo "DATA" | nc -u -w 0 udp-sink.packet-drop-test.svc.cluster.local 9999 2>/dev/null
            done
          ) &
        done
        wait
        echo "Sent 250k packets"
        sleep 10
      done
    resources:
      limits:
        memory: "256Mi"
        cpu: "1000m"
  restartPolicy: Always
EOF
Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).

Does this PR require product documentation?

If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.

Does this PR require a product release notes entry?

If so, fill in "Release Note Text" in the JIRA.

Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.

If so, make sure it is described in the JIRA ticket.

QE requirements (check 1 from the list):

Standard QE validation, with pre-merge tests unless stated otherwise.

Regression tests only (e.g. refactoring with no user-facing change).

No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

leandroberetta self-assigned this Dec 9, 2025

openshift-ci-robot added the jira/valid-reference label Dec 9, 2025

openshift-ci bot added the do-not-merge/work-in-progress label Dec 9, 2025

leandroberetta mentioned this pull request Dec 9, 2025

NETOBSERV-2365: Add the ability to create recording rules instead of alerts for the Network Health feature netobserv/network-observability-operator#2112

Open

10 tasks

openshift-merge-robot added the needs-rebase label Dec 13, 2025

leandroberetta force-pushed the netobserv-2365 branch from a68382f to 076d872 Compare December 17, 2025 14:38

openshift-merge-robot removed the needs-rebase label Dec 17, 2025

leandroberetta force-pushed the netobserv-2365 branch from 075f495 to 6f8b3a1 Compare December 17, 2025 14:39

leandroberetta added 2 commits December 22, 2025 17:32

margins added, minor polish

037c94c

several improvements

29656b7

leandroberetta force-pushed the netobserv-2365 branch from 12caef3 to 29656b7 Compare January 5, 2026 15:28

leandroberetta added 2 commits January 5, 2026 12:37

fix linting

fa04324

fix linting and testing

cbf034b

leandroberetta marked this pull request as ready for review January 5, 2026 16:03

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 5, 2026

leandroberetta requested a review from jotak January 5, 2026 16:28

NETOBSERV-2365: Recording rules support #1163

Are you sure you want to change the base?

NETOBSERV-2365: Recording rules support #1163

Uh oh!

Conversation

leandroberetta commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Recording Rules Feature

Implementation

Configuration

Testing

Dependencies

Checklist

Uh oh!

openshift-ci-robot commented Dec 9, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

openshift-ci bot commented Dec 9, 2025

Uh oh!

openshift-ci bot commented Dec 9, 2025

Uh oh!

codecov bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci-robot commented Jan 5, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Recording Rules Feature

Implementation

Configuration

Dependencies

Checklist

Uh oh!

openshift-ci-robot commented Jan 5, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Recording Rules Feature

Implementation

Configuration

Testing

Dependencies

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leandroberetta commented Dec 9, 2025 •

edited

Loading

openshift-ci-robot commented Dec 9, 2025 •

edited by openshift-ci bot

Loading

codecov bot commented Dec 11, 2025 •

edited

Loading

openshift-ci-robot commented Jan 5, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 5, 2026 •

edited by openshift-ci bot

Loading