Skip to content

Conversation

@leandroberetta
Copy link
Contributor

@leandroberetta leandroberetta commented Dec 9, 2025

Description

This PR adds support for recording rules in the Network Health view. Recording rules are Prometheus recording rules that pre-compute and store health metrics, complementing the existing alerting functionality.

Recording Rules Feature

Recording rules appear alongside alerts in the Network Health view with the following capabilities:

  • Display recording rule violations organized by global, namespace, and node scopes
  • Show severity levels (critical, warning, info) based on configured thresholds
  • Include direction indicators (Src/Dst) when metrics are directional
  • Integrate with the health summary to reflect overall network status
  • Provide direct navigation to query browser for metric exploration

Implementation

UI Components

  • Recording rule cards display in the same gallery as alerts with unified selection behavior
  • Details table shows template name, severity, current value, threshold, and direction
  • Kebab menu provides quick access to view metrics in the query browser

Data Flow

  • Fetches recording rules from Prometheus API filtered by netobserv label
  • Queries current metric values for each recording rule
  • Processes metrics using health rule metadata from FlowCollector configuration
  • Groups rules by resource (global, namespace, node) and severity

Health Summary

  • Aggregates recording rule counts across all scopes
  • Contributes to overall health status determination
  • Displays alongside alert counts in the network health summary

Configuration

Recording rules are configured in the FlowCollector CR under processor.metrics.healthRules with mode: recording. The operator generates the corresponding PrometheusRule resources with the appropriate metric names and evaluation rules.

Screenshot From 2026-01-05 12-22-51

Testing

To test this feature with both alerts and recording rules, use the provided test configurations.

# 1. Configure FlowCollector with alert + recording rule
kubectl patch flowcollector cluster --type=merge --patch '
spec:
  agent:
    ebpf:
      privileged: true
      features:
      - "PacketDrop"
      - "DNSTracking"
  processor:
    advanced:
      env:
        EXPERIMENTAL_ALERTS_HEALTH: "true"
    metrics:
      healthRules:
      - template: DNSNxDomain
        mode: alert
        variants:
        - groupBy: Namespace
          thresholds:
            info: "10"
            warning: "50"
            critical: "80"
      - template: PacketDropsByKernel
        mode: recording
        variants:
        - thresholds:
            info: "0.5"
            warning: "2"
            critical: "5"
'

# 2. Generate DNS errors (for alert)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: dns-test
---
apiVersion: v1
kind: Pod
metadata:
  name: dns-error-generator
  namespace: dns-test
spec:
  containers:
  - name: dns-client
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      echo "Starting DNS error generator..."
      while true; do
        for i in {1..20}; do
          nslookup "nonexistent-domain-\${RANDOM}.invalid" || true
          nslookup "fake-\${RANDOM}.test" || true
          nslookup "does-not-exist-\${RANDOM}.local" || true
        done
        echo "Generated 60 DNS NXDOMAIN errors"
        sleep 5
      done
  restartPolicy: Always
EOF

# 3. Generate packet drops (for recording rule)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: packet-drop-test
---
apiVersion: v1
kind: Service
metadata:
  name: udp-sink
  namespace: packet-drop-test
spec:
  selector:
    app: udp-sink
  ports:
  - port: 9999
    protocol: UDP
---
apiVersion: v1
kind: Pod
metadata:
  name: udp-sink
  namespace: packet-drop-test
  labels:
    app: udp-sink
spec:
  containers:
  - name: sink
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      while true; do
        nc -ul -p 9999 > /dev/null 2>&1
      done
    resources:
      limits:
        memory: "64Mi"
        cpu: "100m"
---
apiVersion: v1
kind: Pod
metadata:
  name: packet-drop-generator
  namespace: packet-drop-test
spec:
  containers:
  - name: flood-gen
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      sleep 10
      while true; do
        for i in {1..50}; do
          (
            for j in {1..5000}; do
              echo "DATA" | nc -u -w 0 udp-sink.packet-drop-test.svc.cluster.local 9999 2>/dev/null
            done
          ) &
        done
        wait
        echo "Sent 250k packets"
        sleep 10
      done
    resources:
      limits:
        memory: "256Mi"
        cpu: "1000m"
  restartPolicy: Always
EOF

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Dec 9, 2025

@leandroberetta: This pull request references NETOBSERV-2365 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

Description

image

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link

openshift-ci bot commented Dec 9, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Dec 9, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mffiedler for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link

codecov bot commented Dec 11, 2025

Codecov Report

❌ Patch coverage is 4.83871% with 59 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.68%. Comparing base (d5e51a4) to head (cbf034b).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
web/src/components/health/health-helper.ts 4.83% 59 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1163      +/-   ##
==========================================
- Coverage   52.95%   52.68%   -0.28%     
==========================================
  Files         209      209              
  Lines       10950    11010      +60     
  Branches     1391     1409      +18     
==========================================
+ Hits         5799     5801       +2     
- Misses       4602     4660      +58     
  Partials      549      549              
Flag Coverage Δ
uitests 54.57% <4.83%> (-0.38%) ⬇️
unittests 47.27% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/config/config.go 47.32% <ø> (ø)
web/src/model/config.ts 100.00% <ø> (ø)
web/src/components/health/health-helper.ts 21.14% <4.83%> (-6.40%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 5, 2026

@leandroberetta: This pull request references NETOBSERV-2365 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

This PR adds support for recording rules in the Network Health view. Recording rules are Prometheus recording rules that pre-compute and store health metrics, complementing the existing alerting functionality.

Recording Rules Feature

Recording rules appear alongside alerts in the Network Health view with the following capabilities:

  • Display recording rule violations organized by global, namespace, and node scopes
  • Show severity levels (critical, warning, info) based on configured thresholds
  • Include direction indicators (Src/Dst) when metrics are directional
  • Integrate with the health summary to reflect overall network status
  • Provide direct navigation to query browser for metric exploration

Implementation

UI Components

  • Recording rule cards display in the same gallery as alerts with unified selection behavior
  • Details table shows template name, severity, current value, threshold, and direction
  • Kebab menu provides quick access to view metrics in the query browser

Data Flow

  • Fetches recording rules from Prometheus API filtered by netobserv label
  • Queries current metric values for each recording rule
  • Processes metrics using health rule metadata from FlowCollector configuration
  • Groups rules by resource (global, namespace, node) and severity

Health Summary

  • Aggregates recording rule counts across all scopes
  • Contributes to overall health status determination
  • Displays alongside alert counts in the network health summary

Configuration

Recording rules are configured in the FlowCollector CR under processor.metrics.healthRules with mode: recording. The operator generates the corresponding PrometheusRule resources with the appropriate metric names and evaluation rules.

Screenshot From 2026-01-05 12-22-51

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@leandroberetta leandroberetta marked this pull request as ready for review January 5, 2026 16:03
@leandroberetta leandroberetta requested a review from jotak January 5, 2026 16:28
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 5, 2026

@leandroberetta: This pull request references NETOBSERV-2365 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

This PR adds support for recording rules in the Network Health view. Recording rules are Prometheus recording rules that pre-compute and store health metrics, complementing the existing alerting functionality.

Recording Rules Feature

Recording rules appear alongside alerts in the Network Health view with the following capabilities:

  • Display recording rule violations organized by global, namespace, and node scopes
  • Show severity levels (critical, warning, info) based on configured thresholds
  • Include direction indicators (Src/Dst) when metrics are directional
  • Integrate with the health summary to reflect overall network status
  • Provide direct navigation to query browser for metric exploration

Implementation

UI Components

  • Recording rule cards display in the same gallery as alerts with unified selection behavior
  • Details table shows template name, severity, current value, threshold, and direction
  • Kebab menu provides quick access to view metrics in the query browser

Data Flow

  • Fetches recording rules from Prometheus API filtered by netobserv label
  • Queries current metric values for each recording rule
  • Processes metrics using health rule metadata from FlowCollector configuration
  • Groups rules by resource (global, namespace, node) and severity

Health Summary

  • Aggregates recording rule counts across all scopes
  • Contributes to overall health status determination
  • Displays alongside alert counts in the network health summary

Configuration

Recording rules are configured in the FlowCollector CR under processor.metrics.healthRules with mode: recording. The operator generates the corresponding PrometheusRule resources with the appropriate metric names and evaluation rules.

Screenshot From 2026-01-05 12-22-51

Testing

To test this feature with both alerts and recording rules, use the provided test configurations.

# 1. Configure FlowCollector with alert + recording rule
kubectl patch flowcollector cluster --type=merge --patch '
spec:
  agent:
    ebpf:
      privileged: true
      features:
      - "PacketDrop"
      - "DNSTracking"
  processor:
    advanced:
      env:
        EXPERIMENTAL_ALERTS_HEALTH: "true"
    metrics:
      healthRules:
      - template: DNSNxDomain
        mode: alert
        variants:
        - groupBy: Namespace
          thresholds:
            info: "10"
            warning: "50"
            critical: "80"
      - template: PacketDropsByKernel
        mode: recording
        variants:
        - thresholds:
            info: "0.5"
            warning: "2"
            critical: "5"
'

# 2. Generate DNS errors (for alert)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: dns-test
---
apiVersion: v1
kind: Pod
metadata:
  name: dns-error-generator
  namespace: dns-test
spec:
  containers:
  - name: dns-client
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      echo "Starting DNS error generator..."
      while true; do
        for i in {1..20}; do
          nslookup "nonexistent-domain-\${RANDOM}.invalid" || true
          nslookup "fake-\${RANDOM}.test" || true
          nslookup "does-not-exist-\${RANDOM}.local" || true
        done
        echo "Generated 60 DNS NXDOMAIN errors"
        sleep 5
      done
  restartPolicy: Always
EOF

# 3. Generate packet drops (for recording rule)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: packet-drop-test
---
apiVersion: v1
kind: Service
metadata:
  name: udp-sink
  namespace: packet-drop-test
spec:
  selector:
    app: udp-sink
  ports:
  - port: 9999
    protocol: UDP
---
apiVersion: v1
kind: Pod
metadata:
  name: udp-sink
  namespace: packet-drop-test
  labels:
    app: udp-sink
spec:
  containers:
  - name: sink
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      while true; do
        nc -ul -p 9999 > /dev/null 2>&1
      done
    resources:
      limits:
        memory: "64Mi"
        cpu: "100m"
---
apiVersion: v1
kind: Pod
metadata:
  name: packet-drop-generator
  namespace: packet-drop-test
spec:
  containers:
  - name: flood-gen
    image: nicolaka/netshoot:latest
    command:
    - /bin/bash
    - -c
    - |
      sleep 10
      while true; do
        for i in {1..50}; do
          (
            for j in {1..5000}; do
              echo "DATA" | nc -u -w 0 udp-sink.packet-drop-test.svc.cluster.local 9999 2>/dev/null
            done
          ) &
        done
        wait
        echo "Sent 250k packets"
        sleep 10
      done
    resources:
      limits:
        memory: "256Mi"
        cpu: "1000m"
  restartPolicy: Always
EOF

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants