ROX-28976: Optimize berserker load in long running cluster #64

JoukoVirtanen · 2025-09-05T15:07:04Z

Recently there were changes to the berserker load for the long running cluster. One of the changes was an increase in the number of berserker pods. Currently Prometheus scrapes metrics on all of the pods in the long running cluster including berserker pods. With the increased number of berserker pods, Prometheus began to OOM. To fix this Prometheus is switched to a different config that filters out some, but not all of the metrics for berserker pods. The memory requests and limits are also increased.

The changes to the berserker load is already merged and was used for testing this PR. stackrox/stackrox#15886

Changes were made here after testing the above PR. A new tag, 0.0.51, was created based off of a recent master, and was used to test the changes here.

tommartensen · 2025-09-29T15:29:02Z

release/start-secured-cluster/patch-monitoring.json

+                        "name": "prometheus",
+                        "resources": {
+                            "requests": {
+                                "memory": "2Gi",


How about we request 8Gi to prevent OOMKills because the node runs out of memory, if it is overcommitted?

If you change this here, please verify in another long-running cluster that Prometheus is starting (as I am unsure if there are 8Gi available on each node with ACS' requirements too).

tommartensen · 2025-09-29T15:30:29Z

release/start-secured-cluster/start-secured-cluster.sh

+
+# Replace the prometheus ConfigMap with one that doesn't scrape as much info from berserker containers
+kubectl -n stackrox delete configmap prometheus
+kubectl create -f "${SCRIPT_DIR}"/prometheus.yaml


I would prefer if we can override the offending values in the monitoring chart. Can you check if that is possible? Same with the update to the monitoring deploment.

Do you mean you want me to make the changes in stackrox/stackrox? I feel like 8Gi is too much for every case that the monitoring pod is used. I could make the changes to prometheus.yaml in stackrox/stackrox, but didn't want to pollute it with references to berserker.

I now use yq to set memory limit and request to 8Gi.

I would simply copy the ../charts/monitoring/values.yaml and create a berserker-values.yaml to later use it directly, but the approach with yq is okay for me.

@tommartensen Requested the use of yq in a private conversation and it is okay with me too.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

vikin91 · 2025-10-01T07:46:36Z

release/start-kube-burner/start-kube-burner.sh

 sed '/captureStart/d' "${KUBE_BURNER_METRICS_FILE}" > "$temp_metrics_file"
 kubectl create configmap --from-file="$temp_metrics_file" kube-burner-metrics-config -n kube-burner

-kubectl create configmap --from-file="$KUBE_BURNER_METRICS_FILE" kube-burner-metrics-config -n kube-burner


Why we don't need that anymore?

We didn't need it in the first place. It is redundant with the line above it.

tommartensen · 2025-10-02T08:47:54Z

I believe we're running in circles here and on Slack, so here is my (untested) counterproposal:

What I would like to avoid is duplicating much of the Prometheus config (it might change) and have changes to the configuration obscured. I believe that it is cleaner to use the Helm built-in capabilities for overrides than to update values files with yq or deleting and re-creating Configmaps outside of the Helm cycle.

JoukoVirtanen · 2025-10-02T16:43:11Z

I believe we're running in circles here and on Slack, so here is my (untested) counterproposal:

counter proposal with values files #67

donotmerge: demonstrate helm overrides in deploy/monitoring stackrox#17105

What I would like to avoid is duplicating much of the Prometheus config (it might change) and have changes to the configuration obscured. I believe that it is cleaner to use the Helm built-in capabilities for overrides than to update values files with yq or deleting and re-creating Configmaps outside of the Helm cycle.

Thank you for putting together this counter proposal. I really like it when reviewers make the effort to create new branches and PRs. However, I think there might be a flaw in your proposal. Your proposal requires changes in both stackrox/stackrox and here. The problem with that is sometimes long running clusters are created for the release branches. If we go with your approach we would also need backports for the release branches in stackrox/stackrox, otherwise the code here would be incompatible with the helm charts for the monitoring deployment in stackrox/stackrox. Perhaps we can merge this as is and create a ticket to go with your approach. I will test your approach and let you know how it goes.

tommartensen · 2025-10-02T16:52:12Z

Your proposal requires changes in both stackrox/stackrox and here.

I think that is a fundamental flaw in how this berserker thing is setup and should be addressed separately.

The problem with that is sometimes long running clusters are created for the release branches.

That shouldn't happen, can you point me to instances where this led to a problem?

Perhaps we can merge this as is and create a ticket to go with your approach.

I will be out-of-office in the next days. Please create a ticket to follow-up on this PR with the values improvements in ROX-10657 and work with someone in @stackrox/release-improvers to get this PR merged.

JoukoVirtanen · 2025-10-02T19:12:12Z

Your proposal requires changes in both stackrox/stackrox and here.

I think that is a fundamental flaw in how this berserker thing is setup and should be addressed separately.

I agree. I am not sure what the solution is. Though, I have a mini design proposal with some ideas on preventing breaking changes to the long running cluster https://docs.google.com/document/d/1Vfq1piBebKwAE9EvjNdb7P_TTO1kAgZvCzcPTqx-Nvg/edit?usp=sharing

It hasn't gotten much attention and hasn't gone anywhere.

The problem with that is sometimes long running clusters are created for the release branches.

That shouldn't happen, can you point me to instances where this led to a problem?

Sometimes long running clusters are needed for patch releases. In one case there was a release that had a memory leak which was fixed in a patch. The long running cluster was helpful with that.

One case in which problems occurred for patch releases was when metrics were sent to OpenSearch. There were changes in stackrox/stackrox that had to be made to get to work. Initially changes were made in stackrox/actions to detect what version of stackrox/stackrox was using and use different scripts based off of that. An old script was used for old release branches and a new script was for the current release. The changes to stackrox/stackrox were backported to older release branches. In the time between when the backports were merged and the logic of which script to use in stackrox/actions was updated, a long running cluster for an old release branch was created. That long running cluster failed. stackrox/actions had to be updated and then another long running cluster was successfully created for the patch release.

Perhaps we can merge this as is and create a ticket to go with your approach.

I will be out-of-office in the next days. Please create a ticket to follow-up on this PR with the values improvements in ROX-10657 and work with someone in @stackrox/release-improvers to get this PR merged.

I have created the following ticket https://issues.redhat.com/browse/ROX-31149

JoukoVirtanen · 2025-10-03T03:48:55Z

@tommartensen I have tested your PRs. It didn't work. I think your idea is sound. There might just be some small mistake somewhere. I ran the following test.

I created a branch from stackrox/stackrox#17105 and tagged it 0.0.52. I then created a new branch in stackrox/test-gh-actions. In that branch I changed the refs in .github/workflows/create-clusters.yml to tm/override-helm-values-berserker. I then created a new long running cluster at https://github.com/stackrox/test-gh-actions/actions/workflows/create-clusters.yml with the branch set to the one I created in stackrox/test-gh-actions and the version set to 0.0.52. When I checked the long running cluster the monitoring pod was in a crash loop. I checked the logs and found the following

$ ks logs monitoring-5b657c9fc5-6kstf -c prometheus
ts=2025-10-03T03:30:11.087Z caller=main.go:438 level=error msg="Error loading config (--config.file=/etc/prometheus/prometheus.yml)" file=/etc/prometheus/prometheus.yml err="parsing YAML file /etc/prometheus/prometheus.yml: yaml: unmarshal errors:\n  line 14: cannot unmarshal !!str `- job_n...` into []*config.ScrapeConfig"

The Prometheus ConfigMap had the following

$ ks get configmap prometheus -o yaml
apiVersion: v1
data:
  prometheus.yml: "global:\n  scrape_interval: 30s\n\nalerting:\n  alertmanagers:\n
    \   - static_configs:\n        - targets:\n            - stackrox-monitoring-alertmanager:9093\n\nrule_files:\n
    \ - /etc/prometheus/rules_*.yml\n\nscrape_configs:\n  |-\n    - job_name: \"kubernetes-pods\"\n
    \     tls_config:\n        insecure_skip_verify: false\n      kubernetes_sd_configs:\n
    \       - role: pod\n          namespaces:\n            own_namespace: true\n
    \     relabel_configs:\n        - action: labelmap\n          regex: __meta_kubernetes_pod_label_(.+)\n
    \       - source_labels: [__meta_kubernetes_namespace]\n          action: replace\n
    \         target_label: namespace\n        - source_labels: [__meta_kubernetes_pod_name]\n
    \         action: replace\n          target_label: pod\n        - source_labels:
    [__meta_kubernetes_pod_node_name]\n          action: replace\n          target_label:
    node_name\n  \n    - job_name: \"kubernetes-cadvisor\"\n      scheme: https\n
    \     metrics_path: /metrics/cadvisor\n      tls_config:\n        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n
    \       insecure_skip_verify: true\n      authorization:\n        credentials_file:
    /var/run/secrets/kubernetes.io/serviceaccount/token\n      kubernetes_sd_configs:\n
    \       - role: node\n      relabel_configs:\n        - action: labelmap\n          regex:
    __meta_kubernetes_node_label_(.+)\n  \n    - job_name: stackrox\n      tls_config:\n
    \       insecure_skip_verify: false\n      kubernetes_sd_configs:\n        - role:
    endpoints\n          namespaces:\n            own_namespace: true\n      relabel_configs:\n
    \       - source_labels: [__meta_kubernetes_endpoint_port_name]\n          action:
    keep\n          regex: monitoring\n        - source_labels: [__meta_kubernetes_endpoints_name]\n
    \         action: replace\n          target_label: job\n        - source_labels:
    [__meta_kubernetes_namespace]\n          action: replace\n          target_label:
    namespace\n        - source_labels: [__meta_kubernetes_endpoint_node_name]\n          action:
    replace\n          target_label: node_name"
  rules_custom.yml: |-
    |-
      groups:
      - name: Kubernetes
        rules:
          - alert: KubernetesContainerOomKiller
            expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
            for: 0m
            labels:
              severity: warning
            annotations:
              summary: Kubernetes container OOM killer (pod {{ "{{" }} .exported_pod {{ "}}" }})
              description: "Container {{ "{{" }} .container {{ "}}" }} in pod {{ "{{" }} .exported_namespace {{ "}}" }}/{{ "{{" }} .exported_pod {{ "}}" }} has been OOMKilled {{ "{{" }}  {{ "}}" }} times in the last 10 minutes."

          - alert: KubernetesPodCrashLooping
            expr: increase(kube_pod_container_status_restarts_total[5m]) > 0
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: Kubernetes pod crash looping (pod {{ "{{" }} .exported_pod {{ "}}" }})
              description: "Pod {{ "{{" }} .exported_namespace {{ "}}" }}/{{ "{{" }} .exported_pod {{ "}}" }} is crash looping."

          - alert: KubernetesReplicaSetMismatch
            expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: Kubernetes ReplicaSet mismatch (replicaset {{ "{{" }} .replicaset {{ "}}" }})
              description: "Replicas mismatch in ReplicaSet {{ "{{" }} .exported_namespace {{ "}}" }}/{{ "{{" }} .replicaset {{ "}}" }}"
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: stackrox-monitoring
    meta.helm.sh/release-namespace: stackrox
  creationTimestamp: "2025-10-03T00:00:19Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: stackrox
  name: prometheus
  namespace: stackrox
  resourceVersion: "1759449619961263014"
  uid: e2acf091-fe60-4ba3-9c91-b49dc5096827

vikin91

The changes look good!

I tried testing this branch with stackrox/stackrox#17146 but in the action execution, I saw that the code from the actions repository is hardcoded to main:

Download action repository 'stackrox/actions@main' (SHA:cc8a9579ba6de41d3b77b32403bf0f5e8809649d)

See: https://github.com/stackrox/stackrox/actions/runs/18283349491/job/52051757197

I will give you then a token of trust in regard that this has been tested and works correctly.

JoukoVirtanen · 2025-10-06T22:50:06Z

The changes look good!

I tried testing this branch with stackrox/stackrox#17146 but in the action execution, I saw that the code from the actions repository is hardcoded to main:
Download action repository 'stackrox/actions@main' (SHA:cc8a9579ba6de41d3b77b32403bf0f5e8809649d)
See: https://github.com/stackrox/stackrox/actions/runs/18283349491/job/52051757197

I will give you then a token of trust in regard that this has been tested and works correctly.

To test this I created a branch in stackrox/test-gh-actions. In that branch I changed the occurrences of v1 in .github/workflows/create-clusters.yml to this branch. That makes it so that the actions in this branch are used.

You need to change

uses: stackrox/actions/.github/workflows/create-demo-clusters.yml@v1

and

workflow-ref: v1

He is on vacation and this should go in before the long running cluster is created for 4.9.

JoukoVirtanen requested a review from a team as a code owner September 5, 2025 15:07

JoukoVirtanen marked this pull request as draft September 13, 2025 18:28

JoukoVirtanen mentioned this pull request Sep 13, 2025

ROX-28976: Optimize berserker load in long running cluster stackrox/stackrox#15886

Merged

4 tasks

JoukoVirtanen changed the title ~~Jv rox 28976 optimize berserker load in long running cluster~~ ROX-28976: Optimize berserker load in long running cluster Sep 29, 2025

JoukoVirtanen marked this pull request as ready for review September 29, 2025 04:39

JoukoVirtanen requested a review from tommartensen September 29, 2025 04:42

tommartensen reviewed Sep 29, 2025

View reviewed changes

tommartensen previously requested changes Sep 29, 2025

View reviewed changes

JoukoVirtanen and others added 16 commits September 29, 2025 20:46

X-Smart-Branch-Parent: main

0fc2370

Set kube-burner config ref for the secured cluster

4651220

Cluster runs for 6h

79ba82d

Set cluster lifespan back to 168h

18c5022

Increase memory for prometheus using a patch

0e704c5

Using correct deployment

169be03

Replace prometheus configmap so berserker containers won't be scraped

765eb45

Corrected prometheus.yaml. Monitoring pod not immediately crashing

932361d

Metrics for berserker containers should be dropped for real

a9a9844

Not collecting metrics for berserker namespaces

a426226

Increased the memory for the prometheus container to 8Gi

3746973

Update release/start-secured-cluster/start-secured-cluster.sh

d261f4d

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Changed the kube burner config ref back

873b4cd

Using yq to set memory limits and requests to 8Gi

7782263

Removed patch-monitoring.json

9b9249f

Removed redundant creation of configmap

04a2cb7

JoukoVirtanen force-pushed the jv-ROX-28976-optimize-berserker-load-in-long-running-cluster branch from 7d3f49d to 04a2cb7 Compare September 30, 2025 04:37

JoukoVirtanen requested review from tommartensen and vikin91 and removed request for vikin91 October 1, 2025 03:00

vikin91 reviewed Oct 1, 2025

View reviewed changes

JoukoVirtanen requested a review from vikin91 October 1, 2025 18:19

vikin91 approved these changes Oct 6, 2025

View reviewed changes

JoukoVirtanen merged commit 09b66ab into main Oct 6, 2025
3 checks passed

JoukoVirtanen deleted the jv-ROX-28976-optimize-berserker-load-in-long-running-cluster branch October 6, 2025 22:51

ROX-28976: Optimize berserker load in long running cluster #64

ROX-28976: Optimize berserker load in long running cluster #64

Uh oh!

Conversation

JoukoVirtanen commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tommartensen commented Oct 2, 2025

Uh oh!

JoukoVirtanen commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tommartensen commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoukoVirtanen commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoukoVirtanen commented Oct 3, 2025

Uh oh!

vikin91 left a comment

Choose a reason for hiding this comment

Uh oh!

JoukoVirtanen commented Oct 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JoukoVirtanen commented Sep 5, 2025 •

edited

Loading

JoukoVirtanen commented Oct 2, 2025 •

edited

Loading

tommartensen commented Oct 2, 2025 •

edited

Loading

JoukoVirtanen commented Oct 2, 2025 •

edited

Loading