Skip to content

Conversation

@JoukoVirtanen
Copy link
Contributor

@JoukoVirtanen JoukoVirtanen commented Sep 5, 2025

Recently there were changes to the berserker load for the long running cluster. One of the changes was an increase in the number of berserker pods. Currently Prometheus scrapes metrics on all of the pods in the long running cluster including berserker pods. With the increased number of berserker pods, Prometheus began to OOM. To fix this Prometheus is switched to a different config that filters out some, but not all of the metrics for berserker pods. The memory requests and limits are also increased.

The changes to the berserker load is already merged and was used for testing this PR. stackrox/stackrox#15886

Changes were made here after testing the above PR. A new tag, 0.0.51, was created based off of a recent master, and was used to test the changes here.

@JoukoVirtanen JoukoVirtanen requested a review from a team as a code owner September 5, 2025 15:07
@JoukoVirtanen JoukoVirtanen marked this pull request as draft September 13, 2025 18:28
@JoukoVirtanen JoukoVirtanen changed the title Jv rox 28976 optimize berserker load in long running cluster ROX-28976: Optimize berserker load in long running cluster Sep 29, 2025
@JoukoVirtanen JoukoVirtanen marked this pull request as ready for review September 29, 2025 04:39
"name": "prometheus",
"resources": {
"requests": {
"memory": "2Gi",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we request 8Gi to prevent OOMKills because the node runs out of memory, if it is overcommitted?

If you change this here, please verify in another long-running cluster that Prometheus is starting (as I am unsure if there are 8Gi available on each node with ACS' requirements too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# Replace the prometheus ConfigMap with one that doesn't scrape as much info from berserker containers
kubectl -n stackrox delete configmap prometheus
kubectl create -f "${SCRIPT_DIR}"/prometheus.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we can override the offending values in the monitoring chart. Can you check if that is possible? Same with the update to the monitoring deploment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean you want me to make the changes in stackrox/stackrox? I feel like 8Gi is too much for every case that the monitoring pod is used. I could make the changes to prometheus.yaml in stackrox/stackrox, but didn't want to pollute it with references to berserker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now use yq to set memory limit and request to 8Gi.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simply copy the ../charts/monitoring/values.yaml and create a berserker-values.yaml to later use it directly, but the approach with yq is okay for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tommartensen Requested the use of yq in a private conversation and it is okay with me too.

@JoukoVirtanen JoukoVirtanen force-pushed the jv-ROX-28976-optimize-berserker-load-in-long-running-cluster branch from 7d3f49d to 04a2cb7 Compare September 30, 2025 04:37
@JoukoVirtanen JoukoVirtanen requested review from tommartensen and vikin91 and removed request for vikin91 October 1, 2025 03:00
sed '/captureStart/d' "${KUBE_BURNER_METRICS_FILE}" > "$temp_metrics_file"
kubectl create configmap --from-file="$temp_metrics_file" kube-burner-metrics-config -n kube-burner

kubectl create configmap --from-file="$KUBE_BURNER_METRICS_FILE" kube-burner-metrics-config -n kube-burner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't need that anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't need it in the first place. It is redundant with the line above it.

@JoukoVirtanen JoukoVirtanen requested a review from vikin91 October 1, 2025 18:19
@tommartensen
Copy link
Collaborator

I believe we're running in circles here and on Slack, so here is my (untested) counterproposal:

What I would like to avoid is duplicating much of the Prometheus config (it might change) and have changes to the configuration obscured. I believe that it is cleaner to use the Helm built-in capabilities for overrides than to update values files with yq or deleting and re-creating Configmaps outside of the Helm cycle.

@JoukoVirtanen
Copy link
Contributor Author

JoukoVirtanen commented Oct 2, 2025

I believe we're running in circles here and on Slack, so here is my (untested) counterproposal:

What I would like to avoid is duplicating much of the Prometheus config (it might change) and have changes to the configuration obscured. I believe that it is cleaner to use the Helm built-in capabilities for overrides than to update values files with yq or deleting and re-creating Configmaps outside of the Helm cycle.

Thank you for putting together this counter proposal. I really like it when reviewers make the effort to create new branches and PRs. However, I think there might be a flaw in your proposal. Your proposal requires changes in both stackrox/stackrox and here. The problem with that is sometimes long running clusters are created for the release branches. If we go with your approach we would also need backports for the release branches in stackrox/stackrox, otherwise the code here would be incompatible with the helm charts for the monitoring deployment in stackrox/stackrox. Perhaps we can merge this as is and create a ticket to go with your approach. I will test your approach and let you know how it goes.

@tommartensen
Copy link
Collaborator

tommartensen commented Oct 2, 2025

Your proposal requires changes in both stackrox/stackrox and here.

I think that is a fundamental flaw in how this berserker thing is setup and should be addressed separately.

The problem with that is sometimes long running clusters are created for the release branches.

That shouldn't happen, can you point me to instances where this led to a problem?

Perhaps we can merge this as is and create a ticket to go with your approach.

I will be out-of-office in the next days. Please create a ticket to follow-up on this PR with the values improvements in ROX-10657 and work with someone in @stackrox/release-improvers to get this PR merged.

@JoukoVirtanen
Copy link
Contributor Author

JoukoVirtanen commented Oct 2, 2025

Your proposal requires changes in both stackrox/stackrox and here.

I think that is a fundamental flaw in how this berserker thing is setup and should be addressed separately.

I agree. I am not sure what the solution is. Though, I have a mini design proposal with some ideas on preventing breaking changes to the long running cluster https://docs.google.com/document/d/1Vfq1piBebKwAE9EvjNdb7P_TTO1kAgZvCzcPTqx-Nvg/edit?usp=sharing

It hasn't gotten much attention and hasn't gone anywhere.

The problem with that is sometimes long running clusters are created for the release branches.

That shouldn't happen, can you point me to instances where this led to a problem?

Sometimes long running clusters are needed for patch releases. In one case there was a release that had a memory leak which was fixed in a patch. The long running cluster was helpful with that.

One case in which problems occurred for patch releases was when metrics were sent to OpenSearch. There were changes in stackrox/stackrox that had to be made to get to work. Initially changes were made in stackrox/actions to detect what version of stackrox/stackrox was using and use different scripts based off of that. An old script was used for old release branches and a new script was for the current release. The changes to stackrox/stackrox were backported to older release branches. In the time between when the backports were merged and the logic of which script to use in stackrox/actions was updated, a long running cluster for an old release branch was created. That long running cluster failed. stackrox/actions had to be updated and then another long running cluster was successfully created for the patch release.

Perhaps we can merge this as is and create a ticket to go with your approach.

I will be out-of-office in the next days. Please create a ticket to follow-up on this PR with the values improvements in ROX-10657 and work with someone in @stackrox/release-improvers to get this PR merged.

I have created the following ticket https://issues.redhat.com/browse/ROX-31149

@JoukoVirtanen
Copy link
Contributor Author

@tommartensen I have tested your PRs. It didn't work. I think your idea is sound. There might just be some small mistake somewhere. I ran the following test.

I created a branch from stackrox/stackrox#17105 and tagged it 0.0.52. I then created a new branch in stackrox/test-gh-actions. In that branch I changed the refs in .github/workflows/create-clusters.yml to tm/override-helm-values-berserker. I then created a new long running cluster at https://github.com/stackrox/test-gh-actions/actions/workflows/create-clusters.yml with the branch set to the one I created in stackrox/test-gh-actions and the version set to 0.0.52. When I checked the long running cluster the monitoring pod was in a crash loop. I checked the logs and found the following

$ ks logs monitoring-5b657c9fc5-6kstf -c prometheus
ts=2025-10-03T03:30:11.087Z caller=main.go:438 level=error msg="Error loading config (--config.file=/etc/prometheus/prometheus.yml)" file=/etc/prometheus/prometheus.yml err="parsing YAML file /etc/prometheus/prometheus.yml: yaml: unmarshal errors:\n  line 14: cannot unmarshal !!str `- job_n...` into []*config.ScrapeConfig"

The Prometheus ConfigMap had the following

$ ks get configmap prometheus -o yaml
apiVersion: v1
data:
  prometheus.yml: "global:\n  scrape_interval: 30s\n\nalerting:\n  alertmanagers:\n
    \   - static_configs:\n        - targets:\n            - stackrox-monitoring-alertmanager:9093\n\nrule_files:\n
    \ - /etc/prometheus/rules_*.yml\n\nscrape_configs:\n  |-\n    - job_name: \"kubernetes-pods\"\n
    \     tls_config:\n        insecure_skip_verify: false\n      kubernetes_sd_configs:\n
    \       - role: pod\n          namespaces:\n            own_namespace: true\n
    \     relabel_configs:\n        - action: labelmap\n          regex: __meta_kubernetes_pod_label_(.+)\n
    \       - source_labels: [__meta_kubernetes_namespace]\n          action: replace\n
    \         target_label: namespace\n        - source_labels: [__meta_kubernetes_pod_name]\n
    \         action: replace\n          target_label: pod\n        - source_labels:
    [__meta_kubernetes_pod_node_name]\n          action: replace\n          target_label:
    node_name\n  \n    - job_name: \"kubernetes-cadvisor\"\n      scheme: https\n
    \     metrics_path: /metrics/cadvisor\n      tls_config:\n        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n
    \       insecure_skip_verify: true\n      authorization:\n        credentials_file:
    /var/run/secrets/kubernetes.io/serviceaccount/token\n      kubernetes_sd_configs:\n
    \       - role: node\n      relabel_configs:\n        - action: labelmap\n          regex:
    __meta_kubernetes_node_label_(.+)\n  \n    - job_name: stackrox\n      tls_config:\n
    \       insecure_skip_verify: false\n      kubernetes_sd_configs:\n        - role:
    endpoints\n          namespaces:\n            own_namespace: true\n      relabel_configs:\n
    \       - source_labels: [__meta_kubernetes_endpoint_port_name]\n          action:
    keep\n          regex: monitoring\n        - source_labels: [__meta_kubernetes_endpoints_name]\n
    \         action: replace\n          target_label: job\n        - source_labels:
    [__meta_kubernetes_namespace]\n          action: replace\n          target_label:
    namespace\n        - source_labels: [__meta_kubernetes_endpoint_node_name]\n          action:
    replace\n          target_label: node_name"
  rules_custom.yml: |-
    |-
      groups:
      - name: Kubernetes
        rules:
          - alert: KubernetesContainerOomKiller
            expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
            for: 0m
            labels:
              severity: warning
            annotations:
              summary: Kubernetes container OOM killer (pod {{ "{{" }} .exported_pod {{ "}}" }})
              description: "Container {{ "{{" }} .container {{ "}}" }} in pod {{ "{{" }} .exported_namespace {{ "}}" }}/{{ "{{" }} .exported_pod {{ "}}" }} has been OOMKilled {{ "{{" }}  {{ "}}" }} times in the last 10 minutes."

          - alert: KubernetesPodCrashLooping
            expr: increase(kube_pod_container_status_restarts_total[5m]) > 0
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: Kubernetes pod crash looping (pod {{ "{{" }} .exported_pod {{ "}}" }})
              description: "Pod {{ "{{" }} .exported_namespace {{ "}}" }}/{{ "{{" }} .exported_pod {{ "}}" }} is crash looping."

          - alert: KubernetesReplicaSetMismatch
            expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: Kubernetes ReplicaSet mismatch (replicaset {{ "{{" }} .replicaset {{ "}}" }})
              description: "Replicas mismatch in ReplicaSet {{ "{{" }} .exported_namespace {{ "}}" }}/{{ "{{" }} .replicaset {{ "}}" }}"
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: stackrox-monitoring
    meta.helm.sh/release-namespace: stackrox
  creationTimestamp: "2025-10-03T00:00:19Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: stackrox
  name: prometheus
  namespace: stackrox
  resourceVersion: "1759449619961263014"
  uid: e2acf091-fe60-4ba3-9c91-b49dc5096827

Copy link
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good!

I tried testing this branch with stackrox/stackrox#17146 but in the action execution, I saw that the code from the actions repository is hardcoded to main:

Download action repository 'stackrox/actions@main' (SHA:cc8a9579ba6de41d3b77b32403bf0f5e8809649d)

See: https://github.com/stackrox/stackrox/actions/runs/18283349491/job/52051757197

I will give you then a token of trust in regard that this has been tested and works correctly.

@JoukoVirtanen
Copy link
Contributor Author

The changes look good!

I tried testing this branch with stackrox/stackrox#17146 but in the action execution, I saw that the code from the actions repository is hardcoded to main:

Download action repository 'stackrox/actions@main' (SHA:cc8a9579ba6de41d3b77b32403bf0f5e8809649d)

See: https://github.com/stackrox/stackrox/actions/runs/18283349491/job/52051757197

I will give you then a token of trust in regard that this has been tested and works correctly.

To test this I created a branch in stackrox/test-gh-actions. In that branch I changed the occurrences of v1 in .github/workflows/create-clusters.yml to this branch. That makes it so that the actions in this branch are used.

You need to change

uses: stackrox/actions/.github/workflows/create-demo-clusters.yml@v1

and

workflow-ref: v1

@JoukoVirtanen JoukoVirtanen dismissed tommartensen’s stale review October 6, 2025 22:51

He is on vacation and this should go in before the long running cluster is created for 4.9.

@JoukoVirtanen JoukoVirtanen merged commit 09b66ab into main Oct 6, 2025
3 checks passed
@JoukoVirtanen JoukoVirtanen deleted the jv-ROX-28976-optimize-berserker-load-in-long-running-cluster branch October 6, 2025 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants