-
Notifications
You must be signed in to change notification settings - Fork 0
ROX-28976: Optimize berserker load in long running cluster #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROX-28976: Optimize berserker load in long running cluster #64
Conversation
| "name": "prometheus", | ||
| "resources": { | ||
| "requests": { | ||
| "memory": "2Gi", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we request 8Gi to prevent OOMKills because the node runs out of memory, if it is overcommitted?
If you change this here, please verify in another long-running cluster that Prometheus is starting (as I am unsure if there are 8Gi available on each node with ACS' requirements too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| # Replace the prometheus ConfigMap with one that doesn't scrape as much info from berserker containers | ||
| kubectl -n stackrox delete configmap prometheus | ||
| kubectl create -f "${SCRIPT_DIR}"/prometheus.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we can override the offending values in the monitoring chart. Can you check if that is possible? Same with the update to the monitoring deploment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean you want me to make the changes in stackrox/stackrox? I feel like 8Gi is too much for every case that the monitoring pod is used. I could make the changes to prometheus.yaml in stackrox/stackrox, but didn't want to pollute it with references to berserker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now use yq to set memory limit and request to 8Gi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simply copy the ../charts/monitoring/values.yaml and create a berserker-values.yaml to later use it directly, but the approach with yq is okay for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tommartensen Requested the use of yq in a private conversation and it is okay with me too.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
7d3f49d to
04a2cb7
Compare
| sed '/captureStart/d' "${KUBE_BURNER_METRICS_FILE}" > "$temp_metrics_file" | ||
| kubectl create configmap --from-file="$temp_metrics_file" kube-burner-metrics-config -n kube-burner | ||
|
|
||
| kubectl create configmap --from-file="$KUBE_BURNER_METRICS_FILE" kube-burner-metrics-config -n kube-burner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we don't need that anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't need it in the first place. It is redundant with the line above it.
|
I believe we're running in circles here and on Slack, so here is my (untested) counterproposal:
What I would like to avoid is duplicating much of the Prometheus config (it might change) and have changes to the configuration obscured. I believe that it is cleaner to use the Helm built-in capabilities for overrides than to update values files with yq or deleting and re-creating Configmaps outside of the Helm cycle. |
Thank you for putting together this counter proposal. I really like it when reviewers make the effort to create new branches and PRs. However, I think there might be a flaw in your proposal. Your proposal requires changes in both |
I think that is a fundamental flaw in how this berserker thing is setup and should be addressed separately.
That shouldn't happen, can you point me to instances where this led to a problem?
I will be out-of-office in the next days. Please create a ticket to follow-up on this PR with the values improvements in ROX-10657 and work with someone in @stackrox/release-improvers to get this PR merged. |
I agree. I am not sure what the solution is. Though, I have a mini design proposal with some ideas on preventing breaking changes to the long running cluster https://docs.google.com/document/d/1Vfq1piBebKwAE9EvjNdb7P_TTO1kAgZvCzcPTqx-Nvg/edit?usp=sharing It hasn't gotten much attention and hasn't gone anywhere.
Sometimes long running clusters are needed for patch releases. In one case there was a release that had a memory leak which was fixed in a patch. The long running cluster was helpful with that. One case in which problems occurred for patch releases was when metrics were sent to OpenSearch. There were changes in stackrox/stackrox that had to be made to get to work. Initially changes were made in stackrox/actions to detect what version of stackrox/stackrox was using and use different scripts based off of that. An old script was used for old release branches and a new script was for the current release. The changes to stackrox/stackrox were backported to older release branches. In the time between when the backports were merged and the logic of which script to use in stackrox/actions was updated, a long running cluster for an old release branch was created. That long running cluster failed. stackrox/actions had to be updated and then another long running cluster was successfully created for the patch release.
I have created the following ticket https://issues.redhat.com/browse/ROX-31149 |
|
@tommartensen I have tested your PRs. It didn't work. I think your idea is sound. There might just be some small mistake somewhere. I ran the following test. I created a branch from stackrox/stackrox#17105 and tagged it 0.0.52. I then created a new branch in stackrox/test-gh-actions. In that branch I changed the refs in The Prometheus ConfigMap had the following |
vikin91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good!
I tried testing this branch with stackrox/stackrox#17146 but in the action execution, I saw that the code from the actions repository is hardcoded to main:
Download action repository 'stackrox/actions@main' (SHA:cc8a9579ba6de41d3b77b32403bf0f5e8809649d)
See: https://github.com/stackrox/stackrox/actions/runs/18283349491/job/52051757197
I will give you then a token of trust in regard that this has been tested and works correctly.
To test this I created a branch in You need to change and |
He is on vacation and this should go in before the long running cluster is created for 4.9.
Recently there were changes to the berserker load for the long running cluster. One of the changes was an increase in the number of berserker pods. Currently Prometheus scrapes metrics on all of the pods in the long running cluster including berserker pods. With the increased number of berserker pods, Prometheus began to OOM. To fix this Prometheus is switched to a different config that filters out some, but not all of the metrics for berserker pods. The memory requests and limits are also increased.
The changes to the berserker load is already merged and was used for testing this PR. stackrox/stackrox#15886
Changes were made here after testing the above PR. A new tag, 0.0.51, was created based off of a recent master, and was used to test the changes here.