Add resource limits to all overlays with CI enforcement and documentation#12
Open
tomncooper wants to merge 5 commits intomainfrom
Open
Add resource limits to all overlays with CI enforcement and documentation#12tomncooper wants to merge 5 commits intomainfrom
tomncooper wants to merge 5 commits intomainfrom
Conversation
Set CPU and memory requests equal to limits for every operator deployment and custom resource across the core and metrics overlays. * Patch Strimzi, Apicurio Registry, and Console operator deployments via kustomize strategic-merge patches * Set resource specs on Kafka node pools and entity operator (topic + user operator) in the Kafka CR * Set resource specs on Apicurio Registry app and UI containers * Set resource specs on Console API and UI containers * Set resource specs on Prometheus Operator deployment and Prometheus CR Signed-off-by: Thomas Cooper <code@tomcooper.dev>
│
Add CI scripts that verify every container in an overlay has resource │
requests and limits, and that overlay documentation pages declare │
accurate resource totals. Add documentation for the core overlay and │
a guide for overlay contributors. │
│
* Add VerifyResourceLimits script to check all containers and CR │
resource fields have requests and limits set │
* Add VerifyDocumentedResources script to check documented cpu_total │
and memory_total match kustomize build output │
* Add CrdSchemaUtils shared utility for CRD schema introspection │
* Add unit tests for both verification scripts │
* Move existing scritp tests into tests subdirectory │
* Add script-tests.yaml workflow to run script unit tests in CI │
* Add docs/overlays/core.md with install instructions, components │
table, and resource requirements │
* Add docs/overlays/developing.md guide covering resource limit and │
documentation requirements for overlay contributors │
* Add resource requirements frontmatter and section to metrics overlay │
docs │
* Refactor validate.yaml to discover overlays dynamically instead of │
a hardcoded list │
* Update README with new script descriptions and test instruction
Signed-off-by: Thomas Cooper <code@tomcooper.dev>
* Update overlay developer docs with more details * Add helper script to show the resource limits set in a given overlay * Set the docs preview script to use the same hugo-book version as the StreamsHub site Signed-off-by: Thomas Cooper <code@tomcooper.dev>
Signed-off-by: Thomas Cooper <code@tomcooper.dev>
04c6c17 to
d3b553e
Compare
The requirement for requests == limits caused total CPU requests (3,650m for metrics overlay) to exceed what a 4-CPU minikube node can allocate after Kubernetes system overhead (~900m), leaving the console deployment stuck Pending with "Insufficient cpu". Lower CPU requests while keeping limits unchanged so pods reserve less for scheduling but can still burst under load: * Console operator: 500m → 100m request, 500m limit * Kafka dual-role: 500m → 250m request, 500m limit * Apicurio registry app: 500m → 250m request, 500m limit * Console API: 500m → 250m request, 500m limit Relax CI invariant from requests == limits to requests <= limits: * Rename checkRequestsEqualsLimits → checkRequestsNotExceedLimits * Update VerifyResourceLimits and VerifyDocumentedResources call sites * Update unit tests to accept requests < limits, reject requests > limits * Update documented cpu_total from 4 to 3 CPU cores for both overlays
Contributor
Author
|
I had to reduce the requests for all resources to get them below the 4 cpu limit of the standard GH Actions runner. We may hit a wall with more complicated overlays and may need a custom testing solution. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Users currently have no way of knowing how much CPU and memory a given overlay requires before deploying it. This makes it difficult to right-size clusters, and deploying without sufficient resources leads to pods stuck in Pending with no clear explanation.
This PR addresses that by:
The PR is quite large as it combines the actual resource limit updates and the CI checking code, but they are is separate commits (the limits are in the first commit with the CI and docs in the later commits), so they can be reviewed separately. If this is still too much I can look at chopping out the docs and resource checks into individual PRs.