Skip to content

Add resource limits to all overlays with CI enforcement and documentation#12

Open
tomncooper wants to merge 5 commits intomainfrom
resource-limits
Open

Add resource limits to all overlays with CI enforcement and documentation#12
tomncooper wants to merge 5 commits intomainfrom
resource-limits

Conversation

@tomncooper
Copy link
Copy Markdown
Contributor

@tomncooper tomncooper commented Apr 10, 2026

Users currently have no way of knowing how much CPU and memory a given overlay requires before deploying it. This makes it difficult to right-size clusters, and deploying without sufficient resources leads to pods stuck in Pending with no clear explanation.

This PR addresses that by:

  1. Setting resource requests and limits on every component: all operator deployments and custom resources across the core and metrics overlays now have explicit CPU and memory specs, with requests equal to limits.
  2. Documenting per-overlay resource totals: each overlay's documentation page now declares the total CPU and memory required in its frontmatter (cpu_total, memory_total), rendered in a "Resource Requirements" section so users can check before installing.
  3. Adding CI scripts to enforce this going forward:
  • VerifyResourceLimits: walks the kustomize build output and CRD schemas to check that every container and every configured CR resource field has both requests and limits set. Optional CRD fields that aren't configured are skipped.
  • VerifyDocumentedResources: sums the actual resource requests from kustomize build and checks that the documented cpu_total and memory_total in the overlay's doc page are sufficient.
  • Both scripts have unit tests, run in CI on every PR, and auto-discover overlays so new overlays are validated without manual workflow updates.
  1. Adding overlay contributor documentation: docs/overlays/developing.md explains the resource limit and documentation requirements so contributors know what CI will enforce.

The PR is quite large as it combines the actual resource limit updates and the CI checking code, but they are is separate commits (the limits are in the first commit with the CI and docs in the later commits), so they can be reviewed separately. If this is still too much I can look at chopping out the docs and resource checks into individual PRs.

Set CPU and memory requests equal to limits for every operator deployment
and custom resource across the core and metrics overlays.

 * Patch Strimzi, Apicurio Registry, and Console operator deployments
   via kustomize strategic-merge patches
 * Set resource specs on Kafka node pools and entity operator (topic +
   user operator) in the Kafka CR
 * Set resource specs on Apicurio Registry app and UI containers
 * Set resource specs on Console API and UI containers
 * Set resource specs on Prometheus Operator deployment and Prometheus CR

Signed-off-by: Thomas Cooper <code@tomcooper.dev>
                                                                                                                                                                                         │
Add CI scripts that verify every container in an overlay has resource                                                                                                                   │
requests and limits, and that overlay documentation pages declare                                                                                                                       │
accurate resource totals. Add documentation for the core overlay and                                                                                                                    │
a guide for overlay contributors.                                                                                                                                                       │
                                                                                                                                                                                        │
* Add VerifyResourceLimits script to check all containers and CR                                                                                                                        │
  resource fields have requests and limits set                                                                                                                                          │
* Add VerifyDocumentedResources script to check documented cpu_total                                                                                                                    │
  and memory_total match kustomize build output                                                                                                                                         │
* Add CrdSchemaUtils shared utility for CRD schema introspection                                                                                                                        │
* Add unit tests for both verification scripts                                                                                                                                          │
* Move existing scritp tests into tests subdirectory                                                                                                                                           │
* Add script-tests.yaml workflow to run script unit tests in CI                                                                                                                                │
* Add docs/overlays/core.md with install instructions, components                                                                                                                       │
  table, and resource requirements                                                                                                                                                      │
* Add docs/overlays/developing.md guide covering resource limit and                                                                                                                     │
  documentation requirements for overlay contributors                                                                                                                                   │
* Add resource requirements frontmatter and section to metrics overlay                                                                                                                  │
  docs                                                                                                                                                                                  │
* Refactor validate.yaml to discover overlays dynamically instead of                                                                                                                    │
  a hardcoded list                                                                                                                                                                      │
* Update README with new script descriptions and test instruction

Signed-off-by: Thomas Cooper <code@tomcooper.dev>
@tomncooper tomncooper requested review from Frawless and kornys April 10, 2026 14:05
* Update overlay developer docs with more details
* Add helper script to show the resource limits set in a given overlay
* Set the docs preview script to use the same hugo-book version as the
  StreamsHub site

Signed-off-by: Thomas Cooper <code@tomcooper.dev>
Signed-off-by: Thomas Cooper <code@tomcooper.dev>
The requirement for requests == limits caused total CPU
requests (3,650m for metrics overlay) to exceed what a 4-CPU minikube
node can allocate after Kubernetes system overhead (~900m), leaving the
console deployment stuck Pending with "Insufficient cpu".

Lower CPU requests while keeping limits unchanged so pods reserve less
for scheduling but can still burst under load:

* Console operator: 500m → 100m request, 500m limit
* Kafka dual-role: 500m → 250m request, 500m limit
* Apicurio registry app: 500m → 250m request, 500m limit
* Console API: 500m → 250m request, 500m limit

Relax CI invariant from requests == limits to requests <= limits:

* Rename checkRequestsEqualsLimits → checkRequestsNotExceedLimits
* Update VerifyResourceLimits and VerifyDocumentedResources call sites
* Update unit tests to accept requests < limits, reject requests > limits
* Update documented cpu_total from 4 to 3 CPU cores for both overlays
@tomncooper
Copy link
Copy Markdown
Contributor Author

I had to reduce the requests for all resources to get them below the 4 cpu limit of the standard GH Actions runner. We may hit a wall with more complicated overlays and may need a custom testing solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant