[Do Not Merge]debug OOM in sail-operator e2e by ctartici · Pull Request #79298 · openshift/release

ctartici · 2026-05-14T12:37:08Z

This PR enhances the OpenShift CI infrastructure for the sail-operator's e2e test suite with improved out-of-memory (OOM) debugging capabilities.

The modified test runner script (ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh) adds a new collect_oom_debug_info() function that automatically gathers diagnostic information when the e2e test pod exits with code 137 (indicating it was killed, typically due to OOM or SIGKILL). The diagnostic collection includes:

Pod description and resource utilization details
The node hosting the test pod, including its resource state and events
Cluster-wide namespace events and node-level events
Resource metrics for all pods running on the affected node

The test orchestration function (execute_and_collect_artifacts) now checks the test pod's exit code after running tests, and if it detects exit code 137, automatically triggers the debug info collection before proceeding with artifact retrieval. This enables better troubleshooting of OOM incidents by distinguishing between pod-level resource limit violations and cluster-wide memory pressure situations.

openshift-ci · 2026-05-14T12:39:39Z

Hi @ctartici. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-05-14T12:39:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ctartici
Once this PR has been reviewed and has the lgtm label, please assign dcillera for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

ci-operator/step-registry/servicemesh/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

FilipB · 2026-05-14T12:39:52Z

/ok-to-test

coderabbitai · 2026-05-14T12:40:27Z

Walkthrough

A test automation script now automatically collects out-of-memory (OOM) diagnostics when a test pod exits with code 137. The new collect_oom_debug_info function gathers pod and node descriptions, cluster events, and resource metrics via OpenShift commands. The existing test execution flow conditionally invokes this helper upon detecting the 137 exit code before proceeding to copy test artifacts.

Changes

OOM Diagnostics Collection

Layer / File(s)	Summary
OOM diagnostic helper and invocation `ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh`	A new `collect_oom_debug_info` function prints pod and node descriptions, namespace and node events, and resource usage metrics via `oc` commands. The `execute_and_collect_artifacts` function now checks the test exit code and calls this helper when the code equals 137 (SIGKILL from OOM) before proceeding with artifact collection.

Estimated Code Review Effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 12

✅ Passed checks (12 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR modifies only a bash shell script for E2E test orchestration, containing no Ginkgo test patterns or test names. Check is not applicable to shell script code.
Test Structure And Quality	✅ Passed	Check is not applicable. PR modifies only a bash shell script, not Ginkgo test code. The custom check requires reviewing Ginkgo test blocks (It, BeforeEach, etc.) which are absent here.
Microshift Test Compatibility	✅ Passed	The PR modifies a bash shell script for CI test orchestration, not Go-based Ginkgo e2e tests. The custom check applies only to new Ginkgo test definitions, which are not present in this PR.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR modifies a bash CI orchestration script, not Ginkgo tests. The custom check applies only to new e2e test definitions. No test cases are added.
Topology-Aware Scheduling Compatibility	✅ Passed	Test script modification. No deployment manifests, operator code, or scheduling constraints introduced. Check not applicable.
Ote Binary Stdout Contract	✅ Passed	The PR modifies a bash shell script, not a Go OTE binary. The OTE Binary Stdout Contract check applies only to Go test binaries communicating with openshift-tests, not to CI/CD orchestration scripts.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This PR does not add any Ginkgo e2e tests. It only modifies a Bash shell script for CI debugging. The custom check is not applicable.
Title check	✅ Passed	The title clearly describes the main change: adding OOM debugging functionality to the sail-operator e2e test script.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-merge-bot · 2026-05-14T12:41:30Z

[REHEARSALNOTIFIER]
@ctartici: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.21-e2e-ocp	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.20-e2e-ocp	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.20-e2e-ocp-arm	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.3-ocp-4.21-e2e-ocp	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.1-ocp-4.21-e2e-ocp	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.0-ocp-4.21-e2e-ocp	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.3-ocp-4.20-e2e-ocp	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.3-ocp-4.20-e2e-ocp-arm	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.0-ocp-4.20-e2e-ocp	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.0-ocp-4.20-e2e-ocp-arm	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.2-ocp-4.21-e2e-ocp	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.2-ocp-4.20-e2e-ocp	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.2-ocp-4.20-e2e-ocp-arm	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.1-ocp-4.20-e2e-ocp	openshift-service-mesh/sail-operator	presubmit	Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.1-ocp-4.20-e2e-ocp-arm	openshift-service-mesh/sail-operator	presubmit	Registry content changed

Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh`:
- Around line 58-60: The current use of oc adm top pods with
--selector="spec.nodeName=..." won't work because the Metrics API cannot filter
by spec.nodeName; change the logic to first list pods on the node using the
existing oc get pods --all-namespaces --field-selector
"spec.nodeName=${node_name}" -o
custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers
and then iterate those namespace/name pairs to call oc adm top pod <NAME> -n
<NAMESPACE> (or oc top pod) for each pod to retrieve per-pod metrics, while
keeping the existing fallbacks/messages when metrics-server is unavailable.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8f1febdf-bf03-4061-a5ef-d673e0b85042

📥 Commits

Reviewing files that changed from the base of the PR and between 55c6069 and b5a4d0a.

📒 Files selected for processing (1)

ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh

coderabbitai · 2026-05-14T12:43:22Z

+    echo "--- All pods on node ${node_name} with resource usage ---"
+    oc adm top pods --all-namespaces --selector="spec.nodeName=${node_name}" 2>&1 || echo "Failed to get pod metrics (metrics-server may not be available)"
+    oc get pods --all-namespaces --field-selector "spec.nodeName=${node_name}" -o wide 2>&1 || echo "Failed to list pods on node"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

For oc adm top pods(orkubectl top pods), does --selectorsupport field selectors likespec.nodeName= or only label selectors? What is the recommended way to get resource usage only for pods scheduled on a specific node?

💡 Result:

Both oc adm top pods and kubectl top pods support both --selector (or -l) for label selectors and --field-selector for field selectors [1][2][3]. Field selectors support spec.nodeName=<node> for Pods in general [4], and the commands document --field-selector with examples like key1=value1 [1][3]. However, when using --field-selector spec.nodeName=<node>, it fails with "spec.nodeName is not a known field selector: only metadata.name, metadata.namespace" because the Metrics API (used by top) only supports filtering on those metadata fields [5]. This is an ongoing limitation as of 2025-2026, with open issues/PRs for broader support [5][6]. The recommended way to get resource usage for pods on a specific node is a two-step script: first list pods with kubectl get pods -A --field-selector spec.nodeName=<node-name> -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers, then pipe to xargs for kubectl top pod <pod> -n <ns> --no-headers [7][5][8]. For OpenShift, use oc equivalents. Alternatively, label nodes and use label selectors on pods if node labels propagate [8].

Citations:

1: https://manpages.opensuse.org/Tumbleweed/oc/oc-adm-top-pod.1.en.html

2: https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/

3: https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod

4: https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/

5: kubectl top pods filter by the node kubernetes/kubernetes#131896

6: Add field-selector option for kubectl top pod kubernetes/kubectl#923

7: https://oneuptime.com/blog/post/2026-02-09-kubectl-top-resource-hungry-pods-nodes/view

8: https://stackoverflow.com/questions/57473042/list-kubectl-top-pods-filtered-by-node

oc adm top pods cannot filter by spec.nodeName due to Metrics API limitations

Line 59 attempts to filter pod metrics by node using --selector="spec.nodeName=...", but this will not work. The Metrics API (used by oc adm top pods) only supports filtering on metadata.name and metadata.namespace—not spec.nodeName. Any attempt to filter by node selector or field selector will be ignored or fail.

To get resource usage for pods on a specific node, use a two-step approach: First, list pods on the node with oc get pods -A --field-selector spec.nodeName=<node-name> -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers, then pipe the results to query metrics for each pod individually with oc top pod <pod> -n <namespace>.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh` around lines 58 - 60, The current use of oc adm top pods with --selector="spec.nodeName=..." won't work because the Metrics API cannot filter by spec.nodeName; change the logic to first list pods on the node using the existing oc get pods --all-namespaces --field-selector "spec.nodeName=${node_name}" -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers and then iterate those namespace/name pairs to call oc adm top pod <NAME> -n <NAMESPACE> (or oc top pod) for each pod to retrieve per-pod metrics, while keeping the existing fallbacks/messages when metrics-server is unavailable.

FilipB · 2026-05-14T12:45:26Z

/pj-rehearse pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.20-e2e-ocp-arm

openshift-merge-bot · 2026-05-14T12:45:50Z

@FilipB: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci · 2026-05-14T12:45:53Z

@ctartici: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ctartici · 2026-05-14T14:26:29Z

/pj-rehearse pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.21-e2e-ocp

openshift-merge-bot · 2026-05-14T14:26:44Z

@ctartici: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

ctartici · 2026-05-14T15:01:37Z

/pj-rehearse pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.20-e2e-ocp-arm

openshift-merge-bot · 2026-05-14T15:01:40Z

@ctartici: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

debug OOM in sail-operator e2e

b5a4d0a

openshift-ci Bot requested review from mayleighnmyers and rcernich May 14, 2026 12:39

openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 14, 2026

openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 14, 2026

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

ctartici changed the title ~~debug OOM in sail-operator e2e~~ [Do Not Merge]debug OOM in sail-operator e2e May 14, 2026

Conversation

ctartici commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci Bot commented May 14, 2026

Uh oh!

openshift-ci Bot commented May 14, 2026

Uh oh!

FilipB commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated Code Review Effort

Uh oh!

openshift-merge-bot Bot commented May 14, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

FilipB commented May 14, 2026

Uh oh!

openshift-merge-bot Bot commented May 14, 2026

Uh oh!

openshift-ci Bot commented May 14, 2026

Uh oh!

ctartici commented May 14, 2026

Uh oh!

openshift-merge-bot Bot commented May 14, 2026

Uh oh!

ctartici commented May 14, 2026

Uh oh!

openshift-merge-bot Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ctartici commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading