Skip to content

[Do Not Merge]debug OOM in sail-operator e2e#79298

Open
ctartici wants to merge 1 commit into
openshift:mainfrom
ctartici:main
Open

[Do Not Merge]debug OOM in sail-operator e2e#79298
ctartici wants to merge 1 commit into
openshift:mainfrom
ctartici:main

Conversation

@ctartici
Copy link
Copy Markdown
Contributor

@ctartici ctartici commented May 14, 2026

This PR enhances the OpenShift CI infrastructure for the sail-operator's e2e test suite with improved out-of-memory (OOM) debugging capabilities.

The modified test runner script (ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh) adds a new collect_oom_debug_info() function that automatically gathers diagnostic information when the e2e test pod exits with code 137 (indicating it was killed, typically due to OOM or SIGKILL). The diagnostic collection includes:

  • Pod description and resource utilization details
  • The node hosting the test pod, including its resource state and events
  • Cluster-wide namespace events and node-level events
  • Resource metrics for all pods running on the affected node

The test orchestration function (execute_and_collect_artifacts) now checks the test pod's exit code after running tests, and if it detects exit code 137, automatically triggers the debug info collection before proceeding with artifact retrieval. This enables better troubleshooting of OOM incidents by distinguishing between pod-level resource limit violations and cluster-wide memory pressure situations.

@openshift-ci openshift-ci Bot requested review from mayleighnmyers and rcernich May 14, 2026 12:39
@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 14, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

Hi @ctartici. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ctartici
Once this PR has been reviewed and has the lgtm label, please assign dcillera for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@FilipB
Copy link
Copy Markdown
Contributor

FilipB commented May 14, 2026

/ok-to-test

@openshift-ci openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 14, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Walkthrough

A test automation script now automatically collects out-of-memory (OOM) diagnostics when a test pod exits with code 137. The new collect_oom_debug_info function gathers pod and node descriptions, cluster events, and resource metrics via OpenShift commands. The existing test execution flow conditionally invokes this helper upon detecting the 137 exit code before proceeding to copy test artifacts.

Changes

OOM Diagnostics Collection

Layer / File(s) Summary
OOM diagnostic helper and invocation
ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh
A new collect_oom_debug_info function prints pod and node descriptions, namespace and node events, and resource usage metrics via oc commands. The execute_and_collect_artifacts function now checks the test exit code and calls this helper when the code equals 137 (SIGKILL from OOM) before proceeding with artifact collection.

Estimated Code Review Effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only a bash shell script for E2E test orchestration, containing no Ginkgo test patterns or test names. Check is not applicable to shell script code.
Test Structure And Quality ✅ Passed Check is not applicable. PR modifies only a bash shell script, not Ginkgo test code. The custom check requires reviewing Ginkgo test blocks (It, BeforeEach, etc.) which are absent here.
Microshift Test Compatibility ✅ Passed The PR modifies a bash shell script for CI test orchestration, not Go-based Ginkgo e2e tests. The custom check applies only to new Ginkgo test definitions, which are not present in this PR.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies a bash CI orchestration script, not Ginkgo tests. The custom check applies only to new e2e test definitions. No test cases are added.
Topology-Aware Scheduling Compatibility ✅ Passed Test script modification. No deployment manifests, operator code, or scheduling constraints introduced. Check not applicable.
Ote Binary Stdout Contract ✅ Passed The PR modifies a bash shell script, not a Go OTE binary. The OTE Binary Stdout Contract check applies only to Go test binaries communicating with openshift-tests, not to CI/CD orchestration scripts.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. It only modifies a Bash shell script for CI debugging. The custom check is not applicable.
Title check ✅ Passed The title clearly describes the main change: adding OOM debugging functionality to the sail-operator e2e test script.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@ctartici: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.21-e2e-ocp openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.20-e2e-ocp openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.20-e2e-ocp-arm openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.3-ocp-4.21-e2e-ocp openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.1-ocp-4.21-e2e-ocp openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.0-ocp-4.21-e2e-ocp openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.3-ocp-4.20-e2e-ocp openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.3-ocp-4.20-e2e-ocp-arm openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.0-ocp-4.20-e2e-ocp openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.0-ocp-4.20-e2e-ocp-arm openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.2-ocp-4.21-e2e-ocp openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.2-ocp-4.20-e2e-ocp openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.2-ocp-4.20-e2e-ocp-arm openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.1-ocp-4.20-e2e-ocp openshift-service-mesh/sail-operator presubmit Registry content changed
pull-ci-openshift-service-mesh-sail-operator-release-3.1-ocp-4.20-e2e-ocp-arm openshift-service-mesh/sail-operator presubmit Registry content changed

Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh`:
- Around line 58-60: The current use of oc adm top pods with
--selector="spec.nodeName=..." won't work because the Metrics API cannot filter
by spec.nodeName; change the logic to first list pods on the node using the
existing oc get pods --all-namespaces --field-selector
"spec.nodeName=${node_name}" -o
custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers
and then iterate those namespace/name pairs to call oc adm top pod <NAME> -n
<NAMESPACE> (or oc top pod) for each pod to retrieve per-pod metrics, while
keeping the existing fallbacks/messages when metrics-server is unavailable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8f1febdf-bf03-4061-a5ef-d673e0b85042

📥 Commits

Reviewing files that changed from the base of the PR and between 55c6069 and b5a4d0a.

📒 Files selected for processing (1)
  • ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh

Comment on lines +58 to +60
echo "--- All pods on node ${node_name} with resource usage ---"
oc adm top pods --all-namespaces --selector="spec.nodeName=${node_name}" 2>&1 || echo "Failed to get pod metrics (metrics-server may not be available)"
oc get pods --all-namespaces --field-selector "spec.nodeName=${node_name}" -o wide 2>&1 || echo "Failed to list pods on node"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

For oc adm top pods(orkubectl top pods), does --selectorsupport field selectors likespec.nodeName= or only label selectors? What is the recommended way to get resource usage only for pods scheduled on a specific node?

💡 Result:

Both oc adm top pods and kubectl top pods support both --selector (or -l) for label selectors and --field-selector for field selectors [1][2][3]. Field selectors support spec.nodeName=<node> for Pods in general [4], and the commands document --field-selector with examples like key1=value1 [1][3]. However, when using --field-selector spec.nodeName=<node>, it fails with "spec.nodeName is not a known field selector: only metadata.name, metadata.namespace" because the Metrics API (used by top) only supports filtering on those metadata fields [5]. This is an ongoing limitation as of 2025-2026, with open issues/PRs for broader support [5][6]. The recommended way to get resource usage for pods on a specific node is a two-step script: first list pods with kubectl get pods -A --field-selector spec.nodeName=<node-name> -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers, then pipe to xargs for kubectl top pod <pod> -n <ns> --no-headers [7][5][8]. For OpenShift, use oc equivalents. Alternatively, label nodes and use label selectors on pods if node labels propagate [8].

Citations:


oc adm top pods cannot filter by spec.nodeName due to Metrics API limitations

Line 59 attempts to filter pod metrics by node using --selector="spec.nodeName=...", but this will not work. The Metrics API (used by oc adm top pods) only supports filtering on metadata.name and metadata.namespace—not spec.nodeName. Any attempt to filter by node selector or field selector will be ignored or fail.

To get resource usage for pods on a specific node, use a two-step approach: First, list pods on the node with oc get pods -A --field-selector spec.nodeName=<node-name> -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers, then pipe the results to query metrics for each pod individually with oc top pod <pod> -n <namespace>.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh`
around lines 58 - 60, The current use of oc adm top pods with
--selector="spec.nodeName=..." won't work because the Metrics API cannot filter
by spec.nodeName; change the logic to first list pods on the node using the
existing oc get pods --all-namespaces --field-selector
"spec.nodeName=${node_name}" -o
custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers
and then iterate those namespace/name pairs to call oc adm top pod <NAME> -n
<NAMESPACE> (or oc top pod) for each pod to retrieve per-pod metrics, while
keeping the existing fallbacks/messages when metrics-server is unavailable.

@FilipB
Copy link
Copy Markdown
Contributor

FilipB commented May 14, 2026

/pj-rehearse pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.20-e2e-ocp-arm

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@FilipB: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

@ctartici: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ctartici ctartici changed the title debug OOM in sail-operator e2e [Do Not Merge]debug OOM in sail-operator e2e May 14, 2026
@ctartici
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.21-e2e-ocp

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@ctartici: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ctartici
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.20-e2e-ocp-arm

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@ctartici: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants