[Do Not Merge]debug OOM in sail-operator e2e#79298
Conversation
|
Hi @ctartici. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ctartici The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/ok-to-test |
WalkthroughA test automation script now automatically collects out-of-memory (OOM) diagnostics when a test pod exits with code 137. The new ChangesOOM Diagnostics Collection
Estimated Code Review Effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[REHEARSALNOTIFIER]
Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals. Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh`:
- Around line 58-60: The current use of oc adm top pods with
--selector="spec.nodeName=..." won't work because the Metrics API cannot filter
by spec.nodeName; change the logic to first list pods on the node using the
existing oc get pods --all-namespaces --field-selector
"spec.nodeName=${node_name}" -o
custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers
and then iterate those namespace/name pairs to call oc adm top pod <NAME> -n
<NAMESPACE> (or oc top pod) for each pod to retrieve per-pod metrics, while
keeping the existing fallbacks/messages when metrics-server is unavailable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 8f1febdf-bf03-4061-a5ef-d673e0b85042
📒 Files selected for processing (1)
ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh
| echo "--- All pods on node ${node_name} with resource usage ---" | ||
| oc adm top pods --all-namespaces --selector="spec.nodeName=${node_name}" 2>&1 || echo "Failed to get pod metrics (metrics-server may not be available)" | ||
| oc get pods --all-namespaces --field-selector "spec.nodeName=${node_name}" -o wide 2>&1 || echo "Failed to list pods on node" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
For oc adm top pods(orkubectl top pods), does --selectorsupport field selectors likespec.nodeName= or only label selectors? What is the recommended way to get resource usage only for pods scheduled on a specific node?
💡 Result:
Both oc adm top pods and kubectl top pods support both --selector (or -l) for label selectors and --field-selector for field selectors [1][2][3]. Field selectors support spec.nodeName=<node> for Pods in general [4], and the commands document --field-selector with examples like key1=value1 [1][3]. However, when using --field-selector spec.nodeName=<node>, it fails with "spec.nodeName is not a known field selector: only metadata.name, metadata.namespace" because the Metrics API (used by top) only supports filtering on those metadata fields [5]. This is an ongoing limitation as of 2025-2026, with open issues/PRs for broader support [5][6]. The recommended way to get resource usage for pods on a specific node is a two-step script: first list pods with kubectl get pods -A --field-selector spec.nodeName=<node-name> -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers, then pipe to xargs for kubectl top pod <pod> -n <ns> --no-headers [7][5][8]. For OpenShift, use oc equivalents. Alternatively, label nodes and use label selectors on pods if node labels propagate [8].
Citations:
- 1: https://manpages.opensuse.org/Tumbleweed/oc/oc-adm-top-pod.1.en.html
- 2: https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/
- 3: https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod
- 4: https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/
- 5:
kubectl top podsfilter by the node kubernetes/kubernetes#131896 - 6: Add
field-selectoroption for kubectl top pod kubernetes/kubectl#923 - 7: https://oneuptime.com/blog/post/2026-02-09-kubectl-top-resource-hungry-pods-nodes/view
- 8: https://stackoverflow.com/questions/57473042/list-kubectl-top-pods-filtered-by-node
oc adm top pods cannot filter by spec.nodeName due to Metrics API limitations
Line 59 attempts to filter pod metrics by node using --selector="spec.nodeName=...", but this will not work. The Metrics API (used by oc adm top pods) only supports filtering on metadata.name and metadata.namespace—not spec.nodeName. Any attempt to filter by node selector or field selector will be ignored or fail.
To get resource usage for pods on a specific node, use a two-step approach: First, list pods on the node with oc get pods -A --field-selector spec.nodeName=<node-name> -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers, then pipe the results to query metrics for each pod individually with oc top pod <pod> -n <namespace>.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh`
around lines 58 - 60, The current use of oc adm top pods with
--selector="spec.nodeName=..." won't work because the Metrics API cannot filter
by spec.nodeName; change the logic to first list pods on the node using the
existing oc get pods --all-namespaces --field-selector
"spec.nodeName=${node_name}" -o
custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers
and then iterate those namespace/name pairs to call oc adm top pod <NAME> -n
<NAMESPACE> (or oc top pod) for each pod to retrieve per-pod metrics, while
keeping the existing fallbacks/messages when metrics-server is unavailable.
|
/pj-rehearse pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.20-e2e-ocp-arm |
|
@FilipB: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@ctartici: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/pj-rehearse pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.21-e2e-ocp |
|
@ctartici: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-openshift-service-mesh-sail-operator-main-ocp-4.20-e2e-ocp-arm |
|
@ctartici: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
This PR enhances the OpenShift CI infrastructure for the sail-operator's e2e test suite with improved out-of-memory (OOM) debugging capabilities.
The modified test runner script (
ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh) adds a newcollect_oom_debug_info()function that automatically gathers diagnostic information when the e2e test pod exits with code 137 (indicating it was killed, typically due to OOM or SIGKILL). The diagnostic collection includes:The test orchestration function (
execute_and_collect_artifacts) now checks the test pod's exit code after running tests, and if it detects exit code 137, automatically triggers the debug info collection before proceeding with artifact retrieval. This enables better troubleshooting of OOM incidents by distinguishing between pod-level resource limit violations and cluster-wide memory pressure situations.