Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,37 @@ readonly RETRY_SLEEP_INTERVAL=30

# --- Functions ---

# collect_oom_debug_info gathers debugging information when a pod exits with 137 (OOM/SIGKILL).
# This helps determine if the pod was killed due to pod limits or node-level OOM.
collect_oom_debug_info() {
echo "=== OOM DEBUG INFO: Collecting diagnostics for exit code 137 ==="

echo "--- Pod description for ${MAISTRA_SC_POD} ---"
oc describe pod "${MAISTRA_SC_POD}" -n "${MAISTRA_NAMESPACE}" 2>&1 || echo "Failed to describe pod"

local node_name
node_name=$(oc get pod "${MAISTRA_SC_POD}" -n "${MAISTRA_NAMESPACE}" -o jsonpath='{.spec.nodeName}' 2>/dev/null || echo "")

echo "--- Events from namespace ${MAISTRA_NAMESPACE} ---"
oc get events -n "${MAISTRA_NAMESPACE}" --sort-by='.lastTimestamp' 2>&1 || echo "Failed to get namespace events"

if [[ -n "${node_name}" ]]; then
echo "--- Node ${node_name} description ---"
oc describe node "${node_name}" 2>&1 || echo "Failed to describe node"

echo "--- Events from node ${node_name} ---"
oc get events --field-selector "involvedObject.name=${node_name}" --all-namespaces --sort-by='.lastTimestamp' 2>&1 || echo "Failed to get node events"

echo "--- All pods on node ${node_name} with resource usage ---"
oc adm top pods --all-namespaces --selector="spec.nodeName=${node_name}" 2>&1 || echo "Failed to get pod metrics (metrics-server may not be available)"
oc get pods --all-namespaces --field-selector "spec.nodeName=${node_name}" -o wide 2>&1 || echo "Failed to list pods on node"
Comment on lines +58 to +60
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

For oc adm top pods(orkubectl top pods), does --selectorsupport field selectors likespec.nodeName= or only label selectors? What is the recommended way to get resource usage only for pods scheduled on a specific node?

💡 Result:

Both oc adm top pods and kubectl top pods support both --selector (or -l) for label selectors and --field-selector for field selectors [1][2][3]. Field selectors support spec.nodeName=<node> for Pods in general [4], and the commands document --field-selector with examples like key1=value1 [1][3]. However, when using --field-selector spec.nodeName=<node>, it fails with "spec.nodeName is not a known field selector: only metadata.name, metadata.namespace" because the Metrics API (used by top) only supports filtering on those metadata fields [5]. This is an ongoing limitation as of 2025-2026, with open issues/PRs for broader support [5][6]. The recommended way to get resource usage for pods on a specific node is a two-step script: first list pods with kubectl get pods -A --field-selector spec.nodeName=<node-name> -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers, then pipe to xargs for kubectl top pod <pod> -n <ns> --no-headers [7][5][8]. For OpenShift, use oc equivalents. Alternatively, label nodes and use label selectors on pods if node labels propagate [8].

Citations:


oc adm top pods cannot filter by spec.nodeName due to Metrics API limitations

Line 59 attempts to filter pod metrics by node using --selector="spec.nodeName=...", but this will not work. The Metrics API (used by oc adm top pods) only supports filtering on metadata.name and metadata.namespace—not spec.nodeName. Any attempt to filter by node selector or field selector will be ignored or fail.

To get resource usage for pods on a specific node, use a two-step approach: First, list pods on the node with oc get pods -A --field-selector spec.nodeName=<node-name> -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers, then pipe the results to query metrics for each pod individually with oc top pod <pod> -n <namespace>.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/servicemesh/sail-operator/e2e-ocp/servicemesh-sail-operator-e2e-ocp-commands.sh`
around lines 58 - 60, The current use of oc adm top pods with
--selector="spec.nodeName=..." won't work because the Metrics API cannot filter
by spec.nodeName; change the logic to first list pods on the node using the
existing oc get pods --all-namespaces --field-selector
"spec.nodeName=${node_name}" -o
custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name --no-headers
and then iterate those namespace/name pairs to call oc adm top pod <NAME> -n
<NAMESPACE> (or oc top pod) for each pod to retrieve per-pod metrics, while
keeping the existing fallbacks/messages when metrics-server is unavailable.

else
echo "WARNING: Could not determine node name for pod ${MAISTRA_SC_POD}"
fi

echo "=== END OOM DEBUG INFO ==="
}

# check_cluster_operators waits up to 15 minutes for all OpenShift cluster
# operators to be in a stable (not Progressing, not Degraded, and Available) state.
check_cluster_operators() {
Expand Down Expand Up @@ -127,6 +158,11 @@ execute_and_collect_artifacts() {
test_rc=$?
echo "Test run (attempt ${attempt}) completed with exit code ${test_rc}"

# Collect debug info if pod was killed (likely OOM)
if [[ "${test_rc}" -eq 137 ]]; then
collect_oom_debug_info
fi

echo "Copying artifacts from test pod after attempt ${attempt}..."
oc cp "${MAISTRA_NAMESPACE}"/"${MAISTRA_SC_POD}":"${ARTIFACT_DIR}"/. "${ARTIFACT_DIR}"

Expand Down