add runbooks for new alerts #363

yati1998 · 2025-12-12T07:31:07Z

there are new alerts introduced for odf
health score calculation. This commit adds
runbooks for each of them

openshift-ci · 2025-12-12T07:31:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yati1998
Once this PR has been reviewed and has the lgtm label, please assign blaineexe for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

alerts/openshift-container-storage-operator/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yati1998 · 2025-12-12T07:31:40Z

@aruniiird @weirdwiz please do have a look.

weirdwiz

needs some changes

weirdwiz · 2025-12-15T06:05:28Z

alerts/openshift-container-storage-operator/ODFCorePodRestarted.md

@@ -0,0 +1,37 @@
+# ODFDiskUtilizationHigh


Suggested change

# ODFDiskUtilizationHigh

# ODFCorePodRestarted

weirdwiz · 2025-12-15T06:05:51Z

alerts/openshift-container-storage-operator/ODFCorePodRestarted.md

+2. If CrashLoopBackOff: Check for configuration errors or version incompatibilities.
+3. If node-related: Cordon and drain the node; replace if faulty.
+4. Ensure HA: MONs should be ≥3; OSDs should be distributed.
+5. Update: If due to a known bug, upgrade ODF to a fixed version.


add a new line here

Suggested change

5. Update: If due to a known bug, upgrade ODF to a fixed version.

5. Update: If due to a known bug, upgrade ODF to a fixed version.

weirdwiz · 2025-12-15T06:06:32Z

alerts/openshift-container-storage-operator/ODFDiskUtilizationHigh.md

no mitigation section? please add mitigation steps,

I am not sure of what mitigation steps should be added here, so I left it empty for now!!
@weirdwiz if you have any suggestions, we can discuss offline.

weirdwiz · 2025-12-15T06:07:38Z

alerts/openshift-container-storage-operator/ODFNodeMTULessThan9000.md

same here, add mitigation steps

The MTU runbook should mention how to verify jumbo frames work end-to-end

I am not sure about this, maybe we can work on it once you are back.

weirdwiz · 2025-12-15T06:08:55Z

alerts/openshift-container-storage-operator/ODFCorePodRestarted.md

existing runbooks reference shared helper documents like:

helpers/podDebug.md

helpers/troubleshootCeph.md

helpers/gatherLogs.md

helpers/networkConnectivity.md

the new runbooks embed all commands inline instead of referencing these. consider using helper links for consistency and maintainability.

weirdwiz · 2025-12-15T06:09:47Z

alerts/openshift-container-storage-operator/ODFNodeLatencyHighOnOSDNodes.md

+ping <node-internal-ip>
+```
+4. Use mtr or traceroute to analyze path and hops.
+5. Verify if the node is under high CPU or network load:


Suggested change

5. Verify if the node is under high CPU or network load:

5. Verify if the node is under high CPU or network load:

oc debug node/<node>

top -b -n 1 | head -20

sar -u 1 5

weirdwiz · 2025-12-15T06:10:13Z

alerts/openshift-container-storage-operator/ODFNodeNICBandwidthSaturation.md

+sar -n DEV 1 5
+```
+3. Use Prometheus to graph:
+```prompql


Suggested change

```prompql

```promql

yati1998 · 2025-12-31T06:11:23Z

@weirdwiz updated the PR except for the 2 comments, we can work on them once you are back.

there are new alerts introduced for odf health score calculation. This commit adds runbooks for each of them Signed-off-by: yati1998 <ypadia@redhat.com>

openshift-ci · 2025-12-31T06:18:44Z

@yati1998: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`4efb275`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from agarwal-mudit and malayparida2000 December 12, 2025 07:31

yati1998 force-pushed the odfhealthscore branch from 9a36696 to 9773226 Compare December 12, 2025 07:38

weirdwiz suggested changes Dec 15, 2025

View reviewed changes

yati1998 force-pushed the odfhealthscore branch from 9773226 to 2deb710 Compare December 31, 2025 06:09

add runbooks for new alerts

4efb275

there are new alerts introduced for odf health score calculation. This commit adds runbooks for each of them Signed-off-by: yati1998 <ypadia@redhat.com>

yati1998 force-pushed the odfhealthscore branch from 2deb710 to 4efb275 Compare December 31, 2025 06:16

	5. Update: If due to a known bug, upgrade ODF to a fixed version.
	5. Update: If due to a known bug, upgrade ODF to a fixed version.

-. Verify if the node is under high CPU or network load:
+. Verify if the node is under high CPU or network load:
+  oc debug node/<node>
+  top -b -n 1 | head -20
+  sar -u 1 5

add runbooks for new alerts #363

Are you sure you want to change the base?

add runbooks for new alerts #363

Uh oh!

Conversation

yati1998 commented Dec 12, 2025

Uh oh!

openshift-ci bot commented Dec 12, 2025

Uh oh!

yati1998 commented Dec 12, 2025

Uh oh!

weirdwiz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yati1998 commented Dec 31, 2025

Uh oh!

openshift-ci bot commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants