-
Notifications
You must be signed in to change notification settings - Fork 135
add runbooks for new alerts #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: yati1998 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@aruniiird @weirdwiz please do have a look. |
9a36696 to
9773226
Compare
weirdwiz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs some changes
| @@ -0,0 +1,37 @@ | |||
| # ODFDiskUtilizationHigh | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # ODFDiskUtilizationHigh | |
| # ODFCorePodRestarted |
| 2. If CrashLoopBackOff: Check for configuration errors or version incompatibilities. | ||
| 3. If node-related: Cordon and drain the node; replace if faulty. | ||
| 4. Ensure HA: MONs should be ≥3; OSDs should be distributed. | ||
| 5. Update: If due to a known bug, upgrade ODF to a fixed version. No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a new line here
| 5. Update: If due to a known bug, upgrade ODF to a fixed version. | |
| 5. Update: If due to a known bug, upgrade ODF to a fixed version. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no mitigation section? please add mitigation steps,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure of what mitigation steps should be added here, so I left it empty for now!!
@weirdwiz if you have any suggestions, we can discuss offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, add mitigation steps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MTU runbook should mention how to verify jumbo frames work end-to-end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this, maybe we can work on it once you are back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing runbooks reference shared helper documents like:
- helpers/podDebug.md
- helpers/troubleshootCeph.md
- helpers/gatherLogs.md
- helpers/networkConnectivity.md
the new runbooks embed all commands inline instead of referencing these. consider using helper links for consistency and maintainability.
| ping <node-internal-ip> | ||
| ``` | ||
| 4. Use mtr or traceroute to analyze path and hops. | ||
| 5. Verify if the node is under high CPU or network load: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 5. Verify if the node is under high CPU or network load: | |
| 5. Verify if the node is under high CPU or network load: | |
| oc debug node/<node> | |
| top -b -n 1 | head -20 | |
| sar -u 1 5 |
| sar -n DEV 1 5 | ||
| ``` | ||
| 3. Use Prometheus to graph: | ||
| ```prompql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ```prompql | |
| ```promql |
9773226 to
2deb710
Compare
|
@weirdwiz updated the PR except for the 2 comments, we can work on them once you are back. |
there are new alerts introduced for odf health score calculation. This commit adds runbooks for each of them Signed-off-by: yati1998 <ypadia@redhat.com>
2deb710 to
4efb275
Compare
|
@yati1998: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
there are new alerts introduced for odf
health score calculation. This commit adds
runbooks for each of them