sled-diagnostics: Capture nvmeadm health logpage#10031
sled-diagnostics: Capture nvmeadm health logpage#10031wfchandler wants to merge 1 commit intomainfrom
Conversation
b1ba6d8 to
e15ebe2
Compare
rmustacc
left a comment
There was a problem hiding this comment.
Is this the first thing that we're adding that relies on the disk working correctly to actually finish generating a support bundle? Assume that we have a device that is hanging or cannot complete commands, how do we ensure that we don't hang the entire support bundle generation process.
| pub fn nvmeadm_logpage_health(nvme_num: u32) -> Command { | ||
| let mut cmd = std::process::Command::new(PFEXEC); | ||
| cmd.env_clear() | ||
| .arg(NVMEADM) | ||
| .arg("-v") | ||
| .arg("get-logpage") | ||
| .arg(&format!("nvme{nvme_num}")) | ||
| .arg("health"); | ||
| cmd | ||
| } |
There was a problem hiding this comment.
If we're going to invoke this, please just use the -O option to get-logpage to send this entirely to a binary file that can be interpreted more efficiently with tools.
There was a problem hiding this comment.
@rmustacc I think this is a "yes and" situation. When we we're specifically looking for a problem with a disk, the binary files are superior. In scenarios where we're just performing a quick health check against the bundle, it's more convenient to have text output.
Text files can be analyzed on non-illumos hosts, and are trivial to read without extracting the files, e.g., bundle-cat bundle.zip --path '*logpage*' | 'grep -A 4 "Critical Warnings"'.
Happy to make a follow-on PR for the binary health log page, and any others you want.
There was a problem hiding this comment.
The text output format is not a stable interface and is going to change. So I think it's critical if we're going to build tooling on top of this that we're doing something that is going to continue to work and not silently break.
There was a problem hiding this comment.
It looks like the print-logpage CL you have in flight for illumos will cover both of our needs, or maybe get-logpage -p.
Perhaps I should just close this PR and wait for those command to be available.
There was a problem hiding this comment.
The same features for print-logpage work for get-logpage. However, if I were doing the support bundle, I would again just gather the thing we want once and then do whatever we want after the fact. Note, the changes going in there don't touch the extent logs today, but will in the future.
During recent customer installs we have found that the `health` logpage exposed by `nvmeadm(8)` was useful in identifying failing drives. Add this output to support bundles.
e15ebe2 to
cb97c8d
Compare
No, these commands (and all others in |
During recent customer installs we have found that the
healthlogpage exposed bynvmeadm(8)was useful in identifying failing drives.Add this output to support bundles.