-
Notifications
You must be signed in to change notification settings - Fork 44
virtcontainers: Add automatic crashdump collection for Cloud Hypervisor #421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: msft-main
Are you sure you want to change the base?
Conversation
1575712 to
1b77ec9
Compare
| virtioFsSocket = "virtiofsd.sock" | ||
| defaultClhPath = "/usr/local/bin/cloud-hypervisor" | ||
| // Timeout for coredump operation - memory dumps can take significant time | ||
| clhCoredumpTimeout = 300 // 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know how the k8s control plane reacts during this coredump phase? Does it see the pod as dead the moment the kernel crashes (so that it can move to restart the pod) or do we essentially have a zombie pod preventing replacement for the duration of coredump collection (potentially 5 minutes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the second case actually. K8s sees the zombie pod but still considers it as running while the coredump is getting collected and then shows the error while stopping the pod. There is no restart happening. 5 minutes is kept as timeout for potentially large coredumps, however if we consider the usual uvm scenarios, this happens pretty quick, (around 30-40 secs).
Nevertheless I am working on an implementation to handle some of these edge case scenarios where timeout could be reduced and there exists a cancellable context for the pod so that collecting coredumps doesn't block the pod replacement for too long
| // Timeout for coredump operation - memory dumps can take significant time | ||
| clhCoredumpTimeout = 300 // 5 minutes | ||
| // Timeout for waiting for event monitor file to be created | ||
| clhEventMonitorFileTimeout = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you name the variable such that we are clear what the units are here? i.e. clhEventMonitorFileTimeoutSeconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this way is more consistent with the rest of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @sprt here. This is more consistent with rest of the constants in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents, I see this as core readability improvement and not huge divergence from coding pattern
| // Timeout for coredump operation - memory dumps can take significant time | ||
| clhCoredumpTimeout = 300 // 5 minutes | ||
| // Timeout for waiting for event monitor file to be created | ||
| clhEventMonitorFileTimeout = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, wouldn't very large files potentially take more than 30 seconds to be written down? Should the strategy here to scale the timeout based on the file size?
|
|
||
| for { | ||
| select { | ||
| case <-timeoutChan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a very large memory dump is created while the Kata host is restarting, would Kata be stuck waiting for the memory dump to complete? Now, I agree that with the current code, that timeout would be 30 seconds at most, but during a host restart, it can represent a significant amount of time waiting. Thoughts?
|
|
||
| // Create context for event reading | ||
| ctx, cancel := context.WithCancel(context.Background()) | ||
| defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using "cancel" anywhere? I would need to look into it further, but that may be one of the ways to cancel a pending copy if Kata is restarted during copy of the crash dump file.
src/runtime/virtcontainers/clh.go
Outdated
| clh.Logger().WithError(err).WithField("dumpSavePath", dumpSavePath).Error("failed to call Statfs") | ||
| return nil | ||
| } | ||
| availableSpaceInBytes := fs.Bavail * uint64(fs.Bsize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Why using bytes here? Would we ever need anything at the granularity of a byte?
Typically, storage is represented in megabytes/mebibytes (even for tiny computers). If you stick to MiB, then the likelyhood of overflowing in the future is greatly delayed. When it comes to disk space, MiB is also arguably more readable by humans than bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh right, thanks for pointing this out! Probably I missed this one since I took this implementation from QEMU (and that code is very old). I've now updated the implementation to use MiBs
src/runtime/virtcontainers/clh.go
Outdated
|
|
||
| // Copy state from /run/vc/sbs to memory dump directory | ||
| statePath := filepath.Join(clh.config.RunStorePath, clh.id) | ||
| command := []string{"/bin/cp", "-ar", statePath, dumpStatePath} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to blindly copy recursively here? Can we be more specific to avoid any potential security issue?
| // Timeout for coredump operation - memory dumps can take significant time | ||
| clhCoredumpTimeout = 300 // 5 minutes | ||
| // Timeout for waiting for event monitor file to be created | ||
| clhEventMonitorFileTimeout = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this way is more consistent with the rest of the code.
|
|
||
| // Set panic behavior based on crashdump configuration | ||
| if crashdumpEnabled { | ||
| // Don't reboot on panic - wait for crashdump collection | ||
| params = append(params, Param{"panic", "0"}) | ||
| } else { | ||
| // Reboot after 1 second on panic (normal behavior) | ||
| params = append(params, Param{"panic", "1"}) | ||
| } | ||
|
|
||
| params = append(params, clhKernelParams...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This way, wouldn't the overridden
panicparameter appear before the default set inclhKernelParams? Is that ok? - I don't think we need the
elsebranch if that's the default inclhKernelParams.
src/runtime/virtcontainers/clh.go
Outdated
| const ( | ||
| // Memory dump format will be set to elf | ||
| clhMemoryDumpFormat = "elf" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This could be collapsed into the previous const block, and the comment isn't crucial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
| // Safely stop event monitoring - uses sync.Once to prevent double-close panic | ||
| clh.stopEventMonitor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need sync.Once? Can terminate() be called more than once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, terminate() can be called from multiple places (user, virtiofsd callback) -
- Multiple goroutines might call
StopVM()simultaneously. Although, theclh.mulock protects the atomic flag check/set, butterminate()still executes and callsstopEventMonitor() - If virtiofsd crashes, this callback triggers
StopVM(). User might also callStopVM()around the same time and this creates a race condition!
The atomic.stopped flag prevents re-execution in most cases but there's still a small race window between concurrent calls. Closing a closed channel panics and sync.Once prevents this
src/runtime/virtcontainers/clh.go
Outdated
| if err := os.Remove(eventMonitorPath); err != nil && !os.IsNotExist(err) { | ||
| clh.Logger().WithError(err).WithField("path", eventMonitorPath).Warn("removing event monitor file failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we ignore os.IsNotExist(err)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled
src/runtime/virtcontainers/clh.go
Outdated
| func (clh *cloudHypervisor) handleCLHEvent(eventJSON string) { | ||
| clh.Logger().WithField("event", eventJSON).Debug("Received CLH event") | ||
|
|
||
| var event map[string]interface{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If events have a schema it would be easier to unmarshal into a struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
src/runtime/virtcontainers/clh.go
Outdated
| return fmt.Errorf("There is not enough free space to store memory dump file. Expected %d bytes, but only %d bytes available", expectedMemorySize, availableSpaceInBytes) | ||
| } | ||
|
|
||
| func (clh *cloudHypervisor) handleGuestPanic() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would remove this function since its only role is to call another one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| if dumpSavePath == "" { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We should probably do this check before we start the watcher loop - surprised to see it so far down in the call stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in the launchClh() function, we check clh.config.IfPVPanicEnabled(). Now this essentially checks if the dumpSavePath is not empty. Hence we already checking this part very early in the code
src/runtime/virtcontainers/clh.go
Outdated
|
|
||
| // Check device free space and estimated dump size | ||
| if err := clh.canDumpGuestMemory(dumpSavePath); err != nil { | ||
| clh.Logger().Warnf("Can't dump guest memory: %s", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Use WithError() instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
src/runtime/virtcontainers/clh.go
Outdated
|
|
||
| // Copy state from /run/vc/sbs to memory dump directory | ||
| statePath := filepath.Join(clh.config.RunStorePath, clh.id) | ||
| command := []string{"/bin/cp", "-ar", statePath, dumpStatePath} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use io.Copy() instead of calling cp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with an already existing fs.CopyDir function
305e714 to
48201ce
Compare
Implement automatic guest memory dump collection when a guest VM panics in Cloud Hypervisor, achieving feature parity with QEMU hypervisor. Implementation: - Monitor CLH event socket for panic events with non-blocking I/O - Enable pvpanic device and configure event-monitor when crashdump enabled - Set panic=-1 kernel param to prevent reboot during memory dump - Dump guest memory to ELF format with hypervisor metadata - Add guest_memory_dump_path option to configuration-clh.toml.in Memory dumps saved to <guest_memory_dump_path>/<sandbox-id>/ including vmcore ELF file, hypervisor config/version, and sandbox state. Requires CLH built with guest_debug feature. This enables automated crash analysis workflows for Kata Containers with Cloud Hypervisor, similar to existing QEMU functionality. Signed-off-by: Ankita Pareek <ankitapareek@microsoft.com>
48201ce to
ea1871a
Compare
Merge Checklist
Summary
Implement automatic guest memory dump collection when a guest VM panics in Cloud Hypervisor, achieving feature parity with QEMU hypervisor.
Implementation:
panic=0kernel cmdline param to prevent reboot during memory dumpguest_memory_dump_pathoption to configuration-clh.toml.inMemory dumps saved to <guest_memory_dump_path>// including vmcore ELF file, hypervisor config/version, and sandbox state.
Requires CLH built with
guest_debugfeature.This enables automated crash analysis workflows for Kata Containers with Cloud Hypervisor, similar to existing QEMU functionality.
Associated issues
Test Methodology