Skip to content

Conversation

@Ankita13-code
Copy link

@Ankita13-code Ankita13-code commented Nov 25, 2025

Merge Checklist
  • Followed patch format from upstream recommendation: https://github.com/kata-containers/community/blob/main/CONTRIBUTING.md#patch-format
  • Included a single commit in a given PR - at least unless there are related commits and each makes sense as a change on its own.
  • Merged using "create a merge commit" rather than "squash and merge" (or similar)
  • genPolicy only: Builds on Windows
  • genPolicy only: Updated sample YAMLs' policy annotations, if applicable
Summary

Implement automatic guest memory dump collection when a guest VM panics in Cloud Hypervisor, achieving feature parity with QEMU hypervisor.

Implementation:

  • Monitor CLH event file for panic events with non-blocking I/O
  • Enable pvpanic device and configure event-monitor when crashdump enabled
  • Set panic=0 kernel cmdline param to prevent reboot during memory dump
  • Dump guest memory to ELF format with hypervisor metadata
  • Add guest_memory_dump_path option to configuration-clh.toml.in

Memory dumps saved to <guest_memory_dump_path>// including vmcore ELF file, hypervisor config/version, and sandbox state.

Requires CLH built with guest_debug feature.

This enables automated crash analysis workflows for Kata Containers with Cloud Hypervisor, similar to existing QEMU functionality.

Associated issues
Test Methodology

@Ankita13-code Ankita13-code force-pushed the ankitaparek/enable-clh-auto-crashdump branch 7 times, most recently from 1575712 to 1b77ec9 Compare December 1, 2025 13:06
@Ankita13-code Ankita13-code marked this pull request as ready for review December 1, 2025 13:07
@Ankita13-code Ankita13-code requested review from a team as code owners December 1, 2025 13:07
virtioFsSocket = "virtiofsd.sock"
defaultClhPath = "/usr/local/bin/cloud-hypervisor"
// Timeout for coredump operation - memory dumps can take significant time
clhCoredumpTimeout = 300 // 5 minutes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how the k8s control plane reacts during this coredump phase? Does it see the pod as dead the moment the kernel crashes (so that it can move to restart the pod) or do we essentially have a zombie pod preventing replacement for the duration of coredump collection (potentially 5 minutes)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the second case actually. K8s sees the zombie pod but still considers it as running while the coredump is getting collected and then shows the error while stopping the pod. There is no restart happening. 5 minutes is kept as timeout for potentially large coredumps, however if we consider the usual uvm scenarios, this happens pretty quick, (around 30-40 secs).

Nevertheless I am working on an implementation to handle some of these edge case scenarios where timeout could be reduced and there exists a cancellable context for the pod so that collecting coredumps doesn't block the pod replacement for too long

// Timeout for coredump operation - memory dumps can take significant time
clhCoredumpTimeout = 300 // 5 minutes
// Timeout for waiting for event monitor file to be created
clhEventMonitorFileTimeout = 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you name the variable such that we are clear what the units are here? i.e. clhEventMonitorFileTimeoutSeconds

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this way is more consistent with the rest of the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sprt here. This is more consistent with rest of the constants in the code

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents, I see this as core readability improvement and not huge divergence from coding pattern

// Timeout for coredump operation - memory dumps can take significant time
clhCoredumpTimeout = 300 // 5 minutes
// Timeout for waiting for event monitor file to be created
clhEventMonitorFileTimeout = 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, wouldn't very large files potentially take more than 30 seconds to be written down? Should the strategy here to scale the timeout based on the file size?


for {
select {
case <-timeoutChan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a very large memory dump is created while the Kata host is restarting, would Kata be stuck waiting for the memory dump to complete? Now, I agree that with the current code, that timeout would be 30 seconds at most, but during a host restart, it can represent a significant amount of time waiting. Thoughts?


// Create context for event reading
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using "cancel" anywhere? I would need to look into it further, but that may be one of the ways to cancel a pending copy if Kata is restarted during copy of the crash dump file.

clh.Logger().WithError(err).WithField("dumpSavePath", dumpSavePath).Error("failed to call Statfs")
return nil
}
availableSpaceInBytes := fs.Bavail * uint64(fs.Bsize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Why using bytes here? Would we ever need anything at the granularity of a byte?
Typically, storage is represented in megabytes/mebibytes (even for tiny computers). If you stick to MiB, then the likelyhood of overflowing in the future is greatly delayed. When it comes to disk space, MiB is also arguably more readable by humans than bytes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right, thanks for pointing this out! Probably I missed this one since I took this implementation from QEMU (and that code is very old). I've now updated the implementation to use MiBs


// Copy state from /run/vc/sbs to memory dump directory
statePath := filepath.Join(clh.config.RunStorePath, clh.id)
command := []string{"/bin/cp", "-ar", statePath, dumpStatePath}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to blindly copy recursively here? Can we be more specific to avoid any potential security issue?

// Timeout for coredump operation - memory dumps can take significant time
clhCoredumpTimeout = 300 // 5 minutes
// Timeout for waiting for event monitor file to be created
clhEventMonitorFileTimeout = 30
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this way is more consistent with the rest of the code.

Comment on lines +482 to 500

// Set panic behavior based on crashdump configuration
if crashdumpEnabled {
// Don't reboot on panic - wait for crashdump collection
params = append(params, Param{"panic", "0"})
} else {
// Reboot after 1 second on panic (normal behavior)
params = append(params, Param{"panic", "1"})
}

params = append(params, clhKernelParams...)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This way, wouldn't the overridden panic parameter appear before the default set in clhKernelParams? Is that ok?
  2. I don't think we need the else branch if that's the default in clhKernelParams.

Comment on lines 99 to 103
const (
// Memory dump format will be set to elf
clhMemoryDumpFormat = "elf"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This could be collapsed into the previous const block, and the comment isn't crucial.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines +1333 to +1342
// Safely stop event monitoring - uses sync.Once to prevent double-close panic
clh.stopEventMonitor()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need sync.Once? Can terminate() be called more than once?

Copy link
Author

@Ankita13-code Ankita13-code Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, terminate() can be called from multiple places (user, virtiofsd callback) -

  1. Multiple goroutines might call StopVM() simultaneously. Although, the clh.mu lock protects the atomic flag check/set, but terminate() still executes and calls stopEventMonitor()
  2. If virtiofsd crashes, this callback triggers StopVM(). User might also call StopVM() around the same time and this creates a race condition!

The atomic.stopped flag prevents re-execution in most cases but there's still a small race window between concurrent calls. Closing a closed channel panics and sync.Once prevents this

Comment on lines 2058 to 2059
if err := os.Remove(eventMonitorPath); err != nil && !os.IsNotExist(err) {
clh.Logger().WithError(err).WithField("path", eventMonitorPath).Warn("removing event monitor file failed")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we ignore os.IsNotExist(err)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled

func (clh *cloudHypervisor) handleCLHEvent(eventJSON string) {
clh.Logger().WithField("event", eventJSON).Debug("Received CLH event")

var event map[string]interface{}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If events have a schema it would be easier to unmarshal into a struct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

return fmt.Errorf("There is not enough free space to store memory dump file. Expected %d bytes, but only %d bytes available", expectedMemorySize, availableSpaceInBytes)
}

func (clh *cloudHypervisor) handleGuestPanic() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would remove this function since its only role is to call another one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines +1731 to +1741
if dumpSavePath == "" {
return nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should probably do this check before we start the watcher loop - surprised to see it so far down in the call stack.

Copy link
Author

@Ankita13-code Ankita13-code Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in the launchClh() function, we check clh.config.IfPVPanicEnabled(). Now this essentially checks if the dumpSavePath is not empty. Hence we already checking this part very early in the code


// Check device free space and estimated dump size
if err := clh.canDumpGuestMemory(dumpSavePath); err != nil {
clh.Logger().Warnf("Can't dump guest memory: %s", err.Error())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use WithError() instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!


// Copy state from /run/vc/sbs to memory dump directory
statePath := filepath.Join(clh.config.RunStorePath, clh.id)
command := []string{"/bin/cp", "-ar", statePath, dumpStatePath}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use io.Copy() instead of calling cp.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with an already existing fs.CopyDir function

@Ankita13-code Ankita13-code force-pushed the ankitaparek/enable-clh-auto-crashdump branch 2 times, most recently from 305e714 to 48201ce Compare December 15, 2025 10:27
Implement automatic guest memory dump collection when a guest VM panics
in Cloud Hypervisor, achieving feature parity with QEMU hypervisor.

Implementation:
- Monitor CLH event socket for panic events with non-blocking I/O
- Enable pvpanic device and configure event-monitor when crashdump enabled
- Set panic=-1 kernel param to prevent reboot during memory dump
- Dump guest memory to ELF format with hypervisor metadata
- Add guest_memory_dump_path option to configuration-clh.toml.in

Memory dumps saved to <guest_memory_dump_path>/<sandbox-id>/ including
vmcore ELF file, hypervisor config/version, and sandbox state.

Requires CLH built with guest_debug feature.

This enables automated crash analysis workflows for Kata Containers with
Cloud Hypervisor, similar to existing QEMU functionality.

Signed-off-by: Ankita Pareek <ankitapareek@microsoft.com>
@Ankita13-code Ankita13-code force-pushed the ankitaparek/enable-clh-auto-crashdump branch from 48201ce to ea1871a Compare December 16, 2025 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants