virtcontainers: Add automatic crashdump collection for Cloud Hypervisor #421

Ankita13-code · 2025-11-25T10:13:13Z

Merge Checklist

Followed patch format from upstream recommendation: https://github.com/kata-containers/community/blob/main/CONTRIBUTING.md#patch-format
Included a single commit in a given PR - at least unless there are related commits and each makes sense as a change on its own.
Merged using "create a merge commit" rather than "squash and merge" (or similar)
genPolicy only: Builds on Windows
genPolicy only: Updated sample YAMLs' policy annotations, if applicable

Summary

Implement automatic guest memory dump collection when a guest VM panics in Cloud Hypervisor, achieving feature parity with QEMU hypervisor.

Implementation:

Monitor CLH event file for panic events with non-blocking I/O
Enable pvpanic device and configure event-monitor when crashdump enabled
Set panic=0 kernel cmdline param to prevent reboot during memory dump
Dump guest memory to ELF format with hypervisor metadata
Add guest_memory_dump_path option to configuration-clh.toml.in

Memory dumps saved to <guest_memory_dump_path>// including vmcore ELF file, hypervisor config/version, and sandbox state.

Requires CLH built with guest_debug feature.

This enables automated crash analysis workflows for Kata Containers with Cloud Hypervisor, similar to existing QEMU functionality.

Associated issues

Test Methodology

src/runtime/virtcontainers/clh.go

Camelron · 2025-12-01T17:42:17Z

src/runtime/virtcontainers/clh.go

-	virtioFsSocket                         = "virtiofsd.sock"
-	defaultClhPath                         = "/usr/local/bin/cloud-hypervisor"
+	// Timeout for coredump operation - memory dumps can take significant time
+	clhCoredumpTimeout = 300 // 5 minutes


Do we know how the k8s control plane reacts during this coredump phase? Does it see the pod as dead the moment the kernel crashes (so that it can move to restart the pod) or do we essentially have a zombie pod preventing replacement for the duration of coredump collection (potentially 5 minutes)?

It's the second case actually. K8s sees the zombie pod but still considers it as running while the coredump is getting collected and then shows the error while stopping the pod. There is no restart happening. 5 minutes is kept as timeout for potentially large coredumps, however if we consider the usual uvm scenarios, this happens pretty quick, (around 30-40 secs).

Nevertheless I am working on an implementation to handle some of these edge case scenarios where timeout could be reduced and there exists a cancellable context for the pod so that collecting coredumps doesn't block the pod replacement for too long

JocelynBerrendonner · 2025-12-01T19:14:58Z

src/runtime/virtcontainers/clh.go

+	// Timeout for coredump operation - memory dumps can take significant time
+	clhCoredumpTimeout = 300 // 5 minutes
+	// Timeout for waiting for event monitor file to be created
+	clhEventMonitorFileTimeout = 30


Can you name the variable such that we are clear what the units are here? i.e. clhEventMonitorFileTimeoutSeconds

I feel this way is more consistent with the rest of the code.

I agree with @sprt here. This is more consistent with rest of the constants in the code

My 2 cents, I see this as core readability improvement and not huge divergence from coding pattern

JocelynBerrendonner · 2025-12-01T19:15:33Z

src/runtime/virtcontainers/clh.go

+	// Timeout for coredump operation - memory dumps can take significant time
+	clhCoredumpTimeout = 300 // 5 minutes
+	// Timeout for waiting for event monitor file to be created
+	clhEventMonitorFileTimeout = 30


BTW, wouldn't very large files potentially take more than 30 seconds to be written down? Should the strategy here to scale the timeout based on the file size?

JocelynBerrendonner · 2025-12-01T19:16:30Z

src/runtime/virtcontainers/clh.go

+
+	for {
+		select {
+		case <-timeoutChan:


If a very large memory dump is created while the Kata host is restarting, would Kata be stuck waiting for the memory dump to complete? Now, I agree that with the current code, that timeout would be 30 seconds at most, but during a host restart, it can represent a significant amount of time waiting. Thoughts?

JocelynBerrendonner · 2025-12-01T19:22:24Z

src/runtime/virtcontainers/clh.go

+
+	// Create context for event reading
+	ctx, cancel := context.WithCancel(context.Background())
+	defer cancel()


Are you using "cancel" anywhere? I would need to look into it further, but that may be one of the ways to cancel a pending copy if Kata is restarted during copy of the crash dump file.

JocelynBerrendonner · 2025-12-01T19:29:27Z

src/runtime/virtcontainers/clh.go

+		clh.Logger().WithError(err).WithField("dumpSavePath", dumpSavePath).Error("failed to call Statfs")
+		return nil
+	}
+	availableSpaceInBytes := fs.Bavail * uint64(fs.Bsize)


[nitpick] Why using bytes here? Would we ever need anything at the granularity of a byte?
Typically, storage is represented in megabytes/mebibytes (even for tiny computers). If you stick to MiB, then the likelyhood of overflowing in the future is greatly delayed. When it comes to disk space, MiB is also arguably more readable by humans than bytes.

Ahh right, thanks for pointing this out! Probably I missed this one since I took this implementation from QEMU (and that code is very old). I've now updated the implementation to use MiBs

JocelynBerrendonner · 2025-12-01T19:52:31Z

src/runtime/virtcontainers/clh.go

+
+	// Copy state from /run/vc/sbs to memory dump directory
+	statePath := filepath.Join(clh.config.RunStorePath, clh.id)
+	command := []string{"/bin/cp", "-ar", statePath, dumpStatePath}


Why do we want to blindly copy recursively here? Can we be more specific to avoid any potential security issue?

sprt · 2025-12-02T11:10:23Z

src/runtime/virtcontainers/clh.go

+	// Timeout for coredump operation - memory dumps can take significant time
+	clhCoredumpTimeout = 300 // 5 minutes
+	// Timeout for waiting for event monitor file to be created
+	clhEventMonitorFileTimeout = 30


I feel this way is more consistent with the rest of the code.

sprt · 2025-12-02T11:14:26Z

src/runtime/virtcontainers/clh.go

+
+	// Set panic behavior based on crashdump configuration
+	if crashdumpEnabled {
+		// Don't reboot on panic - wait for crashdump collection
+		params = append(params, Param{"panic", "0"})
+	} else {
+		// Reboot after 1 second on panic (normal behavior)
+		params = append(params, Param{"panic", "1"})
+	}
+
 	params = append(params, clhKernelParams...)


This way, wouldn't the overridden panic parameter appear before the default set in clhKernelParams? Is that ok?

I don't think we need the else branch if that's the default in clhKernelParams.

sprt · 2025-12-02T11:19:47Z

src/runtime/virtcontainers/clh.go

+const (
+	// Memory dump format will be set to elf
+	clhMemoryDumpFormat = "elf"
 )


nit: This could be collapsed into the previous const block, and the comment isn't crucial.

sprt · 2025-12-02T11:24:17Z

src/runtime/virtcontainers/clh.go

+	// Safely stop event monitoring - uses sync.Once to prevent double-close panic
+	clh.stopEventMonitor()


Why do we need sync.Once? Can terminate() be called more than once?

Yes, terminate() can be called from multiple places (user, virtiofsd callback) -

Multiple goroutines might call StopVM() simultaneously. Although, the clh.mu lock protects the atomic flag check/set, but terminate() still executes and calls stopEventMonitor()

If virtiofsd crashes, this callback triggers StopVM(). User might also call StopVM() around the same time and this creates a race condition!

The atomic.stopped flag prevents re-execution in most cases but there's still a small race window between concurrent calls. Closing a closed channel panics and sync.Once prevents this

sprt · 2025-12-02T11:26:56Z

src/runtime/virtcontainers/clh.go

+		if err := os.Remove(eventMonitorPath); err != nil && !os.IsNotExist(err) {
+			clh.Logger().WithError(err).WithField("path", eventMonitorPath).Warn("removing event monitor file failed")


Why do we ignore os.IsNotExist(err)?

src/runtime/virtcontainers/clh.go

sprt · 2025-12-02T11:49:33Z

src/runtime/virtcontainers/clh.go

+func (clh *cloudHypervisor) handleCLHEvent(eventJSON string) {
+	clh.Logger().WithField("event", eventJSON).Debug("Received CLH event")
+
+	var event map[string]interface{}


nit: If events have a schema it would be easier to unmarshal into a struct.

sprt · 2025-12-02T11:50:51Z

src/runtime/virtcontainers/clh.go

+	return fmt.Errorf("There is not enough free space to store memory dump file. Expected %d bytes, but only %d bytes available", expectedMemorySize, availableSpaceInBytes)
+}
+
+func (clh *cloudHypervisor) handleGuestPanic() {


nit: I would remove this function since its only role is to call another one.

sprt · 2025-12-02T11:56:36Z

src/runtime/virtcontainers/clh.go

+	if dumpSavePath == "" {
+		return nil
+	}


nit: We should probably do this check before we start the watcher loop - surprised to see it so far down in the call stack.

Actually in the launchClh() function, we check clh.config.IfPVPanicEnabled(). Now this essentially checks if the dumpSavePath is not empty. Hence we already checking this part very early in the code

sprt · 2025-12-02T12:02:24Z

src/runtime/virtcontainers/clh.go

+
+	// Check device free space and estimated dump size
+	if err := clh.canDumpGuestMemory(dumpSavePath); err != nil {
+		clh.Logger().Warnf("Can't dump guest memory: %s", err.Error())


nit: Use WithError() instead.

sprt · 2025-12-02T13:42:45Z

src/runtime/virtcontainers/clh.go

+
+	// Copy state from /run/vc/sbs to memory dump directory
+	statePath := filepath.Join(clh.config.RunStorePath, clh.id)
+	command := []string{"/bin/cp", "-ar", statePath, dumpStatePath}


We should use io.Copy() instead of calling cp.

Updated with an already existing fs.CopyDir function

Implement automatic guest memory dump collection when a guest VM panics in Cloud Hypervisor, achieving feature parity with QEMU hypervisor. Implementation: - Monitor CLH event socket for panic events with non-blocking I/O - Enable pvpanic device and configure event-monitor when crashdump enabled - Set panic=-1 kernel param to prevent reboot during memory dump - Dump guest memory to ELF format with hypervisor metadata - Add guest_memory_dump_path option to configuration-clh.toml.in Memory dumps saved to <guest_memory_dump_path>/<sandbox-id>/ including vmcore ELF file, hypervisor config/version, and sandbox state. Requires CLH built with guest_debug feature. This enables automated crash analysis workflows for Kata Containers with Cloud Hypervisor, similar to existing QEMU functionality. Signed-off-by: Ankita Pareek <ankitapareek@microsoft.com>

Ankita13-code force-pushed the ankitaparek/enable-clh-auto-crashdump branch 7 times, most recently from 1575712 to 1b77ec9 Compare December 1, 2025 13:06

Ankita13-code marked this pull request as ready for review December 1, 2025 13:07

Ankita13-code requested review from a team as code owners December 1, 2025 13:07

Camelron reviewed Dec 1, 2025

View reviewed changes

src/runtime/virtcontainers/clh.go Show resolved Hide resolved

Camelron reviewed Dec 1, 2025

View reviewed changes

JocelynBerrendonner reviewed Dec 1, 2025

View reviewed changes

sprt reviewed Dec 2, 2025

View reviewed changes

Ankita13-code force-pushed the ankitaparek/enable-clh-auto-crashdump branch 2 times, most recently from 305e714 to 48201ce Compare December 15, 2025 10:27

Ankita13-code force-pushed the ankitaparek/enable-clh-auto-crashdump branch from 48201ce to ea1871a Compare December 16, 2025 09:18

		// Safely stop event monitoring - uses sync.Once to prevent double-close panic
		clh.stopEventMonitor()

		if err := os.Remove(eventMonitorPath); err != nil && !os.IsNotExist(err) {
		clh.Logger().WithError(err).WithField("path", eventMonitorPath).Warn("removing event monitor file failed")

virtcontainers: Add automatic crashdump collection for Cloud Hypervisor #421

Are you sure you want to change the base?

virtcontainers: Add automatic crashdump collection for Cloud Hypervisor #421

Conversation

Ankita13-code commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Checklist

Summary

Associated issues

Test Methodology

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ankita13-code Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ankita13-code Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Ankita13-code commented Nov 25, 2025 •

edited

Loading

Ankita13-code Dec 15, 2025 •

edited

Loading

Ankita13-code Dec 15, 2025 •

edited

Loading