Skip to content

docs: root cause analysis for AL2023 EC2 c8i VM boot failure#425

Open
DorianZheng wants to merge 2 commits intomainfrom
docs/al2023-investigation
Open

docs: root cause analysis for AL2023 EC2 c8i VM boot failure#425
DorianZheng wants to merge 2 commits intomainfrom
docs/al2023-investigation

Conversation

@DorianZheng
Copy link
Copy Markdown
Member

Summary

Documents the root cause of BoxLite VMs failing to start on Amazon Linux 2023 (kernel 6.1) on EC2 c8i instances with nested KVM.

Root Cause

The guest kernel (Linux 6.12.62 in libkrunfw) triggers an i8042 CMD_RESET_CPU during early boot, causing immediate _exit(0) with no console output. The guest kernel detects an incompatible CPU configuration under kernel 6.1's nested KVM and falls back to hardware reset.

Shutdown sequence

  1. Boot vCPU runs ~5 KVM_RUN iterations
  2. Guest kernel writes CMD_RESET_CPU (0xFE) to i8042 port 0x64
  3. i8042 handler triggers reset_evt EventFd → VMM calls _exit(0)
  4. All threads killed — no console output

Why Ubuntu 24.04 works

Ubuntu 24.04 (kernel 6.17) provides better nested VMX emulation that satisfies the guest kernel's requirements. KVM capabilities (ept, vpid, etc.) are identical between both kernels — the difference is in the VMX implementation details.

Upstream status

  • No existing libkrun/libkrunfw issues for this scenario
  • Running libkrun on nested KVM Linux hosts is untested upstream
  • libkrunfw#50 is about nested KVM inside guests (different thing)

Investigation method

Added eprintln and std::fs::write instrumentation to libkrun's i8042 device handler (devices/src/legacy/i8042.rs) and vCPU run loop (vmm/src/linux/vstate.rs). Confirmed reset via /tmp/krun-i8042-reset.log file written by instrumented binary.

@DorianZheng DorianZheng force-pushed the docs/al2023-investigation branch 2 times, most recently from 0bc3063 to 23509fe Compare April 2, 2026 13:12
Guest kernel (libkrunfw 6.12.62) triggers i8042 CMD_RESET_CPU during
early boot on nested KVM with host kernel 6.1 (Amazon Linux 2023).
The reset causes immediate _exit(0) with no console output.

Root cause: the guest kernel detects an incompatible CPU/hardware
configuration under kernel 6.1's nested KVM emulation and performs
a hardware reset via the i8042 controller. Ubuntu 24.04 (kernel 6.17)
works because it provides better nested VMX emulation.

This is an unreported configuration upstream — libkrun has not been
tested on nested KVM Linux hosts.
@DorianZheng DorianZheng force-pushed the docs/al2023-investigation branch from 23509fe to eebcc59 Compare April 2, 2026 15:09
…ID verification

Add four security improvements to the OCI image pull pipeline, closing
gaps identified by comparing with Docker (containerd) and Podman
(containers/image):

- Size validation: LayerInfo now carries expected size from manifest
  descriptors; StagedDownload.commit() rejects blobs with mismatched
  size before hash check (prevents disk exhaustion from oversized blobs)
- Foreign layer URL rejection: layers_from_image() rejects layers with
  non-distributable media types or foreign URLs (CVE-2020-15157
  mitigation)
- HashingWriter: new AsyncWrite wrapper computes SHA256 inline during
  download, eliminating the post-download re-read and halving I/O while
  maintaining independent verification from oci-client
- DiffID verification: verify_diff_id() decompresses and hashes layer
  tarballs to verify uncompressed content matches rootfs.diff_ids from
  the image config, called during layer_extracted()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant