Skip to content

vmm: avoid pause deadlock on CPU hotplug MMIO#150

Closed
phip1611 wants to merge 1 commit into
cyberus-technology:gardenlinuxfrom
phip1611:backport-fix-cpu-deadlock
Closed

vmm: avoid pause deadlock on CPU hotplug MMIO#150
phip1611 wants to merge 1 commit into
cyberus-technology:gardenlinuxfrom
phip1611:backport-fix-cpu-deadlock

Conversation

@phip1611
Copy link
Copy Markdown
Member

@phip1611 phip1611 commented Apr 24, 2026

Backport of cloud-hypervisor#8092. We also have the same problem in our fork, where I originally designed a solution that I then upstreamed...

Hints for Reviewers

Steps to Undraft

  • Sebastian noticed that I still didn't fix the dead lock, I just reduced the window

@phip1611 phip1611 self-assigned this Apr 24, 2026
@phip1611 phip1611 requested review from Coffeeri, amphi, olivereanderson and tpressure and removed request for olivereanderson and tpressure April 24, 2026 09:21
# TL;DR

In 979fb1c we
replaced the old deadlock with another deadlock. This commit finally
resolves (hopefully) all dead locks on that code path by not holding
`CpuManager::vcpu_states` while waiting for vCPU pause acknowledgements.
A vCPU can receive the pause kick while servicing the ACPI CPU hotplug
MMIO device, and that MMIO path also needs `vcpu_states`. Holding the
mutex across the wait phase deadlocks pause against that MMIO access.

# Problem

`signal_vcpus()` used to lock `CpuManager::vcpu_states` for the whole
function, signal every vCPU, and then wait for each vCPU to acknowledge
the kick.

That lock scope is too wide. A vCPU is allowed to observe the kick in
userspace rather than returning directly from `KVM_RUN`. During boot,
`vcpu0` can be in an MMIO access on the ACPI CPU hotplug device when
pause arrives. `AcpiCpuHotplugController::read()` and `write()` both
lock `vcpu_states` to inspect or update the selected vCPU state.

The deadlock looks like this:

    VMM thread                           vCPU thread
    ----------                           ----------
    lock(vcpu_states)
    signal_vcpus()
    wait for ack  ---------------------> receives pause kick
                                          enters ACPI CPU hotplug MMIO
                                          lock(vcpu_states)  [blocks]
    wait for ack  <--------------------- cannot set vcpu_run_interrupted

The VMM thread waits for `vcpu_run_interrupted` to flip, but the vCPU
cannot reach the pause acknowledgement path because it is sleeping on
the same mutex.

The debug logs matched that cycle exactly: signal delivery kept
working, `vcpu0` stayed in one unmatched `run()` invocation, the stuck
thread sampled in `futex_do_wait`, and the backtrace pointed at
`AcpiCpuHotplugController::read()`.

# Reproducer

This was reproducible by continuously issuing `pause()` / `resume()`
from while a Linux guest was still booting. That boot-tim window
reliably exercises the ACPI CPU hotplug MMIO access that participates in
the deadlock. Once the guest had finished booting, the problem became
much harder to trigger (as there is no MMIO operation without explicit
CPU plugging).

# Solution

Keep the existing two-phase behavior so all vCPUs are still signalled
before the wait phase, but narrow the lifetime of the `vcpu_states`
mutex. Reacquire it only long enough to access one `VcpuState` at a
time in each phase.

That preserves the original pause semantics and the fast signal-all /
wait-all structure, while removing the lock inversion with the ACPI CPU
hotplug MMIO path.

This also remains safe if a vCPU is hot-removed while pause is in
progress. Hot-remove does not shrink `vcpu_states`; it stops the thread
and clears the `VcpuState` handle in place. `signal_vcpus()` can
therefore snapshot the vector length up front, and if a vCPU disappears
between the signal and wait phases,
`wait_until_signal_acknowledged()` will observe `handle.is_none()` and
return successfully.

The interruption handshake itself lives in atomics inside each
`VcpuState`. The outer mutex is only needed to reach the state objects,
not to keep the acknowledgement protocol correct. Dropping the mutex
between iterations therefore does not weaken the pause protocol, but it
does allow MMIO handlers and other `vcpu_states` users to make forward
progress while the VMM waits for the kick to be observed.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
@phip1611 phip1611 force-pushed the backport-fix-cpu-deadlock branch from f0aa7d1 to 3751233 Compare April 24, 2026 09:22
Copy link
Copy Markdown

@Coffeeri Coffeeri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding work!

@phip1611 phip1611 marked this pull request as draft April 24, 2026 14:22
@amphi
Copy link
Copy Markdown

amphi commented Apr 24, 2026

I don't think this really fixes the problem, I think you just shrunk the time-window for the deadlock.

@phip1611
Copy link
Copy Markdown
Member Author

I don't think this really fixes the problem, I think you just shrunk the time-window for the deadlock.

yup, we are holding the lock for the whole retry loop..

@phip1611 phip1611 closed this Apr 28, 2026
@phip1611 phip1611 deleted the backport-fix-cpu-deadlock branch April 28, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants