From 3751233026dea582846dd200fadb3e5585747a82 Mon Sep 17 00:00:00 2001
From: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Date: Fri, 24 Apr 2026 10:27:08 +0200
Subject: [PATCH] vmm: avoid pause deadlock on CPU hotplug MMIO

# TL;DR

In 979fb1c5675caf53176eff28113094f3b2dc104e we
replaced the old deadlock with another deadlock. This commit finally
resolves (hopefully) all dead locks on that code path by not holding
`CpuManager::vcpu_states` while waiting for vCPU pause acknowledgements.
A vCPU can receive the pause kick while servicing the ACPI CPU hotplug
MMIO device, and that MMIO path also needs `vcpu_states`. Holding the
mutex across the wait phase deadlocks pause against that MMIO access.

# Problem

`signal_vcpus()` used to lock `CpuManager::vcpu_states` for the whole
function, signal every vCPU, and then wait for each vCPU to acknowledge
the kick.

That lock scope is too wide. A vCPU is allowed to observe the kick in
userspace rather than returning directly from `KVM_RUN`. During boot,
`vcpu0` can be in an MMIO access on the ACPI CPU hotplug device when
pause arrives. `AcpiCpuHotplugController::read()` and `write()` both
lock `vcpu_states` to inspect or update the selected vCPU state.

The deadlock looks like this:

    VMM thread                           vCPU thread
    ----------                           ----------
    lock(vcpu_states)
    signal_vcpus()
    wait for ack  ---------------------> receives pause kick
                                          enters ACPI CPU hotplug MMIO
                                          lock(vcpu_states)  [blocks]
    wait for ack  <--------------------- cannot set vcpu_run_interrupted

The VMM thread waits for `vcpu_run_interrupted` to flip, but the vCPU
cannot reach the pause acknowledgement path because it is sleeping on
the same mutex.

The debug logs matched that cycle exactly: signal delivery kept
working, `vcpu0` stayed in one unmatched `run()` invocation, the stuck
thread sampled in `futex_do_wait`, and the backtrace pointed at
`AcpiCpuHotplugController::read()`.

# Reproducer

This was reproducible by continuously issuing `pause()` / `resume()`
from while a Linux guest was still booting. That boot-tim window
reliably exercises the ACPI CPU hotplug MMIO access that participates in
the deadlock. Once the guest had finished booting, the problem became
much harder to trigger (as there is no MMIO operation without explicit
CPU plugging).

# Solution

Keep the existing two-phase behavior so all vCPUs are still signalled
before the wait phase, but narrow the lifetime of the `vcpu_states`
mutex. Reacquire it only long enough to access one `VcpuState` at a
time in each phase.

That preserves the original pause semantics and the fast signal-all /
wait-all structure, while removing the lock inversion with the ACPI CPU
hotplug MMIO path.

This also remains safe if a vCPU is hot-removed while pause is in
progress. Hot-remove does not shrink `vcpu_states`; it stops the thread
and clears the `VcpuState` handle in place. `signal_vcpus()` can
therefore snapshot the vector length up front, and if a vCPU disappears
between the signal and wait phases,
`wait_until_signal_acknowledged()` will observe `handle.is_none()` and
return successfully.

The interruption handshake itself lives in atomics inside each
`VcpuState`. The outer mutex is only needed to reach the state objects,
not to keep the acknowledgement protocol correct. Dropping the mutex
between iterations therefore does not weaken the pause protocol, but it
does allow MMIO handlers and other `vcpu_states` users to make forward
progress while the VMM waits for the kick to be observed.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
---
 vmm/src/cpu.rs | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/vmm/src/cpu.rs b/vmm/src/cpu.rs
index 02dd4bb9ff..a967a09666 100644
--- a/vmm/src/cpu.rs
+++ b/vmm/src/cpu.rs
@@ -1551,16 +1551,22 @@ impl CpuManager {
     /// Calls [`VcpuState::signal_thread`] and
     /// [`VcpuState::wait_until_signal_acknowledged`] for each vCPU.
     fn signal_vcpus(&mut self) -> Result<()> {
-        // Holding the lock for the whole operation is correct:
-        let vcpu_states = self.vcpu_states.lock().unwrap();
+        let vcpu_count = self.vcpu_states.lock().unwrap().len();
 
         // Splitting this into two loops reduced the time to pause many vCPUs
         // massively. Example: 254 vCPUs. >254ms -> ~4ms.
-        for state in vcpu_states.iter() {
-            state.signal_thread();
+        //
+        // Do not hold `vcpu_states` across the wait phase. A vCPU can handle
+        // the kick in userspace while servicing MMIO on the ACPI CPU hotplug
+        // device, and that path also takes `vcpu_states`. Holding the mutex
+        // here while waiting would deadlock pause against that MMIO access.
+        for cpu_id in 0..vcpu_count {
+            let vcpu_states = self.vcpu_states.lock().unwrap();
+            vcpu_states[cpu_id].signal_thread();
         }
-        for state in vcpu_states.iter() {
-            state.wait_until_signal_acknowledged()?;
+        for cpu_id in 0..vcpu_count {
+            let vcpu_states = self.vcpu_states.lock().unwrap();
+            vcpu_states[cpu_id].wait_until_signal_acknowledged()?;
         }
 
         Ok(())