From 3751233026dea582846dd200fadb3e5585747a82 Mon Sep 17 00:00:00 2001 From: Philipp Schuster Date: Fri, 24 Apr 2026 10:27:08 +0200 Subject: [PATCH] vmm: avoid pause deadlock on CPU hotplug MMIO # TL;DR In 979fb1c5675caf53176eff28113094f3b2dc104e we replaced the old deadlock with another deadlock. This commit finally resolves (hopefully) all dead locks on that code path by not holding `CpuManager::vcpu_states` while waiting for vCPU pause acknowledgements. A vCPU can receive the pause kick while servicing the ACPI CPU hotplug MMIO device, and that MMIO path also needs `vcpu_states`. Holding the mutex across the wait phase deadlocks pause against that MMIO access. # Problem `signal_vcpus()` used to lock `CpuManager::vcpu_states` for the whole function, signal every vCPU, and then wait for each vCPU to acknowledge the kick. That lock scope is too wide. A vCPU is allowed to observe the kick in userspace rather than returning directly from `KVM_RUN`. During boot, `vcpu0` can be in an MMIO access on the ACPI CPU hotplug device when pause arrives. `AcpiCpuHotplugController::read()` and `write()` both lock `vcpu_states` to inspect or update the selected vCPU state. The deadlock looks like this: VMM thread vCPU thread ---------- ---------- lock(vcpu_states) signal_vcpus() wait for ack ---------------------> receives pause kick enters ACPI CPU hotplug MMIO lock(vcpu_states) [blocks] wait for ack <--------------------- cannot set vcpu_run_interrupted The VMM thread waits for `vcpu_run_interrupted` to flip, but the vCPU cannot reach the pause acknowledgement path because it is sleeping on the same mutex. The debug logs matched that cycle exactly: signal delivery kept working, `vcpu0` stayed in one unmatched `run()` invocation, the stuck thread sampled in `futex_do_wait`, and the backtrace pointed at `AcpiCpuHotplugController::read()`. # Reproducer This was reproducible by continuously issuing `pause()` / `resume()` from while a Linux guest was still booting. That boot-tim window reliably exercises the ACPI CPU hotplug MMIO access that participates in the deadlock. Once the guest had finished booting, the problem became much harder to trigger (as there is no MMIO operation without explicit CPU plugging). # Solution Keep the existing two-phase behavior so all vCPUs are still signalled before the wait phase, but narrow the lifetime of the `vcpu_states` mutex. Reacquire it only long enough to access one `VcpuState` at a time in each phase. That preserves the original pause semantics and the fast signal-all / wait-all structure, while removing the lock inversion with the ACPI CPU hotplug MMIO path. This also remains safe if a vCPU is hot-removed while pause is in progress. Hot-remove does not shrink `vcpu_states`; it stops the thread and clears the `VcpuState` handle in place. `signal_vcpus()` can therefore snapshot the vector length up front, and if a vCPU disappears between the signal and wait phases, `wait_until_signal_acknowledged()` will observe `handle.is_none()` and return successfully. The interruption handshake itself lives in atomics inside each `VcpuState`. The outer mutex is only needed to reach the state objects, not to keep the acknowledgement protocol correct. Dropping the mutex between iterations therefore does not weaken the pause protocol, but it does allow MMIO handlers and other `vcpu_states` users to make forward progress while the VMM waits for the kick to be observed. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster --- vmm/src/cpu.rs | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/vmm/src/cpu.rs b/vmm/src/cpu.rs index 02dd4bb9ff..a967a09666 100644 --- a/vmm/src/cpu.rs +++ b/vmm/src/cpu.rs @@ -1551,16 +1551,22 @@ impl CpuManager { /// Calls [`VcpuState::signal_thread`] and /// [`VcpuState::wait_until_signal_acknowledged`] for each vCPU. fn signal_vcpus(&mut self) -> Result<()> { - // Holding the lock for the whole operation is correct: - let vcpu_states = self.vcpu_states.lock().unwrap(); + let vcpu_count = self.vcpu_states.lock().unwrap().len(); // Splitting this into two loops reduced the time to pause many vCPUs // massively. Example: 254 vCPUs. >254ms -> ~4ms. - for state in vcpu_states.iter() { - state.signal_thread(); + // + // Do not hold `vcpu_states` across the wait phase. A vCPU can handle + // the kick in userspace while servicing MMIO on the ACPI CPU hotplug + // device, and that path also takes `vcpu_states`. Holding the mutex + // here while waiting would deadlock pause against that MMIO access. + for cpu_id in 0..vcpu_count { + let vcpu_states = self.vcpu_states.lock().unwrap(); + vcpu_states[cpu_id].signal_thread(); } - for state in vcpu_states.iter() { - state.wait_until_signal_acknowledged()?; + for cpu_id in 0..vcpu_count { + let vcpu_states = self.vcpu_states.lock().unwrap(); + vcpu_states[cpu_id].wait_until_signal_acknowledged()?; } Ok(())