Skip to content

vmm: migration: ignore snapshots of removed devices#156

Closed
tpressure wants to merge 1 commit into
cyberus-technology:gardenlinuxfrom
tpressure:pci_fix_migrate
Closed

vmm: migration: ignore snapshots of removed devices#156
tpressure wants to merge 1 commit into
cyberus-technology:gardenlinuxfrom
tpressure:pci_fix_migrate

Conversation

@tpressure
Copy link
Copy Markdown

@tpressure tpressure commented May 18, 2026

The migration snapshot stays available after restore so that devices present in the restored VM can recover their state. This breaks down for hotplugged devices that are removed after migration and later re-added with the same identifier.

In that case the stale snapshot is still found by device ID and the new device is constructed as if it were being restored. For virtio-pci hotplug, this leaves the PCI configuration in a restored state and BAR registration fails when the fresh device is added again.

Only use per-device snapshot/state when the device still exists in the restored device tree. Once a device was removed from the live VM, a later hotplug with the same identifier must be treated as a fresh device.

This fixes https://github.com/cobaltcore-dev/cobaltcore/issues/567

An alternative would be to drop the snapshot after live migration, but only after all devices have been reconstructed successfully. In any case, I would argue that the new sanity checks make sense anyway.

Regression test can be found here: https://gitlab.cyberus-technology.de/cyberus/cloud/libvirt/-/merge_requests/209

@tpressure tpressure requested review from amphi and phip1611 May 18, 2026 18:44
The migration snapshot stays available after restore so that devices
present in the restored VM can recover their state. This breaks down
for hotplugged devices that are removed after migration and later
re-added with the same identifier.

In that case the stale snapshot is still found by device ID and the
new device is constructed as if it were being restored. For virtio-pci
hotplug, this leaves the PCI configuration in a restored state and BAR
registration fails when the fresh device is added again.

Only use per-device snapshot/state when the device still exists in the
restored device tree. Once a device was removed from the live VM, a
later hotplug with the same identifier must be treated as a fresh
device.

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
On-behalf-of: SAP thomas.prescher@sap.com
Comment thread vmm/src/device_manager.rs

impl DeviceManager {
fn has_restored_device(&self, id: &str) -> bool {
self.snapshot.is_some() && self.device_tree.lock().unwrap().contains_key(id)
Copy link
Copy Markdown
Member

@phip1611 phip1611 May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice debugging! My very first impression is that this solution is more a workaround than a solution. To me it is odd why the snapshot could exist for that device when the device manager doesn't has that device. I'll look into the root cause today with high priority and then will come back

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@Coffeeri
Copy link
Copy Markdown

Coffeeri commented May 19, 2026

I think the core problem is that self.snapshot is never taken or cleared after the devices have been restored. The detach via eject_device removes the device from the device tree but leaves self.snapshot untouched.
When we hotplug a new disk colliding with the same letter, we still have self.snapshot in place and state_from_id(self.snapshot, id) will use it unconditionally. The minimal fix would set self.snapshot = None; in the very end of DeviceManager::create_devices().

I have just validated it and your reproducer succeeds with this single line fix.

@phip1611
Copy link
Copy Markdown
Member

In coordination with Thomas: Will be replaced with another solution, inspired by @Coffeeri

Copy link
Copy Markdown

@olivereanderson olivereanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

As I understand it snapshot restoration will only work with these changes provided that the device tree is reconstructed first, before the individual devices are re-introduced. That is indeed the current behavior so this is OK, but it might be an idea to document that somewhere.

@phip1611
Copy link
Copy Markdown
Member

Superseded by #157

@phip1611
Copy link
Copy Markdown
Member

LGTM

As I understand it snapshot restoration will only work with these changes provided that the device tree is reconstructed first, before the individual devices are re-introduced. That is indeed the current behavior so this is OK, but it might be an idea to document that somewhere.

Oh sorry, @olivereanderson . At the same time you reviewed this, I was in a meeting room with Thomas and decided to go for a different solution

@olivereanderson
Copy link
Copy Markdown

LGTM
As I understand it snapshot restoration will only work with these changes provided that the device tree is reconstructed first, before the individual devices are re-introduced. That is indeed the current behavior so this is OK, but it might be an idea to document that somewhere.

Oh sorry, @olivereanderson . At the same time you reviewed this, I was in a meeting room with Thomas and decided to go for a different solution

No worries!
I view reviewing as a way to learn more about this domain so I don't see the time as wasted at all.

Regardless, the most important thing is always that we find the best solution within the allocated time frame 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants