vmm: migration: ignore snapshots of removed devices#156
Conversation
The migration snapshot stays available after restore so that devices present in the restored VM can recover their state. This breaks down for hotplugged devices that are removed after migration and later re-added with the same identifier. In that case the stale snapshot is still found by device ID and the new device is constructed as if it were being restored. For virtio-pci hotplug, this leaves the PCI configuration in a restored state and BAR registration fails when the fresh device is added again. Only use per-device snapshot/state when the device still exists in the restored device tree. Once a device was removed from the live VM, a later hotplug with the same identifier must be treated as a fresh device. Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> On-behalf-of: SAP thomas.prescher@sap.com
2ff1298 to
9d1d47a
Compare
|
|
||
| impl DeviceManager { | ||
| fn has_restored_device(&self, id: &str) -> bool { | ||
| self.snapshot.is_some() && self.device_tree.lock().unwrap().contains_key(id) |
There was a problem hiding this comment.
Very nice debugging! My very first impression is that this solution is more a workaround than a solution. To me it is odd why the snapshot could exist for that device when the device manager doesn't has that device. I'll look into the root cause today with high priority and then will come back
|
I think the core problem is that I have just validated it and your reproducer succeeds with this single line fix. |
|
In coordination with Thomas: Will be replaced with another solution, inspired by @Coffeeri |
olivereanderson
left a comment
There was a problem hiding this comment.
LGTM
As I understand it snapshot restoration will only work with these changes provided that the device tree is reconstructed first, before the individual devices are re-introduced. That is indeed the current behavior so this is OK, but it might be an idea to document that somewhere.
|
Superseded by #157 |
Oh sorry, @olivereanderson . At the same time you reviewed this, I was in a meeting room with Thomas and decided to go for a different solution |
No worries! Regardless, the most important thing is always that we find the best solution within the allocated time frame 👍 |
The migration snapshot stays available after restore so that devices present in the restored VM can recover their state. This breaks down for hotplugged devices that are removed after migration and later re-added with the same identifier.
In that case the stale snapshot is still found by device ID and the new device is constructed as if it were being restored. For virtio-pci hotplug, this leaves the PCI configuration in a restored state and BAR registration fails when the fresh device is added again.
Only use per-device snapshot/state when the device still exists in the restored device tree. Once a device was removed from the live VM, a later hotplug with the same identifier must be treated as a fresh device.
This fixes https://github.com/cobaltcore-dev/cobaltcore/issues/567
An alternative would be to drop the snapshot after live migration, but only after all devices have been reconstructed successfully. In any case, I would argue that the new sanity checks make sense anyway.
Regression test can be found here: https://gitlab.cyberus-technology.de/cyberus/cloud/libvirt/-/merge_requests/209