Summary
Upgrading drbd-utils from 9.34.0-1 to 9.34.3-1 on a running production cluster causes all DRBD connections to be torn down and not re-established. The cluster ends up fragmented, with each node showing peers as Connecting and resources as Unknown/Diskless from the LINSTOR controller's perspective. Manual intervention (drbdadm adjust all on every node) is required to recover.
Environment
- drbd-utils: 9.34.3-1 (built 2026-04-17), upgraded from 9.34.0-1
- DRBD kernel module: 9.3.1
- LINSTOR: linstor-server 1.27.1, satellites on each node
- Kernel: 6.17.13-2-pve (Proxmox VE)
- Cluster: 4 nodes (pvevsan1–4), LINSTOR over ZFS_THIN on FC SAN, ~8 active resources, multiple primary VMs running
Root Cause Analysis
The package's postinst script (auto-generated by dh_systemd_start) invokes:
deb-systemd-invoke start 'drbd-configured.target' 'drbd-graceful-shutdown.service'
Under standard deb-systemd-invoke semantics, this performs a stop followed by a start for drbd-graceful-shutdown.service. The service is defined as:
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=/usr/lib/drbd/scripts/drbd-service-shim.sh down all
So during the upgrade, ExecStop runs, which executes drbdsetup down all. This:
- Disconnects all connections for all resources (succeeds unconditionally).
- Attempts to demote primary resources (fails with
(-12) Device is held open by someone for resources used by running VMs).
- Leaves the cluster in a state where connections are torn down but resources remain active locally — and disconnects are not rolled back when subsequent steps fail.
Reproduction Steps
- Have a healthy multi-node DRBD cluster with active primaries (e.g., running VMs holding
/dev/drbdX open).
- Run
apt upgrade drbd-utils from 9.34.0-1 to 9.34.3-1 on every node.
- Observe: connections between nodes drop to
Connecting/StandAlone and do not recover.
Evidence from Logs
/var/log/dpkg.log on pvevsan1:
2026-04-18 13:45:52 upgrade drbd-utils:amd64 9.34.0-1 9.34.3-1
2026-04-18 13:45:53 configure drbd-utils:amd64 9.34.3-1 <none>
2026-04-18 13:45:55 status installed drbd-utils:amd64 9.34.3-1
journalctl (same time window) — systemd stops the service:
Apr 18 13:45:52 pvevsan1 systemd[1]: Stopping drbd-graceful-shutdown.service - ensure all DRBD resources shut down gracefully at system shut down...
drbd-service-shim.sh output — demote fails for primaries with active VMs:
Apr 18 13:45:53 pvevsan1 drbd-service-shim.sh[2970916]: pm-47d8523a: State change failed: (-12) Device is held open by someone
Apr 18 13:45:53 pvevsan1 drbd-service-shim.sh[2970916]: failed to demote
Apr 18 13:45:53 pvevsan1 drbd-service-shim.sh[2970916]: /dev/drbd1005 open_cnt:1, writable:1; list of openers follows
Apr 18 13:45:53 pvevsan1 drbd-service-shim.sh[2970916]: drbd1005 opened by kvm (pid 2754028)
Kernel log — disconnects succeed despite demote failures:
Apr 18 13:45:53 pvevsan1 kernel: drbd pm-b425b652 pvevsan2: conn( Connected -> Disconnecting ) peer( Primary -> Unknown ) [down]
Apr 18 13:45:53 pvevsan1 kernel: drbd pm-b425b652 pvevsan2: conn( Disconnecting -> StandAlone ) [disconnected]
Apr 18 13:45:53 pvevsan1 kernel: drbd pm-b425b652 pvevsan3: Cluster is now split
The same sequence repeats on each node as it gets upgraded (pvevsan2 at 13:46:30, pvevsan3 at 13:46:56, pvevsan4 at 13:47:23).
Resulting State
After the upgrade completed on all nodes, linstor resource list showed:
- All connections in
Connecting state
- Most resources reported as
Unknown from peers' perspective
- Two diskless
InUse resources (whose peers became unreachable) effectively isolated
drbdadm status on each node showed only the locally-active resources, with all peers stuck in Connecting.
Workaround
Run on every node:
This re-establishes the configuration in the kernel and reconnects peers. After applying on all four nodes, the cluster recovered fully (with normal resyncs for resources that had diverged briefly).
Expected Behavior
Upgrading the drbd-utils package on a healthy production cluster must not disrupt active DRBD connections.
Possible fixes (any of which would address the issue):
- The
drbd-graceful-shutdown.service's ExecStop should not run during package upgrades. One option is to add a guard that detects whether the system is actually shutting down (e.g., check for systemctl is-system-running or test a flag set by shutdown.target).
- The
dh_systemd_start directive in debian/rules (or equivalent) could be customized to use --no-restart-after-upgrade for drbd-graceful-shutdown.service.
drbd-service-shim.sh down all could be made transactional: if any disconnect succeeds but a subsequent demote/down fails, the disconnects should be rolled back.
The first or second option seems the most surgical.
Severity
This affects every multi-node DRBD cluster running active workloads at the moment of an apt upgrade drbd-utils. The window of disruption is small (10–15 minutes in our case before manual recovery) but the operation is invisible to the operator until they check linstor resource list or notice replication has stopped. For clusters with strict quorum policies, this could cause I/O errors visible to applications.
Notes
- The upgraded version (9.34.3-1) appears to be very recent (build dated 2026-04-17).
- We had no prior issues with 9.34.0-1.
- We have not yet investigated whether downgrading or rebuilding the postinst would mitigate, since the fix is needed upstream.
Acknowledgements
Log collection, root cause analysis, and the writing of this report were done with the assistance of Claude (Anthropic), an AI assistant. The reproduction, kernel/userspace logs, and verification of the workaround on a live production cluster are mine.
Summary
Upgrading
drbd-utilsfrom 9.34.0-1 to 9.34.3-1 on a running production cluster causes all DRBD connections to be torn down and not re-established. The cluster ends up fragmented, with each node showing peers asConnectingand resources asUnknown/Disklessfrom the LINSTOR controller's perspective. Manual intervention (drbdadm adjust allon every node) is required to recover.Environment
Root Cause Analysis
The package's
postinstscript (auto-generated bydh_systemd_start) invokes:Under standard
deb-systemd-invokesemantics, this performs astopfollowed by astartfordrbd-graceful-shutdown.service. The service is defined as:So during the upgrade,
ExecStopruns, which executesdrbdsetup down all. This:(-12) Device is held open by someonefor resources used by running VMs).Reproduction Steps
/dev/drbdXopen).apt upgrade drbd-utilsfrom 9.34.0-1 to 9.34.3-1 on every node.Connecting/StandAloneand do not recover.Evidence from Logs
/var/log/dpkg.logon pvevsan1:journalctl(same time window) — systemd stops the service:drbd-service-shim.shoutput — demote fails for primaries with active VMs:Kernel log — disconnects succeed despite demote failures:
The same sequence repeats on each node as it gets upgraded (pvevsan2 at 13:46:30, pvevsan3 at 13:46:56, pvevsan4 at 13:47:23).
Resulting State
After the upgrade completed on all nodes,
linstor resource listshowed:ConnectingstateUnknownfrom peers' perspectiveInUseresources (whose peers became unreachable) effectively isolateddrbdadm statuson each node showed only the locally-active resources, with all peers stuck inConnecting.Workaround
Run on every node:
This re-establishes the configuration in the kernel and reconnects peers. After applying on all four nodes, the cluster recovered fully (with normal resyncs for resources that had diverged briefly).
Expected Behavior
Upgrading the
drbd-utilspackage on a healthy production cluster must not disrupt active DRBD connections.Possible fixes (any of which would address the issue):
drbd-graceful-shutdown.service'sExecStopshould not run during package upgrades. One option is to add a guard that detects whether the system is actually shutting down (e.g., check forsystemctl is-system-runningor test a flag set byshutdown.target).dh_systemd_startdirective indebian/rules(or equivalent) could be customized to use--no-restart-after-upgradefordrbd-graceful-shutdown.service.drbd-service-shim.sh down allcould be made transactional: if anydisconnectsucceeds but a subsequentdemote/downfails, the disconnects should be rolled back.The first or second option seems the most surgical.
Severity
This affects every multi-node DRBD cluster running active workloads at the moment of an
apt upgrade drbd-utils. The window of disruption is small (10–15 minutes in our case before manual recovery) but the operation is invisible to the operator until they checklinstor resource listor notice replication has stopped. For clusters with strict quorum policies, this could cause I/O errors visible to applications.Notes
Acknowledgements
Log collection, root cause analysis, and the writing of this report were done with the assistance of Claude (Anthropic), an AI assistant. The reproduction, kernel/userspace logs, and verification of the workaround on a live production cluster are mine.