[Regression 9.34.3] Package upgrade triggers cluster-wide DRBD disconnect via drbd-graceful-shutdown.service ExecStop

## Summary

Upgrading `drbd-utils` from 9.34.0-1 to 9.34.3-1 on a running production cluster causes all DRBD connections to be torn down and not re-established. The cluster ends up fragmented, with each node showing peers as `Connecting` and resources as `Unknown`/`Diskless` from the LINSTOR controller's perspective. Manual intervention (`drbdadm adjust all` on every node) is required to recover.

## Environment

- **drbd-utils:** 9.34.3-1 (built 2026-04-17), upgraded from 9.34.0-1
- **DRBD kernel module:** 9.3.1
- **LINSTOR:** linstor-server 1.27.1, satellites on each node
- **Kernel:** 6.17.13-2-pve (Proxmox VE)
- **Cluster:** 4 nodes (pvevsan1–4), LINSTOR over ZFS_THIN on FC SAN, ~8 active resources, multiple primary VMs running

## Root Cause Analysis

The package's `postinst` script (auto-generated by `dh_systemd_start`) invokes:

```
deb-systemd-invoke start 'drbd-configured.target' 'drbd-graceful-shutdown.service'
```

Under standard `deb-systemd-invoke` semantics, this performs a `stop` followed by a `start` for `drbd-graceful-shutdown.service`. The service is defined as:

```
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=/usr/lib/drbd/scripts/drbd-service-shim.sh down all
```

So during the upgrade, `ExecStop` runs, which executes `drbdsetup down all`. This:

1. Disconnects all connections for all resources (succeeds unconditionally).
2. Attempts to demote primary resources (fails with `(-12) Device is held open by someone` for resources used by running VMs).
3. Leaves the cluster in a state where connections are torn down but resources remain active locally — and **disconnects are not rolled back** when subsequent steps fail.

## Reproduction Steps

1. Have a healthy multi-node DRBD cluster with active primaries (e.g., running VMs holding `/dev/drbdX` open).
2. Run `apt upgrade drbd-utils` from 9.34.0-1 to 9.34.3-1 on every node.
3. Observe: connections between nodes drop to `Connecting`/`StandAlone` and do not recover.

## Evidence from Logs

**`/var/log/dpkg.log` on pvevsan1:**
```
2026-04-18 13:45:52 upgrade drbd-utils:amd64 9.34.0-1 9.34.3-1
2026-04-18 13:45:53 configure drbd-utils:amd64 9.34.3-1 <none>
2026-04-18 13:45:55 status installed drbd-utils:amd64 9.34.3-1
```

**`journalctl` (same time window) — systemd stops the service:**
```
Apr 18 13:45:52 pvevsan1 systemd[1]: Stopping drbd-graceful-shutdown.service - ensure all DRBD resources shut down gracefully at system shut down...
```

**`drbd-service-shim.sh` output — demote fails for primaries with active VMs:**
```
Apr 18 13:45:53 pvevsan1 drbd-service-shim.sh[2970916]: pm-47d8523a: State change failed: (-12) Device is held open by someone
Apr 18 13:45:53 pvevsan1 drbd-service-shim.sh[2970916]: failed to demote
Apr 18 13:45:53 pvevsan1 drbd-service-shim.sh[2970916]: /dev/drbd1005 open_cnt:1, writable:1; list of openers follows
Apr 18 13:45:53 pvevsan1 drbd-service-shim.sh[2970916]: drbd1005 opened by kvm (pid 2754028)
```

**Kernel log — disconnects succeed despite demote failures:**
```
Apr 18 13:45:53 pvevsan1 kernel: drbd pm-b425b652 pvevsan2: conn( Connected -> Disconnecting ) peer( Primary -> Unknown ) [down]
Apr 18 13:45:53 pvevsan1 kernel: drbd pm-b425b652 pvevsan2: conn( Disconnecting -> StandAlone ) [disconnected]
Apr 18 13:45:53 pvevsan1 kernel: drbd pm-b425b652 pvevsan3: Cluster is now split
```

The same sequence repeats on each node as it gets upgraded (pvevsan2 at 13:46:30, pvevsan3 at 13:46:56, pvevsan4 at 13:47:23).

## Resulting State

After the upgrade completed on all nodes, `linstor resource list` showed:
- All connections in `Connecting` state
- Most resources reported as `Unknown` from peers' perspective
- Two diskless `InUse` resources (whose peers became unreachable) effectively isolated

`drbdadm status` on each node showed only the locally-active resources, with all peers stuck in `Connecting`.

## Workaround

Run on every node:

```
drbdadm adjust all
```

This re-establishes the configuration in the kernel and reconnects peers. After applying on all four nodes, the cluster recovered fully (with normal resyncs for resources that had diverged briefly).

## Expected Behavior

Upgrading the `drbd-utils` package on a healthy production cluster must not disrupt active DRBD connections.

Possible fixes (any of which would address the issue):

1. The `drbd-graceful-shutdown.service`'s `ExecStop` should not run during package upgrades. One option is to add a guard that detects whether the system is actually shutting down (e.g., check for `systemctl is-system-running` or test a flag set by `shutdown.target`).
2. The `dh_systemd_start` directive in `debian/rules` (or equivalent) could be customized to use `--no-restart-after-upgrade` for `drbd-graceful-shutdown.service`.
3. `drbd-service-shim.sh down all` could be made transactional: if any `disconnect` succeeds but a subsequent `demote`/`down` fails, the disconnects should be rolled back.

The first or second option seems the most surgical.

## Severity

This affects every multi-node DRBD cluster running active workloads at the moment of an `apt upgrade drbd-utils`. The window of disruption is small (10–15 minutes in our case before manual recovery) but the operation is invisible to the operator until they check `linstor resource list` or notice replication has stopped. For clusters with strict quorum policies, this could cause I/O errors visible to applications.

## Notes

- The upgraded version (9.34.3-1) appears to be very recent (build dated 2026-04-17).
- We had no prior issues with 9.34.0-1.
- We have not yet investigated whether downgrading or rebuilding the postinst would mitigate, since the fix is needed upstream.

## Acknowledgements

Log collection, root cause analysis, and the writing of this report were done with the assistance of Claude (Anthropic), an AI assistant. The reproduction, kernel/userspace logs, and verification of the workaround on a live production cluster are mine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Regression 9.34.3] Package upgrade triggers cluster-wide DRBD disconnect via drbd-graceful-shutdown.service ExecStop #61

Summary

Environment

Root Cause Analysis

Reproduction Steps

Evidence from Logs

Resulting State

Workaround

Expected Behavior

Severity

Notes

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Regression 9.34.3] Package upgrade triggers cluster-wide DRBD disconnect via drbd-graceful-shutdown.service ExecStop #61

Description

Summary

Environment

Root Cause Analysis

Reproduction Steps

Evidence from Logs

Resulting State

Workaround

Expected Behavior

Severity

Notes

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions