Skip to content

Rebase onto upstream drbd-9.2.18 and port the flant-9.2.17 fix line (9.2.18-flant.9)#9

Merged
astef merged 121 commits into
flant-9.2from
flant-9.2.18
Jun 24, 2026
Merged

Rebase onto upstream drbd-9.2.18 and port the flant-9.2.17 fix line (9.2.18-flant.9)#9
astef merged 121 commits into
flant-9.2from
flant-9.2.18

Conversation

@astef

@astef astef commented Jun 23, 2026

Copy link
Copy Markdown
Member

Moves our kernel-module fork off the old 9.2.16 base onto upstream drbd-9.2.18, keeps every Flant patch, and folds in the flant-9.2.17 bug-fix line. Builds clean (drbd.ko + tcp/lb-tcp/rdma, kernel 6.8).

  • Rebased onto drbd-9.2.18 — picks up all upstream 9.2.17/9.2.18 fixes; our work replayed as 9.2.18-flant.1…3 (quorum-minimum-redundancy, configurable dynamic voters, non-voting disk, plus the drbdsetup-teardown / bitmap-IO-quorum / resync-correctness deadlock fixes).
  • Ported the flant-9.2.17 fixes as flant.4…9: use-after-free in resync-discard, dagtag resync stall after Primary reconnect, and four diskless-Primary deadlocks (UUID rotation, UUID-ancestor, false split-brain, reconciliation-resync).
  • Renumbered our netlink fields non_voting 28→36 and quorum_dynamic_voters 17→25 (headers a6d1604c) to stay collision-free against future upstream fields.

Ships with 3p-drbd-utils flant-9.32.0 and 3p-drbd-headers flant — all three pin headers a6d1604c (netlink ABI 36/25, proto 118–124). Note: flant.4+ IDs (36/25) differ from flant.1–3 (28/17), so pair with the matching utils.

JoelColledge and others added 30 commits November 27, 2025 11:03
Previously we considered whether the peer is SyncTarget, but not whether
we are SyncTarget. This could lead to mismatched decisions where one
node detects "Missed end of resync as sync-source", but the
corresponding peer does not detect "Peer missed end of resync". This
results in mismatched replication states.

Below is an example of some logs where this occurred. Both nodes become
PausedSyncS towards each other, and later get stuck in SyncSource
towards each other.

drbd res/0 drbd0 lin3: drbd_sync_handshake:
drbd res/0 drbd0 lin3: self 7A2D98F4A853F406:0000000000000000:0000000000000000:0000000000000000 bits:76800 flags:20
drbd res/0 drbd0 lin3: peer 7A2D98F4A853F406:A0B2DA7EBB1562DA:0000000000000000:0000000000000000 bits:2816 flags:1824
drbd res/0 drbd0 lin3: uuid_compare()=no-sync by rule=both-off
drbd res/0 drbd0 lin3: strategy = source-copy-other-bitmap due to disk states. (UpToDate/Inconsistent)
drbd res/0 drbd0 lin3: Copying bitmap of peer node_id=3 (bitmap_index=2)
drbd res lin3: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) [connected]
drbd res/0 drbd0 lin3: pdsk( DUnknown -> Inconsistent ) repl( Off -> WFBitMapS ) resync-susp( no -> peer ) [connected]
drbd res/0 drbd0 lin3: repl( WFBitMapS -> PausedSyncS ) [receive-bitmap]

drbd res/0 drbd0 lin1: Missed end of resync as sync-source
drbd res/0 drbd0 lin1: drbd_sync_handshake:
drbd res/0 drbd0 lin1: self 7A2D98F4A853F406:A0B2DA7EBB1562DA:0000000000000000:0000000000000000 bits:2816 flags:24
drbd res/0 drbd0 lin1: peer 7A2D98F4A853F406:0000000000000000:0000000000000000:0000000000000000 bits:76800 flags:1020
drbd res/0 drbd0 lin1: uuid_compare()=source-use-bitmap by rule=sync-source-missed-finish
drbd res lin1: Committing remote state change 767181164 (primary_nodes=0)
drbd res lin1: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) [remote]
drbd res/0 drbd0 lin1: pdsk( DUnknown -> Consistent ) repl( Off -> WFBitMapS ) [remote]
drbd res/0 drbd0 lin1: repl( WFBitMapS -> PausedSyncS ) [receive-bitmap]
The "COMPAT_84=y" was already on the build command line, but not on the
make install command line. So the make install sees that invocations
won't match, builds without the compat define, installs those, and rpm
then collects the binaries from the second build.

Fixes: 510bec0 ("build: make COMPAT_84 flag controllable in packaging")
This parameter recently got added to queue_limits, and it gets
initialized to UINT_MAX by default. Set it to 0 to indicate that we
don't support unmap write zeroes.

Equivalent of upstream commit 027a7a9c07d0 ("drbd: init
queue_limits->max_hw_wzeroes_unmap_sectors parameter"), including the
appropriate compat patch.
Use the %pISpc format specifier to print IP addresses with port numbers
in dtl_debugfs_show(). This also adds IPv6 support, as the previous
code only handled IPv4.
Commit d042fab ("drbd: Fix a memory leak and remove the open-coded
page pool") removed the only reader of this list outside of debugging
code. Hence the list is no longer necessary and can be removed. This is
similar to the removal of net_ee in the aforementioned commit.
It took me some time to rediscover why we need this, so add an
explanatory comment.
This makes the meaning clearer in most cases and prepares the way for
introducing a separate flag with the specific meaning that the requests
has been sent.
This is preparation so that we can use both flags on one peer request.
So that we can use find_resync_request() in the normal case. This
prevents problems that occurred when find_resync_requests() [note the
plural] found requests that it should not have done.

Also use a different approach for finding multiple matching requests
which avoids matching too much. Since this is only used in a specific
case, we can implement it in a way that should be reliable.

An example of the problems mentioned above follows. We are node "S" and
peer "T" is running DRBD 9.1.23, that is protocol version 121. S is
SyncSource and T is SyncTarget:
* S handles a write submission while the bitmap exchange occurs; sends
  P_OUT_OF_SYNC
* T sends resync request r0
* T receives P_OUT_OF_SYNC; the resync position jumps backwards
* T sends resync request r1 for the same interval as r0
* S receives and submits r0, handles completion and sends reply
* T sends resync ack for r0
* S receives r1
* S receives resync ack for r0; find_resync_requests() matches both r0
  and r1; S manipulates ((drbd_peer_request) r1).w.list, which is
  concurrently being used for something else

This results in a crash such as this one:

list_del corruption. prev->next should be ff3cb1eb2f47afa0, but was ff3cb1e799af3000
kernel BUG at lib/list_debug.c:51!
RIP: 0010:__list_del_entry_valid.cold+0x31/0x47
Call Trace:
 free_waiting_resync_requests+0x23a/0x4c0 [drbd]
 drain_resync_activity+0x2ab/0x4b0 [drbd]
 conn_disconnect+0xf5/0x770 [drbd]

Or similarly:

list_del corruption, ffff9c95829624d0->next is LIST_POISON1 (dead000000000100)
kernel BUG at lib/list_debug.c:45!
RIP: 0010:__list_del_entry_valid.cold+0xf/0x47
Call Trace:
 drbd_free_peer_req+0xde/0x270 [drbd]
 drbd_free_peer_reqs+0x80/0xc0 [drbd]
 conn_disconnect+0x375/0x770 [drbd]
That is, peers that send one ack corresponding to multiple resync
requests. In particular, protocol version 121 with feature
RESYNC_DAGTAG.

The code for handling this case is untested and, even if we test it now,
is liable to rot. Better to avoid the situation rather than keep code
that is likely to be buggy.
A resync might skip over blocks that are already in-sync. Prior to this
change, P_PEERS_IN_SYNC packets were sent for the blocks the resync
skipped over. When this was a large range, many P_PEERS_IN_SYNC packets
were sent, causing similar problems to those fixed by commit
e2d0439 ("drbd: only send P_PEERS_IN_SYNC for up to 4 MiB when
resync finished").

The solution is to skip steps (extents) where there is no resync
activity. Track the last sync position (last_in_sync_end) and send
P_PEERS_IN_SYNC whenever we jump to a different step or reach the end of
a step.

Fixes: bc218ad ("drbd: only send P_PEERS_IN_SYNC every 4MiB")

Co-developed-by: zhengbing.huang <zhengbing.huang@easystack.cn>
Signed-off-by: zhengbing.huang <zhengbing.huang@easystack.cn>
Signed-off-by: Joel Colledge <joel.colledge@linbit.com>
alloc_send_buffer() does an implicit flush_send_buffer(),
but ignores its return value. This masks the network failure
from _drbd_send_bio() until it reaches the last page in the bio.

The minimal fix seem to be to change_state(,C_NETWORK_FAILURE,)
as soon as we detect the send failure in flush_send_buffer().
alloc_send_buffer() may implicitly flush accumulated data
if there is not enough room to accomodate new data.

It must not ignore potential network failures.

Now alloc_send_buffer() may also return an ERR_PTR.
Callers need to check with IS_ERR and handle errors as appropriate.
Same thing as 6c7ff08 (ci: use full container repository names), but
for the drbd-9.2 branch.
Skip this commit when forward-merging to master.
In _dtl_recv_page(), the receive buffer pointer data was used instead
of the advancing pointer pos when calling dtl_recv_short(). When
receiving a page in multiple chunks from different load-balanced TCP
paths, each chunk overwrites the beginning of the page instead of being
appended at the correct offset, causing data corruption.

Replace data with pos so that each received chunk is placed at the
correct position within the page.

Fixes: 3ba5b50 ('lb-tcp: Fix dtl_recv_pages()')
DRBD requires stable pages because it may read the same bio data
multiple times for local disk I/O and network transmission, and in
some cases for calculating checksums.

The BLK_FEAT_STABLE_WRITES flag is set when the device is first
created, but blk_set_stacking_limits() clears it whenever a
backing device is attached. In some cases the flag may be
inherited from the backing device, but we want it to be enabled
at all times.

Unconditionally re-enable BLK_FEAT_STABLE_WRITES in
drbd_reconsider_queue_parameters() after the queue parameter
negotiations.

Also, document why we want this flag enabled in the first place.
Commit 464c5c7 introduced freeing the bitmap of an existing device.
Only a device that has a backing disk can have a bitmap. Phrased
differently, a diskless node can never have a bitmap. A consequence of
this is that the get_ldev()/put_ldev() protection is sufficient to
protect accesses to device->bitmap. That extents the previously
existing convention that every access to device->ldev needs to be
protected by get_ldev()/put_ldev().

The get_ldev()/put_ldev() delays freeing of a backing device, and from
now on also a bitmap, until any other context stoped using the bitmap
or ldevl objects.

Here is the stacktrace from a LINSTOR tests that triggered the above insight:

 BUG: kernel NULL pointer dereference, address: 000000000000016c
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
[...]
 Call Trace:
  <TASK>
  ? show_regs+0x6d/0x80
  ? __die+0x24/0x80
  ? page_fault_oops+0x99/0x1b0
  ? do_user_addr_fault+0x2ee/0x6b0
  ? exc_page_fault+0x83/0x1b0
  ? asm_exc_page_fault+0x27/0x30
  ? drbd_set_sync+0x2e/0x410 [drbd]
  ? dtt_recv+0x157/0x270 [drbd_transport_tcp]
  ? srso_alias_return_thunk+0x5/0xfbef5
  drbd_set_all_out_of_sync+0x1c/0x30 [drbd]
  receive_rs_deallocated+0xd1/0x270 [drbd]
  ? __pfx_receive_rs_deallocated+0x10/0x10 [drbd]
  drbd_receiver+0x5a0/0xae0

Move drbd_bm_free() to drbd_ldev_destroy(), which runs as deferred
work only after the disk state is D_DISKLESS and local_cnt has reached
zero. This guarantees that no get_ldev() holders remain when the bitmap is
freed.

Fixes: 464c5c7 ("drbd: drbd_alloc_bitmap() only after drbd_read_md")
If a netlink message provides a peer_node_id inside DRBD_NLA_CFG_CONTEXT
but omits the resource_name, adm_ctx->resource stays NULL. The code then
calls drbd_get_connection_by_node_id(NULL, ...) which dereferences the
NULL pointer when iterating resource->connections, causing a kernel crash.

Reject such requests early with ERR_INVALID_REQUEST instead.

Reported-by: Xinqian Sun <xinqian.sun@u.northwestern.edu>
When a resync completes on the target side, but the source never
receives the completion notification (e.g., due to a disconnect), the
source retains a stale bitmap UUID while both sides have 0 bitmap
bits. On reconnect, if the missed-end of resync detection forces a now
Primary node to become a resync target, we get a state transition
failure, since the state machine refuses to give up the last
up-to-date copy of the data.

Fix that by, when detecting a "missed end of resync" during reconnect,
checking whether there are actually any out-of-sync bits before
setting the RS_SOURCE_MISSED_END / RS_PEER_MISSED_END flags.
Also, clear the stale bitmap UUID and push it to history.
Add a static analysis tool that verifies all accesses to device->ldev
and device->bitmap are protected by get_ldev()/put_ldev() brackets.

The checker uses tree-sitter to parse C without preprocessing and
performs bottom-up call graph analysis: functions that directly access
->ldev or ->bitmap propagate a "needs_ldev" requirement upward through
callers. A function is reported when it reaches a call site that is
neither inside a get_ldev/put_ldev bracket nor deferred to its own
caller.

Key features:
- Recognizes negated get_ldev (if (!get_ldev()) bail-out) and positive
  get_ldev (if (get_ldev()) { ... }) protection patterns
- Handles function pointer arguments (e.g. drbd_bitmap_io(dev, &fn, ...))
  as indirect call sites inheriting the caller's protection context
- Resolves variable types from function parameters and local declarations
  to only flag accesses on struct drbd_device, avoiding false positives
  from unrelated structs with bitmap or ldev fields
- Supports /* ldev_safe: reason */ comment annotations for call sites
  where domain knowledge guarantees ldev cannot go away
- Tolerates tree-sitter misparses caused by unexpanded macros by
  detecting and skipping fake outer function_definitions
- Reports full call chain trees from unprotected entry point down to
  the leaf access

Usage: python3 checks/check_ldev_access.py drbd/*.c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
They were part framework to detect accesses to device->ldev that relied
on a patched version of sparse. Removing all that, as we have a better
checker now.
peer_device_proc_drbd_show() accessed device->bitmap (via
drbd_bm_total_weight() and drbd_syncer_progress()) and device->act_log
without holding a get_ldev() reference. If a detach races with a debugfs
read, the bitmap or ldev could be freed while still in use.

Acquire get_ldev_if_state() once at the top and hold it across all
bitmap and activity log accesses, replacing the narrower get_ldev block
that only covered lc_seq_printf_stats().
seq_print_device_proc_drbd() accessed device->bitmap (via
drbd_bm_total_weight() and drbd_syncer_progress()) without holding a
get_ldev() reference. If a detach races with a /proc/drbd read, the
bitmap could be freed while still in use.

Acquire get_ldev_if_state() at the top and guard all bitmap accesses
with have_ldev.
receive_bitmap() accessed the bitmap (via drbd_bm_slot_lock(),
drbd_bm_bits(), drbd_bm_words(), and the receive/decode/send bitmap
calls) without holding a get_ldev() reference. If a detach races with
bitmap reception, the bitmap could be freed while still in use.

Acquire get_ldev() after the wait_event and before drbd_bm_slot_lock(),
release it after drbd_bm_slot_unlock() on both the normal and error
paths.
receive_peer_dagtag() called drbd_bm_clear_many_bits() without holding
a get_ldev() reference. If a detach races with the reconciliation
logic, the bitmap could be freed while still in use.

Wrap the call with get_ldev()/put_ldev() per device.
…ev()

make_ov_request() accesses device->ldev indirectly through
drbd_rs_c_min_rate_throttle() without holding an ldev reference. If a
concurrent detach races with online-verify, this can lead to a
use-after-free. Add get_ldev()/put_ldev() brackets around the body.
And it had an unbalanced put_ldev() on the allocation failure path.
Extend the get_ldev()/put_ldev() bracket in drbd_bm_resize() to also
cover the drbd_md_dax_active(device->ldev) and drbd_dax_bitmap() calls,
which previously accessed device->ldev outside any protection.

The old code had a narrow get_ldev/put_ldev bracket that only validated
on-disk bitmap space, then accessed device->ldev unprotected for the DAX
check. While callers passing capacity != 0 hold ldev today, the function
itself would crash if ever called without ldev on a non-zero capacity.

Restructure the code so that the DAX path runs inside the extended
get_ldev bracket, and the page-based allocation (which does not use ldev)
runs unconditionally when bm_on_pmem was not set.

Also annotate the device->bitmap access for the static checker.
Mark code paths where ldev is implicitly held with /* ldev_safe: reason */
comments so the get_ldev()/put_ldev() static checker can distinguish them
from real bugs. These paths include bio endio callbacks (where ldev is
held since I/O submission), request processing (where queued requests
hold their own ldev references), state machine callbacks (where
extra_ldev_ref_for_after_state_chg() holds an extra local_cnt reference),
and worker/sender thread operations on requests with existing ldev refs.
… annotation

Add a second analysis pass to check_ldev_access.py that verifies every
exit path from a function balances get_ldev() with put_ldev(). This
catches reference leaks where an error path returns without releasing
the ldev reference, which prevents detach.

Introduce a dedicated /* ldev_ref_transfer: reason */ annotation for
functions that intentionally pass their ldev reference to an async
operation (e.g. submitted peer_request whose endio calls put_ldev).
Annotate the four existing reference transfer sites.

When a new annotation is placed at the beginning of a function, it
indicates that the function receives the ldev reference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
astef and others added 25 commits June 23, 2026 21:57
Mark the module version to distinguish builds that include our quorum
and non-voting disk patches from upstream 9.2.18.

flant.1 features:
- quorum-minimum-redundancy enforcement (diskless nodes, tiebreaker,
  runtime recalculation)
- CS_FORCE_RECALC bypass of quorum validation
- recalculate quorum after forget-peer
- CS_HARD for C_UNCONNECTED transition in conn_disconnect
- configurable dynamic voter reduction
- non-voting disk for quorum exclusion
Fixed a race in w_resync_timer that could permanently stall resync.
Administrative operations on any resource (disk attach, resync-after
reconfiguration) trigger a global resync-dependency recalculation
(drbd_pause_after / drbd_resume_next) that may briefly pause and
unpause an unrelated resource's sync target. If the resync timer
fires during this transient pause, the sender thread could read the
intermediate L_PAUSED_SYNC_T state without holding the state lock,
enter the wrong code path, and never reschedule the resync work.
The affected resource would show SyncTarget with 0% progress and
all worker threads idle; the only recovery was to manually
disconnect and reconnect.

Read repl_state[NOW] under read_lock_irq(state_rwlock) before the
switch statement to guarantee a consistent snapshot of the state.

Also bump version to 9.2.18-flant.2.
When a receiver thread restarts after a failed connection attempt,
there is a window between drbd_put_listener() in finish_connect and
drbd_get_listener() in the next prepare_connect where the path is not
registered as a waiter on the shared TCP listener. During this window,
incoming connections for that path are rejected with "Closing
unexpected connection" because drbd_find_path_by_addr() cannot find
the path in the listener's waiters list.

This race becomes critical when combined with strict quorum settings
(quorum-minimum-redundancy >= 2). A diskless node that needs multiple
quorate peers may connect to only one peer before the rejected
connection triggers a cascade: the node detects a cluster split,
disconnects its only peer, loses quorum, and enters suspend-io. It
then has to go through a full resource teardown and reconnect cycle.

The race is amplified when one peer has an asymmetric configuration
(missing connections to some nodes). The receiver thread for the
unreachable peer sits permanently in dtt_wait_for_connect() on the
shared listener, making it the one that always calls accept() and
encounters the missing path.

TCP transport (drbd_transport_tcp.c):

 - dtt_finish_connect(): only unregister listeners and clean up
   accepted sockets on successful connect. On failure, keep the path
   registered so incoming connections are routed correctly.

 - dtt_prepare_connect(): always clean up accepted sockets at the
   start of each connect attempt, regardless of whether the listener
   is already registered. This is safe because dtt_prepare_connect()
   is called at the start of each drbd_transport_connect() invocation,
   before dtt_connect() runs, so there is no concurrent socket routing
   via dtt_wait_for_connect(). Without this cleanup, stale accepted
   sockets from interrupted connect attempts would accumulate and
   cause permanent "short read (expected size 8)" / BrokenPipe
   failures on subsequent connect cycles.

 - dtt_remove_path(): add dtt_cleanup_accepted_sockets() after
   drbd_put_listener() to prevent socket leaks on path destruction.

lb-tcp transport (drbd_transport_lb-tcp.c):

 - dtl_set_active(false): only close sockets, do not unregister
   listeners. The DTL_CONNECTING flag check in dtl_accept_work_fn()
   already prevents stale socket accumulation by rejecting accepts
   when the transport is not actively connecting.

Both transports guarantee eventual cleanup through two paths:
successful connect (finish_connect unregisters) or path destruction
(remove_path unregisters). The RDMA transport is not affected as it
uses an event-driven model with atomic state transitions.

Reproducible with even a single drbdsetup disconnect with some
probability, and reliably with rapid repeated disconnects.

Signed-off-by: David Magton <david.magton@flant.com>
drbd_should_abort_listening() only checked get_t_state() for EXITING
inside an "if (signal_pending(current))" block. When
_drbd_thread_stop() sends SIGHUP to a receiver thread that is inside
dtt_try_connect() -> sock->ops->connect(), the signal is consumed by
the interruptible connect() syscall. By the time the receiver returns
to dtt_connect() and calls drbd_should_abort_listening(),
signal_pending() is false, so get_t_state() is never checked. The
receiver loops back into connect(), never sees t_state == EXITING,
and _drbd_thread_stop() waits on the completion forever in D-state.

This is easily triggered when a connection is configured to a
non-existent peer: the receiver thread loops in dtt_connect() doing
TCP connect attempts that always fail. drbdsetup down -> del_connection
-> _drbd_thread_stop hangs indefinitely, blocking all subsequent admin
commands on the resource due to adm_mutex contention.

Fix by checking get_t_state() unconditionally, not only when a signal
is pending. EXITING is only set by _drbd_thread_stop(restart=false),
which is an explicit request for the thread to terminate. Checking it
without a pending signal cannot cause false positives.

Signed-off-by: David Magton <david.magton@flant.com>
Problem:
When a DRBD connection is being torn down (e.g. drbdsetup del-peer),
conn_disconnect() waits for active_ee_cnt to reach zero via wait_event().
However, a PeerWrite request that has been received but not yet acquired
an Activity Log (AL) entry (i.e. stuck in "blocked-on-al" state) keeps
active_ee_cnt at 1 indefinitely.

The request lifecycle in this deadlock scenario:
1. PeerWrite is received in receive_Data(), active_ee_cnt is incremented,
   request is added to connection->peer_requests (recv_order list).
2. drbd_al_begin_io_fastpath() fails (e.g. due to quorum suspension or
   AL being locked), so the request is queued to device->submit.peer_writes
   via drbd_queue_peer_request(), which also triggers do_submit worker.
3. The connection drops. conn_disconnect() sets cstate to C_NETWORK_FAILURE
   and then waits for active_ee_cnt == 0.
4. The do_submit worker may have already completed its run before the
   connection state changed, or may be blocked behind a locked AL in
   prepare_al_transaction_nonblock(). In either case, it never processes
   this PeerWrite for cleanup.
5. Deadlock: conn_disconnect waits on active_ee_cnt, but the only code
   that can decrement it (drbd_cleanup_peer_requests_wfa, called from
   do_submit) never runs for this request. drbdsetup hangs in D state.

This was observed in production: drbdsetup del-peer and the DRBD receiver
thread both stuck in D (uninterruptible sleep), with debugfs showing
active_ee_cnt: 1 and a PeerWrite with flags "blocked-on-al" aged
67 million jiffies.

Fix (three parts):

1. prepare_al_transaction_nonblock(): Move the cstate < C_CONNECTED
   cleanup loop BEFORE the __LC_LOCKED check. Previously, if the AL was
   locked, the function would immediately goto out, skipping the cleanup
   of requests from disconnected connections. Now these requests are
   always moved to the cleanup list regardless of AL lock state.

2. conn_disconnect(): After drain_resync_activity() and before the
   active_ee_cnt wait, explicitly kick the submit worker (queue_work)
   and wake the AL wait queue (wake_up) for every device on the
   disconnecting connection. This forces do_submit to run, see the
   disconnected cstate, and clean up any blocked-on-al PeerWrites
   via prepare_al_transaction_nonblock -> drbd_cleanup_peer_requests_wfa,
   which decrements active_ee_cnt and wakes ee_wait.

3. cleanup_unacked_peer_requests(): Add a WARN_ON_ONCE guard before
   drbd_al_complete_io() to detect if a peer request without
   EE_IN_ACTLOG ever reaches this function. Calling drbd_al_complete_io
   on such a request would corrupt AL reference counts. With fixes 1+2
   this should never happen, but the guard provides safety and
   diagnostics.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
When del_connection() calls _drbd_thread_stop() for the receiver,
an external event (such as a peer reconnecting or a state change
callback from another connection) can trigger a C_STANDALONE ->
C_UNCONNECTED transition, which calls drbd_thread_start(). This
converts the receiver's t_state from EXITING to RESTARTING, causing
it to restart instead of exit. The _drbd_thread_stop caller then
waits on the completion forever in D-state.

This was observed in production: drbdsetup down issued while a peer
was simultaneously reconnecting. The receiver exited its connect
loop (seeing EXITING), entered conn_disconnect, went to StandAlone.
But before _drbd_thread_stop's wait_for_completion could be
satisfied, the state machine restarted the receiver due to the
peer's connection attempt. The receiver re-entered drbdd and the
completion was never signaled.

Fix by moving change_cstate(C_STANDALONE) before drbd_thread_stop()
in del_connection(). With C_STANDALONE set, conn_disconnect() skips
the receiver restart (oc >= C_UNCONNECTED is false for C_STANDALONE),
and the C_STANDALONE -> C_UNCONNECTED transition that triggers
drbd_thread_start() cannot occur.

The change_cstate(C_STANDALONE) call already existed in
del_connection() as a "race breaker" (added by Lars Ellenberg in
2011, commit 980d566), but was placed after drbd_thread_stop()
where it was never reached when the hang occurred.

Signed-off-by: David Magton <david.magton@flant.com>
…shed connection

When _drbd_thread_stop() sends SIGHUP to a receiver thread that is
blocked in tcp_recvmsg (established connection, receiving data), the
signal may be consumed by the TCP receive path without causing
drbd_recv() to return an error. Additionally, when del_connection()
sets C_STANDALONE before _drbd_thread_stop, the state machine's
finish_state_change() already calls drbd_thread_stop_nowait() for
the C_STANDALONE transition, which sends SIGHUP first. The subsequent
_drbd_thread_stop in del_connection sees t_state already EXITING and
does not send a second signal.

Fix by checking get_t_state() after every transport recv in
drbd_recv(). If the thread is in EXITING state, force -EINTR return
regardless of recv result. This makes the receiver exit immediately
without waiting for the next loop condition check.

Only EXITING is checked, not RESTARTING. RESTARTING means "finish
current iteration and restart" (normal disconnect + reconnect flow).
Treating RESTARTING as an exit condition would kill the feature
exchange during reconnect, causing a permanent BrokenPipe loop:
Connecting -> BrokenPipe -> "short read (expected size 8)" ->
Unconnected -> Connecting, repeating indefinitely.

Triggered by drbdsetup down or del-peer when the receiver is blocked
in tcp_recvmsg on an established connection with the peer actively
sending data.

Signed-off-by: David Magton <david.magton@flant.com>
When del_connection sets C_STANDALONE and marks the receiver thread as
EXITING, the receiver in conn_connect's retry loop can race past both
state changes. At the "start:" label, change_cstate(C_CONNECTING)
succeeds because net_conf still exists, overriding C_STANDALONE.
The -EAGAIN handler only checks for C_DISCONNECTING (==), missing
C_STANDALONE. The receiver re-enters the connect loop indefinitely,
and del_connection (drbdsetup down / del-peer) hangs in
_drbd_thread_stop waiting for thread completion.

Add get_t_state() == EXITING check at the "start:" label before
attempting change_cstate(C_CONNECTING). Widen the -EAGAIN cstate
check from == C_DISCONNECTING to <= C_DISCONNECTING, which also
catches C_STANDALONE.

Both checks only trigger during connection destruction (EXITING /
C_STANDALONE), never during normal disconnect-reconnect cycles which
use RESTARTING / C_UNCONNECTED.

Triggered by drbdsetup down while the receiver thread is in the
connect retry loop (e.g. connection to an unreachable or slowly
responding peer). The receiver races past the C_STANDALONE transition
and re-enters conn_connect indefinitely.

Signed-off-by: David Magton <david.magton@flant.com>
drbd_adm_down called conn_try_disconnect(force=0) for all connections.
Without CS_HARD, the state machine requires a cluster-wide two-phase
commit for graceful disconnect. This fails with:
- SS_NEED_CONNECTION when the peer is unreachable (Connecting state)
- SS_NO_QUORUM when disconnecting would break quorum
- SS_CW_FAILED_BY_PEER when the peer rejects the twopc

For SS_NEED_CONNECTION, conn_try_disconnect spins in its repeat loop
indefinitely. For SS_NO_QUORUM, it returns failure and drbd_adm_down
aborts entirely.

Only attempt graceful disconnect for C_CONNECTED peers where twopc
is possible and outdate negotiation matters. For all other connection
states, or if graceful disconnect fails for any reason (including
quorum loss), fall back to force disconnect (CS_HARD) which takes
the local-only code path and succeeds immediately.

Triggered by drbdsetup down when connections are in Connecting or
Unconnected state (e.g. peer is unreachable or was recently
disconnected). The graceful disconnect hangs or fails, and without
the force fallback drbdsetup down never completes.

Signed-off-by: David Magton <david.magton@flant.com>
drbd_adm_net_opts() and drbd_fsync_device() call
drbd_flush_workqueue(&connection->sender_work) unconditionally. This
queues a completion work item and waits for it to be processed. However,
the sender thread only runs when the connection is at least in
C_CONNECTING state; for StandAlone or Unconnected connections, no thread
processes the sender workqueue, so wait_for_completion() blocks forever,
putting the calling process into uninterruptible D state.

This happens when drbdsetup net-options is called on a connection
that is in StandAlone state (e.g. after new-peer but before connect).
At that point the sender thread is not yet started.

Fix by guarding the flush with a cstate check: only flush if the
connection is at least C_CONNECTING, which guarantees the sender
thread is running and will process the work item.

Signed-off-by: David Magton <david.magton@flant.com>
When bd_link_disk_holder() fails in link_backing_dev(), the function
calls fput(file) to release the backing device. However, the caller
(open_backing_devices) also calls close_backing_dev() on failure,
which calls fput() again on the same file. This double-fput corrupts
the file reference count, leading to use-after-free and ultimately:

    kernel BUG at mm/slub.c:448!

In normal operation this path is never hit because bd_link_disk_holder()
succeeds. But when drbdsetup down races with a concurrent drbdsetup
attach on the same backing device, bd_link_disk_holder() fails with
-EINVAL, triggering the double-free.

Fix by removing the fput()/bdev_release() from link_backing_dev() and
letting the caller handle cleanup exclusively through close_backing_dev().
Also update the corresponding coccinelle compat patches that transform
fput() to bdev_release() (kernel 6.6-6.8) and remove bdev_release()
(kernel <6.6), so the fix applies consistently across all supported
kernel versions.

Signed-off-by: David Magton <david.magton@flant.com>
When drbdsetup down is in progress, concurrent drbdsetup new-peer,
attach, or primary/secondary on the same resource can reach blocking
waits (drbd_flush_workqueue, wait_event) on queues whose processing
threads have already been stopped by the ongoing teardown, causing
permanent D-state hangs.

Additionally, new-peer during teardown can create orphaned sender
threads that are never cleaned up, leaking kernel threads and kref
references, preventing module unload.

Fix by checking DOWN_IN_PROGRESS and R_UNREGISTERED flags at the entry
of drbd_adm_set_role, drbd_adm_attach, and drbd_adm_new_peer. Return
ERR_INVALID_REQUEST immediately if the resource is being torn down.

Reproduced by running drbdsetup down concurrently with drbdsetup
new-peer / attach / primary on the same resource (e.g. rapid resource
teardown and recreation). The concurrent operation races past
adm_mutex and reaches a blocking wait on a dead workqueue.

Signed-off-by: David Magton <david.magton@flant.com>
drbd_queue_bitmap_io() puts bitmap work into pending_bitmap_work and
relies on dec_ap_bio() to move it to the worker queue when
ap_bio_cnt[WRITE] reaches zero. However, when IO is suspended (e.g.
on-no-quorum suspend-io), application writes hold ap_bio_cnt > 0
indefinitely because suspended IO never completes. This creates a
circular dependency:

  1. Suspended IO cannot complete without quorum
  2. Quorum requires at least one UpToDate peer
  3. UpToDate requires resync to complete
  4. Resync requires bitmap exchange
  5. Bitmap exchange is stuck in pending_bitmap_work waiting for
     ap_bio_cnt == 0

Since suspended IO does not modify the on-disk bitmap (it is frozen
before being submitted to peers), it is safe to bypass the
ap_bio_cnt == 0 requirement and move bitmap work directly to the
worker queue.

The fix is applied in two places to cover different race windows:

  - drbd_queue_bitmap_io(): immediately after queuing, if the device
    is already suspended.

  - w_after_state_change(): at the end of every state change, if the
    resource is suspended and any device has pending bitmap work. This
    covers the case where suspension occurs after bitmap work was
    queued but before ap_bio_cnt reached zero.

Triggered by quorum loss on Primary with active IO (e.g. mounted
filesystem with writers), followed by peer reconnection that requires
resync. Reproducible by killing TCP connections to both peers with
ss -K while 8 parallel dd writers do fsync on the DRBD device.

Signed-off-by: David Magton <david.magton@flant.com>
When two peers reconnect simultaneously after quorum loss, the UUID
handshake for one peer can be invalidated by a concurrent state change
from the other peer's handshake. Specifically, UUID_FLAG_PRIMARY_LOST_QUORUM
is set during quorum loss and changes the uuid_flags between when they
are sent to the peer and when they are verified locally. This triggers
rule=initial-handshake-changed, strategy=RETRY_CONNECT.

The RETRY_CONNECT strategy has .reconnect=true, which unconditionally
calls maybe_force_secondary(). For a suspended Primary, this demotes
it to Secondary with force-io-failures, causing ext4 journal abort and
permanent filesystem read-only — even though the Primary's data is not
outdated and the handshake succeeds on the very next retry.

RETRY_CONNECT means "inputs changed during handshake, try again" — it
does not indicate that the peer has newer data. The only other
.reconnect strategy is SYNC_TARGET_PRIMARY_RECONNECT, where
force-secondary is justified because the Primary genuinely needs to
become a sync target.

Fix by skipping maybe_force_secondary() for RETRY_CONNECT. The
handshake will be retried via CONN_HANDSHAKE_RETRY with current
uuid_flags and succeed without demoting the Primary.

Triggered by killing TCP connections to both peers (ss -K or tcpkill)
while 8 parallel writers do fsync on a mounted DRBD filesystem.
The parallel reconnection of two peers causes concurrent UUID
handshakes that interfere with each other's uuid_flags.

Signed-off-by: David Magton <david.magton@flant.com>
…nded

drbd_rs_complete_io() and make_resync_request() delay resync
completion (RS_DONE / drbd_resync_finished) while
drbd_any_flush_pending() returns true, waiting for barrier acks
from the Primary. However, when the resource is suspended (e.g.
on-no-quorum suspend-io), the Primary's IO is frozen and pending
flushes will never complete.

This creates a circular dependency:

  1. Resync completion waits for flush ack from Primary
  2. Flush ack requires Primary to process the barrier
  3. Primary IO is suspended, waiting for quorum
  4. Quorum requires peers to be UpToDate
  5. UpToDate requires resync to complete

Since suspended IO does not modify data on peers (it is frozen
before being sent), skipping the flush check is safe — the resync
data is already durable on the target's backing device.

Fix by checking drbd_suspended() before drbd_any_flush_pending()
in both callsites. When suspended, allow resync to complete
immediately so that quorum can be restored and IO can resume.

Triggered by quorum loss on Primary with active IO, followed by
peer reconnection. The resync completes its data transfer but
the final RS_DONE is blocked by pending flushes from the frozen
Primary, preventing the peer from becoming UpToDate.

Signed-off-by: David Magton <david.magton@flant.com>
…ync blocks

When application writes occur on the SyncSource while a resync is in
progress, the written sectors are replicated to the SyncTarget peer via
P_DATA (drbd_should_do_remote returns true for D_INCONSISTENT +
L_SYNC_TARGET). However, there is a timing window: the bitmap bit is set
when the write is submitted locally, and cleared only when the peer acks
the write. If drbd_resync_finished() calls drbd_bm_total_weight() while
acks are still in flight, it observes n_oos > rs_failed.

This is a critical data safety issue: on a subsequent disconnect/reconnect,
uuid_compare() returns no-sync because the UUIDs were already rotated by
the "completed" resync. The peer is then promoted to UpToDate despite
having stale data — silent data corruption on failover.

Fix:
1. In drbd_resync_finished(), only trigger resync_again++ when
   !drbd_should_do_remote(peer_device). When the peer IS receiving
   application writes, the OOS bits are from in-flight acks — the data
   is already on the peer and the bitmap will clear momentarily. When
   the peer is NOT receiving writes (disconnected/Outdated), the OOS bits
   are genuinely missing data requiring another resync pass.

2. In resync_again(), check drbd_bm_total_weight() before starting a
   new resync pass. If bitmap is clean by the time we get here (late acks
   cleared bits), skip. This prevents zero-length resync passes.

Reproduction: 3-node cluster (Primary + 2 replicas), 8 parallel dd
writers with fsync on mounted filesystem. Kill TCP connections (tcpkill
or ss -K) to both peers simultaneously. 5 out of 7 resyncs finish with
n_oos > rs_failed. After reconnect, the Outdated peer is incorrectly
marked UpToDate by uuid_compare()=no-sync — verified data inconsistency.

Signed-off-by: David Magton <david.magton@flant.com>
…re L_ESTABLISHED for Primary stable source

Two related fixes for unstable resync completion:

BUG (WFBitMapT blocking): After an unstable resync completes, the
after-unstable handshake may initiate a follow-up resync from another
secondary peer (WFBitMapT). If that peer is also unstable, the resync
always results in Outdated (was_resync_stable() returns false).
Meanwhile, the WFBitMapT state causes sanitize_state() to clamp
max_disk_state to D_OUTDATED, blocking even a concurrent stable resync
from the Primary from promoting the device to UpToDate.

Fix 1: Skip after-unstable resync initiation (drbd_start_resync_side)
when a stable sync source (connected Primary with L_ESTABLISHED) is
already available. The Primary will resync directly.

Fix 2: Cancel pending WFBitMapT in __cancel_other_resyncs() when a
stable resync completes. Previously only L_PAUSED_SYNC_T was cancelled.

BUG (wrong stable source detection): drbd_stable_sync_source_present()
considers a connected Primary as a stable source regardless of
replication state. During resync or reconnect handshake, the Primary
may not yet have our full bitmap. This causes was_resync_stable() to
return true for an ongoing unstable resync from a different peer,
leading to premature UUID rotation and permanently Outdated Target.

Fix 3: Require repl_state == L_ESTABLISHED before considering a
connected Primary as a stable source. Only a fully established
replication link guarantees that the Primary has our complete bitmap
and all writes are being replicated.

Reproduction: 3-node cluster (A=Primary, B=Secondary, C=Secondary).
Kill A-C TCP connection while writers are active. C resyncs from B
(unstable). A-C reconnects during resync. Without L_ESTABLISHED check,
C considers A a stable source, was_resync_stable() returns true,
UUIDs rotate, but bitmap is not clean → C stays Outdated permanently.
Observed in stress testing with tcpkill + iptables on 3-node cluster
with 8 parallel dd writers.

Signed-off-by: David Magton <david.magton@flant.com>
… stall

After a connection loss during an active resync, Source-side resync read
requests (P_RS_DATA_REQUEST) that were sent but never acknowledged remain
in the device's interval tree with INTERVAL_SENT set and INTERVAL_RECEIVED
clear. On reconnect, the new resync attempts to write to the same sectors
but finds conflicting intervals and stalls indefinitely waiting for them
to complete — they never will, because the connection that was supposed
to deliver the reply is gone.

The existing drbd_cancel_conflicting_resync_requests() only handles
entries that have been RECEIVED or SUBMITTED (it queues them through the
submit_conflict worker which calls drbd_cleanup_received_resync_write).
SENT-but-not-RECEIVED entries are skipped because they are Source-side
requests: they have no allocated receive buffer, no incremented
backing_ee_cnt or unacked counters, and were never submitted to the
backing device. Passing them through the normal cleanup path causes
counter underflow (dec_unacked on a zero counter) and kernel deadlock.

Add drbd_cleanup_stale_resync_intervals() which directly removes
SENT && !RECEIVED resync entries from the interval tree, decrements
rs_pending_cnt, and frees the peer_req. Called from drain_resync_activity
(conn_disconnect path) after drbd_cancel_conflicting_resync_requests.

Reproduction: 3-node cluster with active filesystem IO (8 dd writers
with fsync). Kill TCP connections (tcpkill) to trigger resync. During
active resync, kill connections again. After second reconnect, resync
hangs permanently with rs_pending_cnt > 0. Stale entries are visible
in debugfs interval tree (SENT flag set, RECEIVED flag clear). Manual
drbdsetup disconnect resolves it because conn_disconnect now calls
the cleanup function.

Signed-off-by: David Magton <david.magton@flant.com>
…UUIDs

When a resync from an unstable Source (secondary with no connected
Primary) completes, the Target must NOT rotate UUIDs or promote its
disk to UpToDate. The Source may have been behind the Primary, so the
data just synced could be incomplete. A UUID rotation at this point
would make the incomplete data "authoritative", and a subsequent
reconnect to the Primary would see matching UUIDs and skip the needed
resync — silent data loss.

Previously, the Target would unconditionally rotate UUIDs and become
Established+UpToDate after any completed resync, regardless of Source
stability. The "Peer was unstable during resync" message was logged
but had no corrective effect.

Fix: when UNSTABLE_RESYNC is set after resync completion on the Target:
1. Skip UUID rotation and disk promotion entirely.
2. Transition to Established (already done by the state change).
3. Send UUID_FLAG_RESYNC to the Source to trigger a full re-handshake.
4. The Source compares UUIDs and sends a new bitmap. If there are
   dirty sectors (from Primary writes during the unstable resync),
   another resync pass runs. If the bitmap is clean AND a stable
   source (Primary with L_ESTABLISHED) is now available, the resync
   completes stably and UUIDs rotate normally.

Target-side receive_bitmap() is extended to accept bitmaps in
L_ESTABLISHED state (from the re-handshake). If bm_total > 0, it
sends its own bitmap back and starts L_SYNC_TARGET. If bm_total == 0,
L_ESTABLISHED is silently maintained (no-op, already UpToDate
effectively).

Reproduction: 3-node cluster (A=Primary, B, C). Kill A-C connection.
C resyncs from B (unstable). Resync completes. Without this fix, C
rotates UUIDs and becomes UpToDate with potentially stale data. When
A reconnects, uuid_compare sees matching UUIDs and declares no-sync.
With this fix, C requests re-handshake from B, which cascades to A,
ensuring C gets all data that A wrote during the unstable resync period.

Signed-off-by: David Magton <david.magton@flant.com>
Summary of changes since flant.2:

Stability fixes for drbdsetup operations:
- Fix drbdsetup down hang (receiver stuck in connect loop)
- Fix del-connection hang (receiver restart race)
- Fix del-peer deadlock (blocked-on-al PeerWrite)
- Fix receiver thread not exiting when connection is established
- Fall back to force disconnect in drbd_adm_down
- Skip sender workqueue flush when sender is not running
- Reject admin operations on resource being removed

Kernel crash / IO deadlock fixes:
- Fix double fput in link_backing_dev error path (kernel BUG)
- Fix bitmap IO deadlock when IO suspended due to quorum loss
- Do not force-secondary on transient handshake retry
- Fix transport listener race causing rejected connections

Resync reliability fixes:
- Fix resync-again triggering on in-flight acks (do_remote guard)
- Fix was_resync_stable() wrong stable source detection
  (require L_ESTABLISHED for connected Primary)
- Fix resync stall after reconnect (stale SENT && !RECEIVED
  intervals lingering in interval tree)
- Fix unstable resync completion (re-handshake via UUID_FLAG_RESYNC
  instead of premature UUID rotation)
When drbd_submit_rs_discard() error path frees entries from
resync_requests, the received_last pointer is reset to NULL.
A subsequent call to drbd_process_rs_discards() then rescans from
the list head and finds entries belonging to an already-submitted
discard group. Between this scan (under peer_reqs_lock) and the
actual drbd_submit_rs_discard() call (without the lock), the
ack_sender thread may complete and free those entries via
e_end_resync_block -> drbd_check_peers_in_sync_progress.

This leads to use-after-free: drbd_remove_peer_req_interval() hits
the ASSERTION !drbd_interval_empty(i) on the freed entry, and
drbd_unmerge_discard() dereferences LIST_POISON1 from the poisoned
recv_order linked list, causing a general protection fault.

Fix this in two ways:

1) In drbd_advance_to_next_rs_discard(), skip TRIM entries that
   belong to already-submitted discard groups. The main entry is
   identified by INTERVAL_SUBMITTED, merged entries by their
   cleared (empty) intervals. This prevents double-processing
   regardless of how received_last was set.

2) In drbd_list_del_resync_request(), when the freed entry is
   received_last or discard_last, set the pointer to the
   predecessor entry on resync_requests instead of NULL. This
   avoids unnecessary full rescans from the list head.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Point submodule to a6d1604 which shifts Flant-specific netlink
attribute IDs to avoid collisions with upstream:
- disk_conf/non_voting: field 28 -> 36
- res_opts/quorum_dynamic_voters: field 17 -> 25

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
When the Primary reconnects after quorum loss, w_send_dagtag may
silently skip sending P_DAGTAG if writes have already been sent past
the queued dagtag value (the dagtag_newer_eq optimization). This can
leave the SyncSource's last_dagtag_sector stale, causing resync
requests with a depend_dagtag to wait indefinitely on dagtag_wait_ee.

Fix this with a three-part approach:

1. Advance last_dagtag_sector from received writes in receive_Data,
   and release dagtag waiters from the write completion handler
   (e_end_block) where data is safely committed to disk. Use an
   atomic dagtag_waiters counter to avoid lock overhead in the
   common case.

2. Fix w_send_dagtag to always send P_DAGTAG with the actual current
   position (send.current_dagtag_sector) instead of silently returning
   when writes raced ahead.

3. Add diagnostic warning when a dagtag wait exceeds 10 seconds,
   logging the stall duration and relevant dagtag values.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
@astef astef self-assigned this Jun 23, 2026
astef added a commit that referenced this pull request Jun 23, 2026
flant-9.2.18 is flant-9.2 rebased onto upstream drbd-9.2.18, plus the
ported flant-9.2.17 fix line (tip 9.2.18-flant.9). This merge records the
pre-rebase branch as an ancestor so PR #9 can fast-forward; the tree is
taken entirely from flant-9.2.18 (strategy=ours).

All flant-9.2 code fixes are already present as rebased equivalents
(verified by content: w_resync_timer race, non-voting disk, reject-admin-
on-teardown guards, unstable-resync re-handshake, etc.). Only the obsolete
9.2.16-flant.* version strings are superseded.
flant-9.2.18 is flant-9.2 rebased onto upstream drbd-9.2.18, plus the
flant-9.2.17 fixes ported as 9.2.18-flant.4 (use-after-free in resync
discard + netlink ID shift to a6d1604c) and 9.2.18-flant.5 (dagtag resync
stall). The later diskless-Primary fixes (flant.6-9: uuid-rotation skip,
UUID-ancestor, false split-brain, reconciliation timer) were dropped
pending re-test against upstream's 9.2.17/9.2.18 diskless-UUID rework.

This merge records the pre-rebase branch as an ancestor so PR #9 can
fast-forward; the tree is taken entirely from flant-9.2.18 (strategy=ours).
@astef astef merged commit 9122f86 into flant-9.2 Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.