Rebase onto upstream drbd-9.2.18 and port the flant-9.2.17 fix line (9.2.18-flant.9)#9
Merged
Conversation
Previously we considered whether the peer is SyncTarget, but not whether we are SyncTarget. This could lead to mismatched decisions where one node detects "Missed end of resync as sync-source", but the corresponding peer does not detect "Peer missed end of resync". This results in mismatched replication states. Below is an example of some logs where this occurred. Both nodes become PausedSyncS towards each other, and later get stuck in SyncSource towards each other. drbd res/0 drbd0 lin3: drbd_sync_handshake: drbd res/0 drbd0 lin3: self 7A2D98F4A853F406:0000000000000000:0000000000000000:0000000000000000 bits:76800 flags:20 drbd res/0 drbd0 lin3: peer 7A2D98F4A853F406:A0B2DA7EBB1562DA:0000000000000000:0000000000000000 bits:2816 flags:1824 drbd res/0 drbd0 lin3: uuid_compare()=no-sync by rule=both-off drbd res/0 drbd0 lin3: strategy = source-copy-other-bitmap due to disk states. (UpToDate/Inconsistent) drbd res/0 drbd0 lin3: Copying bitmap of peer node_id=3 (bitmap_index=2) drbd res lin3: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) [connected] drbd res/0 drbd0 lin3: pdsk( DUnknown -> Inconsistent ) repl( Off -> WFBitMapS ) resync-susp( no -> peer ) [connected] drbd res/0 drbd0 lin3: repl( WFBitMapS -> PausedSyncS ) [receive-bitmap] drbd res/0 drbd0 lin1: Missed end of resync as sync-source drbd res/0 drbd0 lin1: drbd_sync_handshake: drbd res/0 drbd0 lin1: self 7A2D98F4A853F406:A0B2DA7EBB1562DA:0000000000000000:0000000000000000 bits:2816 flags:24 drbd res/0 drbd0 lin1: peer 7A2D98F4A853F406:0000000000000000:0000000000000000:0000000000000000 bits:76800 flags:1020 drbd res/0 drbd0 lin1: uuid_compare()=source-use-bitmap by rule=sync-source-missed-finish drbd res lin1: Committing remote state change 767181164 (primary_nodes=0) drbd res lin1: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) [remote] drbd res/0 drbd0 lin1: pdsk( DUnknown -> Consistent ) repl( Off -> WFBitMapS ) [remote] drbd res/0 drbd0 lin1: repl( WFBitMapS -> PausedSyncS ) [receive-bitmap]
The "COMPAT_84=y" was already on the build command line, but not on the make install command line. So the make install sees that invocations won't match, builds without the compat define, installs those, and rpm then collects the binaries from the second build. Fixes: 510bec0 ("build: make COMPAT_84 flag controllable in packaging")
This parameter recently got added to queue_limits, and it gets
initialized to UINT_MAX by default. Set it to 0 to indicate that we
don't support unmap write zeroes.
Equivalent of upstream commit 027a7a9c07d0 ("drbd: init
queue_limits->max_hw_wzeroes_unmap_sectors parameter"), including the
appropriate compat patch.
Use the %pISpc format specifier to print IP addresses with port numbers in dtl_debugfs_show(). This also adds IPv6 support, as the previous code only handled IPv4.
Commit d042fab ("drbd: Fix a memory leak and remove the open-coded page pool") removed the only reader of this list outside of debugging code. Hence the list is no longer necessary and can be removed. This is similar to the removal of net_ee in the aforementioned commit.
It took me some time to rediscover why we need this, so add an explanatory comment.
This makes the meaning clearer in most cases and prepares the way for introducing a separate flag with the specific meaning that the requests has been sent.
This is preparation so that we can use both flags on one peer request.
So that we can use find_resync_request() in the normal case. This prevents problems that occurred when find_resync_requests() [note the plural] found requests that it should not have done. Also use a different approach for finding multiple matching requests which avoids matching too much. Since this is only used in a specific case, we can implement it in a way that should be reliable. An example of the problems mentioned above follows. We are node "S" and peer "T" is running DRBD 9.1.23, that is protocol version 121. S is SyncSource and T is SyncTarget: * S handles a write submission while the bitmap exchange occurs; sends P_OUT_OF_SYNC * T sends resync request r0 * T receives P_OUT_OF_SYNC; the resync position jumps backwards * T sends resync request r1 for the same interval as r0 * S receives and submits r0, handles completion and sends reply * T sends resync ack for r0 * S receives r1 * S receives resync ack for r0; find_resync_requests() matches both r0 and r1; S manipulates ((drbd_peer_request) r1).w.list, which is concurrently being used for something else This results in a crash such as this one: list_del corruption. prev->next should be ff3cb1eb2f47afa0, but was ff3cb1e799af3000 kernel BUG at lib/list_debug.c:51! RIP: 0010:__list_del_entry_valid.cold+0x31/0x47 Call Trace: free_waiting_resync_requests+0x23a/0x4c0 [drbd] drain_resync_activity+0x2ab/0x4b0 [drbd] conn_disconnect+0xf5/0x770 [drbd] Or similarly: list_del corruption, ffff9c95829624d0->next is LIST_POISON1 (dead000000000100) kernel BUG at lib/list_debug.c:45! RIP: 0010:__list_del_entry_valid.cold+0xf/0x47 Call Trace: drbd_free_peer_req+0xde/0x270 [drbd] drbd_free_peer_reqs+0x80/0xc0 [drbd] conn_disconnect+0x375/0x770 [drbd]
That is, peers that send one ack corresponding to multiple resync requests. In particular, protocol version 121 with feature RESYNC_DAGTAG. The code for handling this case is untested and, even if we test it now, is liable to rot. Better to avoid the situation rather than keep code that is likely to be buggy.
A resync might skip over blocks that are already in-sync. Prior to this change, P_PEERS_IN_SYNC packets were sent for the blocks the resync skipped over. When this was a large range, many P_PEERS_IN_SYNC packets were sent, causing similar problems to those fixed by commit e2d0439 ("drbd: only send P_PEERS_IN_SYNC for up to 4 MiB when resync finished"). The solution is to skip steps (extents) where there is no resync activity. Track the last sync position (last_in_sync_end) and send P_PEERS_IN_SYNC whenever we jump to a different step or reach the end of a step. Fixes: bc218ad ("drbd: only send P_PEERS_IN_SYNC every 4MiB") Co-developed-by: zhengbing.huang <zhengbing.huang@easystack.cn> Signed-off-by: zhengbing.huang <zhengbing.huang@easystack.cn> Signed-off-by: Joel Colledge <joel.colledge@linbit.com>
alloc_send_buffer() does an implicit flush_send_buffer(), but ignores its return value. This masks the network failure from _drbd_send_bio() until it reaches the last page in the bio. The minimal fix seem to be to change_state(,C_NETWORK_FAILURE,) as soon as we detect the send failure in flush_send_buffer().
alloc_send_buffer() may implicitly flush accumulated data if there is not enough room to accomodate new data. It must not ignore potential network failures. Now alloc_send_buffer() may also return an ERR_PTR. Callers need to check with IS_ERR and handle errors as appropriate.
Same thing as 6c7ff08 (ci: use full container repository names), but for the drbd-9.2 branch. Skip this commit when forward-merging to master.
In _dtl_recv_page(), the receive buffer pointer data was used instead of the advancing pointer pos when calling dtl_recv_short(). When receiving a page in multiple chunks from different load-balanced TCP paths, each chunk overwrites the beginning of the page instead of being appended at the correct offset, causing data corruption. Replace data with pos so that each received chunk is placed at the correct position within the page. Fixes: 3ba5b50 ('lb-tcp: Fix dtl_recv_pages()')
DRBD requires stable pages because it may read the same bio data multiple times for local disk I/O and network transmission, and in some cases for calculating checksums. The BLK_FEAT_STABLE_WRITES flag is set when the device is first created, but blk_set_stacking_limits() clears it whenever a backing device is attached. In some cases the flag may be inherited from the backing device, but we want it to be enabled at all times. Unconditionally re-enable BLK_FEAT_STABLE_WRITES in drbd_reconsider_queue_parameters() after the queue parameter negotiations. Also, document why we want this flag enabled in the first place.
Commit 464c5c7 introduced freeing the bitmap of an existing device. Only a device that has a backing disk can have a bitmap. Phrased differently, a diskless node can never have a bitmap. A consequence of this is that the get_ldev()/put_ldev() protection is sufficient to protect accesses to device->bitmap. That extents the previously existing convention that every access to device->ldev needs to be protected by get_ldev()/put_ldev(). The get_ldev()/put_ldev() delays freeing of a backing device, and from now on also a bitmap, until any other context stoped using the bitmap or ldevl objects. Here is the stacktrace from a LINSTOR tests that triggered the above insight: BUG: kernel NULL pointer dereference, address: 000000000000016c #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page [...] Call Trace: <TASK> ? show_regs+0x6d/0x80 ? __die+0x24/0x80 ? page_fault_oops+0x99/0x1b0 ? do_user_addr_fault+0x2ee/0x6b0 ? exc_page_fault+0x83/0x1b0 ? asm_exc_page_fault+0x27/0x30 ? drbd_set_sync+0x2e/0x410 [drbd] ? dtt_recv+0x157/0x270 [drbd_transport_tcp] ? srso_alias_return_thunk+0x5/0xfbef5 drbd_set_all_out_of_sync+0x1c/0x30 [drbd] receive_rs_deallocated+0xd1/0x270 [drbd] ? __pfx_receive_rs_deallocated+0x10/0x10 [drbd] drbd_receiver+0x5a0/0xae0 Move drbd_bm_free() to drbd_ldev_destroy(), which runs as deferred work only after the disk state is D_DISKLESS and local_cnt has reached zero. This guarantees that no get_ldev() holders remain when the bitmap is freed. Fixes: 464c5c7 ("drbd: drbd_alloc_bitmap() only after drbd_read_md")
If a netlink message provides a peer_node_id inside DRBD_NLA_CFG_CONTEXT but omits the resource_name, adm_ctx->resource stays NULL. The code then calls drbd_get_connection_by_node_id(NULL, ...) which dereferences the NULL pointer when iterating resource->connections, causing a kernel crash. Reject such requests early with ERR_INVALID_REQUEST instead. Reported-by: Xinqian Sun <xinqian.sun@u.northwestern.edu>
When a resync completes on the target side, but the source never receives the completion notification (e.g., due to a disconnect), the source retains a stale bitmap UUID while both sides have 0 bitmap bits. On reconnect, if the missed-end of resync detection forces a now Primary node to become a resync target, we get a state transition failure, since the state machine refuses to give up the last up-to-date copy of the data. Fix that by, when detecting a "missed end of resync" during reconnect, checking whether there are actually any out-of-sync bits before setting the RS_SOURCE_MISSED_END / RS_PEER_MISSED_END flags. Also, clear the stale bitmap UUID and push it to history.
Add a static analysis tool that verifies all accesses to device->ldev
and device->bitmap are protected by get_ldev()/put_ldev() brackets.
The checker uses tree-sitter to parse C without preprocessing and
performs bottom-up call graph analysis: functions that directly access
->ldev or ->bitmap propagate a "needs_ldev" requirement upward through
callers. A function is reported when it reaches a call site that is
neither inside a get_ldev/put_ldev bracket nor deferred to its own
caller.
Key features:
- Recognizes negated get_ldev (if (!get_ldev()) bail-out) and positive
get_ldev (if (get_ldev()) { ... }) protection patterns
- Handles function pointer arguments (e.g. drbd_bitmap_io(dev, &fn, ...))
as indirect call sites inheriting the caller's protection context
- Resolves variable types from function parameters and local declarations
to only flag accesses on struct drbd_device, avoiding false positives
from unrelated structs with bitmap or ldev fields
- Supports /* ldev_safe: reason */ comment annotations for call sites
where domain knowledge guarantees ldev cannot go away
- Tolerates tree-sitter misparses caused by unexpanded macros by
detecting and skipping fake outer function_definitions
- Reports full call chain trees from unprotected entry point down to
the leaf access
Usage: python3 checks/check_ldev_access.py drbd/*.c
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
They were part framework to detect accesses to device->ldev that relied on a patched version of sparse. Removing all that, as we have a better checker now.
peer_device_proc_drbd_show() accessed device->bitmap (via drbd_bm_total_weight() and drbd_syncer_progress()) and device->act_log without holding a get_ldev() reference. If a detach races with a debugfs read, the bitmap or ldev could be freed while still in use. Acquire get_ldev_if_state() once at the top and hold it across all bitmap and activity log accesses, replacing the narrower get_ldev block that only covered lc_seq_printf_stats().
seq_print_device_proc_drbd() accessed device->bitmap (via drbd_bm_total_weight() and drbd_syncer_progress()) without holding a get_ldev() reference. If a detach races with a /proc/drbd read, the bitmap could be freed while still in use. Acquire get_ldev_if_state() at the top and guard all bitmap accesses with have_ldev.
receive_bitmap() accessed the bitmap (via drbd_bm_slot_lock(), drbd_bm_bits(), drbd_bm_words(), and the receive/decode/send bitmap calls) without holding a get_ldev() reference. If a detach races with bitmap reception, the bitmap could be freed while still in use. Acquire get_ldev() after the wait_event and before drbd_bm_slot_lock(), release it after drbd_bm_slot_unlock() on both the normal and error paths.
receive_peer_dagtag() called drbd_bm_clear_many_bits() without holding a get_ldev() reference. If a detach races with the reconciliation logic, the bitmap could be freed while still in use. Wrap the call with get_ldev()/put_ldev() per device.
…ev() make_ov_request() accesses device->ldev indirectly through drbd_rs_c_min_rate_throttle() without holding an ldev reference. If a concurrent detach races with online-verify, this can lead to a use-after-free. Add get_ldev()/put_ldev() brackets around the body. And it had an unbalanced put_ldev() on the allocation failure path.
Extend the get_ldev()/put_ldev() bracket in drbd_bm_resize() to also cover the drbd_md_dax_active(device->ldev) and drbd_dax_bitmap() calls, which previously accessed device->ldev outside any protection. The old code had a narrow get_ldev/put_ldev bracket that only validated on-disk bitmap space, then accessed device->ldev unprotected for the DAX check. While callers passing capacity != 0 hold ldev today, the function itself would crash if ever called without ldev on a non-zero capacity. Restructure the code so that the DAX path runs inside the extended get_ldev bracket, and the page-based allocation (which does not use ldev) runs unconditionally when bm_on_pmem was not set. Also annotate the device->bitmap access for the static checker.
Mark code paths where ldev is implicitly held with /* ldev_safe: reason */ comments so the get_ldev()/put_ldev() static checker can distinguish them from real bugs. These paths include bio endio callbacks (where ldev is held since I/O submission), request processing (where queued requests hold their own ldev references), state machine callbacks (where extra_ldev_ref_for_after_state_chg() holds an extra local_cnt reference), and worker/sender thread operations on requests with existing ldev refs.
… annotation Add a second analysis pass to check_ldev_access.py that verifies every exit path from a function balances get_ldev() with put_ldev(). This catches reference leaks where an error path returns without releasing the ldev reference, which prevents detach. Introduce a dedicated /* ldev_ref_transfer: reason */ annotation for functions that intentionally pass their ldev reference to an async operation (e.g. submitted peer_request whose endio calls put_ldev). Annotate the four existing reference transfer sites. When a new annotation is placed at the beginning of a function, it indicates that the function receives the ldev reference. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mark the module version to distinguish builds that include our quorum and non-voting disk patches from upstream 9.2.18. flant.1 features: - quorum-minimum-redundancy enforcement (diskless nodes, tiebreaker, runtime recalculation) - CS_FORCE_RECALC bypass of quorum validation - recalculate quorum after forget-peer - CS_HARD for C_UNCONNECTED transition in conn_disconnect - configurable dynamic voter reduction - non-voting disk for quorum exclusion
Fixed a race in w_resync_timer that could permanently stall resync. Administrative operations on any resource (disk attach, resync-after reconfiguration) trigger a global resync-dependency recalculation (drbd_pause_after / drbd_resume_next) that may briefly pause and unpause an unrelated resource's sync target. If the resync timer fires during this transient pause, the sender thread could read the intermediate L_PAUSED_SYNC_T state without holding the state lock, enter the wrong code path, and never reschedule the resync work. The affected resource would show SyncTarget with 0% progress and all worker threads idle; the only recovery was to manually disconnect and reconnect. Read repl_state[NOW] under read_lock_irq(state_rwlock) before the switch statement to guarantee a consistent snapshot of the state. Also bump version to 9.2.18-flant.2.
When a receiver thread restarts after a failed connection attempt, there is a window between drbd_put_listener() in finish_connect and drbd_get_listener() in the next prepare_connect where the path is not registered as a waiter on the shared TCP listener. During this window, incoming connections for that path are rejected with "Closing unexpected connection" because drbd_find_path_by_addr() cannot find the path in the listener's waiters list. This race becomes critical when combined with strict quorum settings (quorum-minimum-redundancy >= 2). A diskless node that needs multiple quorate peers may connect to only one peer before the rejected connection triggers a cascade: the node detects a cluster split, disconnects its only peer, loses quorum, and enters suspend-io. It then has to go through a full resource teardown and reconnect cycle. The race is amplified when one peer has an asymmetric configuration (missing connections to some nodes). The receiver thread for the unreachable peer sits permanently in dtt_wait_for_connect() on the shared listener, making it the one that always calls accept() and encounters the missing path. TCP transport (drbd_transport_tcp.c): - dtt_finish_connect(): only unregister listeners and clean up accepted sockets on successful connect. On failure, keep the path registered so incoming connections are routed correctly. - dtt_prepare_connect(): always clean up accepted sockets at the start of each connect attempt, regardless of whether the listener is already registered. This is safe because dtt_prepare_connect() is called at the start of each drbd_transport_connect() invocation, before dtt_connect() runs, so there is no concurrent socket routing via dtt_wait_for_connect(). Without this cleanup, stale accepted sockets from interrupted connect attempts would accumulate and cause permanent "short read (expected size 8)" / BrokenPipe failures on subsequent connect cycles. - dtt_remove_path(): add dtt_cleanup_accepted_sockets() after drbd_put_listener() to prevent socket leaks on path destruction. lb-tcp transport (drbd_transport_lb-tcp.c): - dtl_set_active(false): only close sockets, do not unregister listeners. The DTL_CONNECTING flag check in dtl_accept_work_fn() already prevents stale socket accumulation by rejecting accepts when the transport is not actively connecting. Both transports guarantee eventual cleanup through two paths: successful connect (finish_connect unregisters) or path destruction (remove_path unregisters). The RDMA transport is not affected as it uses an event-driven model with atomic state transitions. Reproducible with even a single drbdsetup disconnect with some probability, and reliably with rapid repeated disconnects. Signed-off-by: David Magton <david.magton@flant.com>
drbd_should_abort_listening() only checked get_t_state() for EXITING inside an "if (signal_pending(current))" block. When _drbd_thread_stop() sends SIGHUP to a receiver thread that is inside dtt_try_connect() -> sock->ops->connect(), the signal is consumed by the interruptible connect() syscall. By the time the receiver returns to dtt_connect() and calls drbd_should_abort_listening(), signal_pending() is false, so get_t_state() is never checked. The receiver loops back into connect(), never sees t_state == EXITING, and _drbd_thread_stop() waits on the completion forever in D-state. This is easily triggered when a connection is configured to a non-existent peer: the receiver thread loops in dtt_connect() doing TCP connect attempts that always fail. drbdsetup down -> del_connection -> _drbd_thread_stop hangs indefinitely, blocking all subsequent admin commands on the resource due to adm_mutex contention. Fix by checking get_t_state() unconditionally, not only when a signal is pending. EXITING is only set by _drbd_thread_stop(restart=false), which is an explicit request for the thread to terminate. Checking it without a pending signal cannot cause false positives. Signed-off-by: David Magton <david.magton@flant.com>
Problem: When a DRBD connection is being torn down (e.g. drbdsetup del-peer), conn_disconnect() waits for active_ee_cnt to reach zero via wait_event(). However, a PeerWrite request that has been received but not yet acquired an Activity Log (AL) entry (i.e. stuck in "blocked-on-al" state) keeps active_ee_cnt at 1 indefinitely. The request lifecycle in this deadlock scenario: 1. PeerWrite is received in receive_Data(), active_ee_cnt is incremented, request is added to connection->peer_requests (recv_order list). 2. drbd_al_begin_io_fastpath() fails (e.g. due to quorum suspension or AL being locked), so the request is queued to device->submit.peer_writes via drbd_queue_peer_request(), which also triggers do_submit worker. 3. The connection drops. conn_disconnect() sets cstate to C_NETWORK_FAILURE and then waits for active_ee_cnt == 0. 4. The do_submit worker may have already completed its run before the connection state changed, or may be blocked behind a locked AL in prepare_al_transaction_nonblock(). In either case, it never processes this PeerWrite for cleanup. 5. Deadlock: conn_disconnect waits on active_ee_cnt, but the only code that can decrement it (drbd_cleanup_peer_requests_wfa, called from do_submit) never runs for this request. drbdsetup hangs in D state. This was observed in production: drbdsetup del-peer and the DRBD receiver thread both stuck in D (uninterruptible sleep), with debugfs showing active_ee_cnt: 1 and a PeerWrite with flags "blocked-on-al" aged 67 million jiffies. Fix (three parts): 1. prepare_al_transaction_nonblock(): Move the cstate < C_CONNECTED cleanup loop BEFORE the __LC_LOCKED check. Previously, if the AL was locked, the function would immediately goto out, skipping the cleanup of requests from disconnected connections. Now these requests are always moved to the cleanup list regardless of AL lock state. 2. conn_disconnect(): After drain_resync_activity() and before the active_ee_cnt wait, explicitly kick the submit worker (queue_work) and wake the AL wait queue (wake_up) for every device on the disconnecting connection. This forces do_submit to run, see the disconnected cstate, and clean up any blocked-on-al PeerWrites via prepare_al_transaction_nonblock -> drbd_cleanup_peer_requests_wfa, which decrements active_ee_cnt and wakes ee_wait. 3. cleanup_unacked_peer_requests(): Add a WARN_ON_ONCE guard before drbd_al_complete_io() to detect if a peer request without EE_IN_ACTLOG ever reaches this function. Calling drbd_al_complete_io on such a request would corrupt AL reference counts. With fixes 1+2 this should never happen, but the guard provides safety and diagnostics. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
When del_connection() calls _drbd_thread_stop() for the receiver, an external event (such as a peer reconnecting or a state change callback from another connection) can trigger a C_STANDALONE -> C_UNCONNECTED transition, which calls drbd_thread_start(). This converts the receiver's t_state from EXITING to RESTARTING, causing it to restart instead of exit. The _drbd_thread_stop caller then waits on the completion forever in D-state. This was observed in production: drbdsetup down issued while a peer was simultaneously reconnecting. The receiver exited its connect loop (seeing EXITING), entered conn_disconnect, went to StandAlone. But before _drbd_thread_stop's wait_for_completion could be satisfied, the state machine restarted the receiver due to the peer's connection attempt. The receiver re-entered drbdd and the completion was never signaled. Fix by moving change_cstate(C_STANDALONE) before drbd_thread_stop() in del_connection(). With C_STANDALONE set, conn_disconnect() skips the receiver restart (oc >= C_UNCONNECTED is false for C_STANDALONE), and the C_STANDALONE -> C_UNCONNECTED transition that triggers drbd_thread_start() cannot occur. The change_cstate(C_STANDALONE) call already existed in del_connection() as a "race breaker" (added by Lars Ellenberg in 2011, commit 980d566), but was placed after drbd_thread_stop() where it was never reached when the hang occurred. Signed-off-by: David Magton <david.magton@flant.com>
…shed connection When _drbd_thread_stop() sends SIGHUP to a receiver thread that is blocked in tcp_recvmsg (established connection, receiving data), the signal may be consumed by the TCP receive path without causing drbd_recv() to return an error. Additionally, when del_connection() sets C_STANDALONE before _drbd_thread_stop, the state machine's finish_state_change() already calls drbd_thread_stop_nowait() for the C_STANDALONE transition, which sends SIGHUP first. The subsequent _drbd_thread_stop in del_connection sees t_state already EXITING and does not send a second signal. Fix by checking get_t_state() after every transport recv in drbd_recv(). If the thread is in EXITING state, force -EINTR return regardless of recv result. This makes the receiver exit immediately without waiting for the next loop condition check. Only EXITING is checked, not RESTARTING. RESTARTING means "finish current iteration and restart" (normal disconnect + reconnect flow). Treating RESTARTING as an exit condition would kill the feature exchange during reconnect, causing a permanent BrokenPipe loop: Connecting -> BrokenPipe -> "short read (expected size 8)" -> Unconnected -> Connecting, repeating indefinitely. Triggered by drbdsetup down or del-peer when the receiver is blocked in tcp_recvmsg on an established connection with the peer actively sending data. Signed-off-by: David Magton <david.magton@flant.com>
When del_connection sets C_STANDALONE and marks the receiver thread as EXITING, the receiver in conn_connect's retry loop can race past both state changes. At the "start:" label, change_cstate(C_CONNECTING) succeeds because net_conf still exists, overriding C_STANDALONE. The -EAGAIN handler only checks for C_DISCONNECTING (==), missing C_STANDALONE. The receiver re-enters the connect loop indefinitely, and del_connection (drbdsetup down / del-peer) hangs in _drbd_thread_stop waiting for thread completion. Add get_t_state() == EXITING check at the "start:" label before attempting change_cstate(C_CONNECTING). Widen the -EAGAIN cstate check from == C_DISCONNECTING to <= C_DISCONNECTING, which also catches C_STANDALONE. Both checks only trigger during connection destruction (EXITING / C_STANDALONE), never during normal disconnect-reconnect cycles which use RESTARTING / C_UNCONNECTED. Triggered by drbdsetup down while the receiver thread is in the connect retry loop (e.g. connection to an unreachable or slowly responding peer). The receiver races past the C_STANDALONE transition and re-enters conn_connect indefinitely. Signed-off-by: David Magton <david.magton@flant.com>
drbd_adm_down called conn_try_disconnect(force=0) for all connections. Without CS_HARD, the state machine requires a cluster-wide two-phase commit for graceful disconnect. This fails with: - SS_NEED_CONNECTION when the peer is unreachable (Connecting state) - SS_NO_QUORUM when disconnecting would break quorum - SS_CW_FAILED_BY_PEER when the peer rejects the twopc For SS_NEED_CONNECTION, conn_try_disconnect spins in its repeat loop indefinitely. For SS_NO_QUORUM, it returns failure and drbd_adm_down aborts entirely. Only attempt graceful disconnect for C_CONNECTED peers where twopc is possible and outdate negotiation matters. For all other connection states, or if graceful disconnect fails for any reason (including quorum loss), fall back to force disconnect (CS_HARD) which takes the local-only code path and succeeds immediately. Triggered by drbdsetup down when connections are in Connecting or Unconnected state (e.g. peer is unreachable or was recently disconnected). The graceful disconnect hangs or fails, and without the force fallback drbdsetup down never completes. Signed-off-by: David Magton <david.magton@flant.com>
drbd_adm_net_opts() and drbd_fsync_device() call drbd_flush_workqueue(&connection->sender_work) unconditionally. This queues a completion work item and waits for it to be processed. However, the sender thread only runs when the connection is at least in C_CONNECTING state; for StandAlone or Unconnected connections, no thread processes the sender workqueue, so wait_for_completion() blocks forever, putting the calling process into uninterruptible D state. This happens when drbdsetup net-options is called on a connection that is in StandAlone state (e.g. after new-peer but before connect). At that point the sender thread is not yet started. Fix by guarding the flush with a cstate check: only flush if the connection is at least C_CONNECTING, which guarantees the sender thread is running and will process the work item. Signed-off-by: David Magton <david.magton@flant.com>
When bd_link_disk_holder() fails in link_backing_dev(), the function
calls fput(file) to release the backing device. However, the caller
(open_backing_devices) also calls close_backing_dev() on failure,
which calls fput() again on the same file. This double-fput corrupts
the file reference count, leading to use-after-free and ultimately:
kernel BUG at mm/slub.c:448!
In normal operation this path is never hit because bd_link_disk_holder()
succeeds. But when drbdsetup down races with a concurrent drbdsetup
attach on the same backing device, bd_link_disk_holder() fails with
-EINVAL, triggering the double-free.
Fix by removing the fput()/bdev_release() from link_backing_dev() and
letting the caller handle cleanup exclusively through close_backing_dev().
Also update the corresponding coccinelle compat patches that transform
fput() to bdev_release() (kernel 6.6-6.8) and remove bdev_release()
(kernel <6.6), so the fix applies consistently across all supported
kernel versions.
Signed-off-by: David Magton <david.magton@flant.com>
When drbdsetup down is in progress, concurrent drbdsetup new-peer, attach, or primary/secondary on the same resource can reach blocking waits (drbd_flush_workqueue, wait_event) on queues whose processing threads have already been stopped by the ongoing teardown, causing permanent D-state hangs. Additionally, new-peer during teardown can create orphaned sender threads that are never cleaned up, leaking kernel threads and kref references, preventing module unload. Fix by checking DOWN_IN_PROGRESS and R_UNREGISTERED flags at the entry of drbd_adm_set_role, drbd_adm_attach, and drbd_adm_new_peer. Return ERR_INVALID_REQUEST immediately if the resource is being torn down. Reproduced by running drbdsetup down concurrently with drbdsetup new-peer / attach / primary on the same resource (e.g. rapid resource teardown and recreation). The concurrent operation races past adm_mutex and reaches a blocking wait on a dead workqueue. Signed-off-by: David Magton <david.magton@flant.com>
drbd_queue_bitmap_io() puts bitmap work into pending_bitmap_work and
relies on dec_ap_bio() to move it to the worker queue when
ap_bio_cnt[WRITE] reaches zero. However, when IO is suspended (e.g.
on-no-quorum suspend-io), application writes hold ap_bio_cnt > 0
indefinitely because suspended IO never completes. This creates a
circular dependency:
1. Suspended IO cannot complete without quorum
2. Quorum requires at least one UpToDate peer
3. UpToDate requires resync to complete
4. Resync requires bitmap exchange
5. Bitmap exchange is stuck in pending_bitmap_work waiting for
ap_bio_cnt == 0
Since suspended IO does not modify the on-disk bitmap (it is frozen
before being submitted to peers), it is safe to bypass the
ap_bio_cnt == 0 requirement and move bitmap work directly to the
worker queue.
The fix is applied in two places to cover different race windows:
- drbd_queue_bitmap_io(): immediately after queuing, if the device
is already suspended.
- w_after_state_change(): at the end of every state change, if the
resource is suspended and any device has pending bitmap work. This
covers the case where suspension occurs after bitmap work was
queued but before ap_bio_cnt reached zero.
Triggered by quorum loss on Primary with active IO (e.g. mounted
filesystem with writers), followed by peer reconnection that requires
resync. Reproducible by killing TCP connections to both peers with
ss -K while 8 parallel dd writers do fsync on the DRBD device.
Signed-off-by: David Magton <david.magton@flant.com>
When two peers reconnect simultaneously after quorum loss, the UUID handshake for one peer can be invalidated by a concurrent state change from the other peer's handshake. Specifically, UUID_FLAG_PRIMARY_LOST_QUORUM is set during quorum loss and changes the uuid_flags between when they are sent to the peer and when they are verified locally. This triggers rule=initial-handshake-changed, strategy=RETRY_CONNECT. The RETRY_CONNECT strategy has .reconnect=true, which unconditionally calls maybe_force_secondary(). For a suspended Primary, this demotes it to Secondary with force-io-failures, causing ext4 journal abort and permanent filesystem read-only — even though the Primary's data is not outdated and the handshake succeeds on the very next retry. RETRY_CONNECT means "inputs changed during handshake, try again" — it does not indicate that the peer has newer data. The only other .reconnect strategy is SYNC_TARGET_PRIMARY_RECONNECT, where force-secondary is justified because the Primary genuinely needs to become a sync target. Fix by skipping maybe_force_secondary() for RETRY_CONNECT. The handshake will be retried via CONN_HANDSHAKE_RETRY with current uuid_flags and succeed without demoting the Primary. Triggered by killing TCP connections to both peers (ss -K or tcpkill) while 8 parallel writers do fsync on a mounted DRBD filesystem. The parallel reconnection of two peers causes concurrent UUID handshakes that interfere with each other's uuid_flags. Signed-off-by: David Magton <david.magton@flant.com>
…nded drbd_rs_complete_io() and make_resync_request() delay resync completion (RS_DONE / drbd_resync_finished) while drbd_any_flush_pending() returns true, waiting for barrier acks from the Primary. However, when the resource is suspended (e.g. on-no-quorum suspend-io), the Primary's IO is frozen and pending flushes will never complete. This creates a circular dependency: 1. Resync completion waits for flush ack from Primary 2. Flush ack requires Primary to process the barrier 3. Primary IO is suspended, waiting for quorum 4. Quorum requires peers to be UpToDate 5. UpToDate requires resync to complete Since suspended IO does not modify data on peers (it is frozen before being sent), skipping the flush check is safe — the resync data is already durable on the target's backing device. Fix by checking drbd_suspended() before drbd_any_flush_pending() in both callsites. When suspended, allow resync to complete immediately so that quorum can be restored and IO can resume. Triggered by quorum loss on Primary with active IO, followed by peer reconnection. The resync completes its data transfer but the final RS_DONE is blocked by pending flushes from the frozen Primary, preventing the peer from becoming UpToDate. Signed-off-by: David Magton <david.magton@flant.com>
…ync blocks When application writes occur on the SyncSource while a resync is in progress, the written sectors are replicated to the SyncTarget peer via P_DATA (drbd_should_do_remote returns true for D_INCONSISTENT + L_SYNC_TARGET). However, there is a timing window: the bitmap bit is set when the write is submitted locally, and cleared only when the peer acks the write. If drbd_resync_finished() calls drbd_bm_total_weight() while acks are still in flight, it observes n_oos > rs_failed. This is a critical data safety issue: on a subsequent disconnect/reconnect, uuid_compare() returns no-sync because the UUIDs were already rotated by the "completed" resync. The peer is then promoted to UpToDate despite having stale data — silent data corruption on failover. Fix: 1. In drbd_resync_finished(), only trigger resync_again++ when !drbd_should_do_remote(peer_device). When the peer IS receiving application writes, the OOS bits are from in-flight acks — the data is already on the peer and the bitmap will clear momentarily. When the peer is NOT receiving writes (disconnected/Outdated), the OOS bits are genuinely missing data requiring another resync pass. 2. In resync_again(), check drbd_bm_total_weight() before starting a new resync pass. If bitmap is clean by the time we get here (late acks cleared bits), skip. This prevents zero-length resync passes. Reproduction: 3-node cluster (Primary + 2 replicas), 8 parallel dd writers with fsync on mounted filesystem. Kill TCP connections (tcpkill or ss -K) to both peers simultaneously. 5 out of 7 resyncs finish with n_oos > rs_failed. After reconnect, the Outdated peer is incorrectly marked UpToDate by uuid_compare()=no-sync — verified data inconsistency. Signed-off-by: David Magton <david.magton@flant.com>
…re L_ESTABLISHED for Primary stable source Two related fixes for unstable resync completion: BUG (WFBitMapT blocking): After an unstable resync completes, the after-unstable handshake may initiate a follow-up resync from another secondary peer (WFBitMapT). If that peer is also unstable, the resync always results in Outdated (was_resync_stable() returns false). Meanwhile, the WFBitMapT state causes sanitize_state() to clamp max_disk_state to D_OUTDATED, blocking even a concurrent stable resync from the Primary from promoting the device to UpToDate. Fix 1: Skip after-unstable resync initiation (drbd_start_resync_side) when a stable sync source (connected Primary with L_ESTABLISHED) is already available. The Primary will resync directly. Fix 2: Cancel pending WFBitMapT in __cancel_other_resyncs() when a stable resync completes. Previously only L_PAUSED_SYNC_T was cancelled. BUG (wrong stable source detection): drbd_stable_sync_source_present() considers a connected Primary as a stable source regardless of replication state. During resync or reconnect handshake, the Primary may not yet have our full bitmap. This causes was_resync_stable() to return true for an ongoing unstable resync from a different peer, leading to premature UUID rotation and permanently Outdated Target. Fix 3: Require repl_state == L_ESTABLISHED before considering a connected Primary as a stable source. Only a fully established replication link guarantees that the Primary has our complete bitmap and all writes are being replicated. Reproduction: 3-node cluster (A=Primary, B=Secondary, C=Secondary). Kill A-C TCP connection while writers are active. C resyncs from B (unstable). A-C reconnects during resync. Without L_ESTABLISHED check, C considers A a stable source, was_resync_stable() returns true, UUIDs rotate, but bitmap is not clean → C stays Outdated permanently. Observed in stress testing with tcpkill + iptables on 3-node cluster with 8 parallel dd writers. Signed-off-by: David Magton <david.magton@flant.com>
… stall After a connection loss during an active resync, Source-side resync read requests (P_RS_DATA_REQUEST) that were sent but never acknowledged remain in the device's interval tree with INTERVAL_SENT set and INTERVAL_RECEIVED clear. On reconnect, the new resync attempts to write to the same sectors but finds conflicting intervals and stalls indefinitely waiting for them to complete — they never will, because the connection that was supposed to deliver the reply is gone. The existing drbd_cancel_conflicting_resync_requests() only handles entries that have been RECEIVED or SUBMITTED (it queues them through the submit_conflict worker which calls drbd_cleanup_received_resync_write). SENT-but-not-RECEIVED entries are skipped because they are Source-side requests: they have no allocated receive buffer, no incremented backing_ee_cnt or unacked counters, and were never submitted to the backing device. Passing them through the normal cleanup path causes counter underflow (dec_unacked on a zero counter) and kernel deadlock. Add drbd_cleanup_stale_resync_intervals() which directly removes SENT && !RECEIVED resync entries from the interval tree, decrements rs_pending_cnt, and frees the peer_req. Called from drain_resync_activity (conn_disconnect path) after drbd_cancel_conflicting_resync_requests. Reproduction: 3-node cluster with active filesystem IO (8 dd writers with fsync). Kill TCP connections (tcpkill) to trigger resync. During active resync, kill connections again. After second reconnect, resync hangs permanently with rs_pending_cnt > 0. Stale entries are visible in debugfs interval tree (SENT flag set, RECEIVED flag clear). Manual drbdsetup disconnect resolves it because conn_disconnect now calls the cleanup function. Signed-off-by: David Magton <david.magton@flant.com>
…UUIDs When a resync from an unstable Source (secondary with no connected Primary) completes, the Target must NOT rotate UUIDs or promote its disk to UpToDate. The Source may have been behind the Primary, so the data just synced could be incomplete. A UUID rotation at this point would make the incomplete data "authoritative", and a subsequent reconnect to the Primary would see matching UUIDs and skip the needed resync — silent data loss. Previously, the Target would unconditionally rotate UUIDs and become Established+UpToDate after any completed resync, regardless of Source stability. The "Peer was unstable during resync" message was logged but had no corrective effect. Fix: when UNSTABLE_RESYNC is set after resync completion on the Target: 1. Skip UUID rotation and disk promotion entirely. 2. Transition to Established (already done by the state change). 3. Send UUID_FLAG_RESYNC to the Source to trigger a full re-handshake. 4. The Source compares UUIDs and sends a new bitmap. If there are dirty sectors (from Primary writes during the unstable resync), another resync pass runs. If the bitmap is clean AND a stable source (Primary with L_ESTABLISHED) is now available, the resync completes stably and UUIDs rotate normally. Target-side receive_bitmap() is extended to accept bitmaps in L_ESTABLISHED state (from the re-handshake). If bm_total > 0, it sends its own bitmap back and starts L_SYNC_TARGET. If bm_total == 0, L_ESTABLISHED is silently maintained (no-op, already UpToDate effectively). Reproduction: 3-node cluster (A=Primary, B, C). Kill A-C connection. C resyncs from B (unstable). Resync completes. Without this fix, C rotates UUIDs and becomes UpToDate with potentially stale data. When A reconnects, uuid_compare sees matching UUIDs and declares no-sync. With this fix, C requests re-handshake from B, which cascades to A, ensuring C gets all data that A wrote during the unstable resync period. Signed-off-by: David Magton <david.magton@flant.com>
Summary of changes since flant.2: Stability fixes for drbdsetup operations: - Fix drbdsetup down hang (receiver stuck in connect loop) - Fix del-connection hang (receiver restart race) - Fix del-peer deadlock (blocked-on-al PeerWrite) - Fix receiver thread not exiting when connection is established - Fall back to force disconnect in drbd_adm_down - Skip sender workqueue flush when sender is not running - Reject admin operations on resource being removed Kernel crash / IO deadlock fixes: - Fix double fput in link_backing_dev error path (kernel BUG) - Fix bitmap IO deadlock when IO suspended due to quorum loss - Do not force-secondary on transient handshake retry - Fix transport listener race causing rejected connections Resync reliability fixes: - Fix resync-again triggering on in-flight acks (do_remote guard) - Fix was_resync_stable() wrong stable source detection (require L_ESTABLISHED for connected Primary) - Fix resync stall after reconnect (stale SENT && !RECEIVED intervals lingering in interval tree) - Fix unstable resync completion (re-handshake via UUID_FLAG_RESYNC instead of premature UUID rotation)
When drbd_submit_rs_discard() error path frees entries from resync_requests, the received_last pointer is reset to NULL. A subsequent call to drbd_process_rs_discards() then rescans from the list head and finds entries belonging to an already-submitted discard group. Between this scan (under peer_reqs_lock) and the actual drbd_submit_rs_discard() call (without the lock), the ack_sender thread may complete and free those entries via e_end_resync_block -> drbd_check_peers_in_sync_progress. This leads to use-after-free: drbd_remove_peer_req_interval() hits the ASSERTION !drbd_interval_empty(i) on the freed entry, and drbd_unmerge_discard() dereferences LIST_POISON1 from the poisoned recv_order linked list, causing a general protection fault. Fix this in two ways: 1) In drbd_advance_to_next_rs_discard(), skip TRIM entries that belong to already-submitted discard groups. The main entry is identified by INTERVAL_SUBMITTED, merged entries by their cleared (empty) intervals. This prevents double-processing regardless of how received_last was set. 2) In drbd_list_del_resync_request(), when the freed entry is received_last or discard_last, set the pointer to the predecessor entry on resync_requests instead of NULL. This avoids unnecessary full rescans from the list head. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Point submodule to a6d1604 which shifts Flant-specific netlink attribute IDs to avoid collisions with upstream: - disk_conf/non_voting: field 28 -> 36 - res_opts/quorum_dynamic_voters: field 17 -> 25 Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
When the Primary reconnects after quorum loss, w_send_dagtag may silently skip sending P_DAGTAG if writes have already been sent past the queued dagtag value (the dagtag_newer_eq optimization). This can leave the SyncSource's last_dagtag_sector stale, causing resync requests with a depend_dagtag to wait indefinitely on dagtag_wait_ee. Fix this with a three-part approach: 1. Advance last_dagtag_sector from received writes in receive_Data, and release dagtag waiters from the write completion handler (e_end_block) where data is safely committed to disk. Use an atomic dagtag_waiters counter to avoid lock overhead in the common case. 2. Fix w_send_dagtag to always send P_DAGTAG with the actual current position (send.current_dagtag_sector) instead of silently returning when writes raced ahead. 3. Add diagnostic warning when a dagtag wait exceeds 10 seconds, logging the stall duration and relevant dagtag values. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
astef
added a commit
that referenced
this pull request
Jun 23, 2026
flant-9.2.18 is flant-9.2 rebased onto upstream drbd-9.2.18, plus the ported flant-9.2.17 fix line (tip 9.2.18-flant.9). This merge records the pre-rebase branch as an ancestor so PR #9 can fast-forward; the tree is taken entirely from flant-9.2.18 (strategy=ours). All flant-9.2 code fixes are already present as rebased equivalents (verified by content: w_resync_timer race, non-voting disk, reject-admin- on-teardown guards, unstable-resync re-handshake, etc.). Only the obsolete 9.2.16-flant.* version strings are superseded.
flant-9.2.18 is flant-9.2 rebased onto upstream drbd-9.2.18, plus the flant-9.2.17 fixes ported as 9.2.18-flant.4 (use-after-free in resync discard + netlink ID shift to a6d1604c) and 9.2.18-flant.5 (dagtag resync stall). The later diskless-Primary fixes (flant.6-9: uuid-rotation skip, UUID-ancestor, false split-brain, reconciliation timer) were dropped pending re-test against upstream's 9.2.17/9.2.18 diskless-UUID rework. This merge records the pre-rebase branch as an ancestor so PR #9 can fast-forward; the tree is taken entirely from flant-9.2.18 (strategy=ours).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Moves our kernel-module fork off the old 9.2.16 base onto upstream
drbd-9.2.18, keeps every Flant patch, and folds in theflant-9.2.17bug-fix line. Builds clean (drbd.ko+ tcp/lb-tcp/rdma, kernel 6.8).drbd-9.2.18— picks up all upstream 9.2.17/9.2.18 fixes; our work replayed as9.2.18-flant.1…3(quorum-minimum-redundancy, configurable dynamic voters, non-voting disk, plus the drbdsetup-teardown / bitmap-IO-quorum / resync-correctness deadlock fixes).flant-9.2.17fixes asflant.4…9: use-after-free in resync-discard, dagtag resync stall after Primary reconnect, and four diskless-Primary deadlocks (UUID rotation, UUID-ancestor, false split-brain, reconciliation-resync).non_voting28→36 andquorum_dynamic_voters17→25 (headersa6d1604c) to stay collision-free against future upstream fields.Ships with
3p-drbd-utilsflant-9.32.0and3p-drbd-headersflant— all three pin headersa6d1604c(netlink ABI 36/25, proto 118–124). Note:flant.4+IDs (36/25) differ fromflant.1–3(28/17), so pair with the matching utils.