bpf: Add SOCK_OPS hooks for TCP AutoLOWAT. by kernel-patches-daemon-bpf[bot] · Pull Request #12167 · kernel-patches/bpf

kernel-patches-daemon-bpf · 2026-05-23T08:50:43Z

Pull request for series with
subject: bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1099743

Once bpf_sock_ops_cb_flags_set() supports a new flag, tcpbpf_user.c fails due to the hard-coded max value, 0x80. Let's replace 0x80 with BPF_SOCK_OPS_ALL_CB_FLAGS + 1. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

We will introduce a new type of opt-in hooks for BPF SOCK_OPS prog. The hooks can be enabled on per-socket basis by bpf_setsockopt(): int flag = BPF_SOCK_OPS_RCVQ_CB_FLAG; bpf_setsockopt(sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS, &flags, sizeof(flags)); or via the SOCK_OPS specific helper: bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_RCVQ_CB_FLAG); Once activated, the BPF prog will be invoked with bpf_sock_ops.op set to BPF_SOCK_OPS_RCVQ_CB upon the following events: 1. TCP stack enqueues skb to sk->sk_receive_queue 2. TCP recvmsg() completes This will allow the BPF prog to dynamically adjust sk->sk_rcvlowat, suppressing unnecessary EPOLLIN wakeups until sufficient data (e.g., a full RPC frame) is available in the receive queue. Note that is_locked_tcp_sock_ops() is left unchanged not to enable bpf_setsockopt() unnecessarily, but bpf_sock_ops_cb_flags_set() is supported at BPF_SOCK_OPS_RCVQ_CB to disable by itself. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS prog can be called with BPF_SOCK_OPS_RCVQ_CB. In this hook, we want to parse the RPC descriptor in the skb and adjust sk->sk_rcvlowat based on the RPC frame size. However, we cannot access payload via bpf_sock_ops.data on modern NICs with TCP header/data split on as the payload is not placed in the linear area. Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB. Three notes: 1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is invoked from recvmsg(). 2) Access to bpf_sock_ops.data will be disabled by passing 0 end_offset to bpf_skops_init_skb(). 3) ____bpf_skb_load_bytes() is called directly instead of __bpf_skb_load_bytes() to allow compilers to inline it instead of generating a tail-call. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

We will add a kfunc for BPF_SOCK_OPS_RCVQ_CB hooks to adjust sk->sk_rcvlowat. These hooks will be triggered when: 1. TCP stack enqueues skb to sk->sk_receive_queue 2. TCP recvmsg() completes In the enqueue path, tcp_data_ready() is always called after the hooks in tcp_queue_rcv() and tcp_ofo_queue(). If tcp_set_rcvlowat() were used as is, tcp_data_ready() could be called twice for the same skb, which is redundant and also confusing. Let's split out __tcp_set_rcvlowat() and add a flag to control wakeup behaviour. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

We will invoke BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVQ_CB to adjust sk->sk_rcvlowat when 1. TCP stack enqueues skb to sk->sk_receive_queue 2. TCP recvmsg() completes Let's provide a kfunc to set sk->sk_rcvlowat. Negative values are clamped to INT_MAX, consistent with SO_RCVLOWAT. The wakeup flag is determined based on bpf_sock_ops_kern.skb: * For the enqueue hook, skb is always non-NULL, and wakeup is set to false because * tcp_data_ready() is always called after the hooks in tcp_queue_rcv() and tcp_ofo_queue(). * when tcp_fastopen_add_skb() is called for TFO SYN, the socket is not yet accept()ed, and when called for TFO SYN+ACK, the socket is woken up by sk->sk_state_change() anyway. * For the recvmsg() hook, skb is always NULL, and wakeup is set to true because tcp_data_ready() is not called in the path. An alternative would be to support bpf_setsockopt() by adding BPF_SOCK_OPS_RCVQ_CB to is_locked_tcp_sock_ops(). However, that approach involves excessive conditionals and an unnecessary memcpy(), costs we do not want to pay for every skb in the TCP fast path. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

Both BPF_SOCK_OPS_RCVQ_CB and SOCKMAP can intercept and handle socket receive queues, leading to overlapping use cases. While BPF_SOCK_OPS_RCVQ_CB focuses on optimizing single-socket performance by reducing EPOLLIN wakeups and fully preserves TCP zerocopy support, SOCKMAP is designed to facilitate multi-socket routing at the cost of higher overhead and no zerocopy support. Enabling both features on the same socket makes no sense and results in unexpected interference between them. For instance, SOCKMAP calls __tcp_cleanup_rbuf(), where we will add a BPF_SOCK_OPS_RCVQ_CB hook, and bpf_sock_ops_tcp_set_rcvlowat() calls sk->sk_data_ready(), which would trigger SOCKMAP. Let's make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive. Note that it requires write_lock_bh(&sk->sk_callback_lock) to synchronise with tcp_bpf_update_proto() and check if sk->sk_prot is one of tcp_bpf_prots[][] because sock_map_update_elem() only holds bh_lock_sock() without checking sock_owned_by_user(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

MPTCP has its own sock->ops->set_rcvlowat() / mptcp_set_rcvlowat(). We should not allow calling __tcp_set_rcvlowat() for MPTCP subflows. Let's disable BPF_SOCK_OPS_RCVQ_CB for MPTCP for now. If needed in the future, bpf_sock_ops_tcp_set_rcvlowat() could be extended to properly support MPTCP. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

Unlike SOCKMAP, BPF_SOCK_OPS_RCVQ_CB does not iterate existing skbs in the receive queue when it is enabled for the first time. In practical production use cases, this behavior is usually not a problem. We can safely assume that the upper-layer protocol is designed with specific synchronisation points where the connection is temporarily quiet. At these points, the application can completely drain the receive queue and safely enable BPF_SOCK_OPS_RCVQ_CB while no skbs are pending. A prime example is an application transitioning from HTTP to an RPC protocol: Client Server | | | --- HTTP Upgrade request ---------> | | | [Drain all skbs] | | [Enable BPF_SOCK_OPS_RCVQ_CB] | <-- HTTP 200/Switching protocol --- | | | | --- RPC Frame 1 ------------------> | However, to strictly prevent any potential race conditions arising from unconventional upper-layer protocol designs, let's explicitly signal a failure if BPF_SOCK_OPS_RCVQ_CB is enabled while the receive queue is not empty. -EUCLEAN is chosen to indicate that the caller needs to clean up the receive queue before enabling the feature. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

We will call BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVQ_CB. It requires a similar setup to bpf_skops_established(), and the only difference is the skb data length. Let's factor out the common logic into bpf_skops_common_locked(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVQ_CB. Let's invoke the BPF SOCK_OPS prog when 1. TCP stack enqueues skb to sk->sk_receive_queue -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb() 2. TCP recvmsg() completes -> __tcp_cleanup_rbuf() This will allow the BPF prog to parse each skb and dynamically adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups until sufficient data (e.g., a full RPC frame) is available in the receive queue. Note that the direct access to bpf_sock_ops.data is intentionally disabled by passing 0 as end_offset. Instead, the BPF prog is supposed to use bpf_skb_load_bytes() with bpf_sock_ops because payload is not in the linear area with TCP header/data split on and skb may contain a RPC descriptor in skb frag. This also simplifies the BPF prog. The placement of tcp_bpf_rcvlowat() in tcp_ofo_queue() and tcp_fastopen_add_skb() is chosen to provide the same snapshot with tcp_queue_rcv(). For example, if tcp_bpf_rcvlowat() were called before updating TCP_SKB_CB(skb)->seq in tcp_fastopen_add_skb(), BPF prog would need to implement an unlikely if branch to strip SYN. In addition, TCP stack can queue overlapping skb into recvq. Once rcv_nxt is updated with a new skb, BPF prog cannot infer the previous one from skb->len. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

@client

The test is roughly divided into two stages, and the sequence is as follows: I) Setup 1. Attach two BPF programs to a cgroup 2. Establish a TCP connection (@client <-> @child) within the cgroup 3. Enable BPF_SOCK_OPS_RCVQ_CB on @child II) RPC frame exchange in various patterns 4. Send a partial RPC descriptor from @client to @child 5. Verify that epoll does NOT wake up @child 6. Send the remaining data of the RPC frame 7. Verify that epoll finally wakes up @child During setup, two BPF programs are attached to simulate a real-world scenario; one is SOCK_OPS and the other is CGROUP_SOCKOPT. While the SOCK_OPS prog handles the dynamic adjustment of sk->sk_rcvlowat, the CGROUP_SOCKOPT prog is used to enable BPF_SOCK_OPS_RCVQ_CB via userspace setsockopt() using pseudo options: #define SOL_BPF 0xdeadbeef #define BPF_TCP_AUTOLOWAT 0x8badf00d setsockopt(fd, SOL_BPF, BPF_TCP_AUTOLOWAT, &(int){1}, sizeof(int)); This reflects a common production use case where an application decides to start parsing RPC frames only at a certain point in the stream (e.g., after HTTP Upgrade), rather than immediately after TCP 3WHS (BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, etc). When BPF_TCP_AUTOLOWAT is enabled, the BPF prog initialises sk_local_storage for two sequence numbers to manage its state. Then, for the RPC frame exchange, this test uses a simple format defined as follows 0 8 16 24 32 +--------+--------+-------+--------+ `. | header size | | +--------+--------+-------+--------+ > RPC descriptor (8 bytes) | payload size | | +--------+--------+-------+--------+ .' ~ header ~ +--------+--------+-------+--------+ ~ payload ~ +--------+--------+-------+--------+ Every time a new skb is enqueued to sk->sk_receive_queue, the SOCK_OPS prog parses it and updates these sequence numbers: rpc_desc_seq : the SEQ # of the start of the RPC descriptor rpc_end_seq : the SEQ # of the end of the RPC frame => rpc_desc_seq + 8 + header size + payload size Assume we receive two RPC descriptors in the following pattern: 1. When we receive skb-1, only a part of RPC descriptor is parsed. rpc_desc_seq is set to the first byte while rpc_end_seq is unknown. Thus, sk->sk_rcvlowat is set to the size of the RPC descriptor (8 bytes). <- skb-1 -> <---- skb-2 ----> <------ skb-3 -----> +-----------+.................+....................+...... | RPC desc 1 | header + payload | RPC desc 2 | ... +-----------+.................+....................+...... ^ ^-. `- rpc_desc_seq `- sk->sk_rcvlowat 2. Next, we receive skb-2, which completes the first RPC descriptor. Now rpc_end_seq is known, so sk->sk_rcvlowat is advanced to it. <- skb-1 -> <---- skb-2 ----> <------ skb-3 -----> +-----------+-----------------+....................+...... | RPC desc 1 | header + payload | RPC desc 2 | ... +-----------+-----------------+....................+...... ^ ^ '- rpc_desc_seq '- rpc_end_seq & sk->sk_rcvlowat 3. Once we receive skb-3, which contains the next full RPC descriptor, rpc_desc_seq is advanced and rpc_end_seq is updated according to the size of RPC frame 2. Note that sk->sk_rcvlowat is NOT updated to the new rpc_end_seq yet. This ensures that the application is woken up to read the already complete RPC frame 1. <- skb-1 -> <---- skb-2 ----> <------ skb-3 -----> +-----------+-----------------+--------------------+...... | RPC desc 1 | header + payload | RPC desc 2 | ... | +-----------+-----------------+--------------------+...... ^ ^ rpc_desc_seq -----------' rpc_end_seq ----...-' & sk->sk_rcvlowat This sequence corresponds to the 4th test case in rpc_test_cases[], and we can see helpful output if we "#define DEBUG": # cat /sys/kernel/tracing/trace_pipe | \ awk '{ if ($0 ~ /AF_/) sub(/^.*AF_/, "AF_"); print $0 }' & \ BGPID=$!; ./test_progs -t tcp_autolowat; kill -9 -$BGPID ... AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 0, end_seq: 1, len: 1, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 0 AF_INET6 rpc_test_cases[3]: Copied 1 bytes: rpc_desc_buff_len: 1 AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_desc_buff_len: 1 AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 8, actual: 8 AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 1, end_seq: 8, len: 7, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 1 AF_INET6 rpc_test_cases[3]: Copied full descriptor: rpc_desc_seq: 0, rpc_end_seq: 258, header_len: 100, payload_len: 150 AF_INET6 rpc_test_cases[3]: No more descriptor: rpc_end_seq: 258, end_seq: 8 AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 258, rpc_desc_buff_len: 8 AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 258, actual: 258 ... Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

kernel-patches-daemon-bpf · 2026-05-23T08:50:45Z

Upstream branch: b1fcdf9
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1099743
version: 3

kernel-patches-review-bot · 2026-05-23T09:06:12Z

commit 406468e0f402c5463322d476e5ac286f0ca48cd7
Author: Kuniyuki Iwashima <kuniyu@google.com>

bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat.

This commit adds a new BPF kfunc to allow SOCK_OPS programs to adjust
sk->sk_rcvlowat when TCP enqueues data or when recvmsg() completes.
The kfunc provides negative value clamping and determines the wakeup
flag based on whether skb is NULL.

> diff --git a/net/core/filter.c b/net/core/filter.c
> index fa8a7c7d86eb..3608036632a8 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -12388,6 +12388,22 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk,
>  #endif
>  }
>
> +__bpf_kfunc int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops,
> +					      int rcvlowat)
> +{
> +#ifdef CONFIG_INET
> +	if (skops->op != BPF_SOCK_OPS_RCVQ_CB)
> +		return -EOPNOTSUPP;
> +
> +	if (rcvlowat < 0)
> +		rcvlowat = INT_MAX;
> +
> +	return __tcp_set_rcvlowat(skops->sk, rcvlowat, !skops->skb);

Should this kfunc prevent operation on MPTCP subflows? Bjorn Topel
raised this concern in v1, noting the issue that commit fcf4692fa39e
("mptcp: prevent BPF accessing lowat from a subflow socket.") was
designed to prevent.

You acknowledged this and stated you would drop MPTCP support for now,
mentioning that a subsequent patch (patch 7 in the v3 series: "bpf:
mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB.") would disable the callback
for MPTCP.

However, the current code lacks any MPTCP check. Without either disabling
BPF_SOCK_OPS_RCVQ_CB for MPTCP subflows in the callback registration or
adding a check like:

	if (sk_is_mptcp(skops->sk))
		return -EOPNOTSUPP;

couldn't a BPF program invoke this kfunc on MPTCP subflows and trigger
the same problem that fcf4692fa39e was designed to prevent?

Reference: https://lore.kernel.org/bpf/87ik8ujfa1.fsf@all.your.base.are.belong.to.us/

> +#else
> +	return -EOPNOTSUPP;
> +#endif
> +}
> +
>  __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
>  					      u64 flags)
>  {
> @@ -12539,6 +12555,7 @@ BTF_KFUNCS_END(bpf_kfunc_check_set_tcp_reqsk)
>
>  BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
>  BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp)
> +BTF_ID_FLAGS(func, bpf_sock_ops_tcp_set_rcvlowat)
>  BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
>
>  static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat.
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

kernel-patches-review-bot · 2026-05-23T09:06:33Z

commit 3ab188a8281bfa86b644cc5261b164e9d80eb1be
Author: Kuniyuki Iwashima <kuniyu@google.com>

selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv.

This commit replaces the hard-coded max value 0x80 with
BPF_SOCK_OPS_ALL_CB_FLAGS + 1 to prevent test failures when
bpf_sock_ops_cb_flags_set() supports a new flag.

Link: https://lore.kernel.org/bpf/20260508190227.AFAE9C2BD00@smtp.kernel.org/

> diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> index 7e8fe1bad03f..e4849d2a2956 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> @@ -26,7 +26,8 @@ static void verify_result(struct tcpbpf_globals *result)
>  	ASSERT_EQ(result->bytes_acked, 1002, "bytes_acked");
>  	ASSERT_EQ(result->data_segs_in, 1, "data_segs_in");
>  	ASSERT_EQ(result->data_segs_out, 1, "data_segs_out");
> -	ASSERT_EQ(result->bad_cb_test_rv, 0x80, "bad_cb_test_rv");
> +	ASSERT_EQ(result->bad_cb_test_rv, BPF_SOCK_OPS_ALL_CB_FLAGS + 1,
> +		  "bad_cb_test_rv");
>  	ASSERT_EQ(result->good_cb_test_rv, 0, "good_cb_test_rv");
>  	ASSERT_EQ(result->num_listen, 1, "num_listen");
>

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> index 6935f32eeb8f..e4cb1fab079 100644
> --- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> @@ -92,7 +92,8 @@ int bpf_testcb(struct bpf_sock_ops *skops)
>  		break;
>  	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
>  		/* Test failure to set largest cb flag (assumes not defined) */
> -		global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops, 0x80);
> +		global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops,
> +								  BPF_SOCK_OPS_ALL_CB_FLAGS + 1);

Multiple reviewers raised concerns about forward compatibility with this
approach. When the test is compiled, BPF_SOCK_OPS_ALL_CB_FLAGS + 1 becomes
a hardcoded value. If this compiled test runs on a future kernel where that
value becomes a valid flag, the helper will successfully set the flag and
return 0, causing the test to fail unexpectedly.

Additionally, this assumes contiguous bitmasks. If a flag is ever deprecated
or skipped, adding 1 could overlap with valid bits.

Would it be better to use a safely distant bit, such as 1 << 30, to guarantee
the flag remains unsupported and the test doesn't break in the future?

Reference: https://lore.kernel.org/bpf/20260508190227.AFAE9C2BD00@smtp.kernel.org/

This concern was raised by sashiko-bot@kernel.org in v1 and v2, and also
flagged by bot+bpf-ci@kernel.org in v2. The author's v1 response indicated
the test is not supposed to run on older kernels, but the reviewers are
concerned about FUTURE kernels, not older ones.

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv.
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

kernel-patches-daemon-bpf · 2026-05-23T09:06:35Z

Forwarding comment 4524906572 via email
In-Reply-To: 20260523083001.2911931-6-kuniyu@google.com
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/20260523083001.2911931-6-kuniyu@google.com/

kernel-patches-daemon-bpf · 2026-05-23T09:06:36Z

Forwarding comment 4524907225 via email
In-Reply-To: 20260523083001.2911931-2-kuniyu@google.com
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/20260523083001.2911931-2-kuniyu@google.com/

kernel-patches-review-bot · 2026-05-23T09:06:46Z

commit 63a4101f4cf3e07b1f5dae2fa3b294d87fb1b4f7
Author: Kuniyuki Iwashima <kuniyu@google.com>

bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty.

Unlike SOCKMAP, BPF_SOCK_OPS_RCVQ_CB does not iterate existing skbs in
the receive queue when it is enabled for the first time. In practical
production use cases, applications drain the receive queue at
synchronization points before enabling the feature.

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 82ec2291d6f0..4041b9fc1c74 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5390,6 +5390,9 @@ static int __bpf_sock_ops_cb_flags_set(struct sock *sk, int val)
>  		return 0;
>  	}
>
> +	if (!skb_queue_empty(&sk->sk_receive_queue))
> +		return -EUCLEAN;
> +

Is there a TOCTOU issue here? The check occurs before acquiring
sk->sk_callback_lock, which means:

Thread A (process context)
__bpf_sock_ops_cb_flags_set()
    if (!skb_queue_empty(&sk->sk_receive_queue))  // queue is empty
        // check passes

Thread B (softirq context)
tcp_rcv_established()
    // adds packet to sk->sk_receive_queue

Thread A (continuing)
__bpf_sock_ops_cb_flags_set()
    write_lock_bh(&sk->sk_callback_lock);
    // sets bpf_sock_ops_cb_flags despite queue being non-empty

Looking at the tcp_in_sockmap() check below, it contains
lockdep_assert_held(&sk->sk_callback_lock), showing that related
validation checks are expected to be performed under this lock:

>  	if (unlikely(sk_is_mptcp(sk)))
>  		return -EOPNOTSUPP;
>
>  	write_lock_bh(&sk->sk_callback_lock);
>
>  	if (unlikely(tcp_in_sockmap(sk))) {
>  		write_unlock_bh(&sk->sk_callback_lock);
>  		return -EBUSY;
>  	}

The commit message says the goal is to "strictly prevent any potential
race conditions" by ensuring the receive queue is empty when
BPF_SOCK_OPS_RCVQ_CB is enabled. Can this check be moved after the
write_lock_bh() call to eliminate the race window?

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty.
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

kernel-patches-review-bot · 2026-05-23T09:11:49Z

commit 17de64d91f7bec30b7d257f399bbf7a81be05954
Author: Kuniyuki Iwashima <kuniyu@google.com>

selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.

The test is roughly divided into two stages, and the sequence is as
follows: setup of BPF programs and TCP connection, followed by RPC frame
exchange in various patterns. The test uses a SOCK_OPS prog to handle
dynamic adjustment of sk->sk_rcvlowat and a CGROUP_SOCKOPT prog to enable
BPF_SOCK_OPS_RCVQ_CB via userspace setsockopt().

> diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
> --- a/tools/testing/selftests/bpf/bpf_kfuncs.h
> +++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
> @@ -64,6 +64,10 @@
>  extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk,
>  				   struct bpf_tcp_req_attrs *attrs, int attrs__sz) __ksym;
>
> +struct bpf_sock_ops_kern;
> +extern int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops_kern,
> +					 int rcvlowat) __ksym;
> +
>  void *bpf_cast_to_kern_ctx(void *) __ksym;
>
>  extern void *bpf_rdonly_cast(const void *obj, __u32 btf_id) __ksym __weak;

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> @@ -74,6 +74,8 @@ #define ETH_P_IPV6		0x86DD
>
>  #define NEXTHDR_TCP		6
>
> +#define TCPHDR_FIN		0x01
> +
>  #define TCPOPT_NOP		1
>  #define TCPOPT_EOL		0
>  #define TCPOPT_MSS		2

[ ... ]

> diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
> @@ -0,0 +1,350 @@

[ ... ]

> +		.event = {
> +			{ .type = RPC_EVENT_AUTOLOWAT,	.val = 1},
> +			/* The first descriptor is partial. */
> +			{.type = RPC_EVENT_SEND,	.len = 1},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE},
> +			/* The first descriptor is available. */
> +			{.type = RPC_EVENT_SEND,	.len = RPC_DESC_SIZE - 1},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
> +			/* The first header is ready. */
> +			{.type = RPC_EVENT_SEND,	.len = 100},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
> +			/* skb has the first payload and 1 byte of the next descriptor. */
> +			{.type = RPC_EVENT_SEND,	.len = 150 + 1},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 1},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
> +			/* After reading the first RPC message, SO_RCVLOWAT should be RPC_DESC_SIZE. */
> +			{.type = RPC_EVENT_RECV,	.len = RPC_DESC_SIZE + 150 + 100},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE},

Does calling recv() actually trigger the BPF program to update SO_RCVLOWAT?

Since the kernel only invokes BPF_SOCK_OPS_RCVQ_CB when a new skb is added
to the receive queue, and not when data is dequeued via recv(), the BPF
program doesn't seem to run here.

Could this cause the test to fail unconditionally because sk_rcvlowat
retains its previous value?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

> +static void tcp_autolowat_run_rpc_test(struct tcp_autolowat_test_cb *cb,
> +				       struct rpc_test_case *test_case)
> +{
> +	struct rpc_event *event = test_case->event;
> +	char *ptr = test_case->data;
> +	struct epoll_event ev;
> +	socklen_t optlen;
> +	int err, optval;
> +	char buf[4096];
> +
> +	if (tcp_autolowat_build_data(test_case))
> +		return;
> +
> +	while (1) {
> +		switch (event->type) {
> +		case RPC_EVENT_END:
> +			return;
> +		case RPC_EVENT_AUTOLOWAT:
> +			err = setsockopt(cb->child, SOL_BPF, BPF_TCP_AUTOLOWAT,
> +					 &event->val, sizeof(event->val));
> +			if (!ASSERT_OK(err, "setsockopt"))
> +				return;
> +			break;
> +		case RPC_EVENT_SEND:
> +			err = send(cb->client, ptr, event->len, 0);
> +			if (!ASSERT_EQ(err, event->len, "send"))
> +				return;
> +
> +			ptr += event->len;
> +			break;

Could this lead to race conditions causing flaky tests?

The test calls send() to transmit data and immediately executes the next
event. Because network transmission over the loopback interface involves
asynchronous softirq processing, send() might return before the packet is
fully processed by the receiving socket.

If the next event is RPC_EVENT_EPOLL with a timeout of 0, could
epoll_wait() prematurely return 0? Similarly, for RPC_EVENT_RCVLOWAT,
could getsockopt() read a stale sk_rcvlowat value before the TCP stack and
BPF program have finished processing the packet?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/tcp_autolowat.c b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
> @@ -0,0 +1,326 @@

[ ... ]

> +static int tcp_parse_descriptor(struct bpf_sock_ops *skops,
> +				struct tcp_autolowat_cb *cb,
> +				u32 seq, u32 end_seq)
> +{
> +	struct rpc_descriptor *rpc_desc;
> +	u32 rpc_copied_seq;
> +	u64 copy_len; /* u32 should work, but not for no_alu32 :/ */
> +	u64 rpc_len;
> +	int err;
> +
> +	rpc_copied_seq = cb->rpc_desc_seq + cb->rpc_desc_buff_len;
> +
> +	if (before(cb->rpc_desc_seq + RPC_DESC_SIZE, end_seq))
> +		copy_len = RPC_DESC_SIZE - cb->rpc_desc_buff_len;
> +	else
> +		copy_len = end_seq - rpc_copied_seq;
> +
> +	/* Since LLVM commit 324e27e8bad83ca23a3cd276d7e2e729b1b0b8c7,
> +	 * clang omits the "copy_len == 0" check below, which is necessary
> +	 * to satisfy the BPF verifier's range check for bpf_skb_load_bytes().
> +	 */

This isn't a bug, but multi-line comments in the BPF subsystem are
expected to have the opening '/*' on its own line:

	/*
	 * Since LLVM commit ...
	 * ...
	 */

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

> +static void tcp_set_autolowat(struct bpf_sock_ops_kern *skops_kern,
> +			      struct tcp_autolowat_cb *cb,
> +			      struct tcp_sock *tp)
> +{
> +	/* To handle wraparound. */
> +	u32 val = 0;
> +
> +	LOG("Setting rcvlowat: tp->copied_seq: %u, rpc_desc_seq: %u, rpc_end_seq: %u, rpc_desc_buff_len: %u",
> +	    TP_SEQ(copied_seq), CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
> +
> +	if (before(tp->copied_seq, cb->rpc_desc_seq))
> +		val = cb->rpc_desc_seq - tp->copied_seq;
> +	else if (cb->rpc_desc_buff_len != RPC_DESC_SIZE)
> +		val = RPC_DESC_SIZE;
> +	else
> +		val = cb->rpc_end_seq - tp->copied_seq;

Does this code incorrectly set SO_RCVLOWAT to the total combined size of
all unread frames when multiple RPC frames are received in a single skb?

If multiple complete frames arrive in the same skb, the parsing loop in
tcp_do_autolowat() parses all of them and advances cb->rpc_desc_seq to the
end of the last parsed frame. This sets:

val = cb->rpc_desc_seq - tp->copied_seq

which evaluates to the length of all unread frames combined.

If an application sequentially reads one frame at a time, tp->copied_seq
advances and the queue size decreases, but sk_rcvlowat is not updated
because BPF_SOCK_OPS_RCVQ_CB only fires on packet enqueue.

The remaining queue size will then be strictly less than sk_rcvlowat,
causing epoll_wait to block and trapping the remaining fully-received
frames in the receive queue.

Could this hang applications that rely on epoll to read one message at a
time?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

kernel-patches-review-bot · 2026-05-23T09:12:11Z

commit a7e955d270014037d3dd691d9d45087033d858af
Author: Kuniyuki Iwashima <kuniyu@google.com>

bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.

Both BPF_SOCK_OPS_RCVQ_CB and SOCKMAP can intercept and handle socket
receive queues, leading to overlapping use cases. While BPF_SOCK_OPS_RCVQ_CB
focuses on optimizing single-socket performance by reducing EPOLLIN wakeups
and fully preserves TCP zerocopy support, SOCKMAP is designed to facilitate
multi-socket routing at the cost of higher overhead and no zerocopy support.

> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index c6a6853909c4..bc95d8e7b62e 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2853,6 +2853,7 @@ struct sk_msg;
>  struct sk_psock;
>
>  #ifdef CONFIG_BPF_SYSCALL
> +bool tcp_in_sockmap(const struct sock *sk);
>  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
>  void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
>  #ifdef CONFIG_BPF_STREAM_PARSER
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 3608036632a8..1fb63b264b18 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c

[ ... ]

> @@ -5382,12 +5382,34 @@ static int bpf_sol_tcp_getsockopt(struct sock *sk, int optname,
>  	return 0;
>  }
>
> +static int __bpf_sock_ops_cb_flags_set(struct sock *sk, int val)
> +{
> +	if (!(val & BPF_SOCK_OPS_RCVQ_CB_FLAG) ||
> +	    tcp_sk(sk)->bpf_sock_ops_cb_flags & BPF_SOCK_OPS_RCVQ_CB_FLAG) {
> +		tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> +		return 0;
> +	}
> +
> +	write_lock_bh(&sk->sk_callback_lock);
> +
> +	if (unlikely(tcp_in_sockmap(sk))) {
> +		write_unlock_bh(&sk->sk_callback_lock);
> +		return -EBUSY;
> +	}
> +
> +	tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> +
> +	write_unlock_bh(&sk->sk_callback_lock);
> +
> +	return 0;
> +}
> +

[ ... ]

> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index cc0bd73f36b6..7e7966b095f9 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -705,6 +705,16 @@ int tcp_bpf_strp_read_sock(struct strparser *strp, read_descriptor_t *desc,
>  }
>  #endif /* CONFIG_BPF_STREAM_PARSER */
>
> +bool tcp_in_sockmap(const struct sock *sk)
> +{
> +	const struct proto *prot = sk->sk_prot;
> +
> +	lockdep_assert_held(&sk->sk_callback_lock);
> +
> +	return &tcp_bpf_prots[0][0] <= prot &&
> +		prot <= &tcp_bpf_prots[TCP_BPF_NUM_PROTS - 1][TCP_BPF_NUM_CFGS - 1];
> +}
> +
>  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>  {
>  	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
> @@ -729,6 +739,8 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>  			sock_replace_proto(sk, psock->sk_proto);
>  		}
>  		return 0;
> +	} else if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG)) {
> +		return -EBUSY;
                   ^^^^

A subsystem pattern flags this as potentially concerning: the commit
message explains that write_lock_bh(&sk->sk_callback_lock) is taken to
synchronise with tcp_bpf_update_proto(), but does tcp_bpf_update_proto()
actually take sk->sk_callback_lock?

Looking at the call chain, tcp_bpf_update_proto() is invoked from the
SOCKMAP path through:

sock_map_link() in net/core/sock_map.c
  -> sock_map_init_proto()
     -> tcp_bpf_update_proto()

In sock_map_link(), sk_callback_lock is acquired after sock_map_init_proto()
completes, not before. The lock is taken when setting sk_user_data to the
psock, but sock_map_init_proto() runs outside the lock and calls
sock_replace_proto() to write sk->sk_prot without holding sk_callback_lock.

Is there a race here? Consider this interleaving (the commit message
acknowledges sock_map_update_elem only holds bh_lock_sock(), so it can run
concurrently with setsockopt that holds the socket lock):

  CPU 0 (sockmap path)                   CPU 1 (BPF setsockopt RCVQ_CB)
  sock_map_link():                       __bpf_sock_ops_cb_flags_set():
    sk_psock_init():                       (val has RCVQ_CB, current=0)
      lock sk_callback_lock
      set sk_user_data = psock
      unlock sk_callback_lock
    sock_map_init_proto():
      tcp_bpf_update_proto():
        BPF_SOCK_OPS_TEST_FLAG: 0
        (no flag yet, check passes)
                                           write_lock_bh(sk_callback_lock)
                                           tcp_in_sockmap(sk): sk_prot
                                           still old, returns false
                                           tp->bpf_sock_ops_cb_flags = val
                                           (RCVQ_CB set)
                                           write_unlock_bh
        sock_replace_proto(sk,
          &tcp_bpf_prots[][])
        (sk_prot now in tcp_bpf_prots)

Final state: sk_prot is in tcp_bpf_prots[][] AND bpf_sock_ops_cb_flags has
BPF_SOCK_OPS_RCVQ_CB_FLAG. Both features are now enabled, which violates the
mutual exclusion goal.

The validation checks in tcp_bpf_update_proto() and __bpf_sock_ops_cb_flags_set()
each check shared state before the exclusion point, not under common exclusion.
Running a lockset analysis on the two shared variables:

  bpf_sock_ops_cb_flags: written by __bpf_sock_ops_cb_flags_set under
  sk_callback_lock; read by tcp_bpf_update_proto with no lock.
  Lockset intersection = {sk_callback_lock} intersect {} = {}.

  sk->sk_prot: read by tcp_in_sockmap under sk_callback_lock; written by
  sock_replace_proto (called from tcp_bpf_update_proto) with no lock.
  Lockset intersection = {} intersect {sk_callback_lock} = {}.

Could the synchronization be strengthened so both paths access both variables
under the same lock?

>  	}
>
>  	if (sk->sk_family == AF_INET6) {

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

kernel-patches-daemon-bpf · 2026-05-23T09:20:47Z

Forwarding comment 4524907669 via email
In-Reply-To: 20260523083001.2911931-9-kuniyu@google.com
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/20260523083001.2911931-9-kuniyu@google.com/

kernel-patches-daemon-bpf · 2026-05-23T09:20:48Z

Forwarding comment 4524917859 via email
In-Reply-To: 20260523083001.2911931-12-kuniyu@google.com
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/20260523083001.2911931-12-kuniyu@google.com/

kernel-patches-daemon-bpf · 2026-05-23T09:20:50Z

Forwarding comment 4524918589 via email
In-Reply-To: 20260523083001.2911931-7-kuniyu@google.com
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/20260523083001.2911931-7-kuniyu@google.com/

q2ven added 11 commits May 23, 2026 01:47

kernel-patches-daemon-bpf Bot added new bpf-next V3 labels May 23, 2026

kernel-patches-review-bot Bot added the ai-review label May 23, 2026

kernel-patches-daemon-bpf Bot removed the ai-review label May 23, 2026

kernel-patches-review-bot Bot added the ai-review label May 23, 2026

kernel-patches-daemon-bpf Bot added V3-ci-fail and removed ai-review labels May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.#12167

bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.#12167
kernel-patches-daemon-bpf[bot] wants to merge 11 commits into
bpf-next_basefrom
series/1099743=>bpf-next

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 23, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 23, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 23, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 23, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 23, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 23, 2026

Uh oh!

kernel-patches-review-bot Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

kernel-patches-daemon-bpf Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant