Skip to content

bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.#12167

Open
kernel-patches-daemon-bpf[bot] wants to merge 11 commits into
bpf-next_basefrom
series/1099743=>bpf-next
Open

bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.#12167
kernel-patches-daemon-bpf[bot] wants to merge 11 commits into
bpf-next_basefrom
series/1099743=>bpf-next

Conversation

@kernel-patches-daemon-bpf
Copy link
Copy Markdown

Pull request for series with
subject: bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1099743

q2ven added 11 commits May 23, 2026 01:47
Once bpf_sock_ops_cb_flags_set() supports a new flag,
tcpbpf_user.c fails due to the hard-coded max value, 0x80.

Let's replace 0x80 with BPF_SOCK_OPS_ALL_CB_FLAGS + 1.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
We will introduce a new type of opt-in hooks for BPF SOCK_OPS prog.

The hooks can be enabled on per-socket basis by bpf_setsockopt():

  int flag = BPF_SOCK_OPS_RCVQ_CB_FLAG;

  bpf_setsockopt(sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
                 &flags, sizeof(flags));

or via the SOCK_OPS specific helper:

  bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_RCVQ_CB_FLAG);

Once activated, the BPF prog will be invoked with bpf_sock_ops.op
set to BPF_SOCK_OPS_RCVQ_CB upon the following events:

  1. TCP stack enqueues skb to sk->sk_receive_queue
  2. TCP recvmsg() completes

This will allow the BPF prog to dynamically adjust sk->sk_rcvlowat,
suppressing unnecessary EPOLLIN wakeups until sufficient data
(e.g., a full RPC frame) is available in the receive queue.

Note that is_locked_tcp_sock_ops() is left unchanged not to enable
bpf_setsockopt() unnecessarily, but bpf_sock_ops_cb_flags_set() is
supported at BPF_SOCK_OPS_RCVQ_CB to disable by itself.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS
prog can be called with BPF_SOCK_OPS_RCVQ_CB.

In this hook, we want to parse the RPC descriptor in the skb
and adjust sk->sk_rcvlowat based on the RPC frame size.

However, we cannot access payload via bpf_sock_ops.data on
modern NICs with TCP header/data split on as the payload is
not placed in the linear area.

Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.

Three notes:

  1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is
      invoked from recvmsg().

  2) Access to bpf_sock_ops.data will be disabled by passing
      0 end_offset to bpf_skops_init_skb().

  3) ____bpf_skb_load_bytes() is called directly instead of
     __bpf_skb_load_bytes() to allow compilers to inline it
     instead of generating a tail-call.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
We will add a kfunc for BPF_SOCK_OPS_RCVQ_CB hooks to
adjust sk->sk_rcvlowat.

These hooks will be triggered when:

  1. TCP stack enqueues skb to sk->sk_receive_queue
  2. TCP recvmsg() completes

In the enqueue path, tcp_data_ready() is always called
after the hooks in tcp_queue_rcv() and tcp_ofo_queue().

If tcp_set_rcvlowat() were used as is, tcp_data_ready()
could be called twice for the same skb, which is redundant
and also confusing.

Let's split out __tcp_set_rcvlowat() and add a flag to
control wakeup behaviour.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
We will invoke BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVQ_CB
to adjust sk->sk_rcvlowat when

  1. TCP stack enqueues skb to sk->sk_receive_queue
  2. TCP recvmsg() completes

Let's provide a kfunc to set sk->sk_rcvlowat.

Negative values are clamped to INT_MAX, consistent with SO_RCVLOWAT.

The wakeup flag is determined based on bpf_sock_ops_kern.skb:

  * For the enqueue hook, skb is always non-NULL, and wakeup is
    set to false because

    * tcp_data_ready() is always called after the hooks in
       tcp_queue_rcv() and tcp_ofo_queue().

    * when tcp_fastopen_add_skb() is called for TFO SYN,
       the socket is not yet accept()ed, and when called
       for TFO SYN+ACK, the socket is woken up by
       sk->sk_state_change() anyway.

  * For the recvmsg() hook, skb is always NULL, and wakeup is set
    to true because tcp_data_ready() is not called in the path.

An alternative would be to support bpf_setsockopt() by adding
BPF_SOCK_OPS_RCVQ_CB to is_locked_tcp_sock_ops().

However, that approach involves excessive conditionals and an
unnecessary memcpy(), costs we do not want to pay for every skb
in the TCP fast path.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Both BPF_SOCK_OPS_RCVQ_CB and SOCKMAP can intercept and handle
socket receive queues, leading to overlapping use cases.

While BPF_SOCK_OPS_RCVQ_CB focuses on optimizing single-socket
performance by reducing EPOLLIN wakeups and fully preserves TCP
zerocopy support, SOCKMAP is designed to facilitate multi-socket
routing at the cost of higher overhead and no zerocopy support.

Enabling both features on the same socket makes no sense and
results in unexpected interference between them.

For instance, SOCKMAP calls __tcp_cleanup_rbuf(), where we will
add a BPF_SOCK_OPS_RCVQ_CB hook, and bpf_sock_ops_tcp_set_rcvlowat()
calls sk->sk_data_ready(), which would trigger SOCKMAP.

Let's make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.

Note that it requires write_lock_bh(&sk->sk_callback_lock) to
synchronise with tcp_bpf_update_proto() and check if sk->sk_prot
is one of tcp_bpf_prots[][] because sock_map_update_elem() only
holds bh_lock_sock() without checking sock_owned_by_user().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
MPTCP has its own sock->ops->set_rcvlowat() / mptcp_set_rcvlowat().

We should not allow calling __tcp_set_rcvlowat() for MPTCP subflows.

Let's disable BPF_SOCK_OPS_RCVQ_CB for MPTCP for now.

If needed in the future, bpf_sock_ops_tcp_set_rcvlowat() could be
extended to properly support MPTCP.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Unlike SOCKMAP, BPF_SOCK_OPS_RCVQ_CB does not iterate existing
skbs in the receive queue when it is enabled for the first time.

In practical production use cases, this behavior is usually not
a problem.

We can safely assume that the upper-layer protocol is designed
with specific synchronisation points where the connection is
temporarily quiet.

At these points, the application can completely drain the receive
queue and safely enable BPF_SOCK_OPS_RCVQ_CB while no skbs are
pending.

A prime example is an application transitioning from HTTP to an
RPC protocol:

     Client                                Server
       |                                     |
       | --- HTTP Upgrade request ---------> |
       |                                     | [Drain all skbs]
       |                                     | [Enable BPF_SOCK_OPS_RCVQ_CB]
       | <-- HTTP 200/Switching protocol --- |
       |                                     |
       | --- RPC Frame 1 ------------------> |

However, to strictly prevent any potential race conditions arising
from unconventional upper-layer protocol designs, let's explicitly
signal a failure if BPF_SOCK_OPS_RCVQ_CB is enabled while the receive
queue is not empty.

-EUCLEAN is chosen to indicate that the caller needs to clean up
the receive queue before enabling the feature.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
We will call BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVQ_CB.

It requires a similar setup to bpf_skops_established(), and the
only difference is the skb data length.

Let's factor out the common logic into bpf_skops_common_locked().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVQ_CB.

Let's invoke the BPF SOCK_OPS prog when

  1. TCP stack enqueues skb to sk->sk_receive_queue
     -> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb()

  2. TCP recvmsg() completes
     -> __tcp_cleanup_rbuf()

This will allow the BPF prog to parse each skb and dynamically
adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups
until sufficient data (e.g., a full RPC frame) is available
in the receive queue.

Note that the direct access to bpf_sock_ops.data is intentionally
disabled by passing 0 as end_offset.

Instead, the BPF prog is supposed to use bpf_skb_load_bytes()
with bpf_sock_ops because payload is not in the linear area
with TCP header/data split on and skb may contain a RPC
descriptor in skb frag.  This also simplifies the BPF prog.

The placement of tcp_bpf_rcvlowat() in tcp_ofo_queue() and
tcp_fastopen_add_skb() is chosen to provide the same snapshot
with tcp_queue_rcv().

For example, if tcp_bpf_rcvlowat() were called before updating
TCP_SKB_CB(skb)->seq in tcp_fastopen_add_skb(), BPF prog would
need to implement an unlikely if branch to strip SYN.

In addition, TCP stack can queue overlapping skb into recvq.
Once rcv_nxt is updated with a new skb, BPF prog cannot infer
the previous one from skb->len.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
The test is roughly divided into two stages, and the sequence
is as follows:

  I) Setup

    1. Attach two BPF programs to a cgroup
    2. Establish a TCP connection (@client <-> @child) within the cgroup
    3. Enable BPF_SOCK_OPS_RCVQ_CB on @child

 II) RPC frame exchange in various patterns

    4. Send a partial RPC descriptor from @client to @child
    5. Verify that epoll does NOT wake up @child
    6. Send the remaining data of the RPC frame
    7. Verify that epoll finally wakes up @child

During setup, two BPF programs are attached to simulate
a real-world scenario; one is SOCK_OPS and the other is
CGROUP_SOCKOPT.

While the SOCK_OPS prog handles the dynamic adjustment of
sk->sk_rcvlowat, the CGROUP_SOCKOPT prog is used to enable
BPF_SOCK_OPS_RCVQ_CB via userspace setsockopt() using
pseudo options:

  #define SOL_BPF               0xdeadbeef
  #define BPF_TCP_AUTOLOWAT     0x8badf00d

  setsockopt(fd, SOL_BPF, BPF_TCP_AUTOLOWAT, &(int){1}, sizeof(int));

This reflects a common production use case where an application
decides to start parsing RPC frames only at a certain point in
the stream (e.g., after HTTP Upgrade), rather than immediately
after TCP 3WHS (BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, etc).

When BPF_TCP_AUTOLOWAT is enabled, the BPF prog initialises
sk_local_storage for two sequence numbers to manage its state.

Then, for the RPC frame exchange, this test uses a simple format
defined as follows

  0        8       16      24       32
  +--------+--------+-------+--------+ `.
  |            header size           |  |
  +--------+--------+-------+--------+   > RPC descriptor (8 bytes)
  |            payload size          |  |
  +--------+--------+-------+--------+ .'
  ~               header             ~
  +--------+--------+-------+--------+
  ~               payload            ~
  +--------+--------+-------+--------+

Every time a new skb is enqueued to sk->sk_receive_queue,
the SOCK_OPS prog parses it and updates these sequence numbers:

  rpc_desc_seq : the SEQ # of the start of the RPC descriptor
  rpc_end_seq  : the SEQ # of the end of the RPC frame
                 => rpc_desc_seq + 8 + header size + payload size

Assume we receive two RPC descriptors in the following pattern:

  1. When we receive skb-1, only a part of RPC descriptor is parsed.
     rpc_desc_seq is set to the first byte while rpc_end_seq is
     unknown.  Thus, sk->sk_rcvlowat is set to the size of the RPC
     descriptor (8 bytes).

   <- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
  +-----------+.................+....................+......
  |  RPC desc 1  |  header + payload  |  RPC desc 2  | ...
  +-----------+.................+....................+......
  ^              ^-.
  `- rpc_desc_seq   `- sk->sk_rcvlowat

  2. Next, we receive skb-2, which completes the first RPC descriptor.
     Now rpc_end_seq is known, so sk->sk_rcvlowat is advanced to it.

   <- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
  +-----------+-----------------+....................+......
  |  RPC desc 1  |  header + payload  |  RPC desc 2  | ...
  +-----------+-----------------+....................+......
  ^                                   ^
  '- rpc_desc_seq                     '- rpc_end_seq
                                           & sk->sk_rcvlowat

  3. Once we receive skb-3, which contains the next full RPC descriptor,
     rpc_desc_seq is advanced and rpc_end_seq is updated according
     to the size of RPC frame 2.

     Note that sk->sk_rcvlowat is NOT updated to the new rpc_end_seq
     yet.  This ensures that the application is woken up to read the
     already complete RPC frame 1.

   <- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
  +-----------+-----------------+--------------------+......
  |  RPC desc 1  |  header + payload  |  RPC desc 2  | ...   |
  +-----------+-----------------+--------------------+......
                                      ^                      ^
              rpc_desc_seq -----------'  rpc_end_seq ----...-'
                & sk->sk_rcvlowat

This sequence corresponds to the 4th test case in rpc_test_cases[],
and we can see helpful output if we "#define DEBUG":

  # cat /sys/kernel/tracing/trace_pipe | \
    awk '{ if ($0 ~ /AF_/) sub(/^.*AF_/, "AF_"); print $0 }' & \
    BGPID=$!; ./test_progs -t tcp_autolowat; kill -9 -$BGPID
  ...
  AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 0, end_seq: 1, len: 1, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 0
  AF_INET6 rpc_test_cases[3]: Copied 1 bytes: rpc_desc_buff_len: 1
  AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_desc_buff_len: 1
  AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 8, actual: 8

  AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 1, end_seq: 8, len: 7, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 1
  AF_INET6 rpc_test_cases[3]: Copied full descriptor: rpc_desc_seq: 0, rpc_end_seq: 258, header_len: 100, payload_len: 150
  AF_INET6 rpc_test_cases[3]: No more descriptor: rpc_end_seq: 258, end_seq: 8
  AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 258, rpc_desc_buff_len: 8
  AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 258, actual: 258
  ...

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

Upstream branch: b1fcdf9
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1099743
version: 3

@kernel-patches-review-bot
Copy link
Copy Markdown

commit 406468e0f402c5463322d476e5ac286f0ca48cd7
Author: Kuniyuki Iwashima <kuniyu@google.com>

bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat.

This commit adds a new BPF kfunc to allow SOCK_OPS programs to adjust
sk->sk_rcvlowat when TCP enqueues data or when recvmsg() completes.
The kfunc provides negative value clamping and determines the wakeup
flag based on whether skb is NULL.

> diff --git a/net/core/filter.c b/net/core/filter.c
> index fa8a7c7d86eb..3608036632a8 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -12388,6 +12388,22 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk,
>  #endif
>  }
>
> +__bpf_kfunc int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops,
> +					      int rcvlowat)
> +{
> +#ifdef CONFIG_INET
> +	if (skops->op != BPF_SOCK_OPS_RCVQ_CB)
> +		return -EOPNOTSUPP;
> +
> +	if (rcvlowat < 0)
> +		rcvlowat = INT_MAX;
> +
> +	return __tcp_set_rcvlowat(skops->sk, rcvlowat, !skops->skb);

Should this kfunc prevent operation on MPTCP subflows? Bjorn Topel
raised this concern in v1, noting the issue that commit fcf4692fa39e
("mptcp: prevent BPF accessing lowat from a subflow socket.") was
designed to prevent.

You acknowledged this and stated you would drop MPTCP support for now,
mentioning that a subsequent patch (patch 7 in the v3 series: "bpf:
mptcp: Don't support BPF_SOCK_OPS_RCVQ_CB.") would disable the callback
for MPTCP.

However, the current code lacks any MPTCP check. Without either disabling
BPF_SOCK_OPS_RCVQ_CB for MPTCP subflows in the callback registration or
adding a check like:

	if (sk_is_mptcp(skops->sk))
		return -EOPNOTSUPP;

couldn't a BPF program invoke this kfunc on MPTCP subflows and trigger
the same problem that fcf4692fa39e was designed to prevent?

Reference: https://lore.kernel.org/bpf/87ik8ujfa1.fsf@all.your.base.are.belong.to.us/

> +#else
> +	return -EOPNOTSUPP;
> +#endif
> +}
> +
>  __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
>  					      u64 flags)
>  {
> @@ -12539,6 +12555,7 @@ BTF_KFUNCS_END(bpf_kfunc_check_set_tcp_reqsk)
>
>  BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
>  BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp)
> +BTF_ID_FLAGS(func, bpf_sock_ops_tcp_set_rcvlowat)
>  BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
>
>  static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: tcp: Add kfunc to adjust sk->sk_rcvlowat.
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

@kernel-patches-review-bot
Copy link
Copy Markdown

commit 3ab188a8281bfa86b644cc5261b164e9d80eb1be
Author: Kuniyuki Iwashima <kuniyu@google.com>

selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv.

This commit replaces the hard-coded max value 0x80 with
BPF_SOCK_OPS_ALL_CB_FLAGS + 1 to prevent test failures when
bpf_sock_ops_cb_flags_set() supports a new flag.

Link: https://lore.kernel.org/bpf/20260508190227.AFAE9C2BD00@smtp.kernel.org/

> diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> index 7e8fe1bad03f..e4849d2a2956 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> @@ -26,7 +26,8 @@ static void verify_result(struct tcpbpf_globals *result)
>  	ASSERT_EQ(result->bytes_acked, 1002, "bytes_acked");
>  	ASSERT_EQ(result->data_segs_in, 1, "data_segs_in");
>  	ASSERT_EQ(result->data_segs_out, 1, "data_segs_out");
> -	ASSERT_EQ(result->bad_cb_test_rv, 0x80, "bad_cb_test_rv");
> +	ASSERT_EQ(result->bad_cb_test_rv, BPF_SOCK_OPS_ALL_CB_FLAGS + 1,
> +		  "bad_cb_test_rv");
>  	ASSERT_EQ(result->good_cb_test_rv, 0, "good_cb_test_rv");
>  	ASSERT_EQ(result->num_listen, 1, "num_listen");
>

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> index 6935f32eeb8f..e4cb1fab079 100644
> --- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> @@ -92,7 +92,8 @@ int bpf_testcb(struct bpf_sock_ops *skops)
>  		break;
>  	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
>  		/* Test failure to set largest cb flag (assumes not defined) */
> -		global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops, 0x80);
> +		global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops,
> +								  BPF_SOCK_OPS_ALL_CB_FLAGS + 1);

Multiple reviewers raised concerns about forward compatibility with this
approach. When the test is compiled, BPF_SOCK_OPS_ALL_CB_FLAGS + 1 becomes
a hardcoded value. If this compiled test runs on a future kernel where that
value becomes a valid flag, the helper will successfully set the flag and
return 0, causing the test to fail unexpectedly.

Additionally, this assumes contiguous bitmasks. If a flag is ever deprecated
or skipped, adding 1 could overlap with valid bits.

Would it be better to use a safely distant bit, such as 1 << 30, to guarantee
the flag remains unsupported and the test doesn't break in the future?

Reference: https://lore.kernel.org/bpf/20260508190227.AFAE9C2BD00@smtp.kernel.org/

This concern was raised by sashiko-bot@kernel.org in v1 and v2, and also
flagged by bot+bpf-ci@kernel.org in v2. The author's v1 response indicated
the test is not supposed to run on older kernels, but the reviewers are
concerned about FUTURE kernels, not older ones.



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: selftest: bpf: Use BPF_SOCK_OPS_ALL_CB_FLAGS + 1 for bad_cb_test_rv.
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

@kernel-patches-review-bot
Copy link
Copy Markdown

commit 63a4101f4cf3e07b1f5dae2fa3b294d87fb1b4f7
Author: Kuniyuki Iwashima <kuniyu@google.com>

bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty.

Unlike SOCKMAP, BPF_SOCK_OPS_RCVQ_CB does not iterate existing skbs in
the receive queue when it is enabled for the first time. In practical
production use cases, applications drain the receive queue at
synchronization points before enabling the feature.

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 82ec2291d6f0..4041b9fc1c74 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5390,6 +5390,9 @@ static int __bpf_sock_ops_cb_flags_set(struct sock *sk, int val)
>  		return 0;
>  	}
>
> +	if (!skb_queue_empty(&sk->sk_receive_queue))
> +		return -EUCLEAN;
> +

Is there a TOCTOU issue here? The check occurs before acquiring
sk->sk_callback_lock, which means:

Thread A (process context)
__bpf_sock_ops_cb_flags_set()
    if (!skb_queue_empty(&sk->sk_receive_queue))  // queue is empty
        // check passes

Thread B (softirq context)
tcp_rcv_established()
    // adds packet to sk->sk_receive_queue

Thread A (continuing)
__bpf_sock_ops_cb_flags_set()
    write_lock_bh(&sk->sk_callback_lock);
    // sets bpf_sock_ops_cb_flags despite queue being non-empty

Looking at the tcp_in_sockmap() check below, it contains
lockdep_assert_held(&sk->sk_callback_lock), showing that related
validation checks are expected to be performed under this lock:

>  	if (unlikely(sk_is_mptcp(sk)))
>  		return -EOPNOTSUPP;
>
>  	write_lock_bh(&sk->sk_callback_lock);
>
>  	if (unlikely(tcp_in_sockmap(sk))) {
>  		write_unlock_bh(&sk->sk_callback_lock);
>  		return -EBUSY;
>  	}

The commit message says the goal is to "strictly prevent any potential
race conditions" by ensuring the receive queue is empty when
BPF_SOCK_OPS_RCVQ_CB is enabled. Can this check be moved after the
write_lock_bh() call to eliminate the race window?



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: tcp: Reject BPF_SOCK_OPS_RCVQ_CB if receive queue is not empty.
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

@kernel-patches-review-bot
Copy link
Copy Markdown

commit 17de64d91f7bec30b7d257f399bbf7a81be05954
Author: Kuniyuki Iwashima <kuniyu@google.com>

selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.

The test is roughly divided into two stages, and the sequence is as
follows: setup of BPF programs and TCP connection, followed by RPC frame
exchange in various patterns. The test uses a SOCK_OPS prog to handle
dynamic adjustment of sk->sk_rcvlowat and a CGROUP_SOCKOPT prog to enable
BPF_SOCK_OPS_RCVQ_CB via userspace setsockopt().

> diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
> --- a/tools/testing/selftests/bpf/bpf_kfuncs.h
> +++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
> @@ -64,6 +64,10 @@
>  extern int bpf_sk_assign_tcp_reqsk(struct __sk_buff *skb, struct sock *sk,
>  				   struct bpf_tcp_req_attrs *attrs, int attrs__sz) __ksym;
>
> +struct bpf_sock_ops_kern;
> +extern int bpf_sock_ops_tcp_set_rcvlowat(struct bpf_sock_ops_kern *skops_kern,
> +					 int rcvlowat) __ksym;
> +
>  void *bpf_cast_to_kern_ctx(void *) __ksym;
>
>  extern void *bpf_rdonly_cast(const void *obj, __u32 btf_id) __ksym __weak;

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> @@ -74,6 +74,8 @@ #define ETH_P_IPV6		0x86DD
>
>  #define NEXTHDR_TCP		6
>
> +#define TCPHDR_FIN		0x01
> +
>  #define TCPOPT_NOP		1
>  #define TCPOPT_EOL		0
>  #define TCPOPT_MSS		2

[ ... ]

> diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/tcp_autolowat.c
> @@ -0,0 +1,350 @@

[ ... ]

> +		.event = {
> +			{ .type = RPC_EVENT_AUTOLOWAT,	.val = 1},
> +			/* The first descriptor is partial. */
> +			{.type = RPC_EVENT_SEND,	.len = 1},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE},
> +			/* The first descriptor is available. */
> +			{.type = RPC_EVENT_SEND,	.len = RPC_DESC_SIZE - 1},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
> +			/* The first header is ready. */
> +			{.type = RPC_EVENT_SEND,	.len = 100},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
> +			/* skb has the first payload and 1 byte of the next descriptor. */
> +			{.type = RPC_EVENT_SEND,	.len = 150 + 1},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 1},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE + 150 + 100},
> +			/* After reading the first RPC message, SO_RCVLOWAT should be RPC_DESC_SIZE. */
> +			{.type = RPC_EVENT_RECV,	.len = RPC_DESC_SIZE + 150 + 100},
> +			{.type = RPC_EVENT_EPOLL,	.nfds = 0},
> +			{.type = RPC_EVENT_RCVLOWAT,	.rcvlowat = RPC_DESC_SIZE},

Does calling recv() actually trigger the BPF program to update SO_RCVLOWAT?

Since the kernel only invokes BPF_SOCK_OPS_RCVQ_CB when a new skb is added
to the receive queue, and not when data is dequeued via recv(), the BPF
program doesn't seem to run here.

Could this cause the test to fail unconditionally because sk_rcvlowat
retains its previous value?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

> +static void tcp_autolowat_run_rpc_test(struct tcp_autolowat_test_cb *cb,
> +				       struct rpc_test_case *test_case)
> +{
> +	struct rpc_event *event = test_case->event;
> +	char *ptr = test_case->data;
> +	struct epoll_event ev;
> +	socklen_t optlen;
> +	int err, optval;
> +	char buf[4096];
> +
> +	if (tcp_autolowat_build_data(test_case))
> +		return;
> +
> +	while (1) {
> +		switch (event->type) {
> +		case RPC_EVENT_END:
> +			return;
> +		case RPC_EVENT_AUTOLOWAT:
> +			err = setsockopt(cb->child, SOL_BPF, BPF_TCP_AUTOLOWAT,
> +					 &event->val, sizeof(event->val));
> +			if (!ASSERT_OK(err, "setsockopt"))
> +				return;
> +			break;
> +		case RPC_EVENT_SEND:
> +			err = send(cb->client, ptr, event->len, 0);
> +			if (!ASSERT_EQ(err, event->len, "send"))
> +				return;
> +
> +			ptr += event->len;
> +			break;

Could this lead to race conditions causing flaky tests?

The test calls send() to transmit data and immediately executes the next
event. Because network transmission over the loopback interface involves
asynchronous softirq processing, send() might return before the packet is
fully processed by the receiving socket.

If the next event is RPC_EVENT_EPOLL with a timeout of 0, could
epoll_wait() prematurely return 0? Similarly, for RPC_EVENT_RCVLOWAT,
could getsockopt() read a stale sk_rcvlowat value before the TCP stack and
BPF program have finished processing the packet?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/tcp_autolowat.c b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/tcp_autolowat.c
> @@ -0,0 +1,326 @@

[ ... ]

> +static int tcp_parse_descriptor(struct bpf_sock_ops *skops,
> +				struct tcp_autolowat_cb *cb,
> +				u32 seq, u32 end_seq)
> +{
> +	struct rpc_descriptor *rpc_desc;
> +	u32 rpc_copied_seq;
> +	u64 copy_len; /* u32 should work, but not for no_alu32 :/ */
> +	u64 rpc_len;
> +	int err;
> +
> +	rpc_copied_seq = cb->rpc_desc_seq + cb->rpc_desc_buff_len;
> +
> +	if (before(cb->rpc_desc_seq + RPC_DESC_SIZE, end_seq))
> +		copy_len = RPC_DESC_SIZE - cb->rpc_desc_buff_len;
> +	else
> +		copy_len = end_seq - rpc_copied_seq;
> +
> +	/* Since LLVM commit 324e27e8bad83ca23a3cd276d7e2e729b1b0b8c7,
> +	 * clang omits the "copy_len == 0" check below, which is necessary
> +	 * to satisfy the BPF verifier's range check for bpf_skb_load_bytes().
> +	 */

This isn't a bug, but multi-line comments in the BPF subsystem are
expected to have the opening '/*' on its own line:

	/*
	 * Since LLVM commit ...
	 * ...
	 */

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)

[ ... ]

> +static void tcp_set_autolowat(struct bpf_sock_ops_kern *skops_kern,
> +			      struct tcp_autolowat_cb *cb,
> +			      struct tcp_sock *tp)
> +{
> +	/* To handle wraparound. */
> +	u32 val = 0;
> +
> +	LOG("Setting rcvlowat: tp->copied_seq: %u, rpc_desc_seq: %u, rpc_end_seq: %u, rpc_desc_buff_len: %u",
> +	    TP_SEQ(copied_seq), CB_SEQ(rpc_desc_seq), CB_SEQ(rpc_end_seq), cb->rpc_desc_buff_len);
> +
> +	if (before(tp->copied_seq, cb->rpc_desc_seq))
> +		val = cb->rpc_desc_seq - tp->copied_seq;
> +	else if (cb->rpc_desc_buff_len != RPC_DESC_SIZE)
> +		val = RPC_DESC_SIZE;
> +	else
> +		val = cb->rpc_end_seq - tp->copied_seq;

Does this code incorrectly set SO_RCVLOWAT to the total combined size of
all unread frames when multiple RPC frames are received in a single skb?

If multiple complete frames arrive in the same skb, the parsing loop in
tcp_do_autolowat() parses all of them and advances cb->rpc_desc_seq to the
end of the last parsed frame. This sets:

val = cb->rpc_desc_seq - tp->copied_seq

which evaluates to the length of all unread frames combined.

If an application sequentially reads one frame at a time, tp->copied_seq
advances and the queue size decreases, but sk_rcvlowat is not updated
because BPF_SOCK_OPS_RCVQ_CB only fires on packet enqueue.

The remaining queue size will then be strictly less than sk_rcvlowat,
causing epoll_wait to block and trapping the remaining fully-received
frames in the receive queue.

Could this hang applications that rely on epoll to read one message at a
time?

Reference: https://lore.kernel.org/bpf/20260522083527.D40D51F000E9@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org (Fri, 22 May 2026 08:35:27 +0000)



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: selftest: bpf: Add test for BPF_SOCK_OPS_RCVQ_CB.
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

@kernel-patches-review-bot
Copy link
Copy Markdown

commit a7e955d270014037d3dd691d9d45087033d858af
Author: Kuniyuki Iwashima <kuniyu@google.com>

bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.

Both BPF_SOCK_OPS_RCVQ_CB and SOCKMAP can intercept and handle socket
receive queues, leading to overlapping use cases. While BPF_SOCK_OPS_RCVQ_CB
focuses on optimizing single-socket performance by reducing EPOLLIN wakeups
and fully preserves TCP zerocopy support, SOCKMAP is designed to facilitate
multi-socket routing at the cost of higher overhead and no zerocopy support.

> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index c6a6853909c4..bc95d8e7b62e 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2853,6 +2853,7 @@ struct sk_msg;
>  struct sk_psock;
>
>  #ifdef CONFIG_BPF_SYSCALL
> +bool tcp_in_sockmap(const struct sock *sk);
>  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
>  void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
>  #ifdef CONFIG_BPF_STREAM_PARSER
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 3608036632a8..1fb63b264b18 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c

[ ... ]

> @@ -5382,12 +5382,34 @@ static int bpf_sol_tcp_getsockopt(struct sock *sk, int optname,
>  	return 0;
>  }
>
> +static int __bpf_sock_ops_cb_flags_set(struct sock *sk, int val)
> +{
> +	if (!(val & BPF_SOCK_OPS_RCVQ_CB_FLAG) ||
> +	    tcp_sk(sk)->bpf_sock_ops_cb_flags & BPF_SOCK_OPS_RCVQ_CB_FLAG) {
> +		tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> +		return 0;
> +	}
> +
> +	write_lock_bh(&sk->sk_callback_lock);
> +
> +	if (unlikely(tcp_in_sockmap(sk))) {
> +		write_unlock_bh(&sk->sk_callback_lock);
> +		return -EBUSY;
> +	}
> +
> +	tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> +
> +	write_unlock_bh(&sk->sk_callback_lock);
> +
> +	return 0;
> +}
> +

[ ... ]

> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index cc0bd73f36b6..7e7966b095f9 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -705,6 +705,16 @@ int tcp_bpf_strp_read_sock(struct strparser *strp, read_descriptor_t *desc,
>  }
>  #endif /* CONFIG_BPF_STREAM_PARSER */
>
> +bool tcp_in_sockmap(const struct sock *sk)
> +{
> +	const struct proto *prot = sk->sk_prot;
> +
> +	lockdep_assert_held(&sk->sk_callback_lock);
> +
> +	return &tcp_bpf_prots[0][0] <= prot &&
> +		prot <= &tcp_bpf_prots[TCP_BPF_NUM_PROTS - 1][TCP_BPF_NUM_CFGS - 1];
> +}
> +
>  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>  {
>  	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
> @@ -729,6 +739,8 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>  			sock_replace_proto(sk, psock->sk_proto);
>  		}
>  		return 0;
> +	} else if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_RCVQ_CB_FLAG)) {
> +		return -EBUSY;
                   ^^^^

A subsystem pattern flags this as potentially concerning: the commit
message explains that write_lock_bh(&sk->sk_callback_lock) is taken to
synchronise with tcp_bpf_update_proto(), but does tcp_bpf_update_proto()
actually take sk->sk_callback_lock?

Looking at the call chain, tcp_bpf_update_proto() is invoked from the
SOCKMAP path through:

sock_map_link() in net/core/sock_map.c
  -> sock_map_init_proto()
     -> tcp_bpf_update_proto()

In sock_map_link(), sk_callback_lock is acquired after sock_map_init_proto()
completes, not before. The lock is taken when setting sk_user_data to the
psock, but sock_map_init_proto() runs outside the lock and calls
sock_replace_proto() to write sk->sk_prot without holding sk_callback_lock.

Is there a race here? Consider this interleaving (the commit message
acknowledges sock_map_update_elem only holds bh_lock_sock(), so it can run
concurrently with setsockopt that holds the socket lock):

  CPU 0 (sockmap path)                   CPU 1 (BPF setsockopt RCVQ_CB)
  sock_map_link():                       __bpf_sock_ops_cb_flags_set():
    sk_psock_init():                       (val has RCVQ_CB, current=0)
      lock sk_callback_lock
      set sk_user_data = psock
      unlock sk_callback_lock
    sock_map_init_proto():
      tcp_bpf_update_proto():
        BPF_SOCK_OPS_TEST_FLAG: 0
        (no flag yet, check passes)
                                           write_lock_bh(sk_callback_lock)
                                           tcp_in_sockmap(sk): sk_prot
                                           still old, returns false
                                           tp->bpf_sock_ops_cb_flags = val
                                           (RCVQ_CB set)
                                           write_unlock_bh
        sock_replace_proto(sk,
          &tcp_bpf_prots[][])
        (sk_prot now in tcp_bpf_prots)

Final state: sk_prot is in tcp_bpf_prots[][] AND bpf_sock_ops_cb_flags has
BPF_SOCK_OPS_RCVQ_CB_FLAG. Both features are now enabled, which violates the
mutual exclusion goal.

The validation checks in tcp_bpf_update_proto() and __bpf_sock_ops_cb_flags_set()
each check shared state before the exclusion point, not under common exclusion.
Running a lockset analysis on the two shared variables:

  bpf_sock_ops_cb_flags: written by __bpf_sock_ops_cb_flags_set under
  sk_callback_lock; read by tcp_bpf_update_proto with no lock.
  Lockset intersection = {sk_callback_lock} intersect {} = {}.

  sk->sk_prot: read by tcp_in_sockmap under sk_callback_lock; written by
  sock_replace_proto (called from tcp_bpf_update_proto) with no lock.
  Lockset intersection = {} intersect {sk_callback_lock} = {}.

Could the synchronization be strengthened so both paths access both variables
under the same lock?

>  	}
>
>  	if (sk->sk_family == AF_INET6) {



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: tcp: Make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive.
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26328541814

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant