bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.#12167
bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.#12167kernel-patches-daemon-bpf[bot] wants to merge 11 commits into
Conversation
Once bpf_sock_ops_cb_flags_set() supports a new flag, tcpbpf_user.c fails due to the hard-coded max value, 0x80. Let's replace 0x80 with BPF_SOCK_OPS_ALL_CB_FLAGS + 1. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
We will introduce a new type of opt-in hooks for BPF SOCK_OPS prog.
The hooks can be enabled on per-socket basis by bpf_setsockopt():
int flag = BPF_SOCK_OPS_RCVQ_CB_FLAG;
bpf_setsockopt(sk, SOL_TCP, TCP_BPF_SOCK_OPS_CB_FLAGS,
&flags, sizeof(flags));
or via the SOCK_OPS specific helper:
bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_RCVQ_CB_FLAG);
Once activated, the BPF prog will be invoked with bpf_sock_ops.op
set to BPF_SOCK_OPS_RCVQ_CB upon the following events:
1. TCP stack enqueues skb to sk->sk_receive_queue
2. TCP recvmsg() completes
This will allow the BPF prog to dynamically adjust sk->sk_rcvlowat,
suppressing unnecessary EPOLLIN wakeups until sufficient data
(e.g., a full RPC frame) is available in the receive queue.
Note that is_locked_tcp_sock_ops() is left unchanged not to enable
bpf_setsockopt() unnecessarily, but bpf_sock_ops_cb_flags_set() is
supported at BPF_SOCK_OPS_RCVQ_CB to disable by itself.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
When a TCP skb is queued to sk->sk_receive_queue, BPF SOCK_OPS
prog can be called with BPF_SOCK_OPS_RCVQ_CB.
In this hook, we want to parse the RPC descriptor in the skb
and adjust sk->sk_rcvlowat based on the RPC frame size.
However, we cannot access payload via bpf_sock_ops.data on
modern NICs with TCP header/data split on as the payload is
not placed in the linear area.
Let's support bpf_skb_load_bytes() for BPF_SOCK_OPS_RCVQ_CB.
Three notes:
1) bpf_sock_ops_kern.skb will be NULL when the BPF prog is
invoked from recvmsg().
2) Access to bpf_sock_ops.data will be disabled by passing
0 end_offset to bpf_skops_init_skb().
3) ____bpf_skb_load_bytes() is called directly instead of
__bpf_skb_load_bytes() to allow compilers to inline it
instead of generating a tail-call.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
We will add a kfunc for BPF_SOCK_OPS_RCVQ_CB hooks to adjust sk->sk_rcvlowat. These hooks will be triggered when: 1. TCP stack enqueues skb to sk->sk_receive_queue 2. TCP recvmsg() completes In the enqueue path, tcp_data_ready() is always called after the hooks in tcp_queue_rcv() and tcp_ofo_queue(). If tcp_set_rcvlowat() were used as is, tcp_data_ready() could be called twice for the same skb, which is redundant and also confusing. Let's split out __tcp_set_rcvlowat() and add a flag to control wakeup behaviour. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
We will invoke BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVQ_CB
to adjust sk->sk_rcvlowat when
1. TCP stack enqueues skb to sk->sk_receive_queue
2. TCP recvmsg() completes
Let's provide a kfunc to set sk->sk_rcvlowat.
Negative values are clamped to INT_MAX, consistent with SO_RCVLOWAT.
The wakeup flag is determined based on bpf_sock_ops_kern.skb:
* For the enqueue hook, skb is always non-NULL, and wakeup is
set to false because
* tcp_data_ready() is always called after the hooks in
tcp_queue_rcv() and tcp_ofo_queue().
* when tcp_fastopen_add_skb() is called for TFO SYN,
the socket is not yet accept()ed, and when called
for TFO SYN+ACK, the socket is woken up by
sk->sk_state_change() anyway.
* For the recvmsg() hook, skb is always NULL, and wakeup is set
to true because tcp_data_ready() is not called in the path.
An alternative would be to support bpf_setsockopt() by adding
BPF_SOCK_OPS_RCVQ_CB to is_locked_tcp_sock_ops().
However, that approach involves excessive conditionals and an
unnecessary memcpy(), costs we do not want to pay for every skb
in the TCP fast path.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Both BPF_SOCK_OPS_RCVQ_CB and SOCKMAP can intercept and handle socket receive queues, leading to overlapping use cases. While BPF_SOCK_OPS_RCVQ_CB focuses on optimizing single-socket performance by reducing EPOLLIN wakeups and fully preserves TCP zerocopy support, SOCKMAP is designed to facilitate multi-socket routing at the cost of higher overhead and no zerocopy support. Enabling both features on the same socket makes no sense and results in unexpected interference between them. For instance, SOCKMAP calls __tcp_cleanup_rbuf(), where we will add a BPF_SOCK_OPS_RCVQ_CB hook, and bpf_sock_ops_tcp_set_rcvlowat() calls sk->sk_data_ready(), which would trigger SOCKMAP. Let's make BPF_SOCK_OPS_RCVQ_CB and SOCKMAP mutually exclusive. Note that it requires write_lock_bh(&sk->sk_callback_lock) to synchronise with tcp_bpf_update_proto() and check if sk->sk_prot is one of tcp_bpf_prots[][] because sock_map_update_elem() only holds bh_lock_sock() without checking sock_owned_by_user(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
MPTCP has its own sock->ops->set_rcvlowat() / mptcp_set_rcvlowat(). We should not allow calling __tcp_set_rcvlowat() for MPTCP subflows. Let's disable BPF_SOCK_OPS_RCVQ_CB for MPTCP for now. If needed in the future, bpf_sock_ops_tcp_set_rcvlowat() could be extended to properly support MPTCP. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Unlike SOCKMAP, BPF_SOCK_OPS_RCVQ_CB does not iterate existing
skbs in the receive queue when it is enabled for the first time.
In practical production use cases, this behavior is usually not
a problem.
We can safely assume that the upper-layer protocol is designed
with specific synchronisation points where the connection is
temporarily quiet.
At these points, the application can completely drain the receive
queue and safely enable BPF_SOCK_OPS_RCVQ_CB while no skbs are
pending.
A prime example is an application transitioning from HTTP to an
RPC protocol:
Client Server
| |
| --- HTTP Upgrade request ---------> |
| | [Drain all skbs]
| | [Enable BPF_SOCK_OPS_RCVQ_CB]
| <-- HTTP 200/Switching protocol --- |
| |
| --- RPC Frame 1 ------------------> |
However, to strictly prevent any potential race conditions arising
from unconventional upper-layer protocol designs, let's explicitly
signal a failure if BPF_SOCK_OPS_RCVQ_CB is enabled while the receive
queue is not empty.
-EUCLEAN is chosen to indicate that the caller needs to clean up
the receive queue before enabling the feature.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
We will call BPF SOCK_OPS prog with BPF_SOCK_OPS_RCVQ_CB. It requires a similar setup to bpf_skops_established(), and the only difference is the skb data length. Let's factor out the common logic into bpf_skops_common_locked(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Now, it is time to add the new hooks for BPF_SOCK_OPS_RCVQ_CB.
Let's invoke the BPF SOCK_OPS prog when
1. TCP stack enqueues skb to sk->sk_receive_queue
-> tcp_queue_rcv(), tcp_ofo_queue(), and tcp_fastopen_add_skb()
2. TCP recvmsg() completes
-> __tcp_cleanup_rbuf()
This will allow the BPF prog to parse each skb and dynamically
adjust sk->sk_rcvlowat to suppress unnecessary EPOLLIN wakeups
until sufficient data (e.g., a full RPC frame) is available
in the receive queue.
Note that the direct access to bpf_sock_ops.data is intentionally
disabled by passing 0 as end_offset.
Instead, the BPF prog is supposed to use bpf_skb_load_bytes()
with bpf_sock_ops because payload is not in the linear area
with TCP header/data split on and skb may contain a RPC
descriptor in skb frag. This also simplifies the BPF prog.
The placement of tcp_bpf_rcvlowat() in tcp_ofo_queue() and
tcp_fastopen_add_skb() is chosen to provide the same snapshot
with tcp_queue_rcv().
For example, if tcp_bpf_rcvlowat() were called before updating
TCP_SKB_CB(skb)->seq in tcp_fastopen_add_skb(), BPF prog would
need to implement an unlikely if branch to strip SYN.
In addition, TCP stack can queue overlapping skb into recvq.
Once rcv_nxt is updated with a new skb, BPF prog cannot infer
the previous one from skb->len.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
The test is roughly divided into two stages, and the sequence
is as follows:
I) Setup
1. Attach two BPF programs to a cgroup
2. Establish a TCP connection (@client <-> @child) within the cgroup
3. Enable BPF_SOCK_OPS_RCVQ_CB on @child
II) RPC frame exchange in various patterns
4. Send a partial RPC descriptor from @client to @child
5. Verify that epoll does NOT wake up @child
6. Send the remaining data of the RPC frame
7. Verify that epoll finally wakes up @child
During setup, two BPF programs are attached to simulate
a real-world scenario; one is SOCK_OPS and the other is
CGROUP_SOCKOPT.
While the SOCK_OPS prog handles the dynamic adjustment of
sk->sk_rcvlowat, the CGROUP_SOCKOPT prog is used to enable
BPF_SOCK_OPS_RCVQ_CB via userspace setsockopt() using
pseudo options:
#define SOL_BPF 0xdeadbeef
#define BPF_TCP_AUTOLOWAT 0x8badf00d
setsockopt(fd, SOL_BPF, BPF_TCP_AUTOLOWAT, &(int){1}, sizeof(int));
This reflects a common production use case where an application
decides to start parsing RPC frames only at a certain point in
the stream (e.g., after HTTP Upgrade), rather than immediately
after TCP 3WHS (BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, etc).
When BPF_TCP_AUTOLOWAT is enabled, the BPF prog initialises
sk_local_storage for two sequence numbers to manage its state.
Then, for the RPC frame exchange, this test uses a simple format
defined as follows
0 8 16 24 32
+--------+--------+-------+--------+ `.
| header size | |
+--------+--------+-------+--------+ > RPC descriptor (8 bytes)
| payload size | |
+--------+--------+-------+--------+ .'
~ header ~
+--------+--------+-------+--------+
~ payload ~
+--------+--------+-------+--------+
Every time a new skb is enqueued to sk->sk_receive_queue,
the SOCK_OPS prog parses it and updates these sequence numbers:
rpc_desc_seq : the SEQ # of the start of the RPC descriptor
rpc_end_seq : the SEQ # of the end of the RPC frame
=> rpc_desc_seq + 8 + header size + payload size
Assume we receive two RPC descriptors in the following pattern:
1. When we receive skb-1, only a part of RPC descriptor is parsed.
rpc_desc_seq is set to the first byte while rpc_end_seq is
unknown. Thus, sk->sk_rcvlowat is set to the size of the RPC
descriptor (8 bytes).
<- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
+-----------+.................+....................+......
| RPC desc 1 | header + payload | RPC desc 2 | ...
+-----------+.................+....................+......
^ ^-.
`- rpc_desc_seq `- sk->sk_rcvlowat
2. Next, we receive skb-2, which completes the first RPC descriptor.
Now rpc_end_seq is known, so sk->sk_rcvlowat is advanced to it.
<- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
+-----------+-----------------+....................+......
| RPC desc 1 | header + payload | RPC desc 2 | ...
+-----------+-----------------+....................+......
^ ^
'- rpc_desc_seq '- rpc_end_seq
& sk->sk_rcvlowat
3. Once we receive skb-3, which contains the next full RPC descriptor,
rpc_desc_seq is advanced and rpc_end_seq is updated according
to the size of RPC frame 2.
Note that sk->sk_rcvlowat is NOT updated to the new rpc_end_seq
yet. This ensures that the application is woken up to read the
already complete RPC frame 1.
<- skb-1 -> <---- skb-2 ----> <------ skb-3 ----->
+-----------+-----------------+--------------------+......
| RPC desc 1 | header + payload | RPC desc 2 | ... |
+-----------+-----------------+--------------------+......
^ ^
rpc_desc_seq -----------' rpc_end_seq ----...-'
& sk->sk_rcvlowat
This sequence corresponds to the 4th test case in rpc_test_cases[],
and we can see helpful output if we "#define DEBUG":
# cat /sys/kernel/tracing/trace_pipe | \
awk '{ if ($0 ~ /AF_/) sub(/^.*AF_/, "AF_"); print $0 }' & \
BGPID=$!; ./test_progs -t tcp_autolowat; kill -9 -$BGPID
...
AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 0, end_seq: 1, len: 1, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 0
AF_INET6 rpc_test_cases[3]: Copied 1 bytes: rpc_desc_buff_len: 1
AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_desc_buff_len: 1
AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 8, actual: 8
AF_INET6 rpc_test_cases[3]: Start parsing skb: seq: 1, end_seq: 8, len: 7, rpc_desc_seq: 0, rpc_end_seq: 0, rpc_buff_len: 1
AF_INET6 rpc_test_cases[3]: Copied full descriptor: rpc_desc_seq: 0, rpc_end_seq: 258, header_len: 100, payload_len: 150
AF_INET6 rpc_test_cases[3]: No more descriptor: rpc_end_seq: 258, end_seq: 8
AF_INET6 rpc_test_cases[3]: Setting rcvlowat: tp->copied_seq: 0, rpc_desc_seq: 0, rpc_end_seq: 258, rpc_desc_buff_len: 8
AF_INET6 rpc_test_cases[3]: Set rcvlowat: expected: 258, actual: 258
...
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
|
Upstream branch: b1fcdf9 |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
|
Forwarding comment 4524906572 via email |
|
Forwarding comment 4524907225 via email |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
|
Forwarding comment 4524907669 via email |
|
Forwarding comment 4524917859 via email |
|
Forwarding comment 4524918589 via email |
Pull request for series with
subject: bpf: Add SOCK_OPS hooks for TCP AutoLOWAT.
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1099743