Skip to content

bpf, sockmap: keep sk_msg copy state in sync#12137

Open
kernel-patches-daemon-bpf[bot] wants to merge 1 commit into
bpf-next_basefrom
series/1097929=>bpf-next
Open

bpf, sockmap: keep sk_msg copy state in sync#12137
kernel-patches-daemon-bpf[bot] wants to merge 1 commit into
bpf-next_basefrom
series/1097929=>bpf-next

Conversation

@kernel-patches-daemon-bpf
Copy link
Copy Markdown

Pull request for series with
subject: bpf, sockmap: keep sk_msg copy state in sync
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1097929

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

Upstream branch: 523d2f4
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1097929
version: 3

@kernel-patches-review-bot
Copy link
Copy Markdown

commit 0a58a0c24cc61549577f9d37ffa32a979c3425c2
Author: Zhang Cen <rollkingzzc@gmail.com>

bpf, sockmap: keep sk_msg copy state in sync

SK_MSG uses msg->sg.copy as per-scatterlist-entry provenance. Entries
with this bit set are copied before data/data_end are exposed to SK_MSG
BPF programs for direct packet access.

bpf_msg_pull_data(), bpf_msg_push_data(), and bpf_msg_pop_data()
rewrite the sk_msg scatterlist ring by collapsing, splitting, and
shifting entries. These operations move msg->sg.data[] entries, but the
parallel copy bitmap can be left behind on the old slot. A copied entry
can then return to msg->sg.start with its copy bit clear and be exposed
as directly writable packet data.

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 9590877b0714..223d78708486 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c

[ ... ]

> @@ -2794,6 +2835,8 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
>  {
>  	struct scatterlist sge, nsge, nnsge, rsge = {0}, *psge;
>  	u32 new, i = 0, l = 0, space, copy = 0, offset = 0;
> +	bool sge_copy = false, nsge_copy = false, nnsge_copy = false;
> +	bool rsge_copy = false;
>  	u8 *raw, *to, *from;
>  	struct page *page;
>
> @@ -2866,6 +2909,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
>  			sk_msg_iter_var_prev(i);
>  		psge = sk_msg_elem(msg, i);
>  		rsge = sk_msg_elem_cpy(msg, i);
> +		rsge_copy = sk_msg_elem_is_copy(msg, i);
>
>  		psge->length = start - offset;
>  		rsge.length -= psge->length;

Could copy + len overflow and cause an out-of-bounds write? In
bpf_msg_push_data(), the len parameter is an unbounded u32 value passed
from a bpf program. When calculating the allocation size, copy + len can
overflow a 32-bit integer:

    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
                       get_order(copy + len));

If copy + len wraps around, alloc_pages() will allocate an undersized
buffer. Later, the destination pointer for shifting tail data is
calculated as:

    to = raw + front + len

Due to len being a large value, front + len wraps around and results in a
large positive 32-bit offset (zero-extended to 64-bit). The subsequent
memcpy(to, from, back) then performs a controlled out-of-bounds write well
past the allocated page.

Does bpf_msg_push_data() leak uninitialized kernel memory? The function
allocates a new memory page:

    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
                       get_order(copy + len));

Since the allocation does not include the __GFP_ZERO flag, the new len
bytes inserted into the message are never explicitly zeroed or initialized.
If the bpf program does not manually overwrite these bytes and the message
is transmitted over a socket, this could leak uninitialized kernel memory
to the network or userspace.

Could the ring buffer state become corrupted when data is appended exactly
at the end of the message and the ring buffer is full?

If data is appended exactly at the end of the message (start == offset +
l), the loop finishes with i pointing to the unused slot msg->sg.end. If
the ring buffer is full (!space), the coalescing fallback logic reads:

    copy = msg->sg.data[i].length

Because i points to an unused slot, its length is 0, making copy == 0, and
the fallback coalescing block is bypassed. The function then
unconditionally calls:

    sk_msg_iter_next(msg, end)

which increments msg->sg.end. Since the ring buffer was already full,
advancing msg->sg.end causes it to wrap around and equal msg->sg.start.
This could corrupt the ring buffer state and overwrite the first element
without calling put_page(), leading to page leaks.

[ ... ]

> @@ -2891,23 +2935,34 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
>  	/* Shift one or two slots as needed */
>  	sge = sk_msg_elem_cpy(msg, new);
>  	sg_unmark_end(&sge);
> +	sge_copy = sk_msg_elem_is_copy(msg, new);
>
>  	nsge = sk_msg_elem_cpy(msg, i);
> +	nsge_copy = sk_msg_elem_is_copy(msg, i);
>  	if (rsge.length) {
>  		sk_msg_iter_var_next(i);
>  		nnsge = sk_msg_elem_cpy(msg, i);
> +		nnsge_copy = sk_msg_elem_is_copy(msg, i);
>  		sk_msg_iter_next(msg, end);
>  	}

Should rsge.offset be adjusted after splitting? When rsge is created
from a split entry, the code does not include:

    rsge.offset += start - offset;

to correctly adjust the offset. The rsge points to the tail portion of
the split entry but retains the original offset, which appears incorrect.
This could cause the residual entry to point to the wrong data within
the page.

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf, sockmap: keep sk_msg copy state in sync
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26156967283

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

Upstream branch: 523d2f4
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1097929
version: 3

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

Upstream branch: 3db0419
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1097929
version: 3

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

Upstream branch: b1fcdf9
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1097929
version: 3

SK_MSG uses msg->sg.copy as per-scatterlist-entry provenance. Entries
with this bit set are copied before data/data_end are exposed to SK_MSG
BPF programs for direct packet access.

bpf_msg_pull_data(), bpf_msg_push_data(), and bpf_msg_pop_data()
rewrite the sk_msg scatterlist ring by collapsing, splitting, and
shifting entries. These operations move msg->sg.data[] entries, but the
parallel copy bitmap can be left behind on the old slot. A copied entry
can then return to msg->sg.start with its copy bit clear and be exposed
as directly writable packet data.

This corruption path requires an attached SK_MSG BPF program that calls
the mutating helpers; ordinary sockmap/TLS traffic that never runs
push/pop/pull helper sequences is not affected.

Keep msg->sg.copy synchronized with scatterlist entry moves, preserve
the copy bit when an entry is split, clear it when a helper replaces an
entry with a private page, and clear slots vacated by pull-data
compaction.

Fixes: 015632b ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Fixes: 6fff607 ("bpf: sk_msg program helper bpf_msg_push_data")
Fixes: 7246d8e ("bpf: helper to pop data from messages")
Cc: stable@vger.kernel.org
Co-developed-by: Han Guidong <2045gemini@gmail.com>
Signed-off-by: Han Guidong <2045gemini@gmail.com>
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf Bot force-pushed the series/1097929=>bpf-next branch from 970310f to f0c74f5 Compare May 21, 2026 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants