Skip to content

Cgrp.struct ops.rfc.v1#12124

Open
iamkafai wants to merge 13 commits into
kernel-patches:bpf-next_basefrom
iamkafai:cgrp.struct_ops.rfc.v1
Open

Cgrp.struct ops.rfc.v1#12124
iamkafai wants to merge 13 commits into
kernel-patches:bpf-next_basefrom
iamkafai:cgrp.struct_ops.rfc.v1

Conversation

@iamkafai
Copy link
Copy Markdown
Contributor

RFC v1: support attaching struct_ops to cgroup

Kernel Patches Daemon and others added 13 commits May 17, 2026 14:03
The tcp_congestion_ops struct_ops requires both rcu_tasks
and regular rcu grace period before its map can be freed.

Previously, bpf_struct_ops_map_free() called
synchronize_rcu_mult(call_rcu, call_rcu_tasks) unconditionally for all
struct_ops types.

Introduce free_after_rcu_tasks_gp and free_after_rcu_gp flags on
struct bpf_struct_ops so each type can specify its own requirements.

The free_after_rcu_gp is copied to the map at map creation time so
that the bpf_map_put() will take care of the regular rcu grace period.

The free_after_rcu_tasks_gp is specific to tcp_congestion_ops, so
this patch does not introduce it to the bpf_map. Instead,
it will wait for the rcu_tasks gp internally in bpf_struct_ops_map_free().
Since rcu_tasks is the same as the regular rcu when CONFIG_TASKS_RCU=n,
bpf_struct_ops_map_free() only does synchronize_rcu_tasks() when needed.

This is a cleanup work to prepare for the later patch that needs to
support the rcu_tasks_trace gp.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
st_link->map is always written under update_mutex. The paths that read
st_link->map with rcu_read_lock() are not in the fast path, so they can
simply take update_mutex instead. Remove the __rcu annotation and replace
all RCU accessors with direct pointer reads under update_mutex. Use
READ_ONCE() in bpf_struct_ops_map_link_poll() which reads the pointer
without holding update_mutex.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Add the helper functions bpf_struct_ops_map_kdata(),
bpf_struct_ops_kdata_map_id(), and bpf_struct_ops_map_cfi_stubs()
in bpf_struct_ops.c. They will be called from cgroup.c in the upcoming
patch to create a struct_ops to cgroup attachment link.

bpf_struct_ops_valid_to_reg() is also exposed for the upcoming caller
in cgroup.c.

The link update validation is also refactored into a new function
bpf_struct_ops_link_update_check() such that it can be reused by the caller
in cgroup.c in the upcoming patch.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Both replace_effective_prog() and purge_effective_progs() test
the "!prog_list_prog(pl)" to skip a 'detaching' pl.

When detaching a pl, pl->prog and pl->link are set to NULL in case
the update_effective_progs() failed.

However, replace_effective_prog() is not detaching a pl,
so the case "!prog_list_prog()" will not happen.

In purge_effective_prog(), the pl->prog and pl->link are restored
before calling purge_effective_progs(), so the case "!prog_list_prog()"
will not happen either.

This patch removes them as a prep work for the upcoming work
in attaching struct_ops to cgroup. When attaching a struct_ops
to cgroup, there is a link->map case and the prog_list_prog()
will not consider the link->map. The replace_effective_prog()
and purge_effective_progs() will then incorrectly skip a pl
with struct_ops map attached to it.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
… check

prog_list_length() and compute_effective_progs() use !prog_list_prog(pl)
to skip a 'detaching' pl.

When pl->link is not NULL, prog_list_prog(pl) returns
the pl->link->link.prog. This does not work for the upcoming struct_ops
patch where pl->link is not NULL but pl->link->link.prog is NULL,
because a struct_ops map is attached to the cgroup instead of a BPF prog.

To prepare for the upcoming struct_ops patch, this patch
replaces the prog_list_prog() test with the
"!pl->prog && !pl->link". In __cgroup_bpf_detach(),
both pl->prog and pl->link are set to NULL, so testing
"!pl->prog && !pl->link" is the same test to tell
if a pl is being detached. This change should be a no-op.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
…st_id()

Add three functions to abstract operations on a bpf_prog_list entry.

Right now, the bpf_prog_array_item is initialized by the prog
in pl->link->link.prog. This will not work when struct_ops
is attached to a cgroup. Instead of initializing the bpf_prog_array_item
by a prog, the bpf_prog_array_item should be initialized with
a struct_ops map.

The same goes for __cgroup_bpf_query. Instead of copying the
prog id to user, the struct_ops map id should be copied to user instead.

This patch refactored the bpf_prog_array_item initialization to
prog_list_init_item() and prog_list_replace_item(). The id-getter
is refactored to prog_list_id(). These three new functions
will be modified to support pl->link->map in the later patch.

This is a no-op change.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Move the LSM trampoline unlink into bpf_cgroup_link_auto_detach().
The purpose is to consolidate the auto_detach cleanup logic.

It prepares for the upcoming struct_ops cgroup attachment patch where
bpf_cgroup_link_auto_detach() will need to handle the struct_ops case
(link->map != NULL).

This is a no-op change.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
In the upcoming patch, the array can store a struct_ops map.
The array could have a cfi_stubs acting as a dummy instead of
the dummy_bpf_prog. The array logic will need to skip the cfi_stubs
also in order to support storing struct_ops map in the array.

bpf_cgroup_array_length(), bpf_cgroup_array_copy_to_user(), and
bpf_cgroup_array_delete_safe_at() are added as a preparation work
to allow skipping the cfi_stubs in the upcoming patch. The patch
only skips the dummy_bpf_prog which is the same as the existing behavior.
The current bpf_prog_array_*() callers are changed to call the new
bpf_cgroup_array_*(). This is a no-op change.

Another addition is the bpf_cgroup_array_free(). This prepares
the array to have a different rcu gp for the struct_ops use case.
In this patch, bpf_cgroup_array_free() only goes through the
regular rcu gp. This is a no-op change also.

bpf_prog_dummy() is also added to return the global dummy_bpf_prog.

bpf_cgroup_array_dummy() is added to decide the sentinel based on atype.
It now always returns bpf_prog_dummy(). In the upcoming patch,
it can return a cfi_stubs if the atype belongs to a struct_ops.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This patch adds necessary infrastructure to attach a struct_ops
map to a cgroup. The initial need was to support migrating
the legacy BPF_PROG_TYPE_SOCK_OPS to a struct_ops.
Recently, there are other struct_ops use cases that
need to attach struct_ops to a cgroup. For example,
the recent BPF OOM and memcg discussion in LSFMMBPF 2026.

The motivation is to create a consistent expectation
for attaching struct_ops to cgroup instead of each subsystem
creating its own infrastructure. This logic includes
hierarchy expectation, ordering expectation,
attachment API, and rcu gp.

There is already an existing implementation for attaching
multiple bpf progs to a cgroup. There are also tools
built around it for querying. Attaching a struct_ops map
(which is a group of bpf programs) could also adhere to
a similar API and potentially reuse most of the existing
implementation.

A couple of ideas have been tried. One of them
is to use mprog.c. In terms of the amount of changes,
I eventually came to the same conclusion as Yonghong did
when he implemented the relative_fd/id in
commit 1209339 ("bpf: Implement mprog API on top of existing cgroup progs").
I then shifted the focus to reusing the current
{update,compute,activate,purge}_effective_progs() which has
the main logic that implements the mprog API.

Since then, I tried to add a 'struct cgroup *cgroup' member
to the existing 'struct bpf_struct_ops_link' and link_create
will create a 'struct bpf_struct_ops_link' object to be stored
in the pl->link. This turns out to have more changes on
both cgroup.c and bpf_struct_ops.c than I like.

This patch directly reuses the 'struct bpf_cgroup_link' which
cgroup.c already understands. Add 'struct bpf_map *map'
to 'struct bpf_cgroup_link'. In the future, as more subsystems
are extended by struct_ops, we may consider to make
'struct bpf_map *map' as a primary citizen of a link
like 'struct bpf_program *prog' and directly add
'struct bpf_map *map' to the generic 'struct bpf_link'.

The pl->link could be the traditional 'prog' link or the
new 'map' link. The places that need to handle them differently
have already been refactored into the new prog_list_*() added in
the earlier patch. In those new prog_list_*(), this patch will
check "pl->link && pl->link->map", learn that it is a 'map' link
and handle it correctly.

The bpf_prog_array also needs to handle that its item can store
the traditional 'prog' or it can store a struct_ops map.
The places that need to handle them differently have also
been refactored into the new bpf_cgroup_array_*() added
in the earlier patch. The two differences are:
  - different sentinel (dummy_bpf_prog in prog vs cfi_stub in struct_ops)
  - the array for struct_ops may need to go through different
    rcu gp.
The bpf_cgroup_array_*() functions mainly use the
cgroup_bpf_attach_type (ie atype) to distinguish the array
is storing prog or storing struct_ops map.

This patch also implements a separate struct bpf_link_ops
(named bpf_cgroup_link_lops in this patch) to have a separate
link_ops implementation that only handles the cgroup's struct_ops link.

Misc notes:
- CGROUP_TCP_SOCK_OPS is added to the 'enum cgroup_bpf_attach_type'.
  The actual implementation of the tcp_bpf_ops (a struct_ops)
  will be added in the next patch.

- free_after_mult_rcu_gp is added to 'struct bpf_struct_ops' such that
  the bpf_prog_array can have a mix of sleepable and
  non-sleepable prog in a struct_ops. This can tell
  how the bpf_prog_array should be freed.

- For a struct_ops that supports cgroup attachment, it cannot
  implement its own reg/unreg function. reg/unreg to a cgroup is
  done by the common infrastructure added in this patch.

- The cgroup's struct_ops link only supports BPF_F_ALLOW_MULTI.
  This is enforced internally in cgroup_bpf_struct_ops_attach.
  This should be consistent with the current prog's link
  behavior in cgroup_bpf_link_attach.

  In the future, we may allow each subsystem to choose
  differently. However, it probably does not need
  to expose this flexibility to the user space.

- A cgrp_atype member is added to 'struct bpf_struct_ops'.
  When a subsystem struct_ops needs to support cgroup attachment,
  it needs to add a value to 'enum cgroup_bpf_attach_type'
  and then assign it to the newly added cgrp_atype member
  in the bpf_struct_ops.

- During LINK_CREATE in syscall, the patch uses the same
  BPF_STRUCT_OPS (in attr->link_create.attach_type).
  The bpf_struct_ops_link_create learns the map and
  from the map it learns the st_ops. If the st_ops->cgrp_atype
  is not 0, it will create a cgroup's link.

  The downside of this approach is, if a struct_ops subsystem
  wants to also be attached to non-cgroup later, this will not work.

- When a subsystem registers a struct_ops that supports cgroup
  attachment, the struct_ops infrastructure will also ask the
  cgroup infrastructure to remember a few things. This is done
  by calling bpf_cgroup_struct_ops_register().

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
In LSFMMBPF 2025, I have talked about moving the existing BPF_SOCK_OPS_*_CB
to a struct_ops interface [1].

One user experience on the service is that a
BPF_PROG_TYPE_SOCK_OPS program usually has a long switch case to handle
all CB it needs to handle. This is inconvenient but not big enough to
move to the new struct_ops interface.

We managed to add new CB and the switch case in a BPF_PROG_TYPE_SOCK_OPS program
is getting longer. This is inconvenient to write the bpf prog but not
necessarily a breaking point to retire the interface.

When we managed to find new use cases and adding new CB, the interface started
to show its teeth. The sock_ops->op is a runtime enum to tell where the
bpf prog is called (e.g. BPF_SOCK_OPS_TCP_LISTEN_CB).
While adding more CB flags, we find cases that break the earlier assumption
that is shared by all CB flags. For example, not all call sites have
the tcp_sock locked, so adding is_locked_tcp_sock logic. Not all call sites
have skb, then we added kernel code to handle that. Not all helpers are safe
for sock_ops->op, we also added kernel code for that. When the new CB flags
need arguments other than sk, we also need to extend 'struct bpf_sock_ops_kern'.
The existing 'union { u32 args[4]; u32 replylong[4]; }' is not reliable in
passing args to bpf prog when there are multiple progs attached to a cgroup.

The above has already been solved in struct_ops. struct_ops can
provide a better interface to continue extending operations for sock.
Each ops can define its own argument instead of initializing everything
in 'struct bpf_sock_ops_kern'. Each ops can define what kfunc is allowed.

The patch has moved some of the BPF_SOCK_OPS_* callbacks to bpf_tcp_ops.
I don't think BASE_RTT is useful. NEEDS_ECN should be done in
bpf-tcp-cc instead. The tstamp ones should be a separate
struct_ops (e.g. "bpf_sock_ops") that can work in both TCP and UDP.

TODOs:
- ACTIVE_ESTABLISHED_CB and PASSIVE_ESTABLISHED_CB
- BPF_SOCK_OPS_*HDR related ops
- RTO and BASE_RTT

[1]: https://drive.google.com/file/d/1wjKZth6T0llLJ_ONPAL_6Q_jbxbAjByp/view?usp=sharing

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Add bpf_map__attach_cgroup_opts.

Make bpf_prog_query_opts support type_id.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Simple smoke tests for attaching struct_ops to cgroup.

TODO:
- more than one level of cgroup setup.
- BPF_F_BEFORE/AFTER.
- bpf_link__update_map an existing struct_ops with and without BPF_F_PREORDER

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf Bot force-pushed the bpf-next_base branch 3 times, most recently from ec31e3e to b3beebb Compare May 21, 2026 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant