Cgrp.struct ops.rfc.v1#12124
Open
iamkafai wants to merge 13 commits into
Open
Conversation
The tcp_congestion_ops struct_ops requires both rcu_tasks and regular rcu grace period before its map can be freed. Previously, bpf_struct_ops_map_free() called synchronize_rcu_mult(call_rcu, call_rcu_tasks) unconditionally for all struct_ops types. Introduce free_after_rcu_tasks_gp and free_after_rcu_gp flags on struct bpf_struct_ops so each type can specify its own requirements. The free_after_rcu_gp is copied to the map at map creation time so that the bpf_map_put() will take care of the regular rcu grace period. The free_after_rcu_tasks_gp is specific to tcp_congestion_ops, so this patch does not introduce it to the bpf_map. Instead, it will wait for the rcu_tasks gp internally in bpf_struct_ops_map_free(). Since rcu_tasks is the same as the regular rcu when CONFIG_TASKS_RCU=n, bpf_struct_ops_map_free() only does synchronize_rcu_tasks() when needed. This is a cleanup work to prepare for the later patch that needs to support the rcu_tasks_trace gp. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
st_link->map is always written under update_mutex. The paths that read st_link->map with rcu_read_lock() are not in the fast path, so they can simply take update_mutex instead. Remove the __rcu annotation and replace all RCU accessors with direct pointer reads under update_mutex. Use READ_ONCE() in bpf_struct_ops_map_link_poll() which reads the pointer without holding update_mutex. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Add the helper functions bpf_struct_ops_map_kdata(), bpf_struct_ops_kdata_map_id(), and bpf_struct_ops_map_cfi_stubs() in bpf_struct_ops.c. They will be called from cgroup.c in the upcoming patch to create a struct_ops to cgroup attachment link. bpf_struct_ops_valid_to_reg() is also exposed for the upcoming caller in cgroup.c. The link update validation is also refactored into a new function bpf_struct_ops_link_update_check() such that it can be reused by the caller in cgroup.c in the upcoming patch. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Both replace_effective_prog() and purge_effective_progs() test the "!prog_list_prog(pl)" to skip a 'detaching' pl. When detaching a pl, pl->prog and pl->link are set to NULL in case the update_effective_progs() failed. However, replace_effective_prog() is not detaching a pl, so the case "!prog_list_prog()" will not happen. In purge_effective_prog(), the pl->prog and pl->link are restored before calling purge_effective_progs(), so the case "!prog_list_prog()" will not happen either. This patch removes them as a prep work for the upcoming work in attaching struct_ops to cgroup. When attaching a struct_ops to cgroup, there is a link->map case and the prog_list_prog() will not consider the link->map. The replace_effective_prog() and purge_effective_progs() will then incorrectly skip a pl with struct_ops map attached to it. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
… check prog_list_length() and compute_effective_progs() use !prog_list_prog(pl) to skip a 'detaching' pl. When pl->link is not NULL, prog_list_prog(pl) returns the pl->link->link.prog. This does not work for the upcoming struct_ops patch where pl->link is not NULL but pl->link->link.prog is NULL, because a struct_ops map is attached to the cgroup instead of a BPF prog. To prepare for the upcoming struct_ops patch, this patch replaces the prog_list_prog() test with the "!pl->prog && !pl->link". In __cgroup_bpf_detach(), both pl->prog and pl->link are set to NULL, so testing "!pl->prog && !pl->link" is the same test to tell if a pl is being detached. This change should be a no-op. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
…st_id() Add three functions to abstract operations on a bpf_prog_list entry. Right now, the bpf_prog_array_item is initialized by the prog in pl->link->link.prog. This will not work when struct_ops is attached to a cgroup. Instead of initializing the bpf_prog_array_item by a prog, the bpf_prog_array_item should be initialized with a struct_ops map. The same goes for __cgroup_bpf_query. Instead of copying the prog id to user, the struct_ops map id should be copied to user instead. This patch refactored the bpf_prog_array_item initialization to prog_list_init_item() and prog_list_replace_item(). The id-getter is refactored to prog_list_id(). These three new functions will be modified to support pl->link->map in the later patch. This is a no-op change. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Move the LSM trampoline unlink into bpf_cgroup_link_auto_detach(). The purpose is to consolidate the auto_detach cleanup logic. It prepares for the upcoming struct_ops cgroup attachment patch where bpf_cgroup_link_auto_detach() will need to handle the struct_ops case (link->map != NULL). This is a no-op change. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
In the upcoming patch, the array can store a struct_ops map. The array could have a cfi_stubs acting as a dummy instead of the dummy_bpf_prog. The array logic will need to skip the cfi_stubs also in order to support storing struct_ops map in the array. bpf_cgroup_array_length(), bpf_cgroup_array_copy_to_user(), and bpf_cgroup_array_delete_safe_at() are added as a preparation work to allow skipping the cfi_stubs in the upcoming patch. The patch only skips the dummy_bpf_prog which is the same as the existing behavior. The current bpf_prog_array_*() callers are changed to call the new bpf_cgroup_array_*(). This is a no-op change. Another addition is the bpf_cgroup_array_free(). This prepares the array to have a different rcu gp for the struct_ops use case. In this patch, bpf_cgroup_array_free() only goes through the regular rcu gp. This is a no-op change also. bpf_prog_dummy() is also added to return the global dummy_bpf_prog. bpf_cgroup_array_dummy() is added to decide the sentinel based on atype. It now always returns bpf_prog_dummy(). In the upcoming patch, it can return a cfi_stubs if the atype belongs to a struct_ops. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This patch adds necessary infrastructure to attach a struct_ops map to a cgroup. The initial need was to support migrating the legacy BPF_PROG_TYPE_SOCK_OPS to a struct_ops. Recently, there are other struct_ops use cases that need to attach struct_ops to a cgroup. For example, the recent BPF OOM and memcg discussion in LSFMMBPF 2026. The motivation is to create a consistent expectation for attaching struct_ops to cgroup instead of each subsystem creating its own infrastructure. This logic includes hierarchy expectation, ordering expectation, attachment API, and rcu gp. There is already an existing implementation for attaching multiple bpf progs to a cgroup. There are also tools built around it for querying. Attaching a struct_ops map (which is a group of bpf programs) could also adhere to a similar API and potentially reuse most of the existing implementation. A couple of ideas have been tried. One of them is to use mprog.c. In terms of the amount of changes, I eventually came to the same conclusion as Yonghong did when he implemented the relative_fd/id in commit 1209339 ("bpf: Implement mprog API on top of existing cgroup progs"). I then shifted the focus to reusing the current {update,compute,activate,purge}_effective_progs() which has the main logic that implements the mprog API. Since then, I tried to add a 'struct cgroup *cgroup' member to the existing 'struct bpf_struct_ops_link' and link_create will create a 'struct bpf_struct_ops_link' object to be stored in the pl->link. This turns out to have more changes on both cgroup.c and bpf_struct_ops.c than I like. This patch directly reuses the 'struct bpf_cgroup_link' which cgroup.c already understands. Add 'struct bpf_map *map' to 'struct bpf_cgroup_link'. In the future, as more subsystems are extended by struct_ops, we may consider to make 'struct bpf_map *map' as a primary citizen of a link like 'struct bpf_program *prog' and directly add 'struct bpf_map *map' to the generic 'struct bpf_link'. The pl->link could be the traditional 'prog' link or the new 'map' link. The places that need to handle them differently have already been refactored into the new prog_list_*() added in the earlier patch. In those new prog_list_*(), this patch will check "pl->link && pl->link->map", learn that it is a 'map' link and handle it correctly. The bpf_prog_array also needs to handle that its item can store the traditional 'prog' or it can store a struct_ops map. The places that need to handle them differently have also been refactored into the new bpf_cgroup_array_*() added in the earlier patch. The two differences are: - different sentinel (dummy_bpf_prog in prog vs cfi_stub in struct_ops) - the array for struct_ops may need to go through different rcu gp. The bpf_cgroup_array_*() functions mainly use the cgroup_bpf_attach_type (ie atype) to distinguish the array is storing prog or storing struct_ops map. This patch also implements a separate struct bpf_link_ops (named bpf_cgroup_link_lops in this patch) to have a separate link_ops implementation that only handles the cgroup's struct_ops link. Misc notes: - CGROUP_TCP_SOCK_OPS is added to the 'enum cgroup_bpf_attach_type'. The actual implementation of the tcp_bpf_ops (a struct_ops) will be added in the next patch. - free_after_mult_rcu_gp is added to 'struct bpf_struct_ops' such that the bpf_prog_array can have a mix of sleepable and non-sleepable prog in a struct_ops. This can tell how the bpf_prog_array should be freed. - For a struct_ops that supports cgroup attachment, it cannot implement its own reg/unreg function. reg/unreg to a cgroup is done by the common infrastructure added in this patch. - The cgroup's struct_ops link only supports BPF_F_ALLOW_MULTI. This is enforced internally in cgroup_bpf_struct_ops_attach. This should be consistent with the current prog's link behavior in cgroup_bpf_link_attach. In the future, we may allow each subsystem to choose differently. However, it probably does not need to expose this flexibility to the user space. - A cgrp_atype member is added to 'struct bpf_struct_ops'. When a subsystem struct_ops needs to support cgroup attachment, it needs to add a value to 'enum cgroup_bpf_attach_type' and then assign it to the newly added cgrp_atype member in the bpf_struct_ops. - During LINK_CREATE in syscall, the patch uses the same BPF_STRUCT_OPS (in attr->link_create.attach_type). The bpf_struct_ops_link_create learns the map and from the map it learns the st_ops. If the st_ops->cgrp_atype is not 0, it will create a cgroup's link. The downside of this approach is, if a struct_ops subsystem wants to also be attached to non-cgroup later, this will not work. - When a subsystem registers a struct_ops that supports cgroup attachment, the struct_ops infrastructure will also ask the cgroup infrastructure to remember a few things. This is done by calling bpf_cgroup_struct_ops_register(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
In LSFMMBPF 2025, I have talked about moving the existing BPF_SOCK_OPS_*_CB
to a struct_ops interface [1].
One user experience on the service is that a
BPF_PROG_TYPE_SOCK_OPS program usually has a long switch case to handle
all CB it needs to handle. This is inconvenient but not big enough to
move to the new struct_ops interface.
We managed to add new CB and the switch case in a BPF_PROG_TYPE_SOCK_OPS program
is getting longer. This is inconvenient to write the bpf prog but not
necessarily a breaking point to retire the interface.
When we managed to find new use cases and adding new CB, the interface started
to show its teeth. The sock_ops->op is a runtime enum to tell where the
bpf prog is called (e.g. BPF_SOCK_OPS_TCP_LISTEN_CB).
While adding more CB flags, we find cases that break the earlier assumption
that is shared by all CB flags. For example, not all call sites have
the tcp_sock locked, so adding is_locked_tcp_sock logic. Not all call sites
have skb, then we added kernel code to handle that. Not all helpers are safe
for sock_ops->op, we also added kernel code for that. When the new CB flags
need arguments other than sk, we also need to extend 'struct bpf_sock_ops_kern'.
The existing 'union { u32 args[4]; u32 replylong[4]; }' is not reliable in
passing args to bpf prog when there are multiple progs attached to a cgroup.
The above has already been solved in struct_ops. struct_ops can
provide a better interface to continue extending operations for sock.
Each ops can define its own argument instead of initializing everything
in 'struct bpf_sock_ops_kern'. Each ops can define what kfunc is allowed.
The patch has moved some of the BPF_SOCK_OPS_* callbacks to bpf_tcp_ops.
I don't think BASE_RTT is useful. NEEDS_ECN should be done in
bpf-tcp-cc instead. The tstamp ones should be a separate
struct_ops (e.g. "bpf_sock_ops") that can work in both TCP and UDP.
TODOs:
- ACTIVE_ESTABLISHED_CB and PASSIVE_ESTABLISHED_CB
- BPF_SOCK_OPS_*HDR related ops
- RTO and BASE_RTT
[1]: https://drive.google.com/file/d/1wjKZth6T0llLJ_ONPAL_6Q_jbxbAjByp/view?usp=sharing
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Add bpf_map__attach_cgroup_opts. Make bpf_prog_query_opts support type_id. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Simple smoke tests for attaching struct_ops to cgroup. TODO: - more than one level of cgroup setup. - BPF_F_BEFORE/AFTER. - bpf_link__update_map an existing struct_ops with and without BPF_F_PREORDER Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
ec31e3e to
b3beebb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RFC v1: support attaching struct_ops to cgroup