uprobes/x86: Fix red zone issue for optimized uprobes#12147
uprobes/x86: Fix red zone issue for optimized uprobes#12147kernel-patches-daemon-bpf[bot] wants to merge 12 commits into
Conversation
|
Upstream branch: b1fcdf9 |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
|
Forwarding comment 4508637262 via email |
|
Forwarding comment 4508652446 via email |
|
Forwarding comment 4508663260 via email |
|
Forwarding comment 4508668819 via email |
|
Forwarding comment 4508683614 via email |
In the unregister path we use __in_uprobe_trampoline check with current->mm for the VMA lookup, which is wrong, because we are in the tracer context, not the traced process. Add mm_struct pointer argument to __in_uprobe_trampoline and changing related callers to pass proper mm_struct pointer. Fixes: ba2bfc9 ("uprobes/x86: Add support to optimize uprobes") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Removing struct uprobe_trampoline object and it's tracking code, because it's not needed. We can do same thing directly on top of struct vm_area_struct objects. This makes the code simpler and allows easy propagation of the trampoline vma object into child process in following change. Note the original code called destroy_uprobe_trampoline if the optimiation failed, but it only freed the struct uprobe_trampoline object, not the vma. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
When we do fork or clone without CLONE_VM the new process won't have uprobe trampoline vma objects and at the same time it will have optimized code calling that trampoline and crash. Fixing this by allowing vma uprobe trampoline objects to be copied on fork to the new process. Fixes: ba2bfc9 ("uprobes/x86: Add support to optimize uprobes") Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Andrii reported an issue with optimized uprobes [1] that can clobber
redzone area with call instruction storing return address on stack
where user code may keep temporary data without adjusting rsp.
Fixing this by moving the optimized uprobes on top of 10-bytes nop
instruction, so we can squeeze another instruction to escape the
redzone area before doing the call, like:
lea -0x80(%rsp), %rsp
call tramp
Note the lea instruction is used to adjust the rsp register without
changing the flags.
We use nop10 and following transofrmation to optimized instructions
above and back as suggested by Peterz [2].
Optimize path (int3_update_optimize):
1) Initial state after set_swbp() installed the uprobe:
cc 2e 0f 1f 84 00 00 00 00 00
From offset 0 this is INT3 followed by the tail of the original
10-byte NOP.
2) Trap the call slot before rewriting the NOP tail:
cc 2e 0f 1f 84 [cc] 00 00 00 00
From offset 0 this traps on the uprobe INT3. A thread reaching
offset 5 traps on the temporary INT3 instead of seeing a partially
patched call.
3) Rewrite the LEA tail and call displacement, keeping both INT3 bytes:
cc [8d 64 24 80] cc [d0 d1 d2 d3]
From offset 0 and offset 5 this still traps. The bytes between
them are not executable entry points while both traps are in place.
4) Restore the call opcode at offset 5:
cc 8d 64 24 80 [e8] d0 d1 d2 d3
From offset 0 this still traps. From offset 5 the instruction is
the final CALL to the uprobe trampoline.
5) Publish the first LEA byte:
[48] 8d 64 24 80 e8 d0 d1 d2 d3
From offset 0 this is:
lea -0x80(%rsp), %rsp
call <uprobe-trampoline>
Unoptimize path (int3_update_unoptimize):
1) Initial optimized state:
48 8d 64 24 80 e8 d0 d1 d2 d3
Same as 5) above.
2) Trap new entries before restoring the NOP bytes:
[cc] 8d 64 24 80 e8 d0 d1 d2 d3
From offset 0 this traps. A thread that had already executed the
LEA can still reach the intact CALL at offset 5.
3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
and byte 5 as CALL.
cc [2e 0f 1f 84] e8 d0 d1 d2 d3
From offset 0 this still traps. Offset 5 is still the CALL for any
thread that was already past the first LEA byte.
4) Publish the first byte of the original NOP:
[66] 2e 0f 1f 84 e8 d0 d1 d2 d3
From offset 0 this is the restored 10-byte NOP; the CALL opcode and
displacement are now only NOP operands. Offset 5 still decodes as
CALL for a thread that was already there.
Note as explained in [2] we need to use following nop10:
PF1 PF2 ESC NOPL MOD SIB DISP32
NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
attribute in is_prefix_bad function.
The optimized uprobe performance stays the same:
uprobe-nop : 3.129 ± 0.013M/s
uprobe-push : 3.045 ± 0.006M/s
uprobe-ret : 1.095 ± 0.004M/s
--> uprobe-nop10 : 7.170 ± 0.020M/s
uretprobe-nop : 2.143 ± 0.021M/s
uretprobe-push : 2.090 ± 0.000M/s
uretprobe-ret : 0.942 ± 0.000M/s
--> uretprobe-nop10: 3.381 ± 0.003M/s
usdt-nop : 3.245 ± 0.004M/s
--> usdt-nop10 : 7.256 ± 0.023M/s
[1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
[2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
Fixes: ba2bfc9 ("uprobes/x86: Add support to optimize uprobes")
Assisted-by: Codex:GPT-5.5
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
We now expect nop combo with 10 bytes nop instead of 5 bytes nop, fixing has_nop_combo to reflect that. Fixes: 41a5c7d ("libbpf: Add support to detect nop,nop5 instructions combo for usdt probe") Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
In the previous optimized uprobe fix we changed the syscall error used for its detection from ENXIO to EPROTO. Changing related probe_uprobe_syscall detection check. Acked-by: Andrii Nakryiko <andrii@kernel.org> Fixes: 05738da ("libbpf: Add uprobe syscall feature detection") Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Syncing latest usdt.h change [1]. Now that we have nop10 optimization support in kernel, let's emit nop,nop10 for usdt probe. We leave it up to the library to use desirable nop instruction. [1] TBD Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Optimized uprobes are now on top of 10-bytes nop instructions, reflect that in existing tests. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
|
Upstream branch: b1fcdf9 |
Changing uprobe/usdt trigger bench code to use nop10 instead of nop5. Also changing run_bench_uprobes.sh to use nop10 triggers. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Adding reattach tests for uprobe syscall tests to make sure we can re-attach and optimize same uprobe multiple times. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
The uprobe nop5 optimization used to replace a 5-byte NOP with a 5-byte CALL to a trampoline. The CALL pushes a return address onto the stack at [rsp-8], clobbering whatever was stored there. On x86-64, the red zone is the 128 bytes below rsp that user code may use for temporary storage without adjusting rsp. Compilers can place USDT argument operands there, generating specs like "8@-8(%rbp)" when rbp == rsp. With the CALL-based optimization, the return address overwrites that argument before the BPF-side USDT argument fetch runs. Add two tests for this case. The uprobe_syscall subtest stores known values at -8(%rsp), -16(%rsp), and -24(%rsp), executes an optimized nop10 uprobe, and verifies the red-zone data is still intact. The USDT subtest triggers a probe in a function where the compiler places three USDT operands in the red zone and verifies that all 10 optimized invocations deliver the expected argument values to BPF. On an unfixed kernel, the first hit goes through the INT3 path and later hits use the optimized CALL path, so the red-zone checks fail after optimization. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> [ updates to use nop10 ] Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Adding tests for forked/cloned optimized uprobes and make sure the child can properly execute optimized probe for both fork (dups mm) and clone with CLONE_VM. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
c3a7577 to
1626528
Compare
Pull request for series with
subject: uprobes/x86: Fix red zone issue for optimized uprobes
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1098719