dmaengine: adi-dma: fix use-after-free in cyclic transfers#3112
dmaengine: adi-dma: fix use-after-free in cyclic transfers#3112pamolloy wants to merge 1 commit intoadsp-6.12.0-yfrom
Conversation
Agreed! Claude overdid it IMO. I would go with your proposed fix as something temporary... Looking at the driver it seems like a big refactor is needed anyways. |
Yah, I dug a little deeper, especially since it removed the following comment: And after some prompting it concluded: Before I saw that comment I was confused why the cyclic and non-cyclic approaches were different. |
Probably over complicating again 😄 . That driver looks like a typical out of tree driver so we'll have to refactor it a bit. So, IMO, I would go with your fix. Yeah, we have a copy in there but it's a small struct and I would not expect overhead at all |
A race condition exists in the threaded interrupt handler when processing
cyclic DMA descriptors. The handler accesses channel->current_desc->result
after releasing the channel lock, creating a window where another CPU
executing adi_dma_terminate_all() can free the descriptor.
Race sequence:
CPU 0 (thread handler) CPU 1 (terminate_all)
---------------------- ---------------------
Lock channel->lock
Check current_desc->cyclic
Get callback pointer
Unlock channel->lock
Lock channel->lock
Free current_desc
Unlock channel->lock
Access current_desc->result <- Use-after-free
This results in accessing freed memory containing list poison values
(0xdead000000000100), leading to kernel crashes.
Fix this by copying the result structure to a local variable while
still holding the lock, then using the copy after unlocking. This
follows the same safe pattern used in the non-cyclic path, which
already uses a local descriptor pointer.
The fix is minimal and avoids the complications warned about in the
original code comment regarding cyclic descriptors and the pending
list during termination.
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Philip Molloy <philip.molloy@analog.com>
|
Before I force push, here is the more complicated patch for future reference: |
4bd5720 to
fd506fb
Compare
|
@vasbimpikasadi can someone on your team test this? Ideally also with the example code the customer provided to reproduce the problem 🙏 |
apologies for taking a while. This will be now tested and we'll try to reproduce |
There was a problem hiding this comment.
@vasbimpikasadi can someone on your team test this? Ideally also with the example code the customer provided to reproduce the problem 🙏
Hi @pamolloy I attempted to run the reproducer (no fix applied) and got the following splat:
root@adsp-sc598-som-ezkit:~# insmod /dma_race_sim.ko
[ 55.314105] dma_race_sim: loading out-of-tree module taints kernel.
root@adsp-sc598-som-ezkit:~# [ 55.423785] SIM: Terminator done.
[ 55.523793] Unable to handle kernel paging request at virtual address dead000000000108
[ 55.531570] Mem abort info:
[ 55.534356] ESR = 0x0000000096000044
[ 55.538074] EC = 0x25: DABT (current EL), IL = 32 bits
[ 55.543369] SET = 0, FnV = 0
[ 55.546405] EA = 0, S1PTW = 0
[ 55.549544] FSC = 0x04: level 0 translation fault
[ 55.554392] Data abort info:
[ 55.557256] ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000
[ 55.562723] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[ 55.567763] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 55.573052] [dead000000000108] address between user and kernel address ranges
[ 55.580179] Internal error: Oops: 0000000096000044 1 PREEMPT SMP
[ 55.586414] Modules linked in: dma_race_sim(O) crct10dif_ce
[ 55.591971] CPU: 0 UID: 0 PID: 271 Comm: sim_handler Tainted: G O 6.12.0-yocto-standard #1
[ 55.601603] Tainted: [O]=OOT_MODULE
[ 55.605073] Hardware name: ADI 64-bit SC598 SOM EZ Kit (DT)
[ 55.610629] pstate: 60000009 (nZCv daif PAN -UAO -TCO -DIT -SSBS BTYPE=-)
[ 55.617572] pc : handler_fn+0x84/0x94 [dma_race_sim]
[ 55.622520] lr : handler_fn+0x80/0x94 [dma_race_sim]
[ 55.627467] sp : ffff8000819f3e30
[ 55.630765] x29: ffff8000819f3e30 x28: 0000000000000000 x27: 0000000000000000
[ 55.637883] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 55.645001] x23: dead000000000122 x22: dead000000000100 x21: ffff000090f33040
[ 55.652118] x20: ffff800079238000 x19: ffff8000792384c0 x18: 0000000000000000
[ 55.659236] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 55.666353] x14: 0000000000000000 x13: 0000000000000008 x12: 0000000c936545c0
[ 55.673471] x11: 0000000000000001 x10: 0000000000000a60 x9 : ffff8000819f3cf0
[ 55.680589] x8 : ffff000090f55e80 x7 : ffff00009df43700 x6 : 0000000000000000
[ 55.687706] x5 : ffff8000792384c0 x4 : ffff00009df340c0 x3 : ffff8000792384c0
[ 55.694824] x2 : 0000000000000000 x1 : dead000000000100 x0 : dead000000000122
[ 55.701942] Call trace:
[ 55.704372] handler_fn+0x84/0x94 [dma_race_sim]
[ 55.708972] kthread+0xd8/0xe8
[ 55.712010] ret_from_fork+0x10/0x20
[ 55.715575] Code: 95ba41a9 aa1303e0 95dbb1b5 a94002a1 (f9000420)
[ 55.721646] {}{}[ end trace 0000000000000000 ]{}{}
[ 55.726282] note: sim_handler[271] exited with preempt_count 1
From the splat, the fault appears to occur in the reproducer module itself rather than directly in the adi-dma driver path.
The faulting PC is reported as handler_fn+0x84/0x94 [dma_race_sim]. I also reran it with CONFIG_DEBUG_INFO enabled to get debug symbols, but the result was the same. After reviewing the reproducer source, it appears to create a shared linked list and start two kernel threads on it. The handler thread saves a pointer to a list node, drops the lock, sleeps, and later tries to delete that same node. In parallel, the terminator thread removes and frees all nodes from the list. That makes the reproducer itself capable of triggering a stale pointer/UAF crash independent of the adi-dma driver. So at least from this run, it does not appear to reproduce the failure through the actual adi-dma driver path shown in the customer’s original stack trace. It should probably also be noted this was tested on kernel version 6.12 (yocto 5.0.1) while the customer seems to encounter the issue on kernel version 5.15.78 which would be yocto 3.1.0 released in 2023 iirc.
A customer reported a bug and proposed a fix for a kernel crash. The following stack trace is generated by simulating that crash.
I generated the patch in this PR based on the above. It will require more thorough review and testing. The comments are also fairly verbose and should likely get dropped before merging.
A more trivial fix would be as follows and more clearly identifies the problem: