amdxdna: fix NULL-deref / UAF flushing timed-out mailbox messages#1348
amdxdna: fix NULL-deref / UAF flushing timed-out mailbox messages#1348FIM43-Redeye wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR hardens the AMD XDNA mailbox message lifecycle to prevent NULL/UAF dereferences during channel teardown/unload, especially when firmware is wedged and requests time out.
Changes:
- Add NULL-handle guards in mailbox notify callbacks to avoid dereferencing missing completion state during teardown flushes.
- Reclaim timed-out outstanding messages to prevent teardown from later flushing them with dangling/on-stack handles.
- Introduce a new public helper API to cancel an outstanding mailbox message by id.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/driver/amdxdna/amdxdna_mailbox_helper.c | Adds NULL-handle guard in callback and cancels outstanding msg on timeout to prevent teardown UAF/NULL-deref. |
| src/driver/amdxdna/amdxdna_mailbox.h | Documents and declares new xdna_mailbox_cancel_msg() API. |
| src/driver/amdxdna/amdxdna_mailbox.c | Implements xdna_mailbox_cancel_msg() by erasing from XArray and freeing the message. |
| src/driver/amdxdna/aie4_message.c | Mirrors NULL-handle guard in the AIE4 variant callback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * with a dangling or NULL handle. Race-safe against a late response via | ||
| * the existing xa_erase ownership model. | ||
| */ | ||
| void xdna_mailbox_cancel_msg(struct mailbox_channel *mailbox_chann, int msg_id); |
| if (unlikely(!cb_arg)) | ||
| return 0; |
|
The driver under src/driver/amdxdna is obsolete now. The default driver is under drivers/accel/amdxdna. Could you give it a try and file a PR to fix it, if the same issue also exists in this one? |
A mailbox request whose firmware response never arrives times out in
xdna_send_msg_wait() (-ETIME) but was left outstanding in the channel's
chan_xa. On channel teardown (xdna_mailbox_stop_channel ->
mailbox_release_msg) every such message is flushed via
notify_cb(handle, NULL, 0). By then the handle is either a dangling
on-stack xdna_notify (synchronous waiters) or NULL (fire-and-forget
sends), and xdna_msg_cb() dereferences it unconditionally at out:
BUG: kernel NULL pointer dereference, address: 00000000000001ac
RIP: do_raw_spin_lock+0x8
complete+0x4b [&cb_arg->comp on a NULL cb_arg]
xdna_msg_cb [amdxdna]
xdna_mailbox_stop_channel [amdxdna]
aie2_hw_stop -> aie2_fini -> amdxdna_remove
i.e. an oops (or, for the dangling-stack case, a use-after-free write)
on module unload of a firmware-wedged device -- the exact recovery path
modprobe -r relies on.
Two complementary fixes:
1. Reclaim on timeout (root cause): add xdna_mailbox_cancel_msg(), and
call it from xdna_send_msg_wait() on -ETIME. xdna_mailbox_send_msg()
takes a const message and does not hand the firmware-assigned id back
to the caller, so the timeout site identifies the abandoned message
by the notify handle it was sent with. xdna_mailbox_cancel_msg()
walks chan_xa under xa_lock_irq, erases the matching entry, and frees
it. Holding the lock across the find-and-erase makes it race-safe
against a late response: whichever path erases the entry first owns
the free, and if a response beats the cancel, mailbox_get_resp()
finds the id already gone and bails. The abandoned message no longer
lingers to be flushed with an invalid handle.
2. Defense-in-depth: xdna_msg_cb() now returns early on a NULL handle
instead of dereferencing it at out:. This also covers fire-and-forget
sends (handle == NULL by construction) that are flushed on teardown
and never go through a timed wait.
With both, modprobe -r on a firmware-wedged NPU unloads cleanly instead
of oopsing.
Fixes: b87f920b9344 ("accel/amdxdna: Support hardware mailbox")
Signed-off-by: Jordan Fogel <jhfogel@gmail.com>
9979dc9 to
b1d58df
Compare
|
Done -- force-pushed The bug is identical there: Fix is the same two parts -- |
Fixes #1347.
A mailbox request whose firmware response never arrives times out in
xdna_send_msg_wait() (-ETIME) but is left outstanding in the channel's
chan_xa. On channel teardown, mailbox_release_msg() flushes it via
notify_cb(handle, NULL, 0) -- by then the handle is a dangling on-stack
xdna_notify or NULL, and xdna_msg_cb() dereferences it unconditionally
at out:, oopsing on module unload of a firmware-wedged device.
Two complementary fixes:
Reclaim on timeout (root cause): xdna_mailbox_cancel_msg() xa_erase's
the entry on -ETIME, using the existing xa_erase ownership model so
it is race-safe against a late response.
Defense-in-depth: xdna_msg_cb() / aie4_xdna_msg_cb() return early on
a NULL handle, which also covers fire-and-forget sends flushed on
teardown.
Validated on Phoenix/NPU1: modprobe -r on a firmware-wedged device
unloads cleanly instead of oopsing.