Skip to content

amdxdna: fix NULL-deref / UAF flushing timed-out mailbox messages#1348

Open
FIM43-Redeye wants to merge 1 commit into
amd:mainfrom
FIM43-Redeye:fix/mailbox-timeout-msg-uaf
Open

amdxdna: fix NULL-deref / UAF flushing timed-out mailbox messages#1348
FIM43-Redeye wants to merge 1 commit into
amd:mainfrom
FIM43-Redeye:fix/mailbox-timeout-msg-uaf

Conversation

@FIM43-Redeye
Copy link
Copy Markdown
Contributor

Fixes #1347.

A mailbox request whose firmware response never arrives times out in
xdna_send_msg_wait() (-ETIME) but is left outstanding in the channel's
chan_xa. On channel teardown, mailbox_release_msg() flushes it via
notify_cb(handle, NULL, 0) -- by then the handle is a dangling on-stack
xdna_notify or NULL, and xdna_msg_cb() dereferences it unconditionally
at out:, oopsing on module unload of a firmware-wedged device.

Two complementary fixes:

  1. Reclaim on timeout (root cause): xdna_mailbox_cancel_msg() xa_erase's
    the entry on -ETIME, using the existing xa_erase ownership model so
    it is race-safe against a late response.

  2. Defense-in-depth: xdna_msg_cb() / aie4_xdna_msg_cb() return early on
    a NULL handle, which also covers fire-and-forget sends flushed on
    teardown.

Validated on Phoenix/NPU1: modprobe -r on a firmware-wedged device
unloads cleanly instead of oopsing.

Copilot AI review requested due to automatic review settings May 21, 2026 03:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR hardens the AMD XDNA mailbox message lifecycle to prevent NULL/UAF dereferences during channel teardown/unload, especially when firmware is wedged and requests time out.

Changes:

  • Add NULL-handle guards in mailbox notify callbacks to avoid dereferencing missing completion state during teardown flushes.
  • Reclaim timed-out outstanding messages to prevent teardown from later flushing them with dangling/on-stack handles.
  • Introduce a new public helper API to cancel an outstanding mailbox message by id.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/driver/amdxdna/amdxdna_mailbox_helper.c Adds NULL-handle guard in callback and cancels outstanding msg on timeout to prevent teardown UAF/NULL-deref.
src/driver/amdxdna/amdxdna_mailbox.h Documents and declares new xdna_mailbox_cancel_msg() API.
src/driver/amdxdna/amdxdna_mailbox.c Implements xdna_mailbox_cancel_msg() by erasing from XArray and freeing the message.
src/driver/amdxdna/aie4_message.c Mirrors NULL-handle guard in the AIE4 variant callback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/driver/amdxdna/amdxdna_mailbox.h Outdated
* with a dangling or NULL handle. Race-safe against a late response via
* the existing xa_erase ownership model.
*/
void xdna_mailbox_cancel_msg(struct mailbox_channel *mailbox_chann, int msg_id);
Comment on lines +36 to +37
if (unlikely(!cb_arg))
return 0;
@maxzhen
Copy link
Copy Markdown
Collaborator

maxzhen commented May 21, 2026

The driver under src/driver/amdxdna is obsolete now. The default driver is under drivers/accel/amdxdna. Could you give it a try and file a PR to fix it, if the same issue also exists in this one?

A mailbox request whose firmware response never arrives times out in
xdna_send_msg_wait() (-ETIME) but was left outstanding in the channel's
chan_xa. On channel teardown (xdna_mailbox_stop_channel ->
mailbox_release_msg) every such message is flushed via
notify_cb(handle, NULL, 0). By then the handle is either a dangling
on-stack xdna_notify (synchronous waiters) or NULL (fire-and-forget
sends), and xdna_msg_cb() dereferences it unconditionally at out:

  BUG: kernel NULL pointer dereference, address: 00000000000001ac
  RIP: do_raw_spin_lock+0x8
   complete+0x4b   [&cb_arg->comp on a NULL cb_arg]
   xdna_msg_cb     [amdxdna]
   xdna_mailbox_stop_channel [amdxdna]
   aie2_hw_stop -> aie2_fini -> amdxdna_remove

i.e. an oops (or, for the dangling-stack case, a use-after-free write)
on module unload of a firmware-wedged device -- the exact recovery path
modprobe -r relies on.

Two complementary fixes:

1. Reclaim on timeout (root cause): add xdna_mailbox_cancel_msg(), and
   call it from xdna_send_msg_wait() on -ETIME. xdna_mailbox_send_msg()
   takes a const message and does not hand the firmware-assigned id back
   to the caller, so the timeout site identifies the abandoned message
   by the notify handle it was sent with. xdna_mailbox_cancel_msg()
   walks chan_xa under xa_lock_irq, erases the matching entry, and frees
   it. Holding the lock across the find-and-erase makes it race-safe
   against a late response: whichever path erases the entry first owns
   the free, and if a response beats the cancel, mailbox_get_resp()
   finds the id already gone and bails. The abandoned message no longer
   lingers to be flushed with an invalid handle.

2. Defense-in-depth: xdna_msg_cb() now returns early on a NULL handle
   instead of dereferencing it at out:. This also covers fire-and-forget
   sends (handle == NULL by construction) that are flushed on teardown
   and never go through a timed wait.

With both, modprobe -r on a firmware-wedged NPU unloads cleanly instead
of oopsing.

Fixes: b87f920b9344 ("accel/amdxdna: Support hardware mailbox")
Signed-off-by: Jordan Fogel <jhfogel@gmail.com>
@FIM43-Redeye FIM43-Redeye force-pushed the fix/mailbox-timeout-msg-uaf branch from 9979dc9 to b1d58df Compare May 22, 2026 03:57
@FIM43-Redeye
Copy link
Copy Markdown
Contributor Author

Done -- force-pushed b1d58df, the PR now targets drivers/accel/amdxdna.

The bug is identical there: xdna_send_msg_wait() returns -ETIME without reclaiming the timed-out message from chan_xa, and xdna_msg_cb() then dereferences the handle unconditionally at out:. Same introducing commit (b87f920b9344 "accel/amdxdna: Support hardware mailbox"), still unfixed at mainline HEAD.

Fix is the same two parts -- xdna_mailbox_cancel_msg() called from the timeout path to reclaim the message under xa_lock_irq, plus a NULL-handle early return in xdna_msg_cb() for the fire-and-forget teardown case. Applies clean to main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NULL-deref / use-after-free flushing timed-out mailbox messages on channel teardown

3 participants