Skip to content

Possible missing compiler memory barriers in Cortex-M BASEPRI port: TX_NOT_INTERRUPTABLE tx_queue_send can resume receiver without delivered message #516

@Kairalite

Description

@Kairalite

Summary

TX_PORT_BASEPRI path does not appear to include a memory barrier for the compiler like the previous PRIMASK path used to include.

I am seeing what looks like a ThreadX queue/resume race when TX_NOT_INTERRUPTABLE is enabled. I believe TX_NOT_INTERRUPTABLE makes the memory barrier issue easier to surface.

Under load, a thread blocked in tx_queue_receive(..., TX_WAIT_FOREVER) can resume even though no message was actually delivered into its receive buffer, and the message that the sender attempted to hand off never appears in the queue. In my application this causes the sender-owned block to leak, because cleanup depends on either successful dequeue by the receiver or immediate send failure handling by the sender.

The most reliable trigger in my system is CDC ACM traffic after a host handshake, but the failure itself looks like a generic ThreadX queue semantic failure rather than a USBX-specific problem. I do not yet have a standalone minimal reproducer outside the application.

Update / Likely Root Cause

I now suspect the deeper root cause is not tx_queue_send alone, but the Cortex-M GNU TX_PORT_USE_BASEPRI critical-section implementation.

In the Cortex-M33 GNU port, the PRIMASK path uses inline asm with a "memory" clobber, but the BASEPRI path writes BASEPRI without an equivalent compiler memory barrier. That means enabling TX_PORT_USE_BASEPRI may allow the compiler to reorder ordinary memory accesses across TX_DISABLE / TX_RESTORE, especially under -O3 and -flto.

I locally reintroduced compiler memory barriers around the BASEPRI critical-section path and:

  • the queue/message-loss issue disappeared;
  • the application image became slightly smaller rather than larger.

Based on this, I now believe the queue symptom is likely an effect of missing compiler barriers in the BASEPRI port, with TX_NOT_INTERRUPTABLE making the problem visible.

Environment

  • ThreadX 6.4.0
  • USBX 6.4.0
  • STM32H563RGVx / Cortex-M33
  • GNU Arm Embedded toolchain
  • -O3
  • -flto
  • Non-secure build
  • TX_NOT_INTERRUPTABLE
  • TX_PORT_USE_BASEPRI
  • TX_PORT_BASEPRI=16 (corresponds to BASEPRI of 1 on system that implements 4 priority bits)

Steps To Reproduce

My real application setup is:

  • One thread blocks on tx_queue_receive(&queue, &msg, TX_WAIT_FOREVER).
  • Multiple contexts repeatedly call tx_queue_send(&queue, &msg, TX_NO_WAIT).
  • The most reliable trigger is traffic originating from the USBX CDC ACM read callback along with another thread after a host handshake.
  • After the host connects and sends a handshake command, the device starts registering outbound traffic, including realtime telemetry messages, and the failure becomes reproducible.
  • It is the realtime telemetry message that gets dropped

I realize this is not yet a minimal reproducer, but the observed failure mode appears consistent.

Actual Result

With TX_NOT_INTERRUPTABLE enabled:

  • the waiting thread can resume without a real message being placed into its receive buffer;
  • the message that was supposed to be delivered never appears in the queue;
  • the sender-side message block is leaked because the sender believes ownership was handed off;
  • disabling TX_NOT_INTERRUPTABLE makes the problem disappear in the same application;
  • enabling LTO makes the problem happen more often.
  • The number of spurious thread resumes corresponds 1 to 1 with the number of messages leaked

Expected Result

  • If tx_queue_send resumes a suspended receiver, the receiver should always observe the message that was sent.
  • If the message is not delivered, the sender should not observe a successful handoff.
  • A blocked receiver should not be resumed spuriously in a way that loses the message.

Evidence And Invariants

Application-side evidence from this project:

  • The leaked blocks come from a block pool
  • The leaked block type is typically a block intended for an outbound realtime telemetry
  • These leaked outbound blocks are never stamped with transmission metadata, which is only written after successful dequeue
  • That strongly suggests the leak happens before the suspended thread actually dequeues and processes the outbound message.
  • The sender failure path releases the block immediately on non-success from tx_queue_send, so the leak implies the send side believed the handoff succeeded.
  • The most reliable trigger path starts from the CDC ACM read callback
  • Other threads using similar application patterns appear healthy in the same firmware, including other threads that block on queues
  • Locally restoring compiler memory barriers in the BASEPRI critical-section path makes the issue disappear in the same firmware/configuration.

Interrupt/preemption notes:

  • This build uses TX_PORT_USE_BASEPRI with TX_PORT_BASEPRI=16.
  • In this configuration, only one interrupt remains above the ThreadX BASEPRI mask in my project
  • That interrupt does not call ThreadX services.

Suspected Area / Related Report

I originally suspected the TX_NOT_INTERRUPTABLE fast path in tx_queue_send, specifically the empty-queue / suspended-receiver handoff path. I now think that path may only be where the failure becomes visible.

The stronger suspect is the Cortex-M GNU TX_PORT_USE_BASEPRI port implementation, where the BASEPRI interrupt disable/restore helpers appear to lack the compiler memory barrier semantics present in the PRIMASK path. If so, normal memory operations in ThreadX critical sections may be reordered across TX_DISABLE / TX_RESTORE under optimization/LTO.

In ThreadX 6.4.0, Middlewares/ST/threadx/common/src/tx_queue_send.c has a path where it:

  • selects the suspended thread,
  • unlinks it from the queue suspension list,
  • clears the suspend cleanup pointer,
  • copies the message into thread_ptr->tx_thread_additional_suspend_info,
  • sets tx_thread_suspend_status = TX_SUCCESS,
  • then calls _tx_thread_system_ni_resume(thread_ptr) while interrupts are still disabled.
  • This may also be related to usbx in the sense that usb interrupts that modify kernel state are interfering with threadx services.

I am not claiming this is definitely the root cause, but this is the area that currently looks most suspicious.

This may be related to threadx#258, which also describes a suspended thread resuming from tx_queue_receive with an effectively empty message under load. I am only linking it as possibly related, not as proof of the same root cause.

With the barrier restored, the application image is slightly smaller, so the fix does not appear to be introducing a size regression in this build.

Questions For Maintainers

  • Is this a known issue in the TX_NOT_INTERRUPTABLE queue/resume path?
  • Is calling tx_queue_send(..., TX_NO_WAIT) from the USBX CDC ACM read-callback path an unsupported context assumption that could explain this behavior?
  • Does usbx code that runs from interrupt context modify kernel state in such a way to interfere with other threadx services?
  • If that callback context is supported, does the behavior above match any known race involving _tx_queue_send and _tx_thread_system_ni_resume?
  • Is there a recommended diagnostic or patch I should try next to confirm whether this is a ThreadX kernel bug?
  • Is the absence of a compiler memory barrier in the Cortex-M GNU TX_PORT_USE_BASEPRI path intentional?
  • Should the BASEPRI path provide the same compiler barrier semantics as the PRIMASK path for TX_DISABLE / TX_RESTORE?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdiscussionFlagged for discussion during the weekly team meeting

    Type

    No type

    Projects

    Status

    Discussion

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions