-
Notifications
You must be signed in to change notification settings - Fork 912
Description
Summary
TX_PORT_BASEPRI path does not appear to include a memory barrier for the compiler like the previous PRIMASK path used to include.
I am seeing what looks like a ThreadX queue/resume race when TX_NOT_INTERRUPTABLE is enabled. I believe TX_NOT_INTERRUPTABLE makes the memory barrier issue easier to surface.
Under load, a thread blocked in tx_queue_receive(..., TX_WAIT_FOREVER) can resume even though no message was actually delivered into its receive buffer, and the message that the sender attempted to hand off never appears in the queue. In my application this causes the sender-owned block to leak, because cleanup depends on either successful dequeue by the receiver or immediate send failure handling by the sender.
The most reliable trigger in my system is CDC ACM traffic after a host handshake, but the failure itself looks like a generic ThreadX queue semantic failure rather than a USBX-specific problem. I do not yet have a standalone minimal reproducer outside the application.
Update / Likely Root Cause
I now suspect the deeper root cause is not tx_queue_send alone, but the Cortex-M GNU TX_PORT_USE_BASEPRI critical-section implementation.
In the Cortex-M33 GNU port, the PRIMASK path uses inline asm with a "memory" clobber, but the BASEPRI path writes BASEPRI without an equivalent compiler memory barrier. That means enabling TX_PORT_USE_BASEPRI may allow the compiler to reorder ordinary memory accesses across TX_DISABLE / TX_RESTORE, especially under -O3 and -flto.
I locally reintroduced compiler memory barriers around the BASEPRI critical-section path and:
- the queue/message-loss issue disappeared;
- the application image became slightly smaller rather than larger.
Based on this, I now believe the queue symptom is likely an effect of missing compiler barriers in the BASEPRI port, with TX_NOT_INTERRUPTABLE making the problem visible.
Environment
- ThreadX 6.4.0
- USBX 6.4.0
- STM32H563RGVx / Cortex-M33
- GNU Arm Embedded toolchain
-O3-flto- Non-secure build
TX_NOT_INTERRUPTABLETX_PORT_USE_BASEPRITX_PORT_BASEPRI=16(corresponds to BASEPRI of 1 on system that implements 4 priority bits)
Steps To Reproduce
My real application setup is:
- One thread blocks on
tx_queue_receive(&queue, &msg, TX_WAIT_FOREVER). - Multiple contexts repeatedly call
tx_queue_send(&queue, &msg, TX_NO_WAIT). - The most reliable trigger is traffic originating from the USBX CDC ACM read callback along with another thread after a host handshake.
- After the host connects and sends a handshake command, the device starts registering outbound traffic, including realtime telemetry messages, and the failure becomes reproducible.
- It is the realtime telemetry message that gets dropped
I realize this is not yet a minimal reproducer, but the observed failure mode appears consistent.
Actual Result
With TX_NOT_INTERRUPTABLE enabled:
- the waiting thread can resume without a real message being placed into its receive buffer;
- the message that was supposed to be delivered never appears in the queue;
- the sender-side message block is leaked because the sender believes ownership was handed off;
- disabling
TX_NOT_INTERRUPTABLEmakes the problem disappear in the same application; - enabling LTO makes the problem happen more often.
- The number of spurious thread resumes corresponds 1 to 1 with the number of messages leaked
Expected Result
- If
tx_queue_sendresumes a suspended receiver, the receiver should always observe the message that was sent. - If the message is not delivered, the sender should not observe a successful handoff.
- A blocked receiver should not be resumed spuriously in a way that loses the message.
Evidence And Invariants
Application-side evidence from this project:
- The leaked blocks come from a block pool
- The leaked block type is typically a block intended for an outbound realtime telemetry
- These leaked outbound blocks are never stamped with transmission metadata, which is only written after successful dequeue
- That strongly suggests the leak happens before the suspended thread actually dequeues and processes the outbound message.
- The sender failure path releases the block immediately on non-success from tx_queue_send, so the leak implies the send side believed the handoff succeeded.
- The most reliable trigger path starts from the CDC ACM read callback
- Other threads using similar application patterns appear healthy in the same firmware, including other threads that block on queues
- Locally restoring compiler memory barriers in the BASEPRI critical-section path makes the issue disappear in the same firmware/configuration.
Interrupt/preemption notes:
- This build uses
TX_PORT_USE_BASEPRIwithTX_PORT_BASEPRI=16. - In this configuration, only one interrupt remains above the ThreadX BASEPRI mask in my project
- That interrupt does not call ThreadX services.
Suspected Area / Related Report
I originally suspected the TX_NOT_INTERRUPTABLE fast path in tx_queue_send, specifically the empty-queue / suspended-receiver handoff path. I now think that path may only be where the failure becomes visible.
The stronger suspect is the Cortex-M GNU TX_PORT_USE_BASEPRI port implementation, where the BASEPRI interrupt disable/restore helpers appear to lack the compiler memory barrier semantics present in the PRIMASK path. If so, normal memory operations in ThreadX critical sections may be reordered across TX_DISABLE / TX_RESTORE under optimization/LTO.
In ThreadX 6.4.0, Middlewares/ST/threadx/common/src/tx_queue_send.c has a path where it:
- selects the suspended thread,
- unlinks it from the queue suspension list,
- clears the suspend cleanup pointer,
- copies the message into
thread_ptr->tx_thread_additional_suspend_info, - sets
tx_thread_suspend_status = TX_SUCCESS, - then calls
_tx_thread_system_ni_resume(thread_ptr)while interrupts are still disabled. - This may also be related to usbx in the sense that usb interrupts that modify kernel state are interfering with threadx services.
I am not claiming this is definitely the root cause, but this is the area that currently looks most suspicious.
This may be related to threadx#258, which also describes a suspended thread resuming from tx_queue_receive with an effectively empty message under load. I am only linking it as possibly related, not as proof of the same root cause.
With the barrier restored, the application image is slightly smaller, so the fix does not appear to be introducing a size regression in this build.
Questions For Maintainers
- Is this a known issue in the
TX_NOT_INTERRUPTABLEqueue/resume path? - Is calling
tx_queue_send(..., TX_NO_WAIT)from the USBX CDC ACM read-callback path an unsupported context assumption that could explain this behavior? - Does usbx code that runs from interrupt context modify kernel state in such a way to interfere with other threadx services?
- If that callback context is supported, does the behavior above match any known race involving
_tx_queue_sendand_tx_thread_system_ni_resume? - Is there a recommended diagnostic or patch I should try next to confirm whether this is a ThreadX kernel bug?
- Is the absence of a compiler memory barrier in the Cortex-M GNU TX_PORT_USE_BASEPRI path intentional?
- Should the BASEPRI path provide the same compiler barrier semantics as the PRIMASK path for TX_DISABLE / TX_RESTORE?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status