motion: make emcmotError fifo lockfree MPSC (fixes #3951)#3993
motion: make emcmotError fifo lockfree MPSC (fixes #3951)#3993grandixximo wants to merge 2 commits intoLinuxCNC:masterfrom
Conversation
|
I fixed the rtapi/rtapi_atomic.h some time ago to actually be useful and compile successfully when I fixed the hal streamer code. |
d40f329 to
f823e40
Compare
|
Thanks @BsAtHome. Switched to |
c803c75 to
4d0fe3c
Compare
|
There does seem to be a problem with atomics in kernel compile mode... |
I think it was because of the includes order, has to be last. Edit: It was the order, passes now. |
|
You resolved the type declaration comment without saying why... So, why did you? |
4d0fe3c to
10b2d30
Compare
I must have done it by mistake, wasn't intentional, I was wondering why it was compacted lol |
|
Some quick comments from my side. But I have to look into it in detail. This atomic stuff is always complicated and can have delays / deadlock if not done properly. An RT thread should never spinnlock and even yield() can lock due to the cpu takes always the highest priority thread except priority inheritance is given but the way it is done now, it looks like it is not the case. |
10b2d30 to
bbcb60a
Compare
|
Reworked. Dropped the centralized |
|
There is one thing we can assume for this queue which makes the algorithm simpler: So might be something like this would work, quick and dirty pseudo code: |
bbcb60a to
d4c4d78
Compare
|
@hdiethelm thanks for the alternative design. Question: in your version the consumer drains only when |
|
I think the simpler the code the better as long as it fulfills the purpose. The consumer doesn't really stall, it only skips it until the next cycle. As long as there is no message storm, this should be no issue and happen like once in a year. Producers are all threads, consumer the emctaskmain which consumes only one message per cycle. It can be an issue if no messages are printed any more due to for example the stepper thread writes a message at 25kHz but as long as the ringbuffer is not overwritten when it is full, as soon it is full, a message will be written again. But this is not in my pseudo code. |
|
Ah but there is an edge case: If a producer task freezes or crashes mid write, the ring buffer is locked forever. |
|
There is a reason why lock free queue algorithms have papers about it, it's hard to get it right: |
d4c4d78 to
be3497e
Compare
|
Thanks both. Pushed |
|
Agreed, lock-free is treacherous. Walked through the race cases on the current design before pushing: producer-producer concurrency, mid-write skip, buffer wraparound, and the memory-ordering on |
|
Thanks, looks good at first glance. Does read_seq needs to be an atomic? emcmotErrorGet() is anyway not thread save and emcmotErrorPutfv() only reads it. |
|
You have again that header ordering problem. Where does that happen? What is it that needs to be included before the rtapi_atomic.h header? |
emc_message_handler() can be invoked from any RT thread, making
emcmotErrorPutfv() a multi-producer path that races on num/start/end
against itself and against emcmotErrorGet() in the userspace task.
rtapi_mutex_get spins with sched_yield, unsafe in RT context (priority
inversion vs the non-RT consumer). rtapi_mutex_try would trade the
race for silent message drops on contention.
Replace the int counters with a lockfree MPSC ring using C11
<stdatomic.h> primitives via rtapi_atomic.h, matching the convention
used by hal_lib.c. Struct fields stay plain unsigned long long for
shmem layout compatibility with C++ TUs; access casts to _Atomic at
the use site.
- producers CAS-claim a slot on write_reserve, write payload, then
publish in seq order via release-store on write_commit;
- consumer compares read_seq against acquire-loaded write_commit.
Buffer-full path returns -1 (same bound as before). No sched_yield in
producer, no priority inversion, no silent drops outside the full case.
be3497e to
2db72fe
Compare
|
@hdiethelm yes, @BsAtHome the conflict is Heading to sleep, will pick up tomorrow. |
Yes, that's true. It normally works assuming that the CPU can write the value in one op. But this assuming makes it probably UB when you look at it from a standards viewpoint. For testing: Do you have a machine with good real time behavior? I got mine down to ~1us jitter as long as the GPU has not a lot to do. Testing two scenarios would make sense:
|
|
Latency A/B on a non-RT N100 (no_isolcpus), 5 min each,
Same delta on servo sdev. Servo max delta is smaller on B (+11.4) than A (+17.7), so the atomic fifo is at worst neutral and shows tighter servo-max under stress. Box is non-isolated so absolute numbers are inflated; the relevant signal is the delta within each package. No regression. Concurrency test was added as a separate commit ( |
|
Nice, even with a test! :-) Just checked, the test fails on master, so that's good. Just a tiny thing: How do you do |
I made a modified version of latency-histogram with component proposed by Bertho here #3948 pushmsg.zip Then rebuild, and you should be able to spam messages with
I believe I fixed it |
7a8f52f to
4330810
Compare
8 producer pthreads pump 100k messages each; one consumer verifies
exactly-once delivery and per-producer sequence ordering. Builds
against the in-tree motion sources (emcmotutil.c, dbuf.c, stashf.c).
Run with:
scripts/runtests -p tests/lowlevel/emcmot-error-mpsc
4330810 to
ff3c254
Compare
|
My pushmsg was a quick-and-dirty hack. Your version is pushing messages at the rate of the thread because it is level triggered. That may be very inadequate. I suggest to make a version that can be used generically with load-time settable messages and edge trigger. Maybe something like: dmsg - debug Could f.ex. use personality to set the amount of messages per level (to limit the loop). |
|
It was only used as a quick and dirty test to see if performance regressed, are you asking for a component to put in tree? Or you want me to run the test again with the more general component, but still keep it out of tree?? |
This pushmsg was just a dirty hack for testing. But it may be interesting to be able to push custom messages. There is one component, message, but that one is rather inflexible afaics. Need to think a bit about a generic message generator that is flexible. Most constructs are difficult, compute heavy or have problematic interfacing. Here a genuine KISS approach is sought after ;-) |
|
Got it, will leave it as a future separate PR. The dirty pushmsg stays a local test rig for this PR; the fifo race fix and concurrency test here don't depend on it. We should be good here right? We can work on pushmsg over at #4003 |
A KISS RT-to-non-RT message generator. Each rising edge on pushmsg.error-N, pushmsg.warning-N, pushmsg.info-N, or pushmsg.debug-N emits the corresponding string from the load-time emsg/wmsg/imsg/dmsg arrays through rtapi_print_msg(). Slots without a configured message are inert. 16 slots per level, singleton instance. Discussed with @BsAtHome in LinuxCNC#3993; replaces the inflexible existing message HAL component for diagnostic / test-rig use.
Fixes #3951.
cc @BsAtHome @hdiethelm (from the discussion in #3948 / #3951).
emc_message_handler(), installed bymotmodat the end ofrtapi_app_main, can be invoked from any RT thread.emcmotErrorPutfv()andemcmotErrorGet()share plainintcounters with no synchronisation: concurrent RT producers race onnum/start/end, and the RT producer also racesnum++against the userspace consumer'snum--across processes.rtapi_mutex_getis unsuitable here: it spins withsched_yield, which is priority inversion when an RT producer waits on the non-RT consumer (the header explicitly says it should not be used in realtime code).rtapi_mutex_trywould convert the race into a silent message drop on contention.This patch replaces the counters with a lockfree multi-producer single-consumer ring using GCC
__atomic_*builtins on plainunsigned long longfields, so the layout is identical in the C producer (motmod) and the C++ consumer (milltask) sharing the shmem region.Producers CAS-claim a sequence number on
write_reserve, write the payload to slotS % EMCMOT_ERROR_NUM, then spin untilwrite_commit == Sand release-storeS + 1. The in-order publish step ensures the consumer never sees a committed slot whose payload from a lower seq is still being written. The consumer comparesread_seqto acquire-loadedwrite_commit, copies the slot, and release-storesread_seq + 1. Buffer-full returns-1, same bound as before.Properties: RT-safe, no
sched_yield, no priority inversion, MPSC-correct, no silent drops outside the buffer-full case, no API change.Files touched:
src/emc/motion/motion.h: struct fields replaced (head/tail/start/end/numremoved;write_reserve/write_commit/read_seqadded).src/emc/motion/emcmotutil.c:emcmotErrorInit/emcmotErrorPutfv/emcmotErrorGetrewritten.This does not change when
emc_message_handleris installed (#3948); that symptom is orthogonal and intentionally left for the discussion in #3948 to converge. Thepushmsgcomponent proposed by @BsAtHome in #3948 would be a useful test bed once available.Tested: clean RIP build, no warnings.