-
Notifications
You must be signed in to change notification settings - Fork 140
[PATCH v1] linux-gen: queue: add loop unrolling for event conversions #2211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Added loop unrolling for event to header conversions to improve performance for single producer single consumer and basic queues. Signed-off-by: Sanjyot Vaidya <Sanjyot.Vaidya@arm.com>
| event_index[i+2] = event_hdr[idx+2]->index.u32; | ||
| event_index[i+3] = event_hdr[idx+3]->index.u32; | ||
| } | ||
| switch (num & 0x3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is going to increase code size and add a few new branches (although I did not look at any generated assembly). It is not clear to me that the change is a net positive in actual applications and not just in microbenchmarks that presumably always enqueue long bursts. I believe some ODP apps have high instruction cache pressure.
Another possible optimization might be to get rid of the whole header-pointer-to-index conversion and just store the pointers in rings (but that would then use more cache lines in the rings, so I do not know if that is good or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, cache pressure is something we need to understand. Is there any benchmarking application (not a micro benchmark like we used here) that can be executed to confirm the behaviour?
| switch (num & 0x3) { | ||
| case 3: | ||
| event_index[i++] = event_hdr[idx++]->index.u32; | ||
| __attribute__((fallthrough)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Project convention has been to use simply /* Fallthrough */ comments. i386 build seems to have issues with these attributes.
| switch (num & 0x3) { | ||
| case 3: | ||
| event_hdr[i++] = _odp_event_hdr_from_index_u32(event_index[idx++]); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary newline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While testing this I noticed that now we could have separate optimized implementations for single and multi event enq/deq functions (e.g. odp_queue_enq(), odp_queue_enq_multi()). Currently, the single event variants e.g. plain_queue_enq()) share the same code and I measured some performance penalty from these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this makes sense.
| } | ||
|
|
||
| static inline void event_index_from_hdr(uint32_t event_index[], | ||
| _odp_event_hdr_t *event_hdr[], int num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since event_index_from_hdr() and event_index_to_hdr() functions are identical for basic and spsc queues, their code could be moved to platform/linux-generic/include/odp_event_internal.h. The convention has been to prefix implementation internal functions with _odp, so _odp_event_index_from_hdr() and _odp_event_index_to_hdr().
|
After this commit this PR seems no longer necessary. |
Added loop unrolling for event to header conversions to improve performance for single producer single consumer and basic queues.