Skip to content

Conversation

@sanvai01
Copy link

Added loop unrolling for event to header conversions to improve performance for single producer single consumer and basic queues.

Added loop unrolling for event to header conversions to improve
performance for single producer single consumer and basic queues.

Signed-off-by: Sanjyot Vaidya <Sanjyot.Vaidya@arm.com>
@odpbuild odpbuild changed the title linux-gen: queue: add loop unrolling for event conversions [PATCH v1] linux-gen: queue: add loop unrolling for event conversions Apr 30, 2025
event_index[i+2] = event_hdr[idx+2]->index.u32;
event_index[i+3] = event_hdr[idx+3]->index.u32;
}
switch (num & 0x3) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is going to increase code size and add a few new branches (although I did not look at any generated assembly). It is not clear to me that the change is a net positive in actual applications and not just in microbenchmarks that presumably always enqueue long bursts. I believe some ODP apps have high instruction cache pressure.

Another possible optimization might be to get rid of the whole header-pointer-to-index conversion and just store the pointers in rings (but that would then use more cache lines in the rings, so I do not know if that is good or not).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, cache pressure is something we need to understand. Is there any benchmarking application (not a micro benchmark like we used here) that can be executed to confirm the behaviour?

switch (num & 0x3) {
case 3:
event_index[i++] = event_hdr[idx++]->index.u32;
__attribute__((fallthrough));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Project convention has been to use simply /* Fallthrough */ comments. i386 build seems to have issues with these attributes.

switch (num & 0x3) {
case 3:
event_hdr[i++] = _odp_event_hdr_from_index_u32(event_index[idx++]);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary newline.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing this I noticed that now we could have separate optimized implementations for single and multi event enq/deq functions (e.g. odp_queue_enq(), odp_queue_enq_multi()). Currently, the single event variants e.g. plain_queue_enq()) share the same code and I measured some performance penalty from these changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this makes sense.

}

static inline void event_index_from_hdr(uint32_t event_index[],
_odp_event_hdr_t *event_hdr[], int num)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since event_index_from_hdr() and event_index_to_hdr() functions are identical for basic and spsc queues, their code could be moved to platform/linux-generic/include/odp_event_internal.h. The convention has been to prefix implementation internal functions with _odp, so _odp_event_index_from_hdr() and _odp_event_index_to_hdr().

@MatiasElo
Copy link
Collaborator

After this commit this PR seems no longer necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants