-
Notifications
You must be signed in to change notification settings - Fork 0
comparison for load/store vs loadu/storeu #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Malkovsky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing core functionality
src/alignment_comparison.cpp
Outdated
|
|
||
| #ifdef PIXIE_AVX512_SUPPORT | ||
|
|
||
| static void BM_Loadu512(benchmark::State& state) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed that we want to compare:
-- Aligned store/load
-- Aligned storeu/loadu
-- Unaligned storeu/loadu within a single 64-byte block (for <=256-bit registers)
-- Unaligned storeu/loadu crossing a 64-byte block border
-- (optional) test that unaligned store/load crashes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think that benchmarks are better be organized so that store/load are performed to a conditionally random address to an array of different sizes so that we also see the impact of the cache misses. Specifically we definitely expect some degradation for store/load on an address that crosses 64-byte block border.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. for example unaligned storeu/loadu within a 64-byte block should be something like
alignas(64) uint8_t data[64 * n];
for (auto _ : state) {
const __m256i* ptr = reinterpret_cast<const __m256i*>(data + 1 + 32 * (rng() % (n - 1)));
benchmark::DoNotOptimize(_mm256_loadu_si256(ptr));
}Note that rng() call and % might be heavy in this context
…ixie into alignment-comparison
src/alignment_comparison.cpp
Outdated
| std::mt19937_64 rng(42); | ||
|
|
||
| for (auto _ : state) { | ||
| size_t idx = 64 * (rng() % (n - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is probably better to make n=2^k+1 and perform rng() & ((1 << k) - 1).
The following instructions were compared:
No obvious differences in execution time were observed.
Also need to compare analogs from AVX512.