Skip to content

Conversation

@mperikov
Copy link
Collaborator

@mperikov mperikov commented Dec 2, 2025

The following instructions were compared:

  1. _mm256_loadu_si256 vs _mm256_load_si256
  2. _mm256_storeu_si256 vs _mm256_store_si256

No obvious differences in execution time were observed.

Also need to compare analogs from AVX512.

Copy link
Owner

@Malkovsky Malkovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing core functionality


#ifdef PIXIE_AVX512_SUPPORT

static void BM_Loadu512(benchmark::State& state) {
Copy link
Owner

@Malkovsky Malkovsky Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed that we want to compare:
-- Aligned store/load
-- Aligned storeu/loadu
-- Unaligned storeu/loadu within a single 64-byte block (for <=256-bit registers)
-- Unaligned storeu/loadu crossing a 64-byte block border
-- (optional) test that unaligned store/load crashes

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think that benchmarks are better be organized so that store/load are performed to a conditionally random address to an array of different sizes so that we also see the impact of the cache misses. Specifically we definitely expect some degradation for store/load on an address that crosses 64-byte block border.

Copy link
Owner

@Malkovsky Malkovsky Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. for example unaligned storeu/loadu within a 64-byte block should be something like

  alignas(64) uint8_t data[64 * n];

  for (auto _ : state) {
    const __m256i* ptr = reinterpret_cast<const __m256i*>(data + 1 + 32 * (rng() % (n - 1)));
    benchmark::DoNotOptimize(_mm256_loadu_si256(ptr));
  }

Note that rng() call and % might be heavy in this context

std::mt19937_64 rng(42);

for (auto _ : state) {
size_t idx = 64 * (rng() % (n - 1));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is probably better to make n=2^k+1 and perform rng() & ((1 << k) - 1).

@Malkovsky Malkovsky merged commit 96e8966 into main Dec 18, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants