Supposing your hash values are uniformly distributed "random" values in the range of 0..2**64-1, you can skip the modulo and instead use random-number scaling techniques to distribute the result (the bucket number) evenly throughout that space. I don't think there's any reason to stick with modulo, is there?
The quickest-and-dirtiest of these would be:
size_t bucket(uint64_t hash) {
return uint64_t(uint128_t(hash) * K >> 64);
}
But if the incoming hashes are not already uniform then try this one weird trick:
size_t bucket(uint64_t hash) {
crc32(hash) * K >> 32;
}
(which only works when K is less than 2**32), and you'll have to find the right intrinsic for the target to get the hardware-accelerated CRC instruction)
This one is more thorough:
static inline uint64_t murmurmix64(uint64_t h) {
h ^= h >> 33;
h *= 0xff51afd7ed558ccdULL;
h ^= h >> 33;
h *= 0xc4ceb9fe1a85ec53ULL;
h ^= h >> 33;
return h;
}
size_t bucket(uint64_t hash) {
hash = murmurmix64(hash);
return uint64_t(uint128_t(hash) * K >> 64);
}
And this one is probably fine, covers up to 64-bit ranges, and is probably still faster than mod by a constant:
size_t bucket(uint64_t hash) {
hash *= 0xc4ceb9fe1a85ec53ULL;
return uint64_t(uint128_t(hash) * K >> 64);
}
Plus it probably makes no difference if K is not constant, and so you can avoid calling via a function pointer and just load the table size into a register and perform the multiply by that inline.
Supposing your hash values are uniformly distributed "random" values in the range of 0..2**64-1, you can skip the modulo and instead use random-number scaling techniques to distribute the result (the bucket number) evenly throughout that space. I don't think there's any reason to stick with modulo, is there?
The quickest-and-dirtiest of these would be:
But if the incoming hashes are not already uniform then try this one weird trick:
(which only works when
Kis less than 2**32), and you'll have to find the right intrinsic for the target to get the hardware-accelerated CRC instruction)This one is more thorough:
And this one is probably fine, covers up to 64-bit ranges, and is probably still faster than mod by a constant:
Plus it probably makes no difference if
Kis not constant, and so you can avoid calling via a function pointer and just load the table size into a register and perform the multiply by that inline.