Skip to content

Make XORFastHash OS-independent#7400

Open
MarcelloPerathoner wants to merge 3 commits intoProject-OSRM:masterfrom
MarcelloPerathoner:fix-XORFastHash
Open

Make XORFastHash OS-independent#7400
MarcelloPerathoner wants to merge 3 commits intoProject-OSRM:masterfrom
MarcelloPerathoner:fix-XORFastHash

Conversation

@MarcelloPerathoner
Copy link
Contributor

Issue

XORFastHash gives different results on Windows: see #4693

Fix

  • Rewrote XORFastHash without std::shuffle, which is implementation-dependent.
  • Split the hash 4-ways which makes the random table sufficiently small to fit into L1 cache.
  • Fixed tests.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses OS-dependent behavior in XORFastHash (notably differing results on Windows) by replacing the previous shuffle-based initialization with a deterministic table initialization, and updates alternative-route cucumber features accordingly.

Changes:

  • Reimplemented util::XORFastHash to avoid std::shuffle-dependent initialization and reduce table size (L1-friendly).
  • Updated contractor code and XORFastHashStorage to use the non-templated XORFastHash.
  • Adjusted testbot alternative-route feature scenarios (including removing forced single-threaded preparation).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/contractor/graph_contractor.cpp Switches contractor tie-breaking/independence checks to the new non-templated XORFastHash.
include/util/xor_fast_hash.hpp Rewrites the hash implementation to deterministic 4×256 tabulation hashing.
include/util/xor_fast_hash_storage.hpp Updates storage to use the new XORFastHash type.
features/testbot/alternative_loop.feature Updates alternative-loop cucumber scenarios/tags to reflect the new deterministic behavior expectations.
features/testbot/alternative.feature Removes forced single-threaded data-prep arguments for alternative-route cucumber tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +26 to +29
std::mt19937_64 generator(69); // impl. defined but deterministic default seed

for (size_t i = 0; i < 0x100; ++i)
{
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t is used unqualified in this header, but <cstddef> isn’t included and size_t isn’t guaranteed to be available via the current includes on all standard libraries/compilers. This can break builds on stricter toolchains; prefer std::size_t (and include <cstddef>) for the loop index.

Copilot uses AI. Check for mistakes.

std::iota(begin(table2), end(table2), std::uint16_t{0});
std::shuffle(begin(table2), end(table2), generator);
std::mt19937_64 generator(69); // impl. defined but deterministic default seed
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment // impl. defined but deterministic default seed on std::mt19937_64 is misleading: the mt19937_64 algorithm and its output sequence for a given seed are specified by the C++ standard, so it should be portable/deterministic across platforms. Consider rewording to avoid suggesting the engine is implementation-defined (the earlier issue was std::shuffle, not the engine).

Suggested change
std::mt19937_64 generator(69); // impl. defined but deterministic default seed
std::mt19937_64 generator(69); // fixed seed for deterministic, portable sequence

Copilot uses AI. Check for mistakes.
Comment on lines 74 to 77
private:
std::vector<HashCell> positions;
XORFastHash<MaxNumElements> fast_hasher;
XORFastHash fast_hasher;
unsigned current_timestamp;
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XORFastHashStorage is still templated on MaxNumElements, but the hasher is no longer parameterized and always returns a full 16-bit value. If MaxNumElements is instantiated to something other than 65536, fast_hasher(node) can produce an initial position outside positions and cause OOB access before the probing loop applies % MaxNumElements. Consider enforcing MaxNumElements == (1u<<16) via static_assert, or folding the hash down (position %= MaxNumElements) immediately after hashing.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot is only partially right.

MaxNumElements was removed because a tabulation hash cannot guarantee what Copilot thinks it guarantees:

The returned hash value was not always less than MaxNumElements, because the XOR of two numbers can easily be greater than either number (eg. Max=10, 7 ^ 8 = 15). The bug Copilot complains about was there before this PR.

You cannot take the hash modulo N like Copilot suggests because you lose uniformity, except if N is a power of 2. In that case better use an AND mask on the hash.

... needs a bit more thought.

@DennisOSRM
Copy link
Collaborator

It's good to have consistency. I am wondering though if supplying an implementation of the shuffle algorithm is the less intrusive approach. Eg what cppref is suggesting is very similar to what libstdc++ and libc++ have implemented:


template<class RandomIt, class URBG>
void shuffle(RandomIt first, RandomIt last, URBG&& g)
{
    typedef typename std::iterator_traits<RandomIt>::difference_type diff_t;
    typedef std::uniform_int_distribution<diff_t> distr_t;
    typedef typename distr_t::param_type param_t;
 
    distr_t D;
    for (diff_t i = last - first - 1; i > 0; --i)
    {
        using std::swap;
        swap(first[i], first[D(g, param_t(0, i))]);
    }
}

@MarcelloPerathoner
Copy link
Contributor Author

It's good to have consistency. I am wondering though if supplying an implementation of the shuffle algorithm is the less intrusive approach. Eg what cppref is suggesting is very similar to what libstdc++ and libc++ have implemented:


template<class RandomIt, class URBG>
void shuffle(RandomIt first, RandomIt last, URBG&& g)
{
    typedef typename std::iterator_traits<RandomIt>::difference_type diff_t;
    typedef std::uniform_int_distribution<diff_t> distr_t;
    typedef typename distr_t::param_type param_t;
 
    distr_t D;
    for (diff_t i = last - first - 1; i > 0; --i)
    {
        using std::swap;
        swap(first[i], first[D(g, param_t(0, i))]);
    }
}

I think the shuffle is bogus. Tabulation hash requires an array filled with random numbers. It does not require an array filled with a random permutation of the numbers 1..N . The latter one is much less random than the former. Check with:

https://en.wikipedia.org/wiki/Tabulation_hashing
https://opendatastructures.org/ods-cpp/5_2_Linear_Probing.html#SECTION00923000000000000000

@MarcelloPerathoner
Copy link
Contributor Author

Are we sure XORFastHashStorage::operator[] never gets called from multiple threads? Because it is not thread-safe.

@DennisOSRM
Copy link
Collaborator

DennisOSRM commented Mar 5, 2026

I would have to look into code to verify, but I think the instances are thread local only.

@MarcelloPerathoner
Copy link
Contributor Author

I'm now convinced that the "shuffle" implementation is wrong.

One way to achieve this is to store a giant array, tab of length 2^w where each entry is a random w-bit integer, independent of all the other entries.
---- https://opendatastructures.org/ods-cpp/5_2_Linear_Probing.html#SECTION00923000000000000000

Let A be an array containing a random permutation of the numbers 0..255. Only the first entry of A is truly random. The farther you go into the array the less random the entries become. The second entry is dependent on the first one, and the last entry is totally constrained by all the entries before.

The current implementation claims 3-independence but it does not even achieve 2-independence: if I select 2 different inputs and expect 2 equal hash codes, the probability of this outcome is 0. (For 2-independence it should be $m^{-2}$)
See: https://en.wikipedia.org/wiki/K-independent_hashing#Definitions

@DennisOSRM
Copy link
Collaborator

I'm now convinced that the "shuffle" implementation is wrong.

One way to achieve this is to store a giant array, tab of length 2^w where each entry is a random w-bit integer, independent of all the other entries.
---- https://opendatastructures.org/ods-cpp/5_2_Linear_Probing.html#SECTION00923000000000000000

Let A be an array containing a random permutation of the numbers 0..255. Only the first entry of A is truly random. The farther you go into the array the less random the entries become. The second entry is dependent on the first one, and the last entry is totally constrained by all the entries before.

The current implementation claims 3-independence but it does not even achieve 2-independence: if I select 2 different inputs and expect 2 equal hash codes, the probability of this outcome is 0. (For 2-independence it should be m − 2 ) See: https://en.wikipedia.org/wiki/K-independent_hashing#Definitions

The implementation is poorly documented. It was chosen at a time to shuffle the range of n integers as this provided enough randomness to have a compact and performant implementation. You are right, it is not tabulation hashing by the text book. We re-used the approach described here.

@MarcelloPerathoner
Copy link
Contributor Author

I did some benchmarks, and while I would not bet the farm on the absolute values, the trend is clear: the fixed XORFastHash is faster than the old "shuffle" hash. But the simple ANDHash (which is just taking the lowest 16 bits NodeID & 0xffff) outperforms all the rest.

Seems we don't need no fancy hash at all. Why is this so? The data we want to store is a subsequence of consecutive node ids, that is we have no duplicate values in the data. The data itself is 32 bit unsigned ints. Taking the lowest 16 bits is as good a hash as any, and faster too.

number of nodes: 2147483648 hash table size: 65536 occupancy: 50%

HashStorage read (ms) write (ms)
UnorderedMap 86 91
OldXORFastHash 47 47
XORFastHash 41 41
ANDHash 41 41

number of nodes: 2147483648 hash table size: 65536 occupancy: 75%

HashStorage read (ms) write (ms)
UnorderedMap 127 131
OldXORFastHash 89 88
XORFastHash 79 78
ANDHash 76 76

number of nodes: 2147483648 hash table size: 65536 occupancy: 90%

HashStorage read (ms) write (ms)
UnorderedMap 169 172
OldXORFastHash 127 123
XORFastHash 114 111
ANDHash 108 108

number of nodes: 2147483648 hash table size: 65536 occupancy: 95%

HashStorage read (ms) write (ms)
UnorderedMap 181 185
OldXORFastHash 152 150
XORFastHash 140 137
ANDHash 137 138

number of nodes: 2147483648 hash table size: 65536 occupancy: 99%

HashStorage read (ms) write (ms)
UnorderedMap 194 196
OldXORFastHash 465 464
XORFastHash 305 300
ANDHash 276 275

@MarcelloPerathoner
Copy link
Contributor Author

Thanks for the article. From a very first reading it seems that we can use the node id as tie breaker in the contraction. We don't need no XORFastHash at all.

Do you have an idea how many nodes remain after extraction in a typical case? say: Germany? I could adjust the benchmarks.

@DennisOSRM
Copy link
Collaborator

I would think maybe 10--25 million nodes remain after extraction.

But the searches during preprocessing are all small'ish. The contraction only considers a small hop distance and the subsequent route queries have to settle maybe up to 10k nodes for long distance routes. This is also the reason the small hash table works so well. The number of settled nodes is very small compared to the overall size of the complete network.

@MarcelloPerathoner
Copy link
Contributor Author

There are two places where XORFastHash is currently used:

  1. as tie-breaker in the contractor. It is redundant there because:

    a) hashes do collide, and that makes them useless as tie-breaker functions. You
    always need a second tie-breaker-breaker to safeguard against first tie-breaker
    collision. Then, why not use this second function in the first place?

    b) We have unique integer node ids. A comparison of the node ids is the ideal
    tie-breaker function. It is guaranteed to succeed and very cheap: One assembler
    instruction on most modern CPUs and the node ids are already in the CPU cache.

  2. as primary hash for a custom hash table implementation. In synthethic benchmarks the
    simple AND mask has already proven to perform better than the XORFastHash. In real world
    benchmarks the Berlin-latest dataset contracts almost 10% faster using 10% less memory:

1 origin-master
[2026-03-07T19:53:58] [info] Contraction took 23.7241 sec
[2026-03-07T19:53:58] [info] RAM: peak bytes used: 251019264
1 fix-XORFastHash
[2026-03-07T19:54:21] [info] Contraction took 23.139 sec
[2026-03-07T19:54:21] [info] RAM: peak bytes used: 226705408
2 origin-master
[2026-03-07T19:54:47] [info] Contraction took 25.584 sec
[2026-03-07T19:54:47] [info] RAM: peak bytes used: 249913344
2 fix-XORFastHash
[2026-03-07T19:55:11] [info] Contraction took 24.5741 sec
[2026-03-07T19:55:11] [info] RAM: peak bytes used: 226045952
3 origin-master
[2026-03-07T19:55:38] [info] Contraction took 26.9462 sec
[2026-03-07T19:55:38] [info] RAM: peak bytes used: 248930304
3 fix-XORFastHash
[2026-03-07T19:56:04] [info] Contraction took 25.4197 sec
[2026-03-07T19:56:04] [info] RAM: peak bytes used: 226287616
4 origin-master
[2026-03-07T19:56:32] [info] Contraction took 28.126 sec
[2026-03-07T19:56:32] [info] RAM: peak bytes used: 249135104
4 fix-XORFastHash
[2026-03-07T19:56:59] [info] Contraction took 26.8965 sec
[2026-03-07T19:56:59] [info] RAM: peak bytes used: 227037184
5 origin-master
[2026-03-07T19:57:28] [info] Contraction took 28.5742 sec
[2026-03-07T19:57:28] [info] RAM: peak bytes used: 248373248
5 fix-XORFastHash
[2026-03-07T19:57:55] [info] Contraction took 27.2992 sec
[2026-03-07T19:57:55] [info] RAM: peak bytes used: 226648064
6 origin-master
[2026-03-07T19:58:26] [info] Contraction took 30.4532 sec
[2026-03-07T19:58:26] [info] RAM: peak bytes used: 249106432
6 fix-XORFastHash
[2026-03-07T19:58:54] [info] Contraction took 27.8771 sec
[2026-03-07T19:58:54] [info] RAM: peak bytes used: 227864576
7 origin-master
[2026-03-07T19:59:24] [info] Contraction took 30.2883 sec
[2026-03-07T19:59:24] [info] RAM: peak bytes used: 249479168
7 fix-XORFastHash
[2026-03-07T19:59:52] [info] Contraction took 27.8033 sec
[2026-03-07T19:59:52] [info] RAM: peak bytes used: 226459648
8 origin-master
[2026-03-07T20:00:22] [info] Contraction took 30.1351 sec
[2026-03-07T20:00:22] [info] RAM: peak bytes used: 250085376
8 fix-XORFastHash
[2026-03-07T20:00:50] [info] Contraction took 27.7521 sec
[2026-03-07T20:00:50] [info] RAM: peak bytes used: 226709504
9 origin-master
[2026-03-07T20:01:21] [info] Contraction took 30.4311 sec
[2026-03-07T20:01:21] [info] RAM: peak bytes used: 250949632
9 fix-XORFastHash
[2026-03-07T20:01:49] [info] Contraction took 27.8589 sec
[2026-03-07T20:01:49] [info] RAM: peak bytes used: 226758656
10 origin-master
[2026-03-07T20:02:19] [info] Contraction took 30.3585 sec
[2026-03-07T20:02:19] [info] RAM: peak bytes used: 249942016
10 fix-XORFastHash
[2026-03-07T20:02:47] [info] Contraction took 27.4976 sec
[2026-03-07T20:02:47] [info] RAM: peak bytes used: 226824192

The advantage of the ANDHash becomes more apparent as the CPU heats up.

From the synthetic benchmarks (30-60K Nodes) I'm pretty sure that even searches will perform significantly faster. The only advantage of using a more involved hash function would be if we routinely inserted long runs of consecutive ids. In that case a better hash would avoid long collision runs. But neither the contractor nor the router operate on consecutive ids.

@DennisOSRM
Copy link
Collaborator

Good investigation! This has good insights. If you have the time, could you check if these preprocessing speedups also translate to a larger data set, eg Germany or Europe data sets?

I agree that this speedup looks very likely to exist also for queries. Could you perhaps run a check to verify?

I am growing more and more convinced we should replace the xorfasthash implementation.

Again, good investigative work.

@MarcelloPerathoner
Copy link
Contributor Author

Germany latest

On a Lenovo ThinkPad P1 Gen 7 with Intel Core Ultra 9 185H × 22 and 64,0 GiB

RAW data

                        edges     time          mem
run branch                                         
0   origin-master    96057873  960.745  7862.183594
    fix-XORFastHash  96000988  883.187  7780.847656
1   origin-master    96053769  952.199  7764.093750
    fix-XORFastHash  95974183  878.743  7844.097656
2   origin-master    96053120  960.140  7858.722656
    fix-XORFastHash  95974232  874.394  7983.953125
3   origin-master    96060410  955.880  7876.789062
    fix-XORFastHash  95979722  872.756  7902.941406
4   origin-master    96068170  956.221  7796.613281
    fix-XORFastHash  95956574  872.085  7856.320312
5   origin-master    96063959  971.047  7864.437500
    fix-XORFastHash  95970749  872.735  7880.996094
6   origin-master    95975796  947.365  7907.230469
    fix-XORFastHash  95994454  870.033  7892.316406
7   origin-master    96074653  944.491  7829.511719
    fix-XORFastHash  95992970  875.811  7853.335938
8   origin-master    96096552  950.868  7771.285156
    fix-XORFastHash  96014403  869.920  7798.371094
9   origin-master    96053686  946.476  7857.882812
    fix-XORFastHash  95947393  871.209  7959.000000

Mean

edges stdev time (s) stdev mem (MB) stdev
origin-master 96055799 31051 954.54 8.05 7838.9 47.3
fix-XORFastHash 95980567 20447 874.09 4.18 7875.2 63.7

Normalized

edges time mem
origin-master 1.000 1.000 1.000
fix-XORFastHash 0.999 0.916 1.005

@DennisOSRM
Copy link
Collaborator

Ok. That looks encouraging. Let's proceed with this. The file and class names maybe changing to reflect the new implementation.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

with open(args.coordinates, newline="") as csvfile:
reader = csv.DictReader(csvfile, delimiter="\t")
for row in reader:
print(row)
Comment on lines +139 to +145
nargs="+",
metavar="PATH",
help="The osrm-routed binaries to use or compare (path/to/osrm-routed)",
)
parser_run.add_argument(
"--datasets",
nargs="+",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BS. nargs="+" already does that.

++current_timestamp;
if (std::numeric_limits<unsigned>::max() == current_timestamp)
{
cells.clear();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BS. current_timestamp will overflow to 0 and reset itself.

Comment on lines +464 to +466
/** A heap kept in thread-local storage to avoid multiple recreation of it. */
ContractorHeap heap_exemplar(8000);
tbb::enumerable_thread_specific<ContractorHeap> thread_data(heap_exemplar);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searches are already limited to 2000 nodes.

Comment on lines +49 to +51
def make_request(self):
"""Make one routed request"""
requests.get(self.make_url())
size = size | (size >> 4);
size = size | (size >> 8);
size = size | (size >> 16);
size = size | (size >> 32);
Comment on lines +68 to +74
ValueType &operator[](const KeyType key)
{
std::size_t position = key & mask;
while ((cells[position].time == current_timestamp) && (cells[position].key != key))
{
++position &= mask;
}
const constexpr size_t DeleteGrainSize = 1;

const NodeID number_of_nodes = graph.GetNumberOfNodes();

Comment on lines 418 to 422
const float target_priority = priorities[hop1];
BOOST_ASSERT(target_priority >= 0);
// found a neighbour with lower priority?
if (priority > target_priority)
{
return false;
}
// tie breaking
if (std::abs(priority - target_priority) < std::numeric_limits<float>::epsilon() &&
Bias(hash, new_to_old_node_id[node], new_to_old_node_id[target]))

if (priority > target_priority || (priority == target_priority && bias(node, hop1)))
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BS. We are only interested in a strict monotone ordering of the nodes. Priority is a heuristic anyway so we couldn't care less about exact float comparisons.

Comment on lines 347 to 356
inline std::set<NodeID> GetNeighbours(const ContractorGraph &graph, const NodeID v)
{
std::vector<NodeID> &neighbours = data->neighbours;
neighbours.clear();

// find all neighbours
for (auto e : graph.GetAdjacentEdgeRange(node))
std::set<NodeID> neighbours;
for (auto e : graph.GetAdjacentEdgeRange(v))
{
const NodeID u = graph.GetTarget(e);
if (u != node)
if (u != v)
{
neighbours.push_back(u);
neighbours.insert(u);
}
@MarcelloPerathoner
Copy link
Contributor Author

The PR has evolved into a rewrite of the contractor.

The benchmarks are as follows:

log edges norm time (s) norm mem (MB) norm
/tmp/germany-origin.log 96004743 1.000 899.69 1.000 7762 1.000
/tmp/germany-hash.log 95658742 0.996 828.10 0.920 7632 0.983

We see an 8% improvement of contraction time on germany-latest. (Caveat: this machine is not a benchmark machine, it does other things too.) The memory usage has slightly decreased too.

The current implementation inserts self-loops unconditionally whenever the target of a contracted node is a oneway street. This PR inserts the same self-loop only if the loop is shorter than any other path to the target, ie. treats self-loops the same as any other path through the contracted node. This reduces the total number of edges by about 0.5%.

To test the correctness of this approach I wrote a benchmark that requests 1000 routes between a pair of randomly selected German cities (with population > 100k) and compared the distance obtained with origin/master:

log time (ms) norm distance norm
/tmp/germany-routed-origin.log 2870.05 1.000 323433851 1.000
/tmp/germany-routed-hash.log 2804.05 0.977 323433851 1.000

We see that the routes obtained are the same. We also see a 2% drop in routing times.

Other changes:

  • The code has been simplified and commented
  • Tests have been adapted to the new contraction order
  • 2 benchmark scripts (that generated the above tables) and
  • 1 compiled benchmark have been added
  • xor_fast_hash is gone
  • a new linear hash storage replaces it
  • still open questions are marked with FIXME

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants