Make XORFastHash OS-independent#7400
Make XORFastHash OS-independent#7400MarcelloPerathoner wants to merge 3 commits intoProject-OSRM:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses OS-dependent behavior in XORFastHash (notably differing results on Windows) by replacing the previous shuffle-based initialization with a deterministic table initialization, and updates alternative-route cucumber features accordingly.
Changes:
- Reimplemented
util::XORFastHashto avoidstd::shuffle-dependent initialization and reduce table size (L1-friendly). - Updated contractor code and
XORFastHashStorageto use the non-templatedXORFastHash. - Adjusted testbot alternative-route feature scenarios (including removing forced single-threaded preparation).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/contractor/graph_contractor.cpp | Switches contractor tie-breaking/independence checks to the new non-templated XORFastHash. |
| include/util/xor_fast_hash.hpp | Rewrites the hash implementation to deterministic 4×256 tabulation hashing. |
| include/util/xor_fast_hash_storage.hpp | Updates storage to use the new XORFastHash type. |
| features/testbot/alternative_loop.feature | Updates alternative-loop cucumber scenarios/tags to reflect the new deterministic behavior expectations. |
| features/testbot/alternative.feature | Removes forced single-threaded data-prep arguments for alternative-route cucumber tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
include/util/xor_fast_hash.hpp
Outdated
| std::mt19937_64 generator(69); // impl. defined but deterministic default seed | ||
|
|
||
| for (size_t i = 0; i < 0x100; ++i) | ||
| { |
There was a problem hiding this comment.
size_t is used unqualified in this header, but <cstddef> isn’t included and size_t isn’t guaranteed to be available via the current includes on all standard libraries/compilers. This can break builds on stricter toolchains; prefer std::size_t (and include <cstddef>) for the loop index.
include/util/xor_fast_hash.hpp
Outdated
|
|
||
| std::iota(begin(table2), end(table2), std::uint16_t{0}); | ||
| std::shuffle(begin(table2), end(table2), generator); | ||
| std::mt19937_64 generator(69); // impl. defined but deterministic default seed |
There was a problem hiding this comment.
The comment // impl. defined but deterministic default seed on std::mt19937_64 is misleading: the mt19937_64 algorithm and its output sequence for a given seed are specified by the C++ standard, so it should be portable/deterministic across platforms. Consider rewording to avoid suggesting the engine is implementation-defined (the earlier issue was std::shuffle, not the engine).
| std::mt19937_64 generator(69); // impl. defined but deterministic default seed | |
| std::mt19937_64 generator(69); // fixed seed for deterministic, portable sequence |
| private: | ||
| std::vector<HashCell> positions; | ||
| XORFastHash<MaxNumElements> fast_hasher; | ||
| XORFastHash fast_hasher; | ||
| unsigned current_timestamp; |
There was a problem hiding this comment.
XORFastHashStorage is still templated on MaxNumElements, but the hasher is no longer parameterized and always returns a full 16-bit value. If MaxNumElements is instantiated to something other than 65536, fast_hasher(node) can produce an initial position outside positions and cause OOB access before the probing loop applies % MaxNumElements. Consider enforcing MaxNumElements == (1u<<16) via static_assert, or folding the hash down (position %= MaxNumElements) immediately after hashing.
There was a problem hiding this comment.
Copilot is only partially right.
MaxNumElements was removed because a tabulation hash cannot guarantee what Copilot thinks it guarantees:
The returned hash value was not always less than MaxNumElements, because the XOR of two numbers can easily be greater than either number (eg. Max=10, 7 ^ 8 = 15). The bug Copilot complains about was there before this PR.
You cannot take the hash modulo N like Copilot suggests because you lose uniformity, except if N is a power of 2. In that case better use an AND mask on the hash.
... needs a bit more thought.
|
It's good to have consistency. I am wondering though if supplying an implementation of the shuffle algorithm is the less intrusive approach. Eg what cppref is suggesting is very similar to what libstdc++ and libc++ have implemented: |
I think the shuffle is bogus. Tabulation hash requires an array filled with random numbers. It does not require an array filled with a random permutation of the numbers 1..N . The latter one is much less random than the former. Check with: https://en.wikipedia.org/wiki/Tabulation_hashing |
|
Are we sure |
|
I would have to look into code to verify, but I think the instances are thread local only. |
|
I'm now convinced that the "shuffle" implementation is wrong.
Let A be an array containing a random permutation of the numbers 0..255. Only the first entry of A is truly random. The farther you go into the array the less random the entries become. The second entry is dependent on the first one, and the last entry is totally constrained by all the entries before. The current implementation claims 3-independence but it does not even achieve 2-independence: if I select 2 different inputs and expect 2 equal hash codes, the probability of this outcome is 0. (For 2-independence it should be |
The implementation is poorly documented. It was chosen at a time to shuffle the range of n integers as this provided enough randomness to have a compact and performant implementation. You are right, it is not tabulation hashing by the text book. We re-used the approach described here. |
|
I did some benchmarks, and while I would not bet the farm on the absolute values, the trend is clear: the fixed XORFastHash is faster than the old "shuffle" hash. But the simple ANDHash (which is just taking the lowest 16 bits Seems we don't need no fancy hash at all. Why is this so? The data we want to store is a subsequence of consecutive node ids, that is we have no duplicate values in the data. The data itself is 32 bit unsigned ints. Taking the lowest 16 bits is as good a hash as any, and faster too. number of nodes: 2147483648 hash table size: 65536 occupancy: 50%
number of nodes: 2147483648 hash table size: 65536 occupancy: 75%
number of nodes: 2147483648 hash table size: 65536 occupancy: 90%
number of nodes: 2147483648 hash table size: 65536 occupancy: 95%
number of nodes: 2147483648 hash table size: 65536 occupancy: 99%
|
|
Thanks for the article. From a very first reading it seems that we can use the node id as tie breaker in the contraction. We don't need no XORFastHash at all. Do you have an idea how many nodes remain after extraction in a typical case? say: Germany? I could adjust the benchmarks. |
|
I would think maybe 10--25 million nodes remain after extraction. But the searches during preprocessing are all small'ish. The contraction only considers a small hop distance and the subsequent route queries have to settle maybe up to 10k nodes for long distance routes. This is also the reason the small hash table works so well. The number of settled nodes is very small compared to the overall size of the complete network. |
|
There are two places where XORFastHash is currently used:
The advantage of the ANDHash becomes more apparent as the CPU heats up. From the synthetic benchmarks (30-60K Nodes) I'm pretty sure that even searches will perform significantly faster. The only advantage of using a more involved hash function would be if we routinely inserted long runs of consecutive ids. In that case a better hash would avoid long collision runs. But neither the contractor nor the router operate on consecutive ids. |
|
Good investigation! This has good insights. If you have the time, could you check if these preprocessing speedups also translate to a larger data set, eg Germany or Europe data sets? I agree that this speedup looks very likely to exist also for queries. Could you perhaps run a check to verify? I am growing more and more convinced we should replace the xorfasthash implementation. Again, good investigative work. |
Germany latestOn a Lenovo ThinkPad P1 Gen 7 with Intel Core Ultra 9 185H × 22 and 64,0 GiB RAW dataMean
Normalized
|
|
Ok. That looks encouraging. Let's proceed with this. The file and class names maybe changing to reflect the new implementation. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 18 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/routed_benchmark.py
Outdated
| with open(args.coordinates, newline="") as csvfile: | ||
| reader = csv.DictReader(csvfile, delimiter="\t") | ||
| for row in reader: | ||
| print(row) |
| nargs="+", | ||
| metavar="PATH", | ||
| help="The osrm-routed binaries to use or compare (path/to/osrm-routed)", | ||
| ) | ||
| parser_run.add_argument( | ||
| "--datasets", | ||
| nargs="+", |
There was a problem hiding this comment.
BS. nargs="+" already does that.
| ++current_timestamp; | ||
| if (std::numeric_limits<unsigned>::max() == current_timestamp) | ||
| { | ||
| cells.clear(); |
There was a problem hiding this comment.
BS. current_timestamp will overflow to 0 and reset itself.
src/contractor/graph_contractor.cpp
Outdated
| /** A heap kept in thread-local storage to avoid multiple recreation of it. */ | ||
| ContractorHeap heap_exemplar(8000); | ||
| tbb::enumerable_thread_specific<ContractorHeap> thread_data(heap_exemplar); |
There was a problem hiding this comment.
Searches are already limited to 2000 nodes.
scripts/routed_benchmark.py
Outdated
| def make_request(self): | ||
| """Make one routed request""" | ||
| requests.get(self.make_url()) |
| size = size | (size >> 4); | ||
| size = size | (size >> 8); | ||
| size = size | (size >> 16); | ||
| size = size | (size >> 32); |
| ValueType &operator[](const KeyType key) | ||
| { | ||
| std::size_t position = key & mask; | ||
| while ((cells[position].time == current_timestamp) && (cells[position].key != key)) | ||
| { | ||
| ++position &= mask; | ||
| } |
src/contractor/graph_contractor.cpp
Outdated
| const constexpr size_t DeleteGrainSize = 1; | ||
|
|
||
| const NodeID number_of_nodes = graph.GetNumberOfNodes(); | ||
|
|
src/contractor/graph_contractor.cpp
Outdated
| const float target_priority = priorities[hop1]; | ||
| BOOST_ASSERT(target_priority >= 0); | ||
| // found a neighbour with lower priority? | ||
| if (priority > target_priority) | ||
| { | ||
| return false; | ||
| } | ||
| // tie breaking | ||
| if (std::abs(priority - target_priority) < std::numeric_limits<float>::epsilon() && | ||
| Bias(hash, new_to_old_node_id[node], new_to_old_node_id[target])) | ||
|
|
||
| if (priority > target_priority || (priority == target_priority && bias(node, hop1))) | ||
| { |
There was a problem hiding this comment.
BS. We are only interested in a strict monotone ordering of the nodes. Priority is a heuristic anyway so we couldn't care less about exact float comparisons.
src/contractor/graph_contractor.cpp
Outdated
| inline std::set<NodeID> GetNeighbours(const ContractorGraph &graph, const NodeID v) | ||
| { | ||
| std::vector<NodeID> &neighbours = data->neighbours; | ||
| neighbours.clear(); | ||
|
|
||
| // find all neighbours | ||
| for (auto e : graph.GetAdjacentEdgeRange(node)) | ||
| std::set<NodeID> neighbours; | ||
| for (auto e : graph.GetAdjacentEdgeRange(v)) | ||
| { | ||
| const NodeID u = graph.GetTarget(e); | ||
| if (u != node) | ||
| if (u != v) | ||
| { | ||
| neighbours.push_back(u); | ||
| neighbours.insert(u); | ||
| } |
06660e7 to
3b17140
Compare
|
The PR has evolved into a rewrite of the contractor. The benchmarks are as follows:
We see an 8% improvement of contraction time on germany-latest. (Caveat: this machine is not a benchmark machine, it does other things too.) The memory usage has slightly decreased too. The current implementation inserts self-loops unconditionally whenever the target of a contracted node is a oneway street. This PR inserts the same self-loop only if the loop is shorter than any other path to the target, ie. treats self-loops the same as any other path through the contracted node. This reduces the total number of edges by about 0.5%. To test the correctness of this approach I wrote a benchmark that requests 1000 routes between a pair of randomly selected German cities (with population > 100k) and compared the distance obtained with origin/master:
We see that the routes obtained are the same. We also see a 2% drop in routing times. Other changes:
|
Issue
XORFastHash gives different results on Windows: see #4693
Fix