The current implementation will never output the top 16-bit slice of the simhash. It also computes the remaining slices incorrectly, but that's less serious since the computations are consistent, so the comparisons aren't effected.
Given input 0X0800040002000100L the current algorithm will generate
when it should generate:
[0_{8}, 1_{9}, 2_{10}, 3_{11}]
It would actually be much more efficient (and easier to understand) if it switched the Hadoop type to Long instead of Text and just generated:
[0X0000000000000100L,
0X0000000002000000L,
0X0000040000000000L,
0X0800000000000000L]
This would also speed up sorting and comparisons, particularly for the more common cases where many bits are set and the text strings become very long and inefficient to compare.
The current implementation will never output the top 16-bit slice of the simhash. It also computes the remaining slices incorrectly, but that's less serious since the computations are consistent, so the comparisons aren't effected.
Given input
0X0800040002000100Lthe current algorithm will generatewhen it should generate:
It would actually be much more efficient (and easier to understand) if it switched the Hadoop type to Long instead of Text and just generated:
This would also speed up sorting and comparisons, particularly for the more common cases where many bits are set and the text strings become very long and inefficient to compare.