Skip to content

SimHash slicing algorithm incorrect & inefficient #19

@tfmorris

Description

@tfmorris

The current implementation will never output the top 16-bit slice of the simhash. It also computes the remaining slices incorrectly, but that's less serious since the computations are consistent, so the comparisons aren't effected.

Given input 0X0800040002000100L the current algorithm will generate

[0_{8}, 1_{8}, 2_{8}]

when it should generate:

[0_{8}, 1_{9},  2_{10}, 3_{11}]

It would actually be much more efficient (and easier to understand) if it switched the Hadoop type to Long instead of Text and just generated:

[0X0000000000000100L,
 0X0000000002000000L,
 0X0000040000000000L,
 0X0800000000000000L]

This would also speed up sorting and comparisons, particularly for the more common cases where many bits are set and the text strings become very long and inefficient to compare.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions