SimHash returning 32-bit results, not 64-bits

Although the code and paper suggest that 64-bit hashes are being used, the Java Object.hashCode() function only returns 32 bits. The good news is that the bug in #19 has no effect since the upper 16-bits are always 0 (or perhaps all 1s, depending on sign extension effects).

The bad news is that because bits 32-47 are either all zero (or perhaps evenly divided between all zero & all one), I suspect all (or at least half) of the documents will end up being clustered together, making for a very expensive O(n^2) comparison.

You can probably ignore PR #20 for now. It'll get subsumed into the larger rework necessary.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimHash returning 32-bit results, not 64-bits #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SimHash returning 32-bit results, not 64-bits #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions