SimHash slicing algorithm incorrect & inefficient

The current implementation will never output the top 16-bit slice of the simhash. It also computes the remaining slices incorrectly, but that's less serious since the computations are consistent, so the comparisons aren't effected.

Given input `0X0800040002000100L` the current algorithm will generate

```
[0_{8}, 1_{8}, 2_{8}]
```

when it should generate:

```
[0_{8}, 1_{9},  2_{10}, 3_{11}]
```

It would actually be much more efficient (and easier to understand) if it switched the Hadoop type to Long instead of Text and just generated:

```
[0X0000000000000100L,
 0X0000000002000000L,
 0X0000040000000000L,
 0X0800000000000000L]
```

This would also speed up sorting and comparisons, particularly for the more common cases where many bits are set and the text strings become very long and inefficient to compare.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimHash slicing algorithm incorrect & inefficient #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SimHash slicing algorithm incorrect & inefficient #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions