Fix simhash slicing and add tests. Fixes #19.#20
Fix simhash slicing and add tests. Fixes #19.#20tfmorris wants to merge 1 commit intodkpro:masterfrom
Conversation
|
Good catch, Tom! Since your contributions are getting non-trivial, I'd like to ask you for filling our contributor's license - this is what we apply for all DKPro open-source software (after discussions with the legal department at the Darmstadt Technical University). Please consult http://dkpro.github.io/contributing/ (you can send it via e-mail: habernal@ukp.informatik.tu-darmstadt.de ). Thanks :) |
|
I hope it did not scare you off, Tom :)
|
|
You didn't scare me off. :-) I just needed some quiet time to review the agreement. Plus, as I suggested above, I've gone off on a different tack and implemented a new hashing scheme and built a little benchmarking framework so I can compare. I'll have more PRs in the pipe as soon as I stop playing around (and sign the CLA). |
Also includes a more efficient slicing algorithm that could be used, but requires changes elsewhere in the system
This adds very basic tests for all the static methods in SimHashUtils and fixes the simhash slicing algorithm.
The fixed version uses the current text representation, but I'd actually suggest switching to just using Longs instead of text and computing the slices uses bitmasks. This will not only make the computation of the slices faster and easier to understand, but will speed up the sorting and comparisons during the shuffle phase of clustering (but it does require changes elsewhere in the system).