-
Notifications
You must be signed in to change notification settings - Fork 51
Fix regression in dice::min_size #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
This regression was introduced in 69a6a6f. |
|
@AndreyBozhko do you have a test case that fails due to this? While it's faithful to the paper, the regression being around for almost 11 years could mean that certain code can depend on the bug. |
|
@cynthia Thank you for the feedback. From what I understand, this regression only affects the performance of the algorithm, but does not affect its correctness. Here's my reasoning. Suppose, one wants to retrieve matches using the simstring algorithm such that Since it is always true that Relation So how does the regression in question affect the algorithm? One can show that the min_size value produced by the current implementation never exceeds the min_size value defined by and In the expressions above, This means that during the overlapjoin step, the algorithm should in theory only check the strings whose number of features is between Since the condition for the number To summarize, the error in With that, I believe the fix does not introduce any user-facing changes - the code would still work as expected, just a little more efficient. |
|
And here's the simple test that illustrates the potential problem - master...AndreyBozhko:failing-test. The test checks that the returned value of Below are the outcome of running the test:
$ g++ --version
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
// with no optimizations
$ g++ sample/tests.cpp -o executable -std=c++17 -I./include -O0 && ./executable
// with sanitizer and no optimizations
$ g++ sample/tests.cpp -o executable -std=c++17 -I./include -O0 -fsanitize=float-divide-by-zero,float-cast-overflow && ./executable
include/simstring/measure.h:69:45: runtime error: division by zero
include/simstring/measure.h:69:30: runtime error: inf is outside the range of representable values of type 'int'
// with optimizations
$ g++ sample/tests.cpp -o executable -std=c++17 -I./include -O3 && ./executable
$
$ clang++ --version
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
// with no optimizations
$ clang++ sample/tests.cpp -o executable -std=c++17 -I./include -O0 && ./executable
// with sanitizer and no optimizations
$ clang++ sample/tests.cpp -o executable -std=c++17 -I./include -O0 -fsanitize=float-divide-by-zero,float-cast-overflow && ./executable
include/simstring/measure.h:69:45: runtime error: division by zero
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior include/simstring/measure.h:69:45 in
include/simstring/measure.h:69:21: runtime error: inf is outside the range of representable values of type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior include/simstring/measure.h:69:21 in
// with optimizations
$ clang++ sample/tests.cpp -o executable -std=c++17 -I./include -O3 && ./executable
Assertion failed in sample/tests.cpp, line 14: 8250 not equal to 1
$Note that the regression in As a result, this may lead to some sneaky production bugs like in the test above that used |
cynthia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the extremely comprehensive analysis and some personal searching, this change looks good to me. I'll do some local testing before actually landing it though. Thank you so much for spending the time to provide the detailed explanation!
Justification
The SimString paper (link) defines the conditions for each similarity measure as follows:
The denominator in the min_size formula for Dice measure reads
(2-alpha), not(2-|X|).