Description
src/string/jaro_winkler_distance.rs panics on any non-ASCII (multi-byte UTF-8) input. The function mixes character indices (from .chars().enumerate()) with byte indices (str::len() and &str slicing), then slices &str at those positions. When a slice boundary falls inside a multi-byte UTF-8 character, Rust panics (byte index N is not a char boundary).
The offending spots use byte offsets derived from char-based counting:
let limit = std::cmp::min(s1.len(), s2.len()) / 2; // .len() = BYTE length
for (i, l) in s1.chars().enumerate() { // i = CHAR index
let left = std::cmp::max(0, i as i32 - limit as i32) as usize;
let right = std::cmp::min(i + limit + 1, s2.len());
if s2[left..right].contains(l) { // <-- byte slice with char-derived indices -> panic
...
}
}
...
let bound = std::cmp::min(std::cmp::min(str1.len(), str2.len()), 4);
for (c1, c2) in str1[..bound].chars().zip(str2[..bound].chars()) { ... } // <-- same hazard
The existing unit tests only use ASCII, so the bug is not currently covered.
Steps to reproduce
// ASCII works:
jaro_winkler_distance("martha", "marhta"); // 0.9611111111111111
// Any multi-byte char makes it panic:
jaro_winkler_distance("ab", "céd");
Actual output (verified with rustc -O):
ascii ok: 0.9611111111111111
thread 'main' panicked at ...:
byte index 2 is not a char boundary; it is inside 'é' (bytes 1..3) of `céd`
(get_matched_characters("ab", "céd") computes right = 2 and evaluates s2[0..2], which cuts the 2-byte é in half.)
Expected behavior
jaro_winkler_distance should compute the metric for arbitrary Unicode strings (or at least not panic). String distance metrics are routinely applied to non-ASCII text.
Actual behavior
The function panics (byte index … is not a char boundary) for any input whose access pattern slices through a multi-byte character — i.e. realistic non-ASCII input.
Suggested fix
Operate on characters, not bytes. Collect both inputs into Vec<char> once and index/measure by character throughout:
let s1: Vec<char> = str1.chars().collect();
let s2: Vec<char> = str2.chars().collect();
// use s1.len()/s2.len() (char counts), and slice s2[left..right] as a char slice
This removes the byte/char index mismatch and the panic.
Additional notes (secondary, not the main bug)
- The matching window
limit = min(len1, len2) / 2 differs from the standard Jaro definition floor(max(len1, len2) / 2) - 1; this can yield non-standard scores for some inputs.
get_matched_characters removes the first occurrence of a matched char via s2.find(l) (and substitutes a space) rather than the occurrence found within the search window, which can mis-handle repeated characters.
A char-based rewrite is a good opportunity to address these as well, but the panic above is the blocking correctness issue.
Description
src/string/jaro_winkler_distance.rspanics on any non-ASCII (multi-byte UTF-8) input. The function mixes character indices (from.chars().enumerate()) with byte indices (str::len()and&strslicing), then slices&strat those positions. When a slice boundary falls inside a multi-byte UTF-8 character, Rust panics (byte index N is not a char boundary).The offending spots use byte offsets derived from char-based counting:
The existing unit tests only use ASCII, so the bug is not currently covered.
Steps to reproduce
Actual output (verified with
rustc -O):(
get_matched_characters("ab", "céd")computesright = 2and evaluatess2[0..2], which cuts the 2-byteéin half.)Expected behavior
jaro_winkler_distanceshould compute the metric for arbitrary Unicode strings (or at least not panic). String distance metrics are routinely applied to non-ASCII text.Actual behavior
The function panics (
byte index … is not a char boundary) for any input whose access pattern slices through a multi-byte character — i.e. realistic non-ASCII input.Suggested fix
Operate on characters, not bytes. Collect both inputs into
Vec<char>once and index/measure by character throughout:This removes the byte/char index mismatch and the panic.
Additional notes (secondary, not the main bug)
limit = min(len1, len2) / 2differs from the standard Jaro definitionfloor(max(len1, len2) / 2) - 1; this can yield non-standard scores for some inputs.get_matched_charactersremoves the first occurrence of a matched char vias2.find(l)(and substitutes a space) rather than the occurrence found within the search window, which can mis-handle repeated characters.A char-based rewrite is a good opportunity to address these as well, but the panic above is the blocking correctness issue.