jaro_winkler_distance panics on non-ASCII input (byte/char index mismatch in string slicing)

## Description

`src/string/jaro_winkler_distance.rs` panics on any non-ASCII (multi-byte UTF-8) input. The function mixes **character indices** (from `.chars().enumerate()`) with **byte indices** (`str::len()` and `&str` slicing), then slices `&str` at those positions. When a slice boundary falls inside a multi-byte UTF-8 character, Rust panics (`byte index N is not a char boundary`).

The offending spots use byte offsets derived from char-based counting:

```rust
let limit = std::cmp::min(s1.len(), s2.len()) / 2;          // .len() = BYTE length
for (i, l) in s1.chars().enumerate() {                       // i = CHAR index
    let left  = std::cmp::max(0, i as i32 - limit as i32) as usize;
    let right = std::cmp::min(i + limit + 1, s2.len());
    if s2[left..right].contains(l) {                         // <-- byte slice with char-derived indices -> panic
        ...
    }
}
...
let bound = std::cmp::min(std::cmp::min(str1.len(), str2.len()), 4);
for (c1, c2) in str1[..bound].chars().zip(str2[..bound].chars()) { ... }   // <-- same hazard
```

The existing unit tests only use ASCII, so the bug is not currently covered.

## Steps to reproduce

```rust
// ASCII works:
jaro_winkler_distance("martha", "marhta"); // 0.9611111111111111

// Any multi-byte char makes it panic:
jaro_winkler_distance("ab", "céd");
```

Actual output (verified with `rustc -O`):

```
ascii ok: 0.9611111111111111

thread 'main' panicked at ...:
byte index 2 is not a char boundary; it is inside 'é' (bytes 1..3) of `céd`
```

(`get_matched_characters("ab", "céd")` computes `right = 2` and evaluates `s2[0..2]`, which cuts the 2-byte `é` in half.)

## Expected behavior

`jaro_winkler_distance` should compute the metric for arbitrary Unicode strings (or at least not panic). String distance metrics are routinely applied to non-ASCII text.

## Actual behavior

The function panics (`byte index … is not a char boundary`) for any input whose access pattern slices through a multi-byte character — i.e. realistic non-ASCII input.

## Suggested fix

Operate on characters, not bytes. Collect both inputs into `Vec<char>` once and index/measure by character throughout:

```rust
let s1: Vec<char> = str1.chars().collect();
let s2: Vec<char> = str2.chars().collect();
// use s1.len()/s2.len() (char counts), and slice s2[left..right] as a char slice
```

This removes the byte/char index mismatch and the panic.

## Additional notes (secondary, not the main bug)

- The matching window `limit = min(len1, len2) / 2` differs from the standard Jaro definition `floor(max(len1, len2) / 2) - 1`; this can yield non-standard scores for some inputs.
- `get_matched_characters` removes the **first** occurrence of a matched char via `s2.find(l)` (and substitutes a space) rather than the occurrence found within the search window, which can mis-handle repeated characters.

A char-based rewrite is a good opportunity to address these as well, but the panic above is the blocking correctness issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

jaro_winkler_distance panics on non-ASCII input (byte/char index mismatch in string slicing) #1047

Description

Steps to reproduce

Expected behavior

Actual behavior

Suggested fix

Additional notes (secondary, not the main bug)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

jaro_winkler_distance panics on non-ASCII input (byte/char index mismatch in string slicing) #1047

Description

Description

Steps to reproduce

Expected behavior

Actual behavior

Suggested fix

Additional notes (secondary, not the main bug)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions