fix: crash when processing invalid UTF-8 in embeddings tokenizer#126
Open
sanikolaev wants to merge 2 commits intomasterfrom
Open
fix: crash when processing invalid UTF-8 in embeddings tokenizer#126sanikolaev wants to merge 2 commits intomasterfrom
sanikolaev wants to merge 2 commits intomasterfrom
Conversation
The FFI layer was using `std::str::from_utf8_unchecked` to convert raw bytes from C++ to Rust strings. This function assumes valid UTF-8 and will panic if invalid sequences are encountered, causing the entire process to crash. Changed to use `String::from_utf8_lossy` instead, which gracefully handles invalid UTF-8 by replacing invalid sequences with the replacement character (U+FFFD) rather than panicking. This allows the tokenizer to process the text and generate embeddings, even when the input contains invalid UTF-8. Related issue #125
Add regression test to verify invalid UTF-8 sequences are handled gracefully without panicking. The test verifies that from_utf8_lossy correctly converts invalid UTF-8 sequences to valid strings (replacing invalid bytes with U+FFFD) instead of panicking. Also includes code style fixes from cargo fmt.
Linux debug test results 8 files 8 suites 12m 37s ⏱️ Results for commit 60d956b. |
Windows test results 5 files 5 suites 18m 2s ⏱️ Results for commit 60d956b. |
Linux release test results 8 files 8 suites 6m 11s ⏱️ Results for commit 60d956b. |
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The FFI layer was using
std::str::from_utf8_uncheckedto convert raw bytes from C++ to Rust strings. This function assumes valid UTF-8 and will panic if invalid sequences are encountered, causing the entire process to crash.When
html_strip='1'is enabled in the daemon, the HTML stripper processes text byte-by-byte without validating UTF-8. If the input contains invalid UTF-8 sequences (e.g., corrupted text like 'Myâ¦'), these sequences are passed through to the Rust tokenizer, causing a panic.Changed to use
String::from_utf8_lossyinstead, which gracefully handles invalid UTF-8 by replacing invalid sequences with the replacement character (U+FFFD) rather than panicking. This allows the tokenizer to process the text and generate embeddings, even when the input contains invalid UTF-8.Related issue #125