fix: crash when processing invalid UTF-8 in embeddings tokenizer by sanikolaev · Pull Request #126 · manticoresoftware/columnar

sanikolaev · 2026-01-08T17:22:49Z

The FFI layer was using std::str::from_utf8_unchecked to convert raw bytes from C++ to Rust strings. This function assumes valid UTF-8 and will panic if invalid sequences are encountered, causing the entire process to crash.

When html_strip='1' is enabled in the daemon, the HTML stripper processes text byte-by-byte without validating UTF-8. If the input contains invalid UTF-8 sequences (e.g., corrupted text like 'Myâ¦'), these sequences are passed through to the Rust tokenizer, causing a panic.

Changed to use String::from_utf8_lossy instead, which gracefully handles invalid UTF-8 by replacing invalid sequences with the replacement character (U+FFFD) rather than panicking. This allows the tokenizer to process the text and generate embeddings, even when the input contains invalid UTF-8.

Related issue #125

The FFI layer was using `std::str::from_utf8_unchecked` to convert raw bytes from C++ to Rust strings. This function assumes valid UTF-8 and will panic if invalid sequences are encountered, causing the entire process to crash. Changed to use `String::from_utf8_lossy` instead, which gracefully handles invalid UTF-8 by replacing invalid sequences with the replacement character (U+FFFD) rather than panicking. This allows the tokenizer to process the text and generate embeddings, even when the input contains invalid UTF-8. Related issue #125

Add regression test to verify invalid UTF-8 sequences are handled gracefully without panicking. The test verifies that from_utf8_lossy correctly converts invalid UTF-8 sequences to valid strings (replacing invalid bytes with U+FFFD) instead of panicking. Also includes code style fixes from cargo fmt.

github-actions · 2026-01-08T17:58:19Z

Linux debug test results

8 files 8 suites 12m 37s ⏱️
492 tests 472 ✅ 20 💤 0 ❌
506 runs 486 ✅ 20 💤 0 ❌

Results for commit 60d956b.

github-actions · 2026-01-08T18:02:44Z

Windows test results

5 files 5 suites 18m 2s ⏱️
474 tests 461 ✅ 13 💤 0 ❌
482 runs 469 ✅ 13 💤 0 ❌

Results for commit 60d956b.

github-actions · 2026-01-08T18:04:19Z

Linux release test results

8 files 8 suites 6m 11s ⏱️
492 tests 479 ✅ 13 💤 0 ❌
506 runs 493 ✅ 13 💤 0 ❌

Results for commit 60d956b.

sanikolaev added 2 commits January 9, 2026 00:19

sanikolaev requested a review from donhardman January 8, 2026 18:05

sanikolaev mentioned this pull request Jan 9, 2026

Crash at insert with auto embeddings and html_strip #125

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: crash when processing invalid UTF-8 in embeddings tokenizer#126

fix: crash when processing invalid UTF-8 in embeddings tokenizer#126
sanikolaev wants to merge 2 commits intomasterfrom
issue-125

sanikolaev commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sanikolaev commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

Linux debug test results

Uh oh!

github-actions bot commented Jan 8, 2026

Windows test results

Uh oh!

github-actions bot commented Jan 8, 2026

Linux release test results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant