cli : merge tokens split across UTF-8 boundaries in JSON output#3751
Open
texasich wants to merge 1 commit into
Open
cli : merge tokens split across UTF-8 boundaries in JSON output#3751texasich wants to merge 1 commit into
texasich wants to merge 1 commit into
Conversation
When a multi-byte UTF-8 codepoint (most commonly a CJK character, 3 bytes) is split across multiple whisper tokens, the -ojf/--output-json-full writer emitted each token's partial bytes as its own JSON string, producing invalid UTF-8 that chokes downstream parsers. Merge adjacent tokens in output_json whenever the accumulated text still ends on an incomplete UTF-8 sequence. The merged entry keeps the first token's id/p/t_dtw and extends t1 to the last absorbed token, which matches how segment text is assembled elsewhere. Refs ggml-org#1798
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a multi-byte UTF-8 codepoint — most commonly a CJK character, which takes 3 bytes — happens to land across two or more adjacent whisper tokens, the
-ojf/--output-json-fullwriter was emitting each partial-byte token as its own JSON string. That produces invalid UTF-8, and most downstream parsers rightly choke on it (reported in #1798).Root cause
output_jsoninexamples/cli/cli.cppiterates at the token level and writes one JSON object per token. The segment-level text helpers (whisper_full_get_segment_text) concatenate tokens first, so they're fine — only the per-token loop was broken.Fix
Two small additions to
examples/cli/cli.cpp:utf8_trailing_bytes_needed(const std::string &)— a static helper that walks back past UTF-8 continuation bytes, finds the last lead byte, and returns how many more continuation bytes are still expected (0 if the tail is already a complete codepoint or looks malformed).Merge loop in
output_json— before writing tokens, adjacent tokens are accumulated into astd::vector<merged_token>while the accumulated text still ends on an incomplete UTF-8 sequence. Each merged entry keeps the first token'sid,p, andt_dtwand extendst1to the last absorbed token. The emission loop then runs over the merged list instead of raw tokens.No public API change. No format change for codepoints that already fit in one token. Only
output_jsonis touched;output_scorewrites tab-separated plain text and never had a JSON-validity issue, so it's left alone.Not a duplicate of #3619
PR #3619 (
feat(cli): add word-level LRC output with UTF-8 fix) adds a new LRC output mode and includes a UTF-8 fix scoped to that new writer. It does not touchoutput_jsonor address the issue described in #1798. The two changes are independent.Testing
Standalone unit tests against the helper (all pass):
Built cleanly against current master:
cmake --build build --target whisper-cli -j, no new warnings incli.cpp.