cli : merge tokens split across UTF-8 boundaries in JSON output by texasich · Pull Request #3751 · ggml-org/whisper.cpp

texasich · 2026-04-11T01:06:21Z

When a multi-byte UTF-8 codepoint — most commonly a CJK character, which takes 3 bytes — happens to land across two or more adjacent whisper tokens, the -ojf/--output-json-full writer was emitting each partial-byte token as its own JSON string. That produces invalid UTF-8, and most downstream parsers rightly choke on it (reported in #1798).

Root cause

output_json in examples/cli/cli.cpp iterates at the token level and writes one JSON object per token. The segment-level text helpers (whisper_full_get_segment_text) concatenate tokens first, so they're fine — only the per-token loop was broken.

Fix

Two small additions to examples/cli/cli.cpp:

utf8_trailing_bytes_needed(const std::string &) — a static helper that walks back past UTF-8 continuation bytes, finds the last lead byte, and returns how many more continuation bytes are still expected (0 if the tail is already a complete codepoint or looks malformed).
Merge loop in output_json — before writing tokens, adjacent tokens are accumulated into a std::vector<merged_token> while the accumulated text still ends on an incomplete UTF-8 sequence. Each merged entry keeps the first token's id, p, and t_dtw and extends t1 to the last absorbed token. The emission loop then runs over the merged list instead of raw tokens.

No public API change. No format change for codepoints that already fit in one token. Only output_json is touched; output_score writes tab-separated plain text and never had a JSON-validity issue, so it's left alone.

Not a duplicate of #3619

PR #3619 (feat(cli): add word-level LRC output with UTF-8 fix) adds a new LRC output mode and includes a UTF-8 fix scoped to that new writer. It does not touch output_json or address the issue described in #1798. The two changes are independent.

Testing

Standalone unit tests against the helper (all pass):

empty string, pure ASCII → 0
complete and incomplete 2-byte sequences (é)
complete and incomplete 3-byte sequences (私, the CJK char from the bug report)
all three partial states of a 4-byte emoji
incremental merge simulation: 私 in three tokens → merges into one entry

Built cleanly against current master: cmake --build build --target whisper-cli -j, no new warnings in cli.cpp.

When a multi-byte UTF-8 codepoint (most commonly a CJK character, 3 bytes) is split across multiple whisper tokens, the -ojf/--output-json-full writer emitted each token's partial bytes as its own JSON string, producing invalid UTF-8 that chokes downstream parsers. Merge adjacent tokens in output_json whenever the accumulated text still ends on an incomplete UTF-8 sequence. The merged entry keeps the first token's id/p/t_dtw and extends t1 to the last absorbed token, which matches how segment text is assembled elsewhere. Refs ggml-org#1798

texasich · 2026-04-18T17:43:21Z

gentle ping — this fixes the CJK character split in JSON output (#1798). verified it doesn't overlap with #3619 which only touches the new LRC writer. happy to iterate if anything needs adjusting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli : merge tokens split across UTF-8 boundaries in JSON output#3751

cli : merge tokens split across UTF-8 boundaries in JSON output#3751
texasich wants to merge 1 commit into
ggml-org:masterfrom
texasich:fix/cjk-token-boundary-json

texasich commented Apr 11, 2026

Uh oh!

texasich commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

texasich commented Apr 11, 2026

Root cause

Fix

Not a duplicate of #3619

Testing

Uh oh!

texasich commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant