Skip to content

cli : merge tokens split across UTF-8 boundaries in JSON output#3751

Open
texasich wants to merge 1 commit into
ggml-org:masterfrom
texasich:fix/cjk-token-boundary-json
Open

cli : merge tokens split across UTF-8 boundaries in JSON output#3751
texasich wants to merge 1 commit into
ggml-org:masterfrom
texasich:fix/cjk-token-boundary-json

Conversation

@texasich
Copy link
Copy Markdown
Contributor

When a multi-byte UTF-8 codepoint — most commonly a CJK character, which takes 3 bytes — happens to land across two or more adjacent whisper tokens, the -ojf/--output-json-full writer was emitting each partial-byte token as its own JSON string. That produces invalid UTF-8, and most downstream parsers rightly choke on it (reported in #1798).

Root cause

output_json in examples/cli/cli.cpp iterates at the token level and writes one JSON object per token. The segment-level text helpers (whisper_full_get_segment_text) concatenate tokens first, so they're fine — only the per-token loop was broken.

Fix

Two small additions to examples/cli/cli.cpp:

  1. utf8_trailing_bytes_needed(const std::string &) — a static helper that walks back past UTF-8 continuation bytes, finds the last lead byte, and returns how many more continuation bytes are still expected (0 if the tail is already a complete codepoint or looks malformed).

  2. Merge loop in output_json — before writing tokens, adjacent tokens are accumulated into a std::vector<merged_token> while the accumulated text still ends on an incomplete UTF-8 sequence. Each merged entry keeps the first token's id, p, and t_dtw and extends t1 to the last absorbed token. The emission loop then runs over the merged list instead of raw tokens.

No public API change. No format change for codepoints that already fit in one token. Only output_json is touched; output_score writes tab-separated plain text and never had a JSON-validity issue, so it's left alone.

Not a duplicate of #3619

PR #3619 (feat(cli): add word-level LRC output with UTF-8 fix) adds a new LRC output mode and includes a UTF-8 fix scoped to that new writer. It does not touch output_json or address the issue described in #1798. The two changes are independent.

Testing

Standalone unit tests against the helper (all pass):

  • empty string, pure ASCII → 0
  • complete and incomplete 2-byte sequences (é)
  • complete and incomplete 3-byte sequences (私, the CJK char from the bug report)
  • all three partial states of a 4-byte emoji
  • incremental merge simulation: 私 in three tokens → merges into one entry

Built cleanly against current master: cmake --build build --target whisper-cli -j, no new warnings in cli.cpp.

When a multi-byte UTF-8 codepoint (most commonly a CJK character, 3 bytes)
is split across multiple whisper tokens, the -ojf/--output-json-full
writer emitted each token's partial bytes as its own JSON string, producing
invalid UTF-8 that chokes downstream parsers.

Merge adjacent tokens in output_json whenever the accumulated text still
ends on an incomplete UTF-8 sequence. The merged entry keeps the first
token's id/p/t_dtw and extends t1 to the last absorbed token, which
matches how segment text is assembled elsewhere.

Refs ggml-org#1798
@texasich
Copy link
Copy Markdown
Contributor Author

gentle ping — this fixes the CJK character split in JSON output (#1798). verified it doesn't overlap with #3619 which only touches the new LRC writer. happy to iterate if anything needs adjusting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant