Skip to content

[Tokenizer] Fix silent output truncation in encode/decode finalize#23828

Merged
jtuyls merged 1 commit intoiree-org:mainfrom
jtuyls:fix-encode-silent-truncation
Apr 3, 2026
Merged

[Tokenizer] Fix silent output truncation in encode/decode finalize#23828
jtuyls merged 1 commit intoiree-org:mainfrom
jtuyls:fix-encode-silent-truncation

Conversation

@jtuyls
Copy link
Copy Markdown
Contributor

@jtuyls jtuyls commented Mar 18, 2026

Summary

Return RESOURCE_EXHAUSTED instead of silently truncating when the output buffer is too small, matching the documented API contract in tokenizer.h.

Test results

HuggingFace smoketest (huggingface_smoketest.py)

1667/1667 tokenization comparisons pass across ~80 HuggingFace models (0 mismatches).

76 additional tests fail to load (not tokenization mismatches) — these are tiktoken models listed in the HF smoketest that aren't valid HuggingFace model identifiers. They're tested by the dedicated tiktoken smoketest instead. These tests are removed from the HF smoketest in #23830.

Tiktoken smoketest (tiktoken_smoketest.py)

72/76 tokenization comparisons pass across 4 tiktoken encodings (cl100k_base, o200k_base, r50k_base, p50k_base).

4 failures — identical between upstream and this PR (pre-existing, same as fix-tokenizer-added-tokens branch):

Encoding Failing test Cause
cl100k_base special_token_endoftext IREE matches <|endoftext|> as special token; tiktoken's encode_ordinary treats it as literal text
o200k_base special_token_endoftext Same
r50k_base special_token_endoftext Same
p50k_base special_token_endoftext Same

Root cause: the IREE tokenizer has no "encode ordinary" mode (equivalent to tiktoken's disallowed_special=()). Fixed in #23830.

@jtuyls jtuyls requested a review from benvanik as a code owner March 18, 2026 09:21
@jtuyls jtuyls force-pushed the fix-encode-silent-truncation branch from 21df09d to 6a7f673 Compare March 18, 2026 11:33
Comment thread runtime/src/iree/tokenizer/tokenizer_decode_test.cc Outdated
Return RESOURCE_EXHAUSTED instead of silently truncating when the output
buffer is too small, matching the documented API contract in tokenizer.h.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jorn <jorn.tuyls@gmail.com>
@jtuyls jtuyls force-pushed the fix-encode-silent-truncation branch from 6a7f673 to 4337664 Compare March 25, 2026 16:27
@jtuyls jtuyls requested a review from benvanik March 25, 2026 17:39
@jtuyls jtuyls merged commit 25076ee into iree-org:main Apr 3, 2026
57 checks passed
@jtuyls jtuyls deleted the fix-encode-silent-truncation branch April 3, 2026 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants