[Tokenizer] Fix silent output truncation in encode/decode finalize by jtuyls · Pull Request #23828 · iree-org/iree

jtuyls · 2026-03-18T09:21:27Z

Summary

Return RESOURCE_EXHAUSTED instead of silently truncating when the output buffer is too small, matching the documented API contract in tokenizer.h.

Test results

HuggingFace smoketest (`huggingface_smoketest.py`)

1667/1667 tokenization comparisons pass across ~80 HuggingFace models (0 mismatches).

76 additional tests fail to load (not tokenization mismatches) — these are tiktoken models listed in the HF smoketest that aren't valid HuggingFace model identifiers. They're tested by the dedicated tiktoken smoketest instead. These tests are removed from the HF smoketest in #23830.

Tiktoken smoketest (`tiktoken_smoketest.py`)

72/76 tokenization comparisons pass across 4 tiktoken encodings (cl100k_base, o200k_base, r50k_base, p50k_base).

4 failures — identical between upstream and this PR (pre-existing, same as fix-tokenizer-added-tokens branch):

Encoding	Failing test	Cause
cl100k_base	`special_token_endoftext`	IREE matches `<\|endoftext\|>` as special token; tiktoken's `encode_ordinary` treats it as literal text
o200k_base	`special_token_endoftext`	Same
r50k_base	`special_token_endoftext`	Same
p50k_base	`special_token_endoftext`	Same

Root cause: the IREE tokenizer has no "encode ordinary" mode (equivalent to tiktoken's disallowed_special=()). Fixed in #23830.

Return RESOURCE_EXHAUSTED instead of silently truncating when the output buffer is too small, matching the documented API contract in tokenizer.h. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jorn <jorn.tuyls@gmail.com>

jtuyls requested a review from benvanik as a code owner March 18, 2026 09:21

jtuyls force-pushed the fix-encode-silent-truncation branch from 21df09d to 6a7f673 Compare March 18, 2026 11:33

benvanik requested changes Mar 25, 2026

View reviewed changes

Comment thread runtime/src/iree/tokenizer/tokenizer_decode_test.cc Outdated

jtuyls force-pushed the fix-encode-silent-truncation branch from 6a7f673 to 4337664 Compare March 25, 2026 16:27

jtuyls requested a review from benvanik March 25, 2026 17:39

benvanik approved these changes Apr 3, 2026

View reviewed changes

jtuyls merged commit 25076ee into iree-org:main Apr 3, 2026
57 checks passed

jtuyls deleted the fix-encode-silent-truncation branch April 3, 2026 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizer] Fix silent output truncation in encode/decode finalize#23828

[Tokenizer] Fix silent output truncation in encode/decode finalize#23828
jtuyls merged 1 commit intoiree-org:mainfrom
jtuyls:fix-encode-silent-truncation

jtuyls commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jtuyls commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test results

HuggingFace smoketest (huggingface_smoketest.py)

Tiktoken smoketest (tiktoken_smoketest.py)

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jtuyls commented Mar 18, 2026 •

edited

Loading

HuggingFace smoketest (`huggingface_smoketest.py`)

Tiktoken smoketest (`tiktoken_smoketest.py`)