Skip to content

[Tokenizer] Add encode ordinary mode (no special token matching)#23830

Merged
jtuyls merged 1 commit intoiree-org:mainfrom
jtuyls:fix-tiktoken-encode-ordinary
Apr 3, 2026
Merged

[Tokenizer] Add encode ordinary mode (no special token matching)#23830
jtuyls merged 1 commit intoiree-org:mainfrom
jtuyls:fix-tiktoken-encode-ordinary

Conversation

@jtuyls
Copy link
Copy Markdown
Contributor

@jtuyls jtuyls commented Mar 18, 2026

Add IREE_TOKENIZER_ENCODE_FLAG_NO_SPECIAL_TOKEN_MATCHING to skip special token recognition in input text. Equivalent to tiktoken's encode_ordinary(). Fixes 4/76 tiktoken smoketest failures. Also skips tiktoken models in the
HuggingFace smoketest (tested by the dedicated tiktoken smoketest instead).

Test Results:

  • HuggingFace smoketest: 1667/1667 passed (tiktoken models skipped, tested separately)
  • Tiktoken smoketest: 76/76 passed (all 4 special_token_endoftext tests now pass with --match_special)

Comment thread tools/iree-tokenize-main.c Outdated
Comment thread runtime/src/iree/tokenizer/tokenizer.c Outdated
benvanik pushed a commit that referenced this pull request Mar 25, 2026
#23852)

## Summary

When text preceded back-to-back special tokens (e.g.
<|user|>Hi<|end|><|assistant|>), the first was dropped because matching
continued while a deferred token was pending, overwriting the single
pending_special_token slot. This breaks every multi-turn chat prompt.
Add the missing pending < 0 guard to skip matching until the deferred
token is emitted.

## Test results

### HuggingFace smoketest (`huggingface_smoketest.py`)

**1667/1667** tokenization comparisons pass across ~80 HuggingFace
models (0 mismatches).

76 additional tests fail to **load** (not tokenization mismatches) —
these are tiktoken models listed in the HF smoketest that aren't valid
HuggingFace model identifiers. They're tested by the dedicated tiktoken
smoketest instead. This is fixed in
#23830.

### Tiktoken smoketest (`tiktoken_smoketest.py`)

**72/76** tokenization comparisons pass across 4 tiktoken encodings
(cl100k_base, o200k_base, r50k_base, p50k_base).

4 failures — **identical between upstream and this PR** (pre-existing,
same as `fix-tokenizer-added-tokens` branch):

| Encoding | Failing test | Cause |
|----------|-------------|-------|
| cl100k_base | `special_token_endoftext` | IREE matches
`<\|endoftext\|>` as special token; tiktoken's `encode_ordinary` treats
it as literal text |
| o200k_base | `special_token_endoftext` | Same |
| r50k_base | `special_token_endoftext` | Same |
| p50k_base | `special_token_endoftext` | Same |

Root cause: the IREE tokenizer has no "encode ordinary" mode (equivalent
to tiktoken's `disallowed_special=()`). Fixed in:
#23830.

Signed-off-by: Jorn <jorn.tuyls@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add IREE_TOKENIZER_ENCODE_FLAG_NO_SPECIAL_TOKEN_MATCHING to skip special
token recognition in input text. Equivalent to tiktoken's encode_ordinary().
Fixes 4/76 tiktoken smoketest failures. Also skips tiktoken models in the
HuggingFace smoketest (tested by the dedicated tiktoken smoketest instead).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jorn <jorn.tuyls@gmail.com>
@jtuyls jtuyls force-pushed the fix-tiktoken-encode-ordinary branch from 905f4a3 to a5ec2c9 Compare March 25, 2026 17:04
@jtuyls jtuyls requested a review from benvanik March 25, 2026 17:40
@jtuyls jtuyls merged commit 230ef28 into iree-org:main Apr 3, 2026
106 of 111 checks passed
@jtuyls jtuyls deleted the fix-tiktoken-encode-ordinary branch April 3, 2026 07:11
jtuyls added a commit that referenced this pull request Apr 3, 2026
…23828)

## Summary

Return RESOURCE_EXHAUSTED instead of silently truncating when the output
buffer is too small, matching the documented API contract in
tokenizer.h.

## Test results

### HuggingFace smoketest (`huggingface_smoketest.py`)

**1667/1667** tokenization comparisons pass across ~80 HuggingFace
models (0 mismatches).

76 additional tests fail to **load** (not tokenization mismatches) —
these are tiktoken models listed in the HF smoketest that aren't valid
HuggingFace model identifiers. They're tested by the dedicated tiktoken
smoketest instead. These tests are removed from the HF smoketest in
#23830.

### Tiktoken smoketest (`tiktoken_smoketest.py`)

**72/76** tokenization comparisons pass across 4 tiktoken encodings
(cl100k_base, o200k_base, r50k_base, p50k_base).

4 failures — **identical between upstream and this PR** (pre-existing,
same as `fix-tokenizer-added-tokens` branch):

| Encoding | Failing test | Cause |
|----------|-------------|-------|
| cl100k_base | `special_token_endoftext` | IREE matches
`<\|endoftext\|>` as special token; tiktoken's `encode_ordinary` treats
it as literal text |
| o200k_base | `special_token_endoftext` | Same |
| r50k_base | `special_token_endoftext` | Same |
| p50k_base | `special_token_endoftext` | Same |

Root cause: the IREE tokenizer has no "encode ordinary" mode (equivalent
to tiktoken's `disallowed_special=()`). Fixed in
#23830.

Signed-off-by: Jorn <jorn.tuyls@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants