Summary
hash_tokenizer_config() in openverifiablellm/tokenizer.py is hardcoded to look for BPE artifacts (vocab.json and merges.txt). Calling it after training a SentencePiece tokenizer crashes immediately with
FileNotFoundError because SentencePiece produces spm.model and spm.vocab instead.
The function claims to hash tokenizer configuration but silently only works for one of the two supported
tokenizer types.
Root Cause
In openverifiablellm/tokenizer.py, hash_tokenizer_config() is hardcoded
for BPE artifacts only
SentencePiece training produces completely different artifacts:
spm.model ← binary model file
spm.vocab ← human readable vocab
Neither vocab.json nor merges.txt is ever produced by SentencePiece,
so the function always crashes when called after SentencePiece training.
Proposed Fix
openverifiablellm/tokenizer.py
Detect tokenizer type from artifacts present on disk
tests/test_tokenizer.py
Add missing tests
Files to Change
openverifiablellm/tokenizer.py — fix hash_tokenizer_config()
tests/test_tokenizer.py — add missing test cases
Related
Impact
Critical - Application is unusable
Code of Conduct
Summary
hash_tokenizer_config()inopenverifiablellm/tokenizer.pyis hardcoded to look for BPE artifacts (vocab.jsonandmerges.txt). Calling it after training a SentencePiece tokenizer crashes immediately withFileNotFoundErrorbecause SentencePiece producesspm.modelandspm.vocabinstead.The function claims to hash tokenizer configuration but silently only works for one of the two supported
tokenizer types.
Root Cause
In
openverifiablellm/tokenizer.py,hash_tokenizer_config()is hardcodedfor BPE artifacts only
SentencePiece training produces completely different artifacts:
Neither
vocab.jsonnormerges.txtis ever produced by SentencePiece,so the function always crashes when called after SentencePiece training.
Proposed Fix
openverifiablellm/tokenizer.pyDetect tokenizer type from artifacts present on disk
tests/test_tokenizer.pyAdd missing tests
Files to Change
openverifiablellm/tokenizer.py— fixhash_tokenizer_config()tests/test_tokenizer.py— add missing test casesRelated
(introduced
hash_tokenizer_config()with BPE-only support)test_hash_changes_when_merges_changeImpact
Critical - Application is unusable
Code of Conduct