System Info
transformers version: 5.4.0
- Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
- Python version: 3.13.5
- Huggingface_hub version: 1.8.0
- Safetensors version: 0.7.0
- Accelerate version: 1.13.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA A100-PCIE-40GB
Who can help?
@ArthurZucker @itazap
Information
Tasks
Reproduction
import transformers
print(f"transformers version: {transformers.__version__}")
print("Loading Mistral tokenizer with fix_mistral_regex=True ...")
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"mistralai/Mistral-Nemo-Instruct-2407",
trust_remote_code=True,
fix_mistral_regex=True,
)
Traceback (most recent call last):
File "/mnt/RAPID/tmp/repro_mistral_regex_bug.py", line 23, in <module>
tokenizer = AutoTokenizer.from_pretrained(
"mistralai/Mistral-Nemo-Instruct-2407",
trust_remote_code=True,
fix_mistral_regex=True,
)
File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained
return tokenizer_class_from_name(tokenizer_config_class).from_pretrained(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
pretrained_model_name_or_path, *inputs, **kwargs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
return cls._from_pretrained(
~~~~~~~~~~~~~~~~~~~~^
resolved_vocab_files,
^^^^^^^^^^^^^^^^^^^^^
...<9 lines>...
**kwargs,
^^^^^^^^^
)
^
File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 477, in __init__
self._tokenizer = self._patch_mistral_regex(
~~~~~~~~~~~~~~~~~~~~~~~~~^
self._tokenizer,
^^^^^^^^^^^^^^^^
...<3 lines>...
**kwargs,
^^^^^^^^^
)
^
File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 1363, in _patch_mistral_regex
current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer
^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'
Expected behavior
fix_mistral_regex=True should successfully replace the incorrect pre-tokenizer regex pattern in the Mistral tokenizer without raising any error.
Root cause analysis and suggested fix
In tokenization_utils_tokenizers.py, _patch_mistral_regex is called from __init__ as:
# line ~477
self._tokenizer = self._patch_mistral_regex(
self._tokenizer, # <-- this is a raw tokenizers.Tokenizer (Rust object)
...
)
Inside _patch_mistral_regex, line 1363 then does:
current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer # BUG
But tokenizer here is already self._tokenizer — the raw Rust tokenizers.Tokenizer object. The .backend_tokenizer property exists on the Python-level PreTrainedTokenizerFast / TokenizersBackend wrapper, not on the underlying Rust object itself.
Fix: access .pre_tokenizer directly, since the argument is already the backend tokenizer:
# line 1363 — change:
current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer
# to:
current_pretokenizer = tokenizer.pre_tokenizer
And correspondingly update any write-back in the same method that goes through .backend_tokenizer to use the object directly.
System Info
transformersversion: 5.4.0Who can help?
@ArthurZucker @itazap
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
fix_mistral_regex=Trueshould successfully replace the incorrect pre-tokenizer regex pattern in the Mistral tokenizer without raising any error.Root cause analysis and suggested fix
In
tokenization_utils_tokenizers.py,_patch_mistral_regexis called from__init__as:Inside
_patch_mistral_regex, line 1363 then does:But
tokenizerhere is alreadyself._tokenizer— the raw Rusttokenizers.Tokenizerobject. The.backend_tokenizerproperty exists on the Python-levelPreTrainedTokenizerFast/TokenizersBackendwrapper, not on the underlying Rust object itself.Fix: access
.pre_tokenizerdirectly, since the argument is already the backend tokenizer:And correspondingly update any write-back in the same method that goes through
.backend_tokenizerto use the object directly.