_patch_mistral_regex crashes with AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer' when loading Mistral tokenizer with fix_mistral_regex=True

### System Info

- `transformers` version: 5.4.0
- Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
- Python version: 3.13.5
- Huggingface_hub version: 1.8.0
- Safetensors version: 0.7.0
- Accelerate version: 1.13.0
- Accelerate config:    not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA A100-PCIE-40GB

### Who can help?

@ArthurZucker @itazap

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

```python
import transformers

print(f"transformers version: {transformers.__version__}")
print("Loading Mistral tokenizer with fix_mistral_regex=True ...")

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-Nemo-Instruct-2407",
    trust_remote_code=True,
    fix_mistral_regex=True,
)
```
```
Traceback (most recent call last):
  File "/mnt/RAPID/tmp/repro_mistral_regex_bug.py", line 23, in <module>
    tokenizer = AutoTokenizer.from_pretrained(
        "mistralai/Mistral-Nemo-Instruct-2407",
        trust_remote_code=True,
        fix_mistral_regex=True,
    )
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained
    return tokenizer_class_from_name(tokenizer_config_class).from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path, *inputs, **kwargs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~^
        resolved_vocab_files,
        ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 477, in __init__
    self._tokenizer = self._patch_mistral_regex(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~^
        self._tokenizer,
        ^^^^^^^^^^^^^^^^
    ...<3 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 1363, in _patch_mistral_regex
    current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'
```

### Expected behavior



`fix_mistral_regex=True` should successfully replace the incorrect pre-tokenizer regex pattern in the Mistral tokenizer without raising any error.

---

**Root cause analysis and suggested fix**

In `tokenization_utils_tokenizers.py`, `_patch_mistral_regex` is called from `__init__` as:

```python
# line ~477
self._tokenizer = self._patch_mistral_regex(
    self._tokenizer,   # <-- this is a raw tokenizers.Tokenizer (Rust object)
    ...
)
```

Inside `_patch_mistral_regex`, line 1363 then does:

```python
current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer  # BUG
```

But `tokenizer` here is already `self._tokenizer` — the raw Rust `tokenizers.Tokenizer` object. The `.backend_tokenizer` property exists on the Python-level `PreTrainedTokenizerFast` / `TokenizersBackend` wrapper, **not** on the underlying Rust object itself.

**Fix:** access `.pre_tokenizer` directly, since the argument is already the backend tokenizer:

```python
# line 1363 — change:
current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer
# to:
current_pretokenizer = tokenizer.pre_tokenizer
```

And correspondingly update any write-back in the same method that goes through `.backend_tokenizer` to use the object directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_patch_mistral_regex crashes with AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer' when loading Mistral tokenizer with fix_mistral_regex=True #45081

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

_patch_mistral_regex crashes with AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer' when loading Mistral tokenizer with fix_mistral_regex=True #45081

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions