Skip to content

_patch_mistral_regex crashes with AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer' when loading Mistral tokenizer with fix_mistral_regex=True #45081

@kruthtom0

Description

@kruthtom0

System Info

  • transformers version: 5.4.0
  • Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
  • Python version: 3.13.5
  • Huggingface_hub version: 1.8.0
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A100-PCIE-40GB

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import transformers

print(f"transformers version: {transformers.__version__}")
print("Loading Mistral tokenizer with fix_mistral_regex=True ...")

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-Nemo-Instruct-2407",
    trust_remote_code=True,
    fix_mistral_regex=True,
)
Traceback (most recent call last):
  File "/mnt/RAPID/tmp/repro_mistral_regex_bug.py", line 23, in <module>
    tokenizer = AutoTokenizer.from_pretrained(
        "mistralai/Mistral-Nemo-Instruct-2407",
        trust_remote_code=True,
        fix_mistral_regex=True,
    )
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained
    return tokenizer_class_from_name(tokenizer_config_class).from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path, *inputs, **kwargs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~^
        resolved_vocab_files,
        ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 477, in __init__
    self._tokenizer = self._patch_mistral_regex(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~^
        self._tokenizer,
        ^^^^^^^^^^^^^^^^
    ...<3 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 1363, in _patch_mistral_regex
    current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'

Expected behavior

fix_mistral_regex=True should successfully replace the incorrect pre-tokenizer regex pattern in the Mistral tokenizer without raising any error.


Root cause analysis and suggested fix

In tokenization_utils_tokenizers.py, _patch_mistral_regex is called from __init__ as:

# line ~477
self._tokenizer = self._patch_mistral_regex(
    self._tokenizer,   # <-- this is a raw tokenizers.Tokenizer (Rust object)
    ...
)

Inside _patch_mistral_regex, line 1363 then does:

current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer  # BUG

But tokenizer here is already self._tokenizer — the raw Rust tokenizers.Tokenizer object. The .backend_tokenizer property exists on the Python-level PreTrainedTokenizerFast / TokenizersBackend wrapper, not on the underlying Rust object itself.

Fix: access .pre_tokenizer directly, since the argument is already the backend tokenizer:

# line 1363 — change:
current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer
# to:
current_pretokenizer = tokenizer.pre_tokenizer

And correspondingly update any write-back in the same method that goes through .backend_tokenizer to use the object directly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions