Add chat-template hooks to LMEvalORTGenAIEvaluator by ykhrustalev · Pull Request #2462 · microsoft/Olive

ykhrustalev · 2026-05-12T21:06:34Z

Describe your changes

Implement tokenizer_name and apply_chat_template on LMEvalORTGenAIEvaluator so the backend supports lm_eval.simple_evaluate(apply_chat_template=True). Without these, lm-eval raises NotImplementedError at task setup for any chat-formatted task.

Parity with the HuggingFace backend in lm_eval/models/huggingface.py. The HF tokenizer is loaded lazily on the first apply_chat_template call, so model directories without HF tokenizer files still work for non-chat evaluation. Generation continues to go through og.Tokenizer.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? Release note: Enable apply_chat_template=True in lm-eval for ortgenai-backed evaluators.

(Optional) Issue link

N/A

lm-eval's `simple_evaluate(..., apply_chat_template=True)` requires the underlying LM class to implement `tokenizer_name` and `apply_chat_template`. The HFLM backend has both; the ORT GenAI backend does not, so any attempt to evaluate a chat-tuned ONNX model with chat-formatted prompts raises `NotImplementedError: To use this model with chat templates, please implement the 'tokenizer_name' property.` This adds the two members with the minimum surface area: - `tokenizer_name` returns the model path (for lm-eval's chat-aware result caching), matching the HFLM convention of slash-replacement. - `apply_chat_template` defers to the model's HF tokenizer via `AutoTokenizer.apply_chat_template`, mirroring HFLM's implementation. The HF tokenizer is loaded once at `__init__` purely for chat-template rendering; token-level encode/decode still goes through `og.Tokenizer` and the runtime, so there is no change to generation behavior or any existing code path. Verified end-to-end on LFM2.5-350M (int4, k_quant_mixed) MBPP: without chat-template hooks the eval raised at task start; with them plus `num_fewshot=0` and a chat-friendly stop list, pass@1 went from 0.0/500 to 67/500 (13.4%) -- the original 0.0 was a prompt-format artifact (instruct model + completion-style few-shot), not a conversion regression.

Copilot

Pull request overview

This PR adds lm-eval “chat template” integration hooks to LMEvalORTGenAIEvaluator so ORT GenAI–backed models can be evaluated via lm_eval.simple_evaluate(..., apply_chat_template=True) (matching the capability available in the HuggingFace backend).

Changes:

Add an HF tokenizer instance to LMEvalORTGenAIEvaluator for rendering chat templates.
Implement tokenizer_name for lm-eval chat-template-aware caching.
Implement apply_chat_template(...) by delegating to the HF tokenizer.

+        # HF tokenizer kept solely to render `apply_chat_template`; generation
+        # still uses og.Tokenizer above.
+        self._pretrained = str(pretrained)
+        self._hf_tokenizer = AutoTokenizer.from_pretrained(self._pretrained)


+    @property
+    def tokenizer_name(self) -> str:
+        """Identifier used by lm-eval for chat-template-aware caching."""
+        return self._pretrained.replace("/", "__")


… key, tests - Lazy-load the HF tokenizer on the first ``apply_chat_template`` call rather than at ``__init__``. Callers that never enable chat templating no longer need HF tokenizer files (``tokenizer_config.json`` etc.) in the model directory; eager loading would have regressed those workflows. - ``tokenizer_name`` now replaces both POSIX and Windows path separators with ``__`` so the lm-eval cache identifier is stable across platforms. The previous implementation only handled forward slashes, leaving backslashes in the key on Windows because ``str(Path(...))`` preserves the native separator. - Add unit tests for both behaviours: - ``tokenizer_name`` parametrised over POSIX, relative, and Windows-style paths to lock in the normalisation contract. - ``apply_chat_template`` verified to (a) not load the HF tokenizer at construction, (b) load once on first call, and (c) reuse the cached tokenizer on subsequent calls. ``AutoTokenizer`` is patched so the tests run without any HF tokenizer files on disk. All four new tests pass; ``test_olive_evaluator.py`` as a whole stays green (85 passed). ``lintrunner`` reports no new warnings on the changed files.

Copilot AI review requested due to automatic review settings May 12, 2026 21:06

Copilot started reviewing on behalf of ykhrustalev May 12, 2026 21:07 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

github-advanced-security AI found potential problems May 12, 2026

View reviewed changes

Comment thread test/evaluator/test_olive_evaluator.py Fixed

ykhrustalev added 2 commits May 12, 2026 18:30

Trim comments and docstrings on chat-template hooks

e174f59

Use object.__new__ in chat-template test helper to silence pylint E1120

34ff372

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add chat-template hooks to LMEvalORTGenAIEvaluator#2462

Add chat-template hooks to LMEvalORTGenAIEvaluator#2462
ykhrustalev wants to merge 4 commits into
microsoft:mainfrom
ykhrustalev:lmeval-ort-chat-template

ykhrustalev commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ykhrustalev commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Checklist before requesting a review

(Optional) Issue link

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ykhrustalev commented May 12, 2026 •

edited

Loading