Skip to content

Add chat-template hooks to LMEvalORTGenAIEvaluator#2462

Open
ykhrustalev wants to merge 4 commits into
microsoft:mainfrom
ykhrustalev:lmeval-ort-chat-template
Open

Add chat-template hooks to LMEvalORTGenAIEvaluator#2462
ykhrustalev wants to merge 4 commits into
microsoft:mainfrom
ykhrustalev:lmeval-ort-chat-template

Conversation

@ykhrustalev
Copy link
Copy Markdown
Contributor

@ykhrustalev ykhrustalev commented May 12, 2026

Describe your changes

Implement tokenizer_name and apply_chat_template on LMEvalORTGenAIEvaluator so the backend supports lm_eval.simple_evaluate(apply_chat_template=True). Without these, lm-eval raises NotImplementedError at task setup for any chat-formatted task.

Parity with the HuggingFace backend in lm_eval/models/huggingface.py. The HF tokenizer is loaded lazily on the first apply_chat_template call, so model directories without HF tokenizer files still work for non-chat evaluation. Generation continues to go through og.Tokenizer.

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? Release note: Enable apply_chat_template=True in lm-eval for ortgenai-backed evaluators.

(Optional) Issue link

N/A

lm-eval's `simple_evaluate(..., apply_chat_template=True)` requires the
underlying LM class to implement `tokenizer_name` and `apply_chat_template`.
The HFLM backend has both; the ORT GenAI backend does not, so any attempt
to evaluate a chat-tuned ONNX model with chat-formatted prompts raises
`NotImplementedError: To use this model with chat templates, please
implement the 'tokenizer_name' property.`

This adds the two members with the minimum surface area:
  - `tokenizer_name` returns the model path (for lm-eval's chat-aware
    result caching), matching the HFLM convention of slash-replacement.
  - `apply_chat_template` defers to the model's HF tokenizer via
    `AutoTokenizer.apply_chat_template`, mirroring HFLM's
    implementation.

The HF tokenizer is loaded once at `__init__` purely for chat-template
rendering; token-level encode/decode still goes through `og.Tokenizer`
and the runtime, so there is no change to generation behavior or any
existing code path.

Verified end-to-end on LFM2.5-350M (int4, k_quant_mixed) MBPP:
without chat-template hooks the eval raised at task start; with them
plus `num_fewshot=0` and a chat-friendly stop list, pass@1 went from
0.0/500 to 67/500 (13.4%) -- the original 0.0 was a prompt-format
artifact (instruct model + completion-style few-shot), not a
conversion regression.
Copilot AI review requested due to automatic review settings May 12, 2026 21:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds lm-eval “chat template” integration hooks to LMEvalORTGenAIEvaluator so ORT GenAI–backed models can be evaluated via lm_eval.simple_evaluate(..., apply_chat_template=True) (matching the capability available in the HuggingFace backend).

Changes:

  • Add an HF tokenizer instance to LMEvalORTGenAIEvaluator for rendering chat templates.
  • Implement tokenizer_name for lm-eval chat-template-aware caching.
  • Implement apply_chat_template(...) by delegating to the HF tokenizer.

Comment thread olive/evaluator/lmeval_ort.py Outdated
Comment on lines +501 to +504
# HF tokenizer kept solely to render `apply_chat_template`; generation
# still uses og.Tokenizer above.
self._pretrained = str(pretrained)
self._hf_tokenizer = AutoTokenizer.from_pretrained(self._pretrained)
Comment thread olive/evaluator/lmeval_ort.py Outdated
@property
def tokenizer_name(self) -> str:
"""Identifier used by lm-eval for chat-template-aware caching."""
return self._pretrained.replace("/", "__")
… key, tests

- Lazy-load the HF tokenizer on the first ``apply_chat_template`` call rather
  than at ``__init__``. Callers that never enable chat templating no longer
  need HF tokenizer files (``tokenizer_config.json`` etc.) in the model
  directory; eager loading would have regressed those workflows.

- ``tokenizer_name`` now replaces both POSIX and Windows path separators with
  ``__`` so the lm-eval cache identifier is stable across platforms. The
  previous implementation only handled forward slashes, leaving backslashes
  in the key on Windows because ``str(Path(...))`` preserves the native
  separator.

- Add unit tests for both behaviours:
    - ``tokenizer_name`` parametrised over POSIX, relative, and Windows-style
      paths to lock in the normalisation contract.
    - ``apply_chat_template`` verified to (a) not load the HF tokenizer at
      construction, (b) load once on first call, and (c) reuse the cached
      tokenizer on subsequent calls. ``AutoTokenizer`` is patched so the
      tests run without any HF tokenizer files on disk.

All four new tests pass; ``test_olive_evaluator.py`` as a whole stays green
(85 passed). ``lintrunner`` reports no new warnings on the changed files.
Comment thread test/evaluator/test_olive_evaluator.py Fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants