Implement generate_until for LMEvalORTGenAIEvaluator by justinchuby · Pull Request #2448 · microsoft/Olive

justinchuby · 2026-05-07T22:02:09Z

Summary

Implements generate_until for the ortgenai (ORT GenAI) evaluator in lm-eval-harness integration, enabling chain-of-thought (CoT) benchmarks.

Motivation

The LMEvalORTGenAIEvaluator previously raised NotImplementedError for generate_until, which blocked CoT-scored benchmarks like MMLU Pro (v3). These benchmarks are the standard methodology for evaluating instruction-tuned models — Google's published Gemma4 scores use CoT generation + regex answer extraction, not log-likelihood scoring.

Changes

Add generate_until method to LMEvalORTGenAIEvaluator
Support multiple EOS token IDs (e.g., Gemma4 uses [1, 106])
Periodic stop-sequence checking during generation for early exit
Handle temperature/sampling and max_gen_toks from gen_kwargs

Testing

Validated with Gemma4 E4B-IT ONNX models on MMLU Pro:

Log-likelihood scoring (leaderboard_mmlu_pro): 33.0% F16
CoT generation (mmlu_pro): running, results pending

The log-likelihood vs CoT methodology difference explains the gap vs Google's published 69.4% (which uses CoT).

Add generate_until method to the ortgenai evaluator, enabling chain-of-thought (CoT) benchmarks like MMLU Pro that generate text and extract answers via regex filters. Previously, generate_until raised NotImplementedError, limiting the evaluator to log-likelihood-only benchmarks. This blocked CoT-scored benchmarks which are the standard methodology for instruction-tuned models like Gemma4. The implementation: - Tokenizes the prompt and generates token-by-token using og.Generator - Supports multiple EOS token IDs (common in modern models) - Checks stop sequences periodically during generation for early exit - Handles temperature/sampling and max_gen_toks from gen_kwargs - Truncates output at the first matching stop sequence Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Copilot

Pull request overview

Adds generate_until support to the ortgenai (ONNX Runtime GenAI) lm-eval-harness evaluator so CoT-style benchmarks can run via text generation rather than log-likelihood only.

Changes:

Implement generate_until in LMEvalORTGenAIEvaluator, including stop-sequence handling and early-exit checks.
Add config handling for eos_token_id that may be a list (but currently only retains the first element).
Support basic sampling vs greedy decoding based on temperature, and enforce max_gen_toks within the model’s max_length.

+            # Use the first element for loglikelihood evaluation.
+            self._eot_token_id = eot[0] if isinstance(eot, list) else eot


+            eos_ids = self._eot_token_id if isinstance(self._eot_token_id, (list, tuple)) else [self._eot_token_id]
+
+            generated_ids = []
+            # Decode periodically to check for stop sequences
+            decode_interval = 16
+            while not generator.is_done():
+                generator.generate_next_token()
+                token_id = generator.get_next_tokens()[0]
+                generated_ids.append(token_id)
+                if token_id in eos_ids:
+                    break


Copilot AI review requested due to automatic review settings May 7, 2026 22:02

Copilot started reviewing on behalf of justinchuby May 7, 2026 22:03 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement generate_until for LMEvalORTGenAIEvaluator#2448

Implement generate_until for LMEvalORTGenAIEvaluator#2448
justinchuby wants to merge 1 commit intomainfrom
justinchu/ortgenai-generate-until

justinchuby commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Use the first element for loglikelihood evaluation.
		self._eot_token_id = eot[0] if isinstance(eot, list) else eot

Conversation

justinchuby commented May 7, 2026

Summary

Motivation

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants