Skip to content

Implement generate_until for LMEvalORTGenAIEvaluator#2448

Open
justinchuby wants to merge 1 commit intomainfrom
justinchu/ortgenai-generate-until
Open

Implement generate_until for LMEvalORTGenAIEvaluator#2448
justinchuby wants to merge 1 commit intomainfrom
justinchu/ortgenai-generate-until

Conversation

@justinchuby
Copy link
Copy Markdown
Contributor

Summary

Implements generate_until for the ortgenai (ORT GenAI) evaluator in lm-eval-harness integration, enabling chain-of-thought (CoT) benchmarks.

Motivation

The LMEvalORTGenAIEvaluator previously raised NotImplementedError for generate_until, which blocked CoT-scored benchmarks like MMLU Pro (v3). These benchmarks are the standard methodology for evaluating instruction-tuned models — Google's published Gemma4 scores use CoT generation + regex answer extraction, not log-likelihood scoring.

Changes

  • Add generate_until method to LMEvalORTGenAIEvaluator
  • Support multiple EOS token IDs (e.g., Gemma4 uses [1, 106])
  • Periodic stop-sequence checking during generation for early exit
  • Handle temperature/sampling and max_gen_toks from gen_kwargs

Testing

Validated with Gemma4 E4B-IT ONNX models on MMLU Pro:

  • Log-likelihood scoring (leaderboard_mmlu_pro): 33.0% F16
  • CoT generation (mmlu_pro): running, results pending

The log-likelihood vs CoT methodology difference explains the gap vs Google's published 69.4% (which uses CoT).

Add generate_until method to the ortgenai evaluator, enabling
chain-of-thought (CoT) benchmarks like MMLU Pro that generate text
and extract answers via regex filters.

Previously, generate_until raised NotImplementedError, limiting the
evaluator to log-likelihood-only benchmarks. This blocked CoT-scored
benchmarks which are the standard methodology for instruction-tuned
models like Gemma4.

The implementation:
- Tokenizes the prompt and generates token-by-token using og.Generator
- Supports multiple EOS token IDs (common in modern models)
- Checks stop sequences periodically during generation for early exit
- Handles temperature/sampling and max_gen_toks from gen_kwargs
- Truncates output at the first matching stop sequence

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Copilot AI review requested due to automatic review settings May 7, 2026 22:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds generate_until support to the ortgenai (ONNX Runtime GenAI) lm-eval-harness evaluator so CoT-style benchmarks can run via text generation rather than log-likelihood only.

Changes:

  • Implement generate_until in LMEvalORTGenAIEvaluator, including stop-sequence handling and early-exit checks.
  • Add config handling for eos_token_id that may be a list (but currently only retains the first element).
  • Support basic sampling vs greedy decoding based on temperature, and enforce max_gen_toks within the model’s max_length.

Comment on lines +514 to +515
# Use the first element for loglikelihood evaluation.
self._eot_token_id = eot[0] if isinstance(eot, list) else eot
Comment on lines +621 to +631
eos_ids = self._eot_token_id if isinstance(self._eot_token_id, (list, tuple)) else [self._eot_token_id]

generated_ids = []
# Decode periodically to check for stop sequences
decode_interval = 16
while not generator.is_done():
generator.generate_next_token()
token_id = generator.get_next_tokens()[0]
generated_ids.append(token_id)
if token_id in eos_ids:
break
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants