feat: add ort-multimodal evaluator for fast multimodal benchmarks by justinchuby · Pull Request #2451 · microsoft/Olive

justinchuby · 2026-05-08T00:52:33Z

Summary

Add LMEvalORTMultimodalEvaluator (ort-multimodal backend) that uses direct ORT InferenceSession for multimodal GenAI packages. This is 15× faster than the ortgenai backend for text-only benchmarks on multimodal models.

Problem

The ortgenai evaluator loads all sub-models (decoder + vision + audio + embedding) and creates a new og.Generator per model call, adding significant overhead for text-only benchmarks like MMLU Pro.

The built-in ort evaluator can't handle models with heterogeneous KV cache head dimensions (e.g. Gemma4 with head_dim=256 for sliding attention and head_dim=512 for full attention) because its Prefill class allocates uniform IOBinding buffers.

Solution

The new ort-multimodal evaluator:

Reads genai_config.json to locate decoder and embedding ONNX files
Loads only the decoder and embedding sessions (skips vision/audio)
Runs embedding model to convert input_ids → inputs_embeds
Runs decoder with per-layer empty KV cache buffers (supporting heterogeneous head_dim)
Returns full logits [batch, seq_len, vocab] in a single forward pass

Benchmark

Gemma4 E2B-IT MMLU Pro (12,032 questions, CUDA EP, H200 GPU):

Backend	Accuracy	Throughput	vs ortgenai
PyTorch HF	22.6%	185 req/s	19.5×
ort-multimodal	22.6%	142 req/s	15×
ortgenai	22.6%	9.5 req/s	1×

Full analysis: onnxruntime/mobius#285

Usage

import olive.evaluator.lmeval_ort  # registers backends

from lm_eval.api.registry import get_model
model = get_model('ort-multimodal')(
    pretrained='/path/to/genai/package',
    ep='CUDAExecutionProvider',
    max_length=2048,
)

Or via Olive config:

{"type": "LMEvaluator", "tasks": ["leaderboard_mmlu_pro"], "model_class": "ort-multimodal"}

Add LMEvalORTMultimodalEvaluator ('ort-multimodal' backend) that uses direct ORT InferenceSession for multimodal GenAI packages. This avoids the overhead of GenAI's Generator API while supporting models with heterogeneous KV cache head dimensions (e.g. Gemma4 with head_dim=256 for sliding attention and head_dim=512 for full attention). The evaluator: - Loads decoder and embedding ONNX models from genai_config.json - Runs embedding model to convert input_ids -> inputs_embeds - Runs decoder with per-layer empty KV cache buffers - Returns full logits [batch, seq_len, vocab] in a single forward pass Performance on Gemma4 E2B-IT MMLU Pro (CUDA EP): - ort-multimodal: 142 req/s (full run, matches PyTorch HF accuracy) - ortgenai: 9.5 req/s (19.5x slower) - PyTorch HF: 185 req/s (baseline) Also wire up 'ort-multimodal' in LMEvaluator.evaluate() for Olive pipeline integration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

justinchuby · 2026-05-08T00:55:55Z

Not generalized

Copilot

Pull request overview

This PR adds a new lm-eval backend (ort-multimodal) to run fast text-only benchmarks on ORT GenAI multimodal packages by directly invoking ONNX Runtime InferenceSession for the decoder (+ embedding), avoiding GenAI generator overhead and supporting heterogeneous KV cache head dimensions.

Changes:

Add LMEvalORTMultimodalEvaluator registered as ort-multimodal in olive.evaluator.lmeval_ort.
Wire LMEvaluator to dispatch model_class="ort-multimodal" with appropriate init args.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
olive/evaluator/olive_evaluator.py	Adds `ort-multimodal` dispatch path to construct lm-eval model init args.
olive/evaluator/lmeval_ort.py	Implements the `ort-multimodal` lm-eval backend using direct ORT sessions for decoder/embedding.

+        elif self.model_class == "ort-multimodal":
+            init_args = {
+                "pretrained": str(Path(model.model_path).parent),
+                "ep": self.ep or execution_providers,
+                "ep_options": self.ep_options,
+            }


+        # Set up execution providers
+        providers = []
+        if ep:
+            providers.append(ep)
+        providers.append("CPUExecutionProvider")
+
+        # Load decoder session
+        decoder_path = str(model_dir / decoder_config["filename"])
+        logger.info("Loading decoder from %s", decoder_path)
+        self._decoder_sess = ort.InferenceSession(decoder_path, providers=providers)
+


+        providers = []
+        if ep:
+            providers.append(ep)
+        providers.append("CPUExecutionProvider")


+            )
+
+        result = self._decoder_sess.run(["logits"], dec_feed)
+        return torch.from_numpy(result[0])


+        elif self.model_class == "ort-multimodal":
+            init_args = {
+                "pretrained": str(Path(model.model_path).parent),
+                "ep": self.ep or execution_providers,
+                "ep_options": self.ep_options,
+            }


Cache the og.Generator object across model_call invocations and use rewind_to(0) to reset state instead of creating a new Generator per sample. Falls back to creating a new Generator if rewind_to is not supported (e.g. older GenAI versions without multimodal rewind fix). Performance on Gemma4 E2B-IT MMLU Pro (CUDA EP): - Per-call: 379ms -> 256ms (1.48x per model call) - End-to-end limit=200: 160.6s -> 145.2s (1.11x overall) - Estimated full run savings: ~12 minutes Requires onnxruntime-genai >= 0.14.0 with microsoft/onnxruntime-genai#2141 for multimodal model support. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>


-    def model_call(self, input_ids: torch.Tensor, cont_len: int = 0) -> torch.Tensor:
-        batch_size, seq_len = input_ids.shape
+    def _get_generator(self, batch_size: int) -> "og.Generator":


Replace get_output('logits') with incremental get_logits() in the ortgenai evaluator's model_call. get_output copies the FULL logits tensor (seq_len × vocab_size × 2 bytes, e.g. 472MB for 900 tokens with 262K vocab) from GPU to CPU each call, taking 410ms. get_logits() returns only the last position's logits (~1MB), taking 1.8ms. For loglikelihood scoring, we only need logits at the continuation token positions (typically 1-20 tokens). The new approach appends the context as a bulk prefill, then steps through continuation tokens one at a time using get_logits(), collecting only the needed positions. Performance on Gemma4 E2B-IT MMLU Pro (limit=50, CUDA EP): - Before: 46.9s (10.6 req/s) - After: 24.2s (20.7 req/s) - Speedup: 1.94x Also removes the _returns_full_logits detection since the evaluator now always uses the incremental path (which works for both model types). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Copilot AI review requested due to automatic review settings May 8, 2026 00:52

Copilot started reviewing on behalf of justinchuby May 8, 2026 00:53 View session

justinchuby marked this pull request as draft May 8, 2026 00:55

Copilot AI reviewed May 8, 2026

View reviewed changes

github-advanced-security AI found potential problems May 8, 2026

View reviewed changes

Comment thread olive/evaluator/lmeval_ort.py

def model_call(self, input_ids: torch.Tensor, cont_len: int = 0) -> torch.Tensor:

batch_size, seq_len = input_ids.shape

def _get_generator(self, batch_size: int) -> "og.Generator":

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ort-multimodal evaluator for fast multimodal benchmarks#2451

feat: add ort-multimodal evaluator for fast multimodal benchmarks#2451
justinchuby wants to merge 3 commits intomainfrom
justinchu/ort-multimodal-eval

justinchuby commented May 8, 2026

Uh oh!

justinchuby commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

justinchuby commented May 8, 2026

Summary

Problem

Solution

Benchmark

Usage

Uh oh!

justinchuby commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants