Skip to content

feat: add ort-multimodal evaluator for fast multimodal benchmarks#2451

Draft
justinchuby wants to merge 3 commits intomainfrom
justinchu/ort-multimodal-eval
Draft

feat: add ort-multimodal evaluator for fast multimodal benchmarks#2451
justinchuby wants to merge 3 commits intomainfrom
justinchu/ort-multimodal-eval

Conversation

@justinchuby
Copy link
Copy Markdown
Contributor

Summary

Add LMEvalORTMultimodalEvaluator (ort-multimodal backend) that uses direct ORT InferenceSession for multimodal GenAI packages. This is 15× faster than the ortgenai backend for text-only benchmarks on multimodal models.

Problem

The ortgenai evaluator loads all sub-models (decoder + vision + audio + embedding) and creates a new og.Generator per model call, adding significant overhead for text-only benchmarks like MMLU Pro.

The built-in ort evaluator can't handle models with heterogeneous KV cache head dimensions (e.g. Gemma4 with head_dim=256 for sliding attention and head_dim=512 for full attention) because its Prefill class allocates uniform IOBinding buffers.

Solution

The new ort-multimodal evaluator:

  • Reads genai_config.json to locate decoder and embedding ONNX files
  • Loads only the decoder and embedding sessions (skips vision/audio)
  • Runs embedding model to convert input_idsinputs_embeds
  • Runs decoder with per-layer empty KV cache buffers (supporting heterogeneous head_dim)
  • Returns full logits [batch, seq_len, vocab] in a single forward pass

Benchmark

Gemma4 E2B-IT MMLU Pro (12,032 questions, CUDA EP, H200 GPU):

Backend Accuracy Throughput vs ortgenai
PyTorch HF 22.6% 185 req/s 19.5×
ort-multimodal 22.6% 142 req/s 15×
ortgenai 22.6% 9.5 req/s

Full analysis: onnxruntime/mobius#285

Usage

import olive.evaluator.lmeval_ort  # registers backends

from lm_eval.api.registry import get_model
model = get_model('ort-multimodal')(
    pretrained='/path/to/genai/package',
    ep='CUDAExecutionProvider',
    max_length=2048,
)

Or via Olive config:

{"type": "LMEvaluator", "tasks": ["leaderboard_mmlu_pro"], "model_class": "ort-multimodal"}

Add LMEvalORTMultimodalEvaluator ('ort-multimodal' backend) that uses
direct ORT InferenceSession for multimodal GenAI packages. This avoids
the overhead of GenAI's Generator API while supporting models with
heterogeneous KV cache head dimensions (e.g. Gemma4 with head_dim=256
for sliding attention and head_dim=512 for full attention).

The evaluator:
- Loads decoder and embedding ONNX models from genai_config.json
- Runs embedding model to convert input_ids -> inputs_embeds
- Runs decoder with per-layer empty KV cache buffers
- Returns full logits [batch, seq_len, vocab] in a single forward pass

Performance on Gemma4 E2B-IT MMLU Pro (CUDA EP):
- ort-multimodal: 142 req/s (full run, matches PyTorch HF accuracy)
- ortgenai:        9.5 req/s (19.5x slower)
- PyTorch HF:    185 req/s (baseline)

Also wire up 'ort-multimodal' in LMEvaluator.evaluate() for Olive
pipeline integration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Copilot AI review requested due to automatic review settings May 8, 2026 00:52
@justinchuby
Copy link
Copy Markdown
Contributor Author

Not generalized

@justinchuby justinchuby marked this pull request as draft May 8, 2026 00:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new lm-eval backend (ort-multimodal) to run fast text-only benchmarks on ORT GenAI multimodal packages by directly invoking ONNX Runtime InferenceSession for the decoder (+ embedding), avoiding GenAI generator overhead and supporting heterogeneous KV cache head dimensions.

Changes:

  • Add LMEvalORTMultimodalEvaluator registered as ort-multimodal in olive.evaluator.lmeval_ort.
  • Wire LMEvaluator to dispatch model_class="ort-multimodal" with appropriate init args.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
olive/evaluator/olive_evaluator.py Adds ort-multimodal dispatch path to construct lm-eval model init args.
olive/evaluator/lmeval_ort.py Implements the ort-multimodal lm-eval backend using direct ORT sessions for decoder/embedding.

Comment on lines +1575 to +1580
elif self.model_class == "ort-multimodal":
init_args = {
"pretrained": str(Path(model.model_path).parent),
"ep": self.ep or execution_providers,
"ep_options": self.ep_options,
}
Comment on lines +521 to +531
# Set up execution providers
providers = []
if ep:
providers.append(ep)
providers.append("CPUExecutionProvider")

# Load decoder session
decoder_path = str(model_dir / decoder_config["filename"])
logger.info("Loading decoder from %s", decoder_path)
self._decoder_sess = ort.InferenceSession(decoder_path, providers=providers)

Comment on lines +522 to +525
providers = []
if ep:
providers.append(ep)
providers.append("CPUExecutionProvider")
)

result = self._decoder_sess.run(["logits"], dec_feed)
return torch.from_numpy(result[0])
Comment on lines +1575 to +1580
elif self.model_class == "ort-multimodal":
init_args = {
"pretrained": str(Path(model.model_path).parent),
"ep": self.ep or execution_providers,
"ep_options": self.ep_options,
}
Cache the og.Generator object across model_call invocations and use
rewind_to(0) to reset state instead of creating a new Generator per
sample. Falls back to creating a new Generator if rewind_to is not
supported (e.g. older GenAI versions without multimodal rewind fix).

Performance on Gemma4 E2B-IT MMLU Pro (CUDA EP):
- Per-call: 379ms -> 256ms (1.48x per model call)
- End-to-end limit=200: 160.6s -> 145.2s (1.11x overall)
- Estimated full run savings: ~12 minutes

Requires onnxruntime-genai >= 0.14.0 with microsoft/onnxruntime-genai#2141
for multimodal model support.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>

def model_call(self, input_ids: torch.Tensor, cont_len: int = 0) -> torch.Tensor:
batch_size, seq_len = input_ids.shape
def _get_generator(self, batch_size: int) -> "og.Generator":
Replace get_output('logits') with incremental get_logits() in the
ortgenai evaluator's model_call. get_output copies the FULL logits
tensor (seq_len × vocab_size × 2 bytes, e.g. 472MB for 900 tokens
with 262K vocab) from GPU to CPU each call, taking 410ms. get_logits()
returns only the last position's logits (~1MB), taking 1.8ms.

For loglikelihood scoring, we only need logits at the continuation
token positions (typically 1-20 tokens). The new approach appends the
context as a bulk prefill, then steps through continuation tokens one
at a time using get_logits(), collecting only the needed positions.

Performance on Gemma4 E2B-IT MMLU Pro (limit=50, CUDA EP):
- Before: 46.9s (10.6 req/s)
- After:  24.2s (20.7 req/s)
- Speedup: 1.94x

Also removes the _returns_full_logits detection since the evaluator
now always uses the incremental path (which works for both model types).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants