feat: add ort-multimodal evaluator for fast multimodal benchmarks#2451
Draft
justinchuby wants to merge 3 commits intomainfrom
Draft
feat: add ort-multimodal evaluator for fast multimodal benchmarks#2451justinchuby wants to merge 3 commits intomainfrom
justinchuby wants to merge 3 commits intomainfrom
Conversation
Add LMEvalORTMultimodalEvaluator ('ort-multimodal' backend) that uses
direct ORT InferenceSession for multimodal GenAI packages. This avoids
the overhead of GenAI's Generator API while supporting models with
heterogeneous KV cache head dimensions (e.g. Gemma4 with head_dim=256
for sliding attention and head_dim=512 for full attention).
The evaluator:
- Loads decoder and embedding ONNX models from genai_config.json
- Runs embedding model to convert input_ids -> inputs_embeds
- Runs decoder with per-layer empty KV cache buffers
- Returns full logits [batch, seq_len, vocab] in a single forward pass
Performance on Gemma4 E2B-IT MMLU Pro (CUDA EP):
- ort-multimodal: 142 req/s (full run, matches PyTorch HF accuracy)
- ortgenai: 9.5 req/s (19.5x slower)
- PyTorch HF: 185 req/s (baseline)
Also wire up 'ort-multimodal' in LMEvaluator.evaluate() for Olive
pipeline integration.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Contributor
Author
|
Not generalized |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a new lm-eval backend (ort-multimodal) to run fast text-only benchmarks on ORT GenAI multimodal packages by directly invoking ONNX Runtime InferenceSession for the decoder (+ embedding), avoiding GenAI generator overhead and supporting heterogeneous KV cache head dimensions.
Changes:
- Add
LMEvalORTMultimodalEvaluatorregistered asort-multimodalinolive.evaluator.lmeval_ort. - Wire
LMEvaluatorto dispatchmodel_class="ort-multimodal"with appropriate init args.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| olive/evaluator/olive_evaluator.py | Adds ort-multimodal dispatch path to construct lm-eval model init args. |
| olive/evaluator/lmeval_ort.py | Implements the ort-multimodal lm-eval backend using direct ORT sessions for decoder/embedding. |
Comment on lines
+1575
to
+1580
| elif self.model_class == "ort-multimodal": | ||
| init_args = { | ||
| "pretrained": str(Path(model.model_path).parent), | ||
| "ep": self.ep or execution_providers, | ||
| "ep_options": self.ep_options, | ||
| } |
Comment on lines
+521
to
+531
| # Set up execution providers | ||
| providers = [] | ||
| if ep: | ||
| providers.append(ep) | ||
| providers.append("CPUExecutionProvider") | ||
|
|
||
| # Load decoder session | ||
| decoder_path = str(model_dir / decoder_config["filename"]) | ||
| logger.info("Loading decoder from %s", decoder_path) | ||
| self._decoder_sess = ort.InferenceSession(decoder_path, providers=providers) | ||
|
|
Comment on lines
+522
to
+525
| providers = [] | ||
| if ep: | ||
| providers.append(ep) | ||
| providers.append("CPUExecutionProvider") |
| ) | ||
|
|
||
| result = self._decoder_sess.run(["logits"], dec_feed) | ||
| return torch.from_numpy(result[0]) |
Comment on lines
+1575
to
+1580
| elif self.model_class == "ort-multimodal": | ||
| init_args = { | ||
| "pretrained": str(Path(model.model_path).parent), | ||
| "ep": self.ep or execution_providers, | ||
| "ep_options": self.ep_options, | ||
| } |
Cache the og.Generator object across model_call invocations and use rewind_to(0) to reset state instead of creating a new Generator per sample. Falls back to creating a new Generator if rewind_to is not supported (e.g. older GenAI versions without multimodal rewind fix). Performance on Gemma4 E2B-IT MMLU Pro (CUDA EP): - Per-call: 379ms -> 256ms (1.48x per model call) - End-to-end limit=200: 160.6s -> 145.2s (1.11x overall) - Estimated full run savings: ~12 minutes Requires onnxruntime-genai >= 0.14.0 with microsoft/onnxruntime-genai#2141 for multimodal model support. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
|
|
||
| def model_call(self, input_ids: torch.Tensor, cont_len: int = 0) -> torch.Tensor: | ||
| batch_size, seq_len = input_ids.shape | ||
| def _get_generator(self, batch_size: int) -> "og.Generator": |
Replace get_output('logits') with incremental get_logits() in the
ortgenai evaluator's model_call. get_output copies the FULL logits
tensor (seq_len × vocab_size × 2 bytes, e.g. 472MB for 900 tokens
with 262K vocab) from GPU to CPU each call, taking 410ms. get_logits()
returns only the last position's logits (~1MB), taking 1.8ms.
For loglikelihood scoring, we only need logits at the continuation
token positions (typically 1-20 tokens). The new approach appends the
context as a bulk prefill, then steps through continuation tokens one
at a time using get_logits(), collecting only the needed positions.
Performance on Gemma4 E2B-IT MMLU Pro (limit=50, CUDA EP):
- Before: 46.9s (10.6 req/s)
- After: 24.2s (20.7 req/s)
- Speedup: 1.94x
Also removes the _returns_full_logits detection since the evaluator
now always uses the incremental path (which works for both model types).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
LMEvalORTMultimodalEvaluator(ort-multimodalbackend) that uses direct ORT InferenceSession for multimodal GenAI packages. This is 15× faster than theortgenaibackend for text-only benchmarks on multimodal models.Problem
The
ortgenaievaluator loads all sub-models (decoder + vision + audio + embedding) and creates a newog.Generatorper model call, adding significant overhead for text-only benchmarks like MMLU Pro.The built-in
ortevaluator can't handle models with heterogeneous KV cache head dimensions (e.g. Gemma4 with head_dim=256 for sliding attention and head_dim=512 for full attention) because itsPrefillclass allocates uniform IOBinding buffers.Solution
The new
ort-multimodalevaluator:genai_config.jsonto locate decoder and embedding ONNX filesinput_ids→inputs_embeds[batch, seq_len, vocab]in a single forward passBenchmark
Gemma4 E2B-IT MMLU Pro (12,032 questions, CUDA EP, H200 GPU):
Full analysis: onnxruntime/mobius#285
Usage
Or via Olive config:
{"type": "LMEvaluator", "tasks": ["leaderboard_mmlu_pro"], "model_class": "ort-multimodal"}