perf: use get_logits() to avoid massive GPU→CPU logits copy in ortgenai evaluator#2452
Open
justinchuby wants to merge 2 commits intomainfrom
Open
perf: use get_logits() to avoid massive GPU→CPU logits copy in ortgenai evaluator#2452justinchuby wants to merge 2 commits intomainfrom
justinchuby wants to merge 2 commits intomainfrom
Conversation
Replace get_output('logits') with get_logits() in the ortgenai
evaluator. get_output copies the full logits tensor from GPU to CPU
each call (e.g. 472MB for 900 tokens × 262K vocab in f16), taking
~410ms. get_logits() returns only the last position's logits (~1MB),
taking ~2ms.
The evaluator now always uses incremental token appending: bulk-prefill
the context, then step through continuation tokens collecting logits
at each position via get_logits(). This is both faster and simpler than
the previous approach which had separate paths for full-logits and
single-position models.
Benchmark on Gemma4 E2B-IT MMLU Pro (limit=50, CUDA EP):
- Before: 46.9s (10.6 req/s)
- After: 24.2s (20.7 req/s)
- Speedup: 1.94×
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the performance of the ortgenai-backed LM Eval evaluator by avoiding repeated large GPU→CPU transfers of the full logits tensor during loglikelihood scoring.
Changes:
- Switches logits retrieval from
generator.get_output("logits")togenerator.get_logits()to fetch only the last-position logits. - Refactors
model_callto bulk-append the prefix once, then incrementally append remaining tokens and accumulate per-position logits for the continuation window.
Comment on lines
554
to
559
| if batch_size > 1 and cont_len > 1: | ||
| raise ValueError( | ||
| "batch_size > 1 is not supported when the model returns single-position logits" | ||
| " and continuation length > 1. Right-padding misaligns continuation positions across" | ||
| " batch elements. Use batch_size=1 instead." | ||
| ) |
Contributor
Author
There was a problem hiding this comment.
@copilot apply changes based on this feedback
Contributor
There was a problem hiding this comment.
Updated in f086863: reworded the ValueError to describe the evaluator’s incremental get_logits() strategy (and right-padding misalignment) instead of implying this is a model output-format property.
Agent-Logs-Url: https://github.com/microsoft/Olive/sessions/8e4a3ef0-bdf6-4a0c-a2a8-3b3e56650f0a Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace
get_output('logits')withget_logits()in the ortgenai evaluator to avoid copying the full logits tensor from GPU to CPU each call.Problem
generator.get_output('logits')copies the entire logits tensor ([batch, seq_len, vocab_size]) from GPU to CPU. For a 900-token prompt with 262K vocab, this is 472MB per call, taking ~410ms — more than the forward pass itself (289ms).For loglikelihood evaluation, we only need logits at the continuation token positions (typically 1–20 positions out of 900). Copying all 900 positions is wasteful.
Fix
Use
generator.get_logits()which returns only the last position's logits (~1MB, ~2ms). The evaluator now always uses incremental token appending: bulk-prefill the context, then step through continuation tokens collecting logits at each position.Benchmark
Gemma4 E2B-IT, MMLU Pro limit=50, CUDA EP, H200:
Per-call impact
get_output('logits')get_logits()Full analysis: onnxruntime/mobius#285