Skip to content

perf: use get_logits() to avoid massive GPU→CPU logits copy in ortgenai evaluator#2452

Open
justinchuby wants to merge 2 commits intomainfrom
justinchu/fix-ortgenai-logits-perf
Open

perf: use get_logits() to avoid massive GPU→CPU logits copy in ortgenai evaluator#2452
justinchuby wants to merge 2 commits intomainfrom
justinchu/fix-ortgenai-logits-perf

Conversation

@justinchuby
Copy link
Copy Markdown
Contributor

Summary

Replace get_output('logits') with get_logits() in the ortgenai evaluator to avoid copying the full logits tensor from GPU to CPU each call.

Problem

generator.get_output('logits') copies the entire logits tensor ([batch, seq_len, vocab_size]) from GPU to CPU. For a 900-token prompt with 262K vocab, this is 472MB per call, taking ~410ms — more than the forward pass itself (289ms).

For loglikelihood evaluation, we only need logits at the continuation token positions (typically 1–20 positions out of 900). Copying all 900 positions is wasteful.

Fix

Use generator.get_logits() which returns only the last position's logits (~1MB, ~2ms). The evaluator now always uses incremental token appending: bulk-prefill the context, then step through continuation tokens collecting logits at each position.

Benchmark

Gemma4 E2B-IT, MMLU Pro limit=50, CUDA EP, H200:

Metric Before After
Time 46.9s 24.2s
Throughput 10.6 req/s 20.7 req/s
Speedup 1.94×
Accuracy 16.0% 16.0% ✅

Per-call impact

API Copy size Time
get_output('logits') 472MB (all positions) ~410ms
get_logits() ~1MB (last position) ~2ms

Full analysis: onnxruntime/mobius#285

Replace get_output('logits') with get_logits() in the ortgenai
evaluator. get_output copies the full logits tensor from GPU to CPU
each call (e.g. 472MB for 900 tokens × 262K vocab in f16), taking
~410ms. get_logits() returns only the last position's logits (~1MB),
taking ~2ms.

The evaluator now always uses incremental token appending: bulk-prefill
the context, then step through continuation tokens collecting logits
at each position via get_logits(). This is both faster and simpler than
the previous approach which had separate paths for full-logits and
single-position models.

Benchmark on Gemma4 E2B-IT MMLU Pro (limit=50, CUDA EP):
- Before: 46.9s (10.6 req/s)
- After:  24.2s (20.7 req/s)
- Speedup: 1.94×

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Copilot AI review requested due to automatic review settings May 8, 2026 16:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the performance of the ortgenai-backed LM Eval evaluator by avoiding repeated large GPU→CPU transfers of the full logits tensor during loglikelihood scoring.

Changes:

  • Switches logits retrieval from generator.get_output("logits") to generator.get_logits() to fetch only the last-position logits.
  • Refactors model_call to bulk-append the prefix once, then incrementally append remaining tokens and accumulate per-position logits for the continuation window.

Comment on lines 554 to 559
if batch_size > 1 and cont_len > 1:
raise ValueError(
"batch_size > 1 is not supported when the model returns single-position logits"
" and continuation length > 1. Right-padding misaligns continuation positions across"
" batch elements. Use batch_size=1 instead."
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in f086863: reworded the ValueError to describe the evaluator’s incremental get_logits() strategy (and right-padding misalignment) instead of implying this is a model output-format property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants