perf: use get_logits() to avoid massive GPU→CPU logits copy in ortgenai evaluator by justinchuby · Pull Request #2452 · microsoft/Olive

justinchuby · 2026-05-08T16:28:51Z

Summary

Replace get_output('logits') with get_logits() in the ortgenai evaluator to avoid copying the full logits tensor from GPU to CPU each call.

Problem

generator.get_output('logits') copies the entire logits tensor ([batch, seq_len, vocab_size]) from GPU to CPU. For a 900-token prompt with 262K vocab, this is 472MB per call, taking ~410ms — more than the forward pass itself (289ms).

For loglikelihood evaluation, we only need logits at the continuation token positions (typically 1–20 positions out of 900). Copying all 900 positions is wasteful.

Fix

Use generator.get_logits() which returns only the last position's logits (~1MB, ~2ms). The evaluator now always uses incremental token appending: bulk-prefill the context, then step through continuation tokens collecting logits at each position.

Benchmark

Gemma4 E2B-IT, MMLU Pro limit=50, CUDA EP, H200:

Metric	Before	After
Time	46.9s	24.2s
Throughput	10.6 req/s	20.7 req/s
Speedup	—	1.94×
Accuracy	16.0%	16.0% ✅

Per-call impact

API	Copy size	Time
`get_output('logits')`	472MB (all positions)	~410ms
`get_logits()`	~1MB (last position)	~2ms

Full analysis: onnxruntime/mobius#285

Replace get_output('logits') with get_logits() in the ortgenai evaluator. get_output copies the full logits tensor from GPU to CPU each call (e.g. 472MB for 900 tokens × 262K vocab in f16), taking ~410ms. get_logits() returns only the last position's logits (~1MB), taking ~2ms. The evaluator now always uses incremental token appending: bulk-prefill the context, then step through continuation tokens collecting logits at each position via get_logits(). This is both faster and simpler than the previous approach which had separate paths for full-logits and single-position models. Benchmark on Gemma4 E2B-IT MMLU Pro (limit=50, CUDA EP): - Before: 46.9s (10.6 req/s) - After: 24.2s (20.7 req/s) - Speedup: 1.94× Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Copilot

Pull request overview

This PR improves the performance of the ortgenai-backed LM Eval evaluator by avoiding repeated large GPU→CPU transfers of the full logits tensor during loglikelihood scoring.

Changes:

Switches logits retrieval from generator.get_output("logits") to generator.get_logits() to fetch only the last-position logits.
Refactors model_call to bulk-append the prefix once, then incrementally append remaining tokens and accumulate per-position logits for the continuation window.

justinchuby · 2026-05-08T17:26:12Z

        if batch_size > 1 and cont_len > 1:
            raise ValueError(
                "batch_size > 1 is not supported when the model returns single-position logits"
                " and continuation length > 1. Right-padding misaligns continuation positions across"
                " batch elements. Use batch_size=1 instead."
            )


@copilot apply changes based on this feedback

Updated in f086863: reworded the ValueError to describe the evaluator’s incremental get_logits() strategy (and right-padding misalignment) instead of implying this is a model output-format property.

Agent-Logs-Url: https://github.com/microsoft/Olive/sessions/8e4a3ef0-bdf6-4a0c-a2a8-3b3e56650f0a Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 8, 2026 16:28

Copilot started reviewing on behalf of justinchuby May 8, 2026 16:29 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Copilot started work on behalf of justinchuby May 8, 2026 17:26 View session

fix: reword ortgenai incremental logits batching error message

f086863

Agent-Logs-Url: https://github.com/microsoft/Olive/sessions/8e4a3ef0-bdf6-4a0c-a2a8-3b3e56650f0a Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot finished work on behalf of justinchuby May 8, 2026 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: use get_logits() to avoid massive GPU→CPU logits copy in ortgenai evaluator#2452

perf: use get_logits() to avoid massive GPU→CPU logits copy in ortgenai evaluator#2452
justinchuby wants to merge 2 commits intomainfrom
justinchu/fix-ortgenai-logits-perf

justinchuby commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

justinchuby May 8, 2026

Uh oh!

Copilot AI May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

justinchuby commented May 8, 2026

Summary

Problem

Fix

Benchmark

Per-call impact

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

justinchuby May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants