Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries for INT8 embedding quantization by apsonawane · Pull Request #2464 · microsoft/Olive

apsonawane · 2026-05-14T00:17:08Z

Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries

Summary

Adds two new graph surgeries for post-hoc INT8 embedding quantization and weight sharing, along with evaluator fixes for hybrid attention architectures (e.g., Qwen3.5-2B with GatedDeltaNet + standard attention).

Motivation

Models with large vocabularies (e.g., Qwen3.5-2B with 248K tokens) have FP16 embeddings that dominate model size (~970 MB out of 2.0 GB for INT4 weights). The ModelBuilder's default quantizer (Neural Compressor) only quantizes MatMul ops, leaving Gather (embedding) as FP16. RTN-based quantizers that support INT8 embedding natively (k_quant_last) destroy accuracy on hybrid architectures (26% vs 59% MMLU).

Changes

New Graph Surgeries (`graph_surgeries.py`)

QuantizeEmbeddingInt8: Converts FP16 Gather embedding to INT8 GatherBlockQuantized with per-block asymmetric quantization (zero_point=128, block_size=32). Reduces embedding from ~970 MB to ~530 MB with negligible accuracy loss.
ShareEmbeddingLmHead: Replaces lm_head's INT4 MatMulNBits with INT8 MatMulNBits sharing the embedding weight via Reshape, eliminating duplicate storage. Saves ~250 MB.
Helper functions: _find_embed_node, _find_lm_head_node, _find_initializer, _get_node_attrs

Evaluator Fixes

lmeval_ort.py: Support for 3D position_ids (mRoPE) and hybrid conv_state/recurrent_state inputs for models with mixed attention + linear attention layers
olive_evaluator.py: Fix metric parsing for lm-eval results with non-comma metric keys and non-numeric values
onnx_io.py: Fix KV cache layer index detection for non-contiguous indices (e.g., attention at layers 3,7,11,15,19,23 only)

Results (Qwen3.5-2B)

Configuration	Size	MMLU	Δ vs FP16
Baseline FP16	4.3 GB	59.27%	—
INT4 weights + FP16 embed	2.0 GB	57.21%	-2.06%
INT4 weights + INT8 embed	1.6 GB	57.19%	-2.08%
INT4 weights + shared INT8 embed/lm_head	1.4 GB	57.11%	-2.16%

Testing

7 unit tests added in test/passes/onnx/test_quantize_embedding.py
All tests pass
End-to-end validated via Olive recipe with MMLU evaluation

…r INT8 embedding quantization

Copilot

Pull request overview

This PR adds two new ONNX graph surgeries to enable post-hoc INT8 embedding quantization and embedding/lm_head weight sharing (to reduce model size for large-vocab LLMs), and updates the lm-eval ORT evaluator + IO utilities to better support hybrid attention architectures and pruned/non-contiguous KV-cache indices.

Changes:

Add QuantizeEmbeddingInt8 (FP16/FP32 Gather → INT8 GatherBlockQuantized) and ShareEmbeddingLmHead (reuse embedding quantization params/weights for INT8 MatMulNBits) graph surgeries.
Improve lmeval_ort runtime IO binding to support 3D position_ids (mRoPE) and hybrid state tensors (conv_state / recurrent_state).
Fix KV-cache layer index detection for non-contiguous layer indices and make LM-eval metric parsing more robust to varied key formats/values.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`olive/passes/onnx/graph_surgeries.py`	Adds two new embedding-focused graph surgeries and helper functions.
`olive/passes/onnx/model_builder.py`	Removes a debug message about ignored tied-embedding flags in embedding construction.
`olive/evaluator/lmeval_ort.py`	Adds support for mRoPE `position_ids` rank detection and hybrid state IO binding/buffers.
`olive/evaluator/olive_evaluator.py`	Tightens parsing of lm-eval metric outputs (skip aliases/non-numeric, handle comma keys).
`olive/common/onnx_io.py`	Detects actual KV-cache layer indices from input names (supports non-contiguous indices).
`test/passes/onnx/test_quantize_embedding.py`	Adds unit tests covering the new embedding surgeries.

Comments suppressed due to low confidence (1)

test/passes/onnx/test_quantize_embedding.py:176

old_init_names is assigned but never used, which will fail linting (ruff F841). Remove the variable or assert on it (e.g., compare old vs new initializers) so the assignment is meaningful.


        old_init_names = {init.name for init in model.graph.initializer}

apsonawane added 3 commits April 15, 2026 11:50

Update tie-word embedding surgery

5f66048

Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries fo…

14ec328

…r INT8 embedding quantization

Merge branch 'main' into asonawane/tieword

6ba1f23

Copilot AI review requested due to automatic review settings May 14, 2026 00:17

Copilot started reviewing on behalf of apsonawane May 14, 2026 00:18 View session

github-advanced-security AI found potential problems May 14, 2026

View reviewed changes

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread test/passes/onnx/test_quantize_embedding.py Outdated

Comment thread olive/passes/onnx/graph_surgeries.py

Comment thread olive/passes/onnx/graph_surgeries.py

github-advanced-security AI found potential problems May 14, 2026

View reviewed changes

apsonawane mentioned this pull request May 14, 2026

Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared INT8 embedding microsoft/olive-recipes#422

Open

Fix comments

cbb973a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries for INT8 embedding quantization#2464

Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries for INT8 embedding quantization#2464
apsonawane wants to merge 4 commits into
mainfrom
asonawane/tieword

apsonawane commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

apsonawane commented May 14, 2026