Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries for INT8 embedding quantization#2464
Open
apsonawane wants to merge 4 commits into
Open
Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries for INT8 embedding quantization#2464apsonawane wants to merge 4 commits into
apsonawane wants to merge 4 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds two new ONNX graph surgeries to enable post-hoc INT8 embedding quantization and embedding/lm_head weight sharing (to reduce model size for large-vocab LLMs), and updates the lm-eval ORT evaluator + IO utilities to better support hybrid attention architectures and pruned/non-contiguous KV-cache indices.
Changes:
- Add
QuantizeEmbeddingInt8(FP16/FP32Gather→ INT8GatherBlockQuantized) andShareEmbeddingLmHead(reuse embedding quantization params/weights for INT8MatMulNBits) graph surgeries. - Improve
lmeval_ortruntime IO binding to support 3Dposition_ids(mRoPE) and hybrid state tensors (conv_state/recurrent_state). - Fix KV-cache layer index detection for non-contiguous layer indices and make LM-eval metric parsing more robust to varied key formats/values.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
olive/passes/onnx/graph_surgeries.py |
Adds two new embedding-focused graph surgeries and helper functions. |
olive/passes/onnx/model_builder.py |
Removes a debug message about ignored tied-embedding flags in embedding construction. |
olive/evaluator/lmeval_ort.py |
Adds support for mRoPE position_ids rank detection and hybrid state IO binding/buffers. |
olive/evaluator/olive_evaluator.py |
Tightens parsing of lm-eval metric outputs (skip aliases/non-numeric, handle comma keys). |
olive/common/onnx_io.py |
Detects actual KV-cache layer indices from input names (supports non-contiguous indices). |
test/passes/onnx/test_quantize_embedding.py |
Adds unit tests covering the new embedding surgeries. |
Comments suppressed due to low confidence (1)
test/passes/onnx/test_quantize_embedding.py:176
old_init_namesis assigned but never used, which will fail linting (ruff F841). Remove the variable or assert on it (e.g., compare old vs new initializers) so the assignment is meaningful.
old_init_names = {init.name for init in model.graph.initializer}
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries
Summary
Adds two new graph surgeries for post-hoc INT8 embedding quantization and weight sharing, along with evaluator fixes for hybrid attention architectures (e.g., Qwen3.5-2B with GatedDeltaNet + standard attention).
Motivation
Models with large vocabularies (e.g., Qwen3.5-2B with 248K tokens) have FP16 embeddings that dominate model size (~970 MB out of 2.0 GB for INT4 weights). The ModelBuilder's default quantizer (Neural Compressor) only quantizes
MatMulops, leavingGather(embedding) as FP16. RTN-based quantizers that support INT8 embedding natively (k_quant_last) destroy accuracy on hybrid architectures (26% vs 59% MMLU).Changes
New Graph Surgeries (
graph_surgeries.py)QuantizeEmbeddingInt8: Converts FP16Gatherembedding to INT8GatherBlockQuantizedwith per-block asymmetric quantization (zero_point=128, block_size=32). Reduces embedding from ~970 MB to ~530 MB with negligible accuracy loss.ShareEmbeddingLmHead: Replaces lm_head's INT4MatMulNBitswith INT8MatMulNBitssharing the embedding weight viaReshape, eliminating duplicate storage. Saves ~250 MB._find_embed_node,_find_lm_head_node,_find_initializer,_get_node_attrsEvaluator Fixes
lmeval_ort.py: Support for 3Dposition_ids(mRoPE) and hybridconv_state/recurrent_stateinputs for models with mixed attention + linear attention layersolive_evaluator.py: Fix metric parsing for lm-eval results with non-comma metric keys and non-numeric valuesonnx_io.py: Fix KV cache layer index detection for non-contiguous indices (e.g., attention at layers 3,7,11,15,19,23 only)Results (Qwen3.5-2B)
Testing
test/passes/onnx/test_quantize_embedding.py