Skip to content

Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries for INT8 embedding quantization#2464

Open
apsonawane wants to merge 4 commits into
mainfrom
asonawane/tieword
Open

Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries for INT8 embedding quantization#2464
apsonawane wants to merge 4 commits into
mainfrom
asonawane/tieword

Conversation

@apsonawane
Copy link
Copy Markdown
Contributor

Add QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries

Summary

Adds two new graph surgeries for post-hoc INT8 embedding quantization and weight sharing, along with evaluator fixes for hybrid attention architectures (e.g., Qwen3.5-2B with GatedDeltaNet + standard attention).

Motivation

Models with large vocabularies (e.g., Qwen3.5-2B with 248K tokens) have FP16 embeddings that dominate model size (~970 MB out of 2.0 GB for INT4 weights). The ModelBuilder's default quantizer (Neural Compressor) only quantizes MatMul ops, leaving Gather (embedding) as FP16. RTN-based quantizers that support INT8 embedding natively (k_quant_last) destroy accuracy on hybrid architectures (26% vs 59% MMLU).

Changes

New Graph Surgeries (graph_surgeries.py)

  • QuantizeEmbeddingInt8: Converts FP16 Gather embedding to INT8 GatherBlockQuantized with per-block asymmetric quantization (zero_point=128, block_size=32). Reduces embedding from ~970 MB to ~530 MB with negligible accuracy loss.
  • ShareEmbeddingLmHead: Replaces lm_head's INT4 MatMulNBits with INT8 MatMulNBits sharing the embedding weight via Reshape, eliminating duplicate storage. Saves ~250 MB.
  • Helper functions: _find_embed_node, _find_lm_head_node, _find_initializer, _get_node_attrs

Evaluator Fixes

  • lmeval_ort.py: Support for 3D position_ids (mRoPE) and hybrid conv_state/recurrent_state inputs for models with mixed attention + linear attention layers
  • olive_evaluator.py: Fix metric parsing for lm-eval results with non-comma metric keys and non-numeric values
  • onnx_io.py: Fix KV cache layer index detection for non-contiguous indices (e.g., attention at layers 3,7,11,15,19,23 only)

Results (Qwen3.5-2B)

Configuration Size MMLU Δ vs FP16
Baseline FP16 4.3 GB 59.27%
INT4 weights + FP16 embed 2.0 GB 57.21% -2.06%
INT4 weights + INT8 embed 1.6 GB 57.19% -2.08%
INT4 weights + shared INT8 embed/lm_head 1.4 GB 57.11% -2.16%

Testing

  • 7 unit tests added in test/passes/onnx/test_quantize_embedding.py
  • All tests pass
  • End-to-end validated via Olive recipe with MMLU evaluation

Copilot AI review requested due to automatic review settings May 14, 2026 00:17
Comment thread test/passes/onnx/test_quantize_embedding.py Fixed
Comment thread test/passes/onnx/test_quantize_embedding.py Fixed
Comment thread test/passes/onnx/test_quantize_embedding.py Fixed
Comment thread test/passes/onnx/test_quantize_embedding.py Fixed
Comment thread test/passes/onnx/test_quantize_embedding.py Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds two new ONNX graph surgeries to enable post-hoc INT8 embedding quantization and embedding/lm_head weight sharing (to reduce model size for large-vocab LLMs), and updates the lm-eval ORT evaluator + IO utilities to better support hybrid attention architectures and pruned/non-contiguous KV-cache indices.

Changes:

  • Add QuantizeEmbeddingInt8 (FP16/FP32 Gather → INT8 GatherBlockQuantized) and ShareEmbeddingLmHead (reuse embedding quantization params/weights for INT8 MatMulNBits) graph surgeries.
  • Improve lmeval_ort runtime IO binding to support 3D position_ids (mRoPE) and hybrid state tensors (conv_state / recurrent_state).
  • Fix KV-cache layer index detection for non-contiguous layer indices and make LM-eval metric parsing more robust to varied key formats/values.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
olive/passes/onnx/graph_surgeries.py Adds two new embedding-focused graph surgeries and helper functions.
olive/passes/onnx/model_builder.py Removes a debug message about ignored tied-embedding flags in embedding construction.
olive/evaluator/lmeval_ort.py Adds support for mRoPE position_ids rank detection and hybrid state IO binding/buffers.
olive/evaluator/olive_evaluator.py Tightens parsing of lm-eval metric outputs (skip aliases/non-numeric, handle comma keys).
olive/common/onnx_io.py Detects actual KV-cache layer indices from input names (supports non-contiguous indices).
test/passes/onnx/test_quantize_embedding.py Adds unit tests covering the new embedding surgeries.
Comments suppressed due to low confidence (1)

test/passes/onnx/test_quantize_embedding.py:176

  • old_init_names is assigned but never used, which will fail linting (ruff F841). Remove the variable or assert on it (e.g., compare old vs new initializers) so the assignment is meaningful.

        old_init_names = {init.name for init in model.graph.initializer}

Comment thread test/passes/onnx/test_quantize_embedding.py Outdated
Comment thread olive/passes/onnx/graph_surgeries.py
Comment thread olive/passes/onnx/graph_surgeries.py
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
Comment thread olive/passes/onnx/graph_surgeries.py Fixed
Comment thread test/passes/onnx/test_quantize_embedding.py Fixed
Comment thread test/passes/onnx/test_quantize_embedding.py Fixed
Comment thread test/passes/onnx/test_quantize_embedding.py Fixed
Comment thread test/passes/onnx/test_quantize_embedding.py Fixed
Comment thread test/passes/onnx/test_quantize_embedding.py Fixed
Comment thread test/passes/onnx/test_quantize_embedding.py Fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants