feat(qwen3): genai bundle generation and inference script by DingmaomaoBJTU · Pull Request #996 · microsoft/winml-cli

DingmaomaoBJTU · 2026-06-29T08:01:17Z

Summary

Adds onnxruntime-genai integration for winml-exported Qwen3 transformer-only models.

What's new

File	Change
`src/winml/modelkit/models/hf/qwen3/genai.py`	New module: `PipelineStage`, `DecoderIOMapping`, `build_genai_config()`, `build_qwen3_transformer_only_stages()`, `write_genai_bundle()`
`scripts/export_qwen3_transformer_only.py`	New `--genai-bundle`, `--embeddings`, `--lm-head` flags
`scripts/infer_genai.py`	New inference script (CPU / QNN EP)
`src/winml/modelkit/models/hf/qwen3/__init__.py`	Exports new symbols
`tests/unit/models/qwen3/test_genai_config.py`	35 unit tests

Design

build_genai_config is architecture-agnostic: it takes a list[PipelineStage] and DecoderIOMapping — no tensor names are hardcoded inside it.

build_qwen3_transformer_only_stages is the Qwen3-specific factory. It calls _introspect_onnx_io on the built ctx.onnx / iter.onnx, runs _detect_format_patterns to discover past_keys_%d / present_values_%d style patterns from the actual graph I/O, and returns (stages, decoder_io). Tensor names can never drift from what the ONNX really contains.

build_genai_config(hf_config, ..., pipeline=stages, decoder_io=decoder_io)
        ↑ generic, architecture-agnostic
        
build_qwen3_transformer_only_stages(ctx_onnx, iter_onnx, num_layers)
        ↑ Qwen3-specific; introspects ONNX to discover all tensor names

write_genai_bundle calls build_qwen3_transformer_only_stages internally, so the one-shot API is unchanged.

Usage

# Build + emit genai bundle in one step
python scripts/export_qwen3_transformer_only.py \
  --model Qwen/Qwen3-0.6B \
  --output out/qwen3 \
  --genai-bundle out/qwen3_bundle \
  --embeddings path/to/embeddings.onnx \
  --lm-head    path/to/lm_head.onnx

# Inference
python scripts/infer_genai.py \
  --model out/qwen3_bundle \
  --ep qnn \
  --chat \
  --prompt "Hello, how are you?"

Notes

embeddings.onnx and lm_head.onnx are placeholders — copy them in via --embeddings / --lm-head. Future PRs will generate them from winml directly.
infer_genai.py requires onnxruntime-genai-winml and windowsml[with-ort].
The 4-stage pipeline config format matches result/qwen3-genai-share/model/genai_config.json exactly.

Follows from #836.

- src/winml/modelkit/models/hf/qwen3/genai.py: new module with build_genai_config() and write_genai_bundle(). build_genai_config generates the onnxruntime-genai pipeline config JSON from a HF PretrainedConfig + max_cache_len + prefill_seq_len. write_genai_bundle copies the winml-built ctx/iter ONNX, optional placeholder embeddings and lm_head ONNX, saves tokenizer files from HF, and writes genai_config.json. - scripts/export_qwen3_transformer_only.py: add --genai-bundle DIR, --embeddings ONNX, --lm-head ONNX flags. When --genai-bundle is set, write_genai_bundle is called after the build to emit a complete onnxruntime-genai bundle. - scripts/infer_genai.py: new inference script. Loads the genai bundle with og.Config, registers WinML EPs (QNN), and runs greedy generation via og.Generator. Supports --ep cpu|qnn, --chat template wrapping, --max-new, --context-length, --verbose. - src/winml/modelkit/models/hf/qwen3/__init__.py: export build_genai_config and write_genai_bundle. - tests/unit/models/qwen3/test_genai_config.py: 21 unit tests for build_genai_config covering pipeline structure, KV name counts, tensor name constants, edge cases (list eos_token_id, missing head_dim, None pad_token_id, custom filenames, variable layer count).

…tion Replace hardcoded tensor-name constants with a data-driven design: - PipelineStage dataclass: carries name, filename, run_on_prompt/token_gen, inputs, outputs, is_lm_head. Callers construct stages explicitly; no tensor names are baked into build_genai_config itself. - DecoderIOMapping dataclass: holds the %d-style format strings that genai uses to expand per-layer KV tensor names. Defaults match Qwen3 naming but any naming convention is supported. - build_genai_config: now takes pipeline: list[PipelineStage] and decoder_io: DecoderIOMapping. Architecture-agnostic; no Qwen3-specific logic. prefill_seq_len=None omits the sliding_window section. - _introspect_onnx_io: reads graph.input / graph.output from an ONNX model without loading external data weights. - _detect_format_patterns: scans tensor names for indexed groups matching <prefix><int> with exactly num_layers consecutive zero-based indices, returns {prefix: 'prefix%d'} patterns. - build_qwen3_transformer_only_stages: Qwen3-specific factory that calls _introspect_onnx_io on the built ctx/iter ONNX, detects KV patterns via _detect_format_patterns, and returns (list[PipelineStage], DecoderIOMapping). Tensor names can never drift from the actual ONNX graph I/O. - write_genai_bundle: delegates to build_qwen3_transformer_only_stages instead of hardcoding names. Tests (35 total, all pass): - TestBuildGenaiConfig: +2 new cases (no sliding_window, custom DecoderIOMapping) - TestDetectFormatPatterns: 6 new unit tests for the pattern detector - TestBuildQwen3TransformerOnlyStages: 6 new tests using patched _introspect_onnx_io (no real ONNX files required)

+DEFAULT_LM_HEAD_FILENAME = "lm_head.onnx"
+
+# Tokenizer files written by AutoTokenizer.save_pretrained.
+_TOKENIZER_FILES = [


- GenaiSession drives og.Model + og.Generator lifecycle for autoregressive text generation; peer class to WinMLSession (not a subclass) - GenerationConfig dataclass: temperature, top_p, top_k, max_new_tokens, repetition_penalty, do_sample - Lazy onnxruntime_genai import via _import_og() — class importable without the package installed (raises GenaiNotInstalledError on first use) - Reuses WinMLEPRegistry for EP discovery/registration (idempotent) - EP support: cpu (clear_providers only), qnn, dml - context_length read from genai_config.json; overridable at construction - generate_streaming() yields decoded token strings; generator del'd in finally - generate() returns joined string; auto-load on first call if not loaded - 33 unit tests; all use patch.dict(sys.modules) to avoid real hardware

- Moves chat template logic from infer_genai.py into GenaiSession - Supports optional system prompt - ChatML is not Qwen3-specific; used by Qwen2/3, Yi, Mistral, etc. - infer_genai.py _wrap_chat_template now delegates to the static method - Updated --chat flag help text and script docstring - 4 new tests covering user-only, with-system, no-system-turn, assistant-priming

+                    break
+        finally:
+            # Explicit deletion releases the KV cache buffer held by the generator.
+            del generator


- PipelineStage gains session_options: dict | None = None field; PipelineStage.to_dict() emits it when set - Add _qnn_stage_session_options(log_id, soc_model) helper that produces QNN HTP provider_options for a pipeline stage - build_qwen3_transformer_only_stages gains ep='cpu' and soc_model='60' params; when ep='qnn' the context and iterator stages receive QNN session_options, embeddings and lm_head stay on CPU (no session_options) - write_genai_bundle threads ep/soc_model through - export_qwen3_transformer_only.py passes ep='qnn' when --device npu - 5 new tests covering cpu/qnn ep routing and soc_model propagation (39 total, all pass)

Remove clear_providers/append_provider calls from GenaiSession.load(). EP placement is fully driven by per-stage session_options in genai_config.json. clear_providers() only clears the top-level provider and cannot override per-stage session_options embedded in the pipeline config. - Add 'mixed' EP (use genai_config.json as-is; default for infer_genai.py) - _NEEDS_WINML_EPS covers mixed/qnn/dml to trigger EP registration - Replace _EP_PROVIDER_MAP with _VALID_EPS + _NEEDS_WINML_EPS sets - Update tests: remove append_provider assertions, add mixed/config-not-modified tests - infer_genai.py default EP changed from 'cpu' to 'mixed' Result: NPU bundle (out/qwen3_bundle_npu) now runs at 9.3 tok/s vs 1.2 tok/s CPU

- GenaiSession gains compile=True parameter - _prepare_compiled_bundle(): detects QNN stages from genai_config.json, compiles each stage to EPContext ONNX via ort.ModelCompiler in a subprocess - _compile_stage(): 5-minute timeout per stage to handle QNN SDK hang (known bug: w8a16 + multi-token prefill hangs indefinitely) - Compiled artifacts cached in bundle_dir/_compiled/; reused on subsequent runs - _mirror_non_onnx_files(): symlinks/copies tokenizer files so og.Config can load from the compiled sub-directory - infer_genai.py --compile flag wired through to GenaiSession

…on_optimization_mode=0 Root cause: QNN SDK ModelCompiler deadlocks when compiling w8a16 quantized ONNX with multi-token static input shapes (seq_len > 1) at graph finalization optimization levels 1-3. The genai_config uses level 3 for runtime inference, which triggers the hang when passed to ModelCompiler directly. Fix: _compile_stage now forces htp_graph_finalization_optimization_mode=0 for compilation. This lets ModelCompiler finish (ctx ~41s, iter ~67s) while runtime inference still uses the full level-3 optimization from genai_config (EPContext loading bypasses compilation entirely, so the runtime option is irrelevant). Also fixes: - Pipeline stage detection: genai_config uses 'qnn' key (not 'QNNExecutionProvider') in provider_options; detection and option extraction now uses the correct key - _patch_stage_filename: genai_config pipeline is a list, not a dict; updated to iterate list entries correctly - _prepare_compiled_bundle: passes QNN provider options from each stage's session_options to _compile_stage so soc_model, backend_path, etc. are respected - Removed the 'prefill fallback to JIT' warning since the hang is now fixed

… spawn Windows multiprocessing spawn serialises the subprocess target via pickle. Local functions (closures) defined inside a method cannot be pickled, which caused 'AttributeError: Can't pickle local function' at runtime. Moved the compilation logic to a module-level function _qnn_compile_worker so it is importable by name in the spawned subprocess. Also fix ONNX filename in compiled genai_config: use ctx_onnx.name (just the filename) instead of str(ctx_onnx) (absolute path). ort-genai resolves filenames relative to the directory passed to og.Config, so an absolute path causes double-path concatenation and a 'file not found' error.

… stages Previously _compile_stage forced mode='0' for ALL stages to avoid a QNN SDK deadlock on w8a16 + multi-token prefill. This also silently capped the iter (generation) stage at mode 0, producing under-optimized kernels (~10 tok/s). Fix: only force mode=0 for prefill stages (run_on_prompt=true, seq_len>1 where the deadlock occurs). Generation stages (run_on_token_gen=true, seq_len=1) use the configured mode from genai_config.json (typically '3'), which is safe for single-token input and produces fully-optimized kernels. Performance: Before: 10.4 tok/s (both ctx+iter compiled with mode 0) After: 43.4 tok/s (ctx mode 0, iter mode 3) — matches reference ~45 tok/s _prepare_compiled_bundle now passes is_prefill flag per stage based on run_on_prompt / run_on_token_gen fields in genai_config.json pipeline config.

… _compile_stage The original mode=0 override was added to avoid a QNN SDK deadlock when compiling w8a16 prefill (seq_len>1) at higher optimization levels. Testing revealed the deadlock only occurs when QNN provider options are NOT passed to ort.ModelCompiler at all (causing it to fall back to a broken default path). With correct QNN options (backend_path, soc_model, etc.) forwarded, mode=3 compiles successfully for both ctx (~73s) and iter (~67s) with no hang. Remove the is_prefill flag and mode override entirely. _compile_stage now passes genai_config QNN options unchanged, giving fully-optimized kernels for all stages. Performance (hot NPU, EPContext loaded): ctx+iter both mode=3: ~44.5 tok/s vs reference ~45 tok/s

…i as shim - Extract all architecture-agnostic logic (PipelineStage, DecoderIOMapping, build_genai_config, build_decoder_pipeline_stages, write_genai_bundle, qnn_stage_session_options, ONNX introspection helpers) into src/winml/modelkit/utils/genai.py so other model families can reuse it - Reduce qwen3/genai.py to a thin re-export shim with a backward-compatible build_qwen3_transformer_only_stages alias for existing callers - fix(codeql): remove unused _TOKENIZER_FILES from utils/genai.py - fix(codeql): remove unnecessary del generator in GenaiSession.generate_streaming - fix(codeql): add missing Protocol body ellipsis in QuantConfigFinalizer.finalize - fix(codeql): import get_quant_finalizer directly in quant/__init__.py - fix(test): update mock patch path to winml.modelkit.utils.genai._introspect_onnx_io - fix(test): replace bare 'import onnx' with 'from onnx import ...' in test_qwen3_calibration.py

github-actions Bot added 2 commits June 29, 2026 16:00

github-advanced-security AI found potential problems Jun 29, 2026

View reviewed changes

Comment thread src/winml/modelkit/models/hf/qwen3/genai.py Outdated

DEFAULT_LM_HEAD_FILENAME = "lm_head.onnx"

# Tokenizer files written by AutoTokenizer.save_pretrained.

_TOKENIZER_FILES = [

github-advanced-security AI found potential problems Jun 29, 2026

View reviewed changes

Comment thread src/winml/modelkit/session/genai_session.py Fixed

github-advanced-security AI found potential problems Jun 29, 2026

View reviewed changes

Comment thread src/winml/modelkit/session/genai_session.py Outdated

break

finally:

# Explicit deletion releases the KV cache buffer held by the generator.

del generator

github-actions Bot added 8 commits June 30, 2026 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(qwen3): genai bundle generation and inference script#996

feat(qwen3): genai bundle generation and inference script#996
DingmaomaoBJTU wants to merge 12 commits into
mainfrom
pr/836/feature/qwen3-quant

DingmaomaoBJTU commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

DingmaomaoBJTU commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Design

Usage

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DingmaomaoBJTU commented Jun 29, 2026 •

edited

Loading