Skip to content

feat(qwen3): genai bundle generation and inference script#996

Draft
DingmaomaoBJTU wants to merge 12 commits into
mainfrom
pr/836/feature/qwen3-quant
Draft

feat(qwen3): genai bundle generation and inference script#996
DingmaomaoBJTU wants to merge 12 commits into
mainfrom
pr/836/feature/qwen3-quant

Conversation

@DingmaomaoBJTU

@DingmaomaoBJTU DingmaomaoBJTU commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds onnxruntime-genai integration for winml-exported Qwen3 transformer-only models.

What's new

File Change
src/winml/modelkit/models/hf/qwen3/genai.py New module: PipelineStage, DecoderIOMapping, build_genai_config(), build_qwen3_transformer_only_stages(), write_genai_bundle()
scripts/export_qwen3_transformer_only.py New --genai-bundle, --embeddings, --lm-head flags
scripts/infer_genai.py New inference script (CPU / QNN EP)
src/winml/modelkit/models/hf/qwen3/__init__.py Exports new symbols
tests/unit/models/qwen3/test_genai_config.py 35 unit tests

Design

build_genai_config is architecture-agnostic: it takes a list[PipelineStage] and DecoderIOMapping — no tensor names are hardcoded inside it.

build_qwen3_transformer_only_stages is the Qwen3-specific factory. It calls _introspect_onnx_io on the built ctx.onnx / iter.onnx, runs _detect_format_patterns to discover past_keys_%d / present_values_%d style patterns from the actual graph I/O, and returns (stages, decoder_io). Tensor names can never drift from what the ONNX really contains.

build_genai_config(hf_config, ..., pipeline=stages, decoder_io=decoder_io)
        ↑ generic, architecture-agnostic
        
build_qwen3_transformer_only_stages(ctx_onnx, iter_onnx, num_layers)
        ↑ Qwen3-specific; introspects ONNX to discover all tensor names

write_genai_bundle calls build_qwen3_transformer_only_stages internally, so the one-shot API is unchanged.

Usage

# Build + emit genai bundle in one step
python scripts/export_qwen3_transformer_only.py \
  --model Qwen/Qwen3-0.6B \
  --output out/qwen3 \
  --genai-bundle out/qwen3_bundle \
  --embeddings path/to/embeddings.onnx \
  --lm-head    path/to/lm_head.onnx

# Inference
python scripts/infer_genai.py \
  --model out/qwen3_bundle \
  --ep qnn \
  --chat \
  --prompt "Hello, how are you?"

Notes

  • embeddings.onnx and lm_head.onnx are placeholders — copy them in via --embeddings / --lm-head. Future PRs will generate them from winml directly.
  • infer_genai.py requires onnxruntime-genai-winml and windowsml[with-ort].
  • The 4-stage pipeline config format matches result/qwen3-genai-share/model/genai_config.json exactly.

Follows from #836.

github-actions Bot added 2 commits June 29, 2026 16:00
- src/winml/modelkit/models/hf/qwen3/genai.py: new module with
  build_genai_config() and write_genai_bundle(). build_genai_config
  generates the onnxruntime-genai pipeline config JSON from a HF
  PretrainedConfig + max_cache_len + prefill_seq_len. write_genai_bundle
  copies the winml-built ctx/iter ONNX, optional placeholder embeddings
  and lm_head ONNX, saves tokenizer files from HF, and writes
  genai_config.json.

- scripts/export_qwen3_transformer_only.py: add --genai-bundle DIR,
  --embeddings ONNX, --lm-head ONNX flags. When --genai-bundle is set,
  write_genai_bundle is called after the build to emit a complete
  onnxruntime-genai bundle.

- scripts/infer_genai.py: new inference script. Loads the genai bundle
  with og.Config, registers WinML EPs (QNN), and runs greedy generation
  via og.Generator. Supports --ep cpu|qnn, --chat template wrapping,
  --max-new, --context-length, --verbose.

- src/winml/modelkit/models/hf/qwen3/__init__.py: export
  build_genai_config and write_genai_bundle.

- tests/unit/models/qwen3/test_genai_config.py: 21 unit tests for
  build_genai_config covering pipeline structure, KV name counts,
  tensor name constants, edge cases (list eos_token_id, missing head_dim,
  None pad_token_id, custom filenames, variable layer count).
…tion

Replace hardcoded tensor-name constants with a data-driven design:

- PipelineStage dataclass: carries name, filename, run_on_prompt/token_gen,
  inputs, outputs, is_lm_head. Callers construct stages explicitly; no
  tensor names are baked into build_genai_config itself.

- DecoderIOMapping dataclass: holds the %d-style format strings that genai
  uses to expand per-layer KV tensor names. Defaults match Qwen3 naming
  but any naming convention is supported.

- build_genai_config: now takes pipeline: list[PipelineStage] and
  decoder_io: DecoderIOMapping. Architecture-agnostic; no Qwen3-specific
  logic. prefill_seq_len=None omits the sliding_window section.

- _introspect_onnx_io: reads graph.input / graph.output from an ONNX
  model without loading external data weights.

- _detect_format_patterns: scans tensor names for indexed groups matching
  <prefix><int> with exactly num_layers consecutive zero-based indices,
  returns {prefix: 'prefix%d'} patterns.

- build_qwen3_transformer_only_stages: Qwen3-specific factory that calls
  _introspect_onnx_io on the built ctx/iter ONNX, detects KV patterns via
  _detect_format_patterns, and returns (list[PipelineStage], DecoderIOMapping).
  Tensor names can never drift from the actual ONNX graph I/O.

- write_genai_bundle: delegates to build_qwen3_transformer_only_stages
  instead of hardcoding names.

Tests (35 total, all pass):
- TestBuildGenaiConfig: +2 new cases (no sliding_window, custom DecoderIOMapping)
- TestDetectFormatPatterns: 6 new unit tests for the pattern detector
- TestBuildQwen3TransformerOnlyStages: 6 new tests using patched
  _introspect_onnx_io (no real ONNX files required)
DEFAULT_LM_HEAD_FILENAME = "lm_head.onnx"

# Tokenizer files written by AutoTokenizer.save_pretrained.
_TOKENIZER_FILES = [
- GenaiSession drives og.Model + og.Generator lifecycle for autoregressive
  text generation; peer class to WinMLSession (not a subclass)
- GenerationConfig dataclass: temperature, top_p, top_k, max_new_tokens,
  repetition_penalty, do_sample
- Lazy onnxruntime_genai import via _import_og() — class importable without
  the package installed (raises GenaiNotInstalledError on first use)
- Reuses WinMLEPRegistry for EP discovery/registration (idempotent)
- EP support: cpu (clear_providers only), qnn, dml
- context_length read from genai_config.json; overridable at construction
- generate_streaming() yields decoded token strings; generator del'd in finally
- generate() returns joined string; auto-load on first call if not loaded
- 33 unit tests; all use patch.dict(sys.modules) to avoid real hardware
Comment thread src/winml/modelkit/session/genai_session.py Fixed
- Moves chat template logic from infer_genai.py into GenaiSession
- Supports optional system prompt
- ChatML is not Qwen3-specific; used by Qwen2/3, Yi, Mistral, etc.
- infer_genai.py _wrap_chat_template now delegates to the static method
- Updated --chat flag help text and script docstring
- 4 new tests covering user-only, with-system, no-system-turn, assistant-priming
break
finally:
# Explicit deletion releases the KV cache buffer held by the generator.
del generator
github-actions Bot added 8 commits June 30, 2026 12:39
- PipelineStage gains session_options: dict | None = None field;
  PipelineStage.to_dict() emits it when set
- Add _qnn_stage_session_options(log_id, soc_model) helper that
  produces QNN HTP provider_options for a pipeline stage
- build_qwen3_transformer_only_stages gains ep='cpu' and soc_model='60'
  params; when ep='qnn' the context and iterator stages receive QNN
  session_options, embeddings and lm_head stay on CPU (no session_options)
- write_genai_bundle threads ep/soc_model through
- export_qwen3_transformer_only.py passes ep='qnn' when --device npu
- 5 new tests covering cpu/qnn ep routing and soc_model propagation
  (39 total, all pass)
Remove clear_providers/append_provider calls from GenaiSession.load().
EP placement is fully driven by per-stage session_options in genai_config.json.
clear_providers() only clears the top-level provider and cannot override
per-stage session_options embedded in the pipeline config.

- Add 'mixed' EP (use genai_config.json as-is; default for infer_genai.py)
- _NEEDS_WINML_EPS covers mixed/qnn/dml to trigger EP registration
- Replace _EP_PROVIDER_MAP with _VALID_EPS + _NEEDS_WINML_EPS sets
- Update tests: remove append_provider assertions, add mixed/config-not-modified tests
- infer_genai.py default EP changed from 'cpu' to 'mixed'

Result: NPU bundle (out/qwen3_bundle_npu) now runs at 9.3 tok/s vs 1.2 tok/s CPU
- GenaiSession gains compile=True parameter
- _prepare_compiled_bundle(): detects QNN stages from genai_config.json,
  compiles each stage to EPContext ONNX via ort.ModelCompiler in a subprocess
- _compile_stage(): 5-minute timeout per stage to handle QNN SDK hang
  (known bug: w8a16 + multi-token prefill hangs indefinitely)
- Compiled artifacts cached in bundle_dir/_compiled/; reused on subsequent runs
- _mirror_non_onnx_files(): symlinks/copies tokenizer files so og.Config
  can load from the compiled sub-directory
- infer_genai.py --compile flag wired through to GenaiSession
…on_optimization_mode=0

Root cause: QNN SDK ModelCompiler deadlocks when compiling w8a16 quantized
ONNX with multi-token static input shapes (seq_len > 1) at graph finalization
optimization levels 1-3. The genai_config uses level 3 for runtime inference,
which triggers the hang when passed to ModelCompiler directly.

Fix: _compile_stage now forces htp_graph_finalization_optimization_mode=0 for
compilation. This lets ModelCompiler finish (ctx ~41s, iter ~67s) while runtime
inference still uses the full level-3 optimization from genai_config (EPContext
loading bypasses compilation entirely, so the runtime option is irrelevant).

Also fixes:
- Pipeline stage detection: genai_config uses 'qnn' key (not 'QNNExecutionProvider')
  in provider_options; detection and option extraction now uses the correct key
- _patch_stage_filename: genai_config pipeline is a list, not a dict; updated
  to iterate list entries correctly
- _prepare_compiled_bundle: passes QNN provider options from each stage's
  session_options to _compile_stage so soc_model, backend_path, etc. are respected
- Removed the 'prefill fallback to JIT' warning since the hang is now fixed
… spawn

Windows multiprocessing spawn serialises the subprocess target via pickle.
Local functions (closures) defined inside a method cannot be pickled, which
caused 'AttributeError: Can't pickle local function' at runtime.

Moved the compilation logic to a module-level function _qnn_compile_worker
so it is importable by name in the spawned subprocess.

Also fix ONNX filename in compiled genai_config: use ctx_onnx.name (just the
filename) instead of str(ctx_onnx) (absolute path).  ort-genai resolves
filenames relative to the directory passed to og.Config, so an absolute path
causes double-path concatenation and a 'file not found' error.
… stages

Previously _compile_stage forced mode='0' for ALL stages to avoid a QNN SDK
deadlock on w8a16 + multi-token prefill. This also silently capped the iter
(generation) stage at mode 0, producing under-optimized kernels (~10 tok/s).

Fix: only force mode=0 for prefill stages (run_on_prompt=true, seq_len>1
where the deadlock occurs). Generation stages (run_on_token_gen=true,
seq_len=1) use the configured mode from genai_config.json (typically '3'),
which is safe for single-token input and produces fully-optimized kernels.

Performance:
  Before: 10.4 tok/s (both ctx+iter compiled with mode 0)
  After:  43.4 tok/s (ctx mode 0, iter mode 3) — matches reference ~45 tok/s

_prepare_compiled_bundle now passes is_prefill flag per stage based on
run_on_prompt / run_on_token_gen fields in genai_config.json pipeline config.
… _compile_stage

The original mode=0 override was added to avoid a QNN SDK deadlock when
compiling w8a16 prefill (seq_len>1) at higher optimization levels.

Testing revealed the deadlock only occurs when QNN provider options are
NOT passed to ort.ModelCompiler at all (causing it to fall back to a
broken default path). With correct QNN options (backend_path, soc_model,
etc.) forwarded, mode=3 compiles successfully for both ctx (~73s) and
iter (~67s) with no hang.

Remove the is_prefill flag and mode override entirely. _compile_stage now
passes genai_config QNN options unchanged, giving fully-optimized kernels
for all stages.

Performance (hot NPU, EPContext loaded):
  ctx+iter both mode=3: ~44.5 tok/s vs reference ~45 tok/s
…i as shim

- Extract all architecture-agnostic logic (PipelineStage, DecoderIOMapping,
  build_genai_config, build_decoder_pipeline_stages, write_genai_bundle,
  qnn_stage_session_options, ONNX introspection helpers) into
  src/winml/modelkit/utils/genai.py so other model families can reuse it
- Reduce qwen3/genai.py to a thin re-export shim with a backward-compatible
  build_qwen3_transformer_only_stages alias for existing callers
- fix(codeql): remove unused _TOKENIZER_FILES from utils/genai.py
- fix(codeql): remove unnecessary del generator in GenaiSession.generate_streaming
- fix(codeql): add missing Protocol body ellipsis in QuantConfigFinalizer.finalize
- fix(codeql): import get_quant_finalizer directly in quant/__init__.py
- fix(test): update mock patch path to winml.modelkit.utils.genai._introspect_onnx_io
- fix(test): replace bare 'import onnx' with 'from onnx import ...' in
  test_qwen3_calibration.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants