Skip to content

Multi Resolution Support for Qwen2.5-VL Model#875

Open
quic-sanising wants to merge 6 commits intoquic:release/v1.21.0from
quic-sanising:qwenvl2_5_multi_spec
Open

Multi Resolution Support for Qwen2.5-VL Model#875
quic-sanising wants to merge 6 commits intoquic:release/v1.21.0from
quic-sanising:qwenvl2_5_multi_spec

Conversation

@quic-sanising
Copy link
Contributor

@quic-sanising quic-sanising commented Mar 18, 2026

Qwen2.5‑VL workloads often see variable image sizes; supporting multiple resolutions improves usability and benchmarking realism. This PR adds multi‑resolution support to the Qwen2.5‑VL specialization flow. The goal is to allow a single run to handle multiple image sizes more robustly while keeping specialization metadata consistent and avoiding shape/buffer mismatches.

What changed?

Qwen2.5‑VL specialization now supports multiple (width, height) pairs without requiring changes in model onnx:

  • Accept width and height as either int or List[int].
  • Reuse shared smart_resize utility instead of local implementation (imports: from qwen_vl_utils import smart_resize).
  • Compute encoder specialization for each resolution by looping over (width, height) pairs.
  • Compute decoder specialization based on the min_vision_size across all provided resolutions.
  • Allow overriding image tokenization constraints via mm_processor_kwargs, particularly, min_pixels and max_pixels.
  • Allow overriding prefill/decode vision_size specialization via vision_size user input.

Testing

  • Single‑resolution path: pass height=int, width=int and verify behavior matches previous outputs.
  • Multi‑resolution path: pass height=[...], width=[...] and confirm:
    • specializations are generated for each resolution,
    • min_vision_size is used consistently,
    • no shape mismatch at runtime.
  • KV‑offload run with at least two distinct pixel_values shapes that are present in vision_session.allowed_shapes.

Checklist

  • Support both scalar and list inputs for image resolution.
  • Remove duplicated resizing logic in favor of shared utility.
  • Select correct buffers for vision session based on input shapes.
  • Pad vision embeddings to match language session constraints.

quic-sanising added 5 commits March 18, 2026 14:40
Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
…nd max pixels

Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant