feat: add vision support for MLX models (Kimi K2.5 + Qwen3-VL)#1455
Closed
Drifter4242 wants to merge 1 commit intoexo-explore:mainfrom
Closed
feat: add vision support for MLX models (Kimi K2.5 + Qwen3-VL)#1455Drifter4242 wants to merge 1 commit intoexo-explore:mainfrom
Drifter4242 wants to merge 1 commit intoexo-explore:mainfrom
Conversation
Contributor
Author
Contributor
Author
Contributor
Author
Collaborator
|
Thank you as always! I was looking at this but didn't have time to get around to this. I will do a proper review soon, but on a quick skim, it looks great :) |
Dual-path VisionEncoder architecture: - Custom path: explicit projector + image processor (Kimi K2.5 / MoonViT) - HF path: Qwen2VLImageProcessor + VisionModel (Qwen3-VL) Pipeline: - VisionEncoder loads vision weights separately from language model - build_vision_prompt() injects image placeholder tokens - create_vision_embeddings() merges vision features into text embeddings - KV prefix cache validates media regions across requests Integration: - chat_completions adapter: base64/URL image extraction from OpenAI-format messages - download_utils: HTTP URL fetch for remote images - Model cards for Kimi K2.5 (vision fields) and Qwen3-VL-4B-Instruct-4bit - Dashboard vision state management
Collaborator
|
Hey! Thanks so much for this! Since this is stale, I'm closing it in favour of #1802. However, I have cherry-picked your commit into the PR and used a lot of the code from this one :) |
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Motivation
I thought it would be nice for Kimi K2.5 to have vision.
It's a very large change and I appreciate that the exo team might not want such big commit, but it does work and I felt like I should at least do a show and tell.
I used Opus 4.6 extensively, but I read through all the code, tested it and got it to rewrite several sections. In particular, I pushed it to minimize the changes to generate.py (putting all the vision code in vision.py). The runner also has only minor changes.
It does download the Kimi K2.5 vision weights from my own hf repo (since the existing mlx community model only has text).
It also uses the hf image processing code which is somewhat a security issue (but exo already uses hf code for Kimi).
I've also used mlx-vlm but only for the neural net architecture definitions.
Known issues (written by Opus 4.6):
grid_thwdimensions are ignored.vision_prefillpasses noposition_ids, so vision tokens fall back to 1Darangepositions. This likely degrades Qwen3-VL output quality — Kimi K2.5 doesn't use M-RoPE so is unaffected.deepstack_merger_list). We extract only the final hidden states and discard the rest.Changes
(written by Opus 4.6)
Vision pipeline (
src/exo/worker/engines/mlx/vision.py— new, ~780 lines)VisionEncoder— lazy-loads vision tower + projector weights from safetensors (supports both bundled MLX weights and separate PyTorch repos)VisionPipeline.process()— full encode → prompt → embed → result flow, returning aVisionResultthatgenerate.pyconsumes without knowing vision internalsvision_prefill()— directly runs transformer layers with spliced embeddings (bypassesstream_generatewhich doesn't supportinput_embeddingsfor all architectures)create_vision_embeddings()— cumsum-based splicing of vision features into LM token embeddings at pad-token positionsMediaRegion— content-hashed image regions for KV prefix cache validation (prevents false cache hits when different images share the same pad-token IDs)trust_remote_code(Kimi's custom processor, Qwen2VLImageProcessor, etc.)Model cards
mlx-community--Kimi-K2.5.tomlwith[vision]section pointing todavehind/Kimi-K2.5-visionweights repomlx-community--Qwen3-VL-4B-Instruct-4bit.tomlwith bundled vision configAPI adapter (
chat_completions.py)contentparts (inline base64 +http(s)://fetch)chat_template_messagesso Jinja templates emit correct<image>placeholdersTransport
max_transmit_sizefrom 1MB → 8MB inrust/networking/src/swarm.rsto handle image payloadsKV prefix cache (
cache.py)_validate_media_match()— truncates cache hits at the first media region where the content hash doesn't match, preventing stale vision features from being reusedMinimal changes to generate.py
prepare_vision(), usesVisionResultfor tokens/embeddings, andvision_prefill_cached()instead of the normal prefill path when images are presentWhy It Works
(written by Opus 4.6)
The design keeps vision details out of the generation loop.
generate.pygets aVisionResultcontaining pre-spliced embeddings and just prefills the KV cache with them — it doesn't know about image processors, vision towers, or projectors.Vision features are spliced into text embeddings using cumsum indexing over the image-token mask, which is both model-agnostic and efficient. The KV prefix cache uses content hashes on media regions to detect when images change between requests, avoiding false cache hits that would cause the model to "see" the wrong image.
The safety-net resize at the API entry point (before gossipsub serialization) prevents large images from blocking the entire command pipeline — the original hang was caused by base64 payloads exceeding gossipsub's message size limit, which blocked
_send_out()beforepublish()could deliver locally.Test Plan
Manual Testing
Two Mac Studio M3 Ultras with 512Gb each connected via RDMA.
I tried various images on Qwen3.5-VL and Kimi K2.5
Automated Testing
There is a basic image kv cache test.