Skip to content

feat: add vision support for MLX models (Kimi K2.5 + Qwen3-VL)#1455

Closed
Drifter4242 wants to merge 1 commit intoexo-explore:mainfrom
Drifter4242:vision
Closed

feat: add vision support for MLX models (Kimi K2.5 + Qwen3-VL)#1455
Drifter4242 wants to merge 1 commit intoexo-explore:mainfrom
Drifter4242:vision

Conversation

@Drifter4242
Copy link
Copy Markdown
Contributor

Motivation

I thought it would be nice for Kimi K2.5 to have vision.
It's a very large change and I appreciate that the exo team might not want such big commit, but it does work and I felt like I should at least do a show and tell.
I used Opus 4.6 extensively, but I read through all the code, tested it and got it to rewrite several sections. In particular, I pushed it to minimize the changes to generate.py (putting all the vision code in vision.py). The runner also has only minor changes.
It does download the Kimi K2.5 vision weights from my own hf repo (since the existing mlx community model only has text).
It also uses the hf image processing code which is somewhat a security issue (but exo already uses hf code for Kimi).
I've also used mlx-vlm but only for the neural net architecture definitions.

Known issues (written by Opus 4.6):

  • No video support. Only static images; video frames and temporal grid_thw dimensions are ignored.
  • Qwen3-VL M-RoPE not implemented. Qwen3-VL uses 3D positional embeddings (temporal/height/width) for vision tokens. Our vision_prefill passes no position_ids, so vision tokens fall back to 1D arange positions. This likely degrades Qwen3-VL output quality — Kimi K2.5 doesn't use M-RoPE so is unaffected.
  • Qwen3-VL deepstack features discarded. The vision tower returns intermediate hidden states for injection into early LM layers (deepstack_merger_list). We extract only the final hidden states and discard the rest.
  • Cross-image contamination. With multiple images in one conversation, the model can confuse content across images (hallucinated text from image 1 appearing in the description of image 2). Likely a VLM attention limitation rather than a code bug.
  • Gossipsub limit at 8MB. Large images are resized at the API entry point (2048×2048 cap) but the overall message budget is generous. Very large multi-image payloads could still approach the limit.

Changes

(written by Opus 4.6)

Vision pipeline (src/exo/worker/engines/mlx/vision.py — new, ~780 lines)

  • VisionEncoder — lazy-loads vision tower + projector weights from safetensors (supports both bundled MLX weights and separate PyTorch repos)
  • VisionPipeline.process() — full encode → prompt → embed → result flow, returning a VisionResult that generate.py consumes without knowing vision internals
  • vision_prefill() — directly runs transformer layers with spliced embeddings (bypasses stream_generate which doesn't support input_embeddings for all architectures)
  • create_vision_embeddings() — cumsum-based splicing of vision features into LM token embeddings at pad-token positions
  • MediaRegion — content-hashed image regions for KV prefix cache validation (prevents false cache hits when different images share the same pad-token IDs)
  • Works with any HF image processor via trust_remote_code (Kimi's custom processor, Qwen2VLImageProcessor, etc.)

Model cards

  • Updated mlx-community--Kimi-K2.5.toml with [vision] section pointing to davehind/Kimi-K2.5-vision weights repo
  • Added mlx-community--Qwen3-VL-4B-Instruct-4bit.toml with bundled vision config

API adapter (chat_completions.py)

  • Extracts images from multimodal content parts (inline base64 + http(s):// fetch)
  • Safety-net resize at 2048×2048 pixels to keep base64 payloads under the gossipsub transport limit
  • Preserves multimodal content structure in chat_template_messages so Jinja templates emit correct <image> placeholders

Transport

  • Bumped gossipsub max_transmit_size from 1MB → 8MB in rust/networking/src/swarm.rs to handle image payloads

KV prefix cache (cache.py)

  • _validate_media_match() — truncates cache hits at the first media region where the content hash doesn't match, preventing stale vision features from being reused

Minimal changes to generate.py

  • ~50 lines added: calls prepare_vision(), uses VisionResult for tokens/embeddings, and vision_prefill_cached() instead of the normal prefill path when images are present

Why It Works

(written by Opus 4.6)
The design keeps vision details out of the generation loop. generate.py gets a VisionResult containing pre-spliced embeddings and just prefills the KV cache with them — it doesn't know about image processors, vision towers, or projectors.

Vision features are spliced into text embeddings using cumsum indexing over the image-token mask, which is both model-agnostic and efficient. The KV prefix cache uses content hashes on media regions to detect when images change between requests, avoiding false cache hits that would cause the model to "see" the wrong image.

The safety-net resize at the API entry point (before gossipsub serialization) prevents large images from blocking the entire command pipeline — the original hang was caused by base64 payloads exceeding gossipsub's message size limit, which blocked _send_out() before publish() could deliver locally.

Test Plan

Manual Testing

Two Mac Studio M3 Ultras with 512Gb each connected via RDMA.
I tried various images on Qwen3.5-VL and Kimi K2.5

Automated Testing

There is a basic image kv cache test.

@Drifter4242
Copy link
Copy Markdown
Contributor Author

image

@Drifter4242
Copy link
Copy Markdown
Contributor Author

image

@Drifter4242
Copy link
Copy Markdown
Contributor Author

image

@rltakashige
Copy link
Copy Markdown
Collaborator

Thank you as always! I was looking at this but didn't have time to get around to this.

I will do a proper review soon, but on a quick skim, it looks great :)

Dual-path VisionEncoder architecture:
- Custom path: explicit projector + image processor (Kimi K2.5 / MoonViT)
- HF path: Qwen2VLImageProcessor + VisionModel (Qwen3-VL)

Pipeline:
- VisionEncoder loads vision weights separately from language model
- build_vision_prompt() injects image placeholder tokens
- create_vision_embeddings() merges vision features into text embeddings
- KV prefix cache validates media regions across requests

Integration:
- chat_completions adapter: base64/URL image extraction from OpenAI-format messages
- download_utils: HTTP URL fetch for remote images
- Model cards for Kimi K2.5 (vision fields) and Qwen3-VL-4B-Instruct-4bit
- Dashboard vision state management
@rltakashige
Copy link
Copy Markdown
Collaborator

Hey! Thanks so much for this! Since this is stale, I'm closing it in favour of #1802. However, I have cherry-picked your commit into the PR and used a lot of the code from this one :)

@rltakashige
Copy link
Copy Markdown
Collaborator

image

Qwen3-VL still works. I don't think I have the patience to wait for Kimi to download, so I'm hoping that does, too. Please check it out if you'd like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants