feat: add vision support for MLX models (Kimi K2.5 + Qwen3-VL) by Drifter4242 · Pull Request #1455 · exo-explore/exo

Drifter4242 · 2026-02-12T07:18:34Z

Motivation

I thought it would be nice for Kimi K2.5 to have vision.
It's a very large change and I appreciate that the exo team might not want such big commit, but it does work and I felt like I should at least do a show and tell.
I used Opus 4.6 extensively, but I read through all the code, tested it and got it to rewrite several sections. In particular, I pushed it to minimize the changes to generate.py (putting all the vision code in vision.py). The runner also has only minor changes.
It does download the Kimi K2.5 vision weights from my own hf repo (since the existing mlx community model only has text).
It also uses the hf image processing code which is somewhat a security issue (but exo already uses hf code for Kimi).
I've also used mlx-vlm but only for the neural net architecture definitions.

Known issues (written by Opus 4.6):

No video support. Only static images; video frames and temporal grid_thw dimensions are ignored.
Qwen3-VL M-RoPE not implemented. Qwen3-VL uses 3D positional embeddings (temporal/height/width) for vision tokens. Our vision_prefill passes no position_ids, so vision tokens fall back to 1D arange positions. This likely degrades Qwen3-VL output quality — Kimi K2.5 doesn't use M-RoPE so is unaffected.
Qwen3-VL deepstack features discarded. The vision tower returns intermediate hidden states for injection into early LM layers (deepstack_merger_list). We extract only the final hidden states and discard the rest.
Cross-image contamination. With multiple images in one conversation, the model can confuse content across images (hallucinated text from image 1 appearing in the description of image 2). Likely a VLM attention limitation rather than a code bug.
Gossipsub limit at 8MB. Large images are resized at the API entry point (2048×2048 cap) but the overall message budget is generous. Very large multi-image payloads could still approach the limit.

Changes

(written by Opus 4.6)

Vision pipeline (`src/exo/worker/engines/mlx/vision.py` — new, ~780 lines)

VisionEncoder — lazy-loads vision tower + projector weights from safetensors (supports both bundled MLX weights and separate PyTorch repos)
VisionPipeline.process() — full encode → prompt → embed → result flow, returning a VisionResult that generate.py consumes without knowing vision internals
vision_prefill() — directly runs transformer layers with spliced embeddings (bypasses stream_generate which doesn't support input_embeddings for all architectures)
create_vision_embeddings() — cumsum-based splicing of vision features into LM token embeddings at pad-token positions
MediaRegion — content-hashed image regions for KV prefix cache validation (prevents false cache hits when different images share the same pad-token IDs)
Works with any HF image processor via trust_remote_code (Kimi's custom processor, Qwen2VLImageProcessor, etc.)

Model cards

Updated mlx-community--Kimi-K2.5.toml with [vision] section pointing to davehind/Kimi-K2.5-vision weights repo
Added mlx-community--Qwen3-VL-4B-Instruct-4bit.toml with bundled vision config

API adapter (`chat_completions.py`)

Extracts images from multimodal content parts (inline base64 + http(s):// fetch)
Safety-net resize at 2048×2048 pixels to keep base64 payloads under the gossipsub transport limit
Preserves multimodal content structure in chat_template_messages so Jinja templates emit correct <image> placeholders

Transport

Bumped gossipsub max_transmit_size from 1MB → 8MB in rust/networking/src/swarm.rs to handle image payloads

KV prefix cache (`cache.py`)

_validate_media_match() — truncates cache hits at the first media region where the content hash doesn't match, preventing stale vision features from being reused

Minimal changes to generate.py

~50 lines added: calls prepare_vision(), uses VisionResult for tokens/embeddings, and vision_prefill_cached() instead of the normal prefill path when images are present

Why It Works

(written by Opus 4.6)
The design keeps vision details out of the generation loop. generate.py gets a VisionResult containing pre-spliced embeddings and just prefills the KV cache with them — it doesn't know about image processors, vision towers, or projectors.

Vision features are spliced into text embeddings using cumsum indexing over the image-token mask, which is both model-agnostic and efficient. The KV prefix cache uses content hashes on media regions to detect when images change between requests, avoiding false cache hits that would cause the model to "see" the wrong image.

The safety-net resize at the API entry point (before gossipsub serialization) prevents large images from blocking the entire command pipeline — the original hang was caused by base64 payloads exceeding gossipsub's message size limit, which blocked _send_out() before publish() could deliver locally.

Test Plan

Manual Testing

Two Mac Studio M3 Ultras with 512Gb each connected via RDMA.
I tried various images on Qwen3.5-VL and Kimi K2.5

Automated Testing

There is a basic image kv cache test.

Drifter4242 · 2026-02-12T07:20:51Z

Drifter4242 · 2026-02-12T07:21:47Z

Drifter4242 · 2026-02-12T07:22:22Z

rltakashige · 2026-02-12T10:50:17Z

Thank you as always! I was looking at this but didn't have time to get around to this.

I will do a proper review soon, but on a quick skim, it looks great :)

Dual-path VisionEncoder architecture: - Custom path: explicit projector + image processor (Kimi K2.5 / MoonViT) - HF path: Qwen2VLImageProcessor + VisionModel (Qwen3-VL) Pipeline: - VisionEncoder loads vision weights separately from language model - build_vision_prompt() injects image placeholder tokens - create_vision_embeddings() merges vision features into text embeddings - KV prefix cache validates media regions across requests Integration: - chat_completions adapter: base64/URL image extraction from OpenAI-format messages - download_utils: HTTP URL fetch for remote images - Model cards for Kimi K2.5 (vision fields) and Qwen3-VL-4B-Instruct-4bit - Dashboard vision state management

rltakashige · 2026-03-26T18:12:18Z

Hey! Thanks so much for this! Since this is stale, I'm closing it in favour of #1802. However, I have cherry-picked your commit into the PR and used a lot of the code from this one :)

rltakashige · 2026-03-26T18:23:44Z

Qwen3-VL still works. I don't think I have the patience to wait for Kimi to download, so I'm hoping that does, too. Please check it out if you'd like

Drifter4242 force-pushed the vision branch from aa4b17e to 00bfb56 Compare February 13, 2026 03:21

exo-explore deleted a comment from AlexCheema Feb 19, 2026

rltakashige closed this Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add vision support for MLX models (Kimi K2.5 + Qwen3-VL)#1455

feat: add vision support for MLX models (Kimi K2.5 + Qwen3-VL)#1455
Drifter4242 wants to merge 1 commit intoexo-explore:mainfrom
Drifter4242:vision

Drifter4242 commented Feb 12, 2026

Uh oh!

Drifter4242 commented Feb 12, 2026

Uh oh!

Drifter4242 commented Feb 12, 2026

Uh oh!

Drifter4242 commented Feb 12, 2026

Uh oh!

rltakashige commented Feb 12, 2026

Uh oh!

rltakashige commented Mar 26, 2026

Uh oh!

rltakashige commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Drifter4242 commented Feb 12, 2026

Motivation

Changes

Vision pipeline (src/exo/worker/engines/mlx/vision.py — new, ~780 lines)

Model cards

API adapter (chat_completions.py)

Transport

KV prefix cache (cache.py)

Minimal changes to generate.py

Why It Works

Test Plan

Manual Testing

Automated Testing

Uh oh!

Drifter4242 commented Feb 12, 2026

Uh oh!

Drifter4242 commented Feb 12, 2026

Uh oh!

Drifter4242 commented Feb 12, 2026

Uh oh!

rltakashige commented Feb 12, 2026

Uh oh!

rltakashige commented Mar 26, 2026

Uh oh!

rltakashige commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vision pipeline (`src/exo/worker/engines/mlx/vision.py` — new, ~780 lines)

API adapter (`chat_completions.py`)

KV prefix cache (`cache.py`)