Add DeepSeek-v4 (Flash/Pro) by Blaizzy · Pull Request #1192 · ml-explore/mlx-lm

Blaizzy · 2026-04-24T14:19:48Z

Note: Please install this transformers PR from source to avoid tokenizer bugs.

pip install git+https://github.com/huggingface/transformers.git@refs/pull/45643/head

Weights here:
https://huggingface.co/collections/mlx-community/deepseek-v4

Blaizzy · 2026-04-24T14:32:57Z

You can now run it on a 256GB Mac by keeping a experts in 4bit!

We could do 5bit since it's much better than 4bit right now. I'm open to opinions @angeloskath

Blaizzy · 2026-04-24T15:25:15Z

It's faster now!

machiabeli · 2026-04-24T16:42:04Z

Hey @Blaizzy — just flagging some technical notes since we're both working on V4 support and PR #1189 landed ~10 hours earlier with significant overlap:

Compressed attention mask direction (line 770-773):
The mask padding for compressed KV rows uses mx.ones, but create_attention_mask returns negative values for blocked positions. Padding with ones would block attention to compressed rows rather than allow it. PR #1189 uses mx.zeros here.

Sinkhorn normalization:
The Python loop path (line 222-226) dispatches ~40 kernel launches per call (softmax + iters x sum + div). PR #1189 has a fused Metal kernel that does this in a single register-resident dispatch — benchmarked at 3.5-5.7x faster on micro, 1.83x end-to-end.

sqrtsoftplus numerical stability:
nn.softplus(x) can overflow for large scores. PR #1189 uses mx.logaddexp(scores, zeros) which is log-sum-exp stable.

Happy to coordinate if the maintainers want to consolidate into one PR. Our implementation has live generation validation at 21.86 tok/s on M3 Ultra (DeepSeek-V4-Flash-4bit, 160GB peak).

Blaizzy · 2026-04-24T16:56:21Z

Hey @machiabeli, thanks!

Yes, same person who left the earlier feedback, good to connect properly.

I've been poking at this in parallel and landed on something close to the source numerically with minimal changes, but there's definitely room to combine approaches. A PR from you on the compressed attention mask, Sinkhorn norm, and sqrt-softplus would be really welcome, happy to review and merge what works best.

Or I can cherry pick and add you as a co-author.

…ompile decorator to _limited_swiglu function

Blaizzy · 2026-04-24T18:12:53Z

+    if (
+        config.get("quantization", None) is None
+        and getattr(model_args, "quantization", None) is not None
+        and any(k.endswith(".scales") for k in weights)
+    ):
+        config["quantization"] = model_args.quantization
+
    def _quantize(quantization):
        def class_predicate(p, m):
+            if not hasattr(m, "to_quantized"):
+                return False
+            if f"{p}.scales" not in weights:
+                return False
            # Handle custom per layer quantizations
            if p in config["quantization"]:
                return config["quantization"][p]
-            if not hasattr(m, "to_quantized"):
-                return False
-            return f"{p}.scales" in weights
+            return True


The goal here is to preserve mxfp4 expert quant since MLX supports it. So I made the quantize_config key in the config class default to that, and these changes help prequantized models load properly.

It can be done via predicate but couldn't find an elegant way of doing it.

Note: it doesn't affect any model.

Alternative is to dequant -> requant similar to how we do with FP8.

…ponding unit tests for HyperConnection

… with matmul for improved performance

trevorgordon981 · 2026-04-28T03:14:25Z

Following up on #issuecomment-4329720377 with more specific findings.

I attempted to build a true 8-bit conversion myself by streaming through deepseek-ai/DeepSeek-V4-Flash (the 148 GB native FP8 release): read each shard, dequantize the I8 + F8_E8M0 (block-scale) and F8_E4M3 weights to bf16, re-quantize at affine q8 group_size=64, write MLX-format shards with the sanitize-equivalent name remapping (embed.weight → model.embed_tokens.weight, hc_attn_X → attn_hc.X, experts stacked into switch_mlp.{gate,down,up}_proj, etc.). The dequant + requantize math validates fine (round-trip error 0.74% rel). Pipeline runs in ~5 min for the full 46-shard input.

Hit an architectural wall on the routed-expert dimensions that I can't reconcile from public artifacts.

Source (deepseek-ai/DeepSeek-V4-Flash) per-expert shapes:

experts.X.w1.weight: I8 [2048, 2048]
experts.X.w2.weight: I8 [4096, 1024]
experts.X.w3.weight: I8 [2048, 2048]

Your mlx-community/DeepSeek-V4-Flash-8bit switch_mlp (inferred from the mxfp4 packed/scales shapes):

gate_proj per expert (bf16): [2048, 4096]
down_proj per expert (bf16): [4096, 2048]
up_proj   per expert (bf16): [2048, 4096]

Both repos' configs say hidden_size=4096, moe_intermediate_size=2048, n_routed_experts=256, n_shared_experts=1 — they should produce the same per-expert shapes. But the source's last dims are exactly half of yours.

Param accounting confirms it: source's routed experts at (2048, 2048) sum to ~138B params (consistent with the 148 GB I8 file size), while your switch_mlp shapes imply ~280B in routed experts alone, which lines up with mxfp4's compressed 155 GB file size for a fuller param count.

The shared_experts block in source is sized correctly (shared_experts.w1.weight: F8_E4M3 [2048, 4096] matches your shared_experts.gate_proj (2048, 4096)). Only the routed experts disagree.

Two questions, in order of usefulness to me:

How does your conversion process derive (2048, 4096) per-expert weights from the source's (2048, 2048)? Concat of w1+w3 (which would only give you gate_proj, not also up_proj), TP-shard combination across pairs of source experts, or some V4-Flash-specific reshaping I'm not seeing in mlx_lm/models/deepseek_v4.py?
Would a true 8-bit version (no bits: 4 override on the FFN/MoE expert layers) be feasible to publish? My use case is a 512 GB Mac Studio backend; ~280-300 GB q8 fits comfortably with KV cache headroom and would be a quality bump over the current mxfp4-on-FFN profile. Happy to run the same harness against it if you publish.

If (1) has a clean answer I can finish the converter myself for (2). If not, only you have the conversion path that produces the right shapes.

Either way thanks for the work — dd6b92f mixed quant has been serving Alfred backend reliably for several hours now.

Tonoken3 · 2026-04-29T01:26:12Z

Performance update — Apple optimizations live ✨

After pulling the latest with @angeloskath's three commits (RoPE kernel native path, GLU cast simplification, original-checkpoint loader):

Setup: Mac Studio M3 Ultra 512GB, mxfp8, single-machine, no TP

Prompt type	TTFT	TPS	Notes
Short JP (~50 tok output)	2793 ms (cold)	33.8 tok/s	first run, model warmup
Long JP (~500 char essay)	232 ms (cached)	33.6 tok/s	flat
Code (Python w/ docstrings)	199 ms	33.6 tok/s	flat

The thing that jumps out: TPS is now context-flat (33.8 vs 33.6 vs 33.6 across very different workloads). Previous builds had visible decay on longer generations. This is the signature of the cast/RoPE overhead going away.

Cumulative trajectory on this hardware:

Build	TPS
Initial mxfp8 (with token-drop regression)	22.4
Sinkhorn / HC fusion (0xClandestine)	26.7
`acf650c9` orientation fix (Blaizzy)	30.2
Apple kernel optimizations (angeloskath)	33.6–33.8

~+50% from baseline in 4 days of community iteration. Quality remains pristine (no token drops, no repetition loops, perfect Japanese & code).

Welcome to the PR @angeloskath — the kernel-native RoPE path is doing real work here on Apple Silicon. 🙌

…'s fp32-cast patches Pulls in PR ml-explore#1192 from upstream including: - Apple-team commits from Angelos Katharopoulos: - 3cf5282 "Start simplifying and speeding up the attention" (2026-04-29) - 4951496 "Fix RoPE to use the kernel by scaling freqs" (2026-04-28) - 81a8c57 "Simplify GLU and gate remove intermediate castings" (2026-04-28) - Blaizzy refactor stack (output projection, KV reshape, BatchRotatingKVCache, scoring/RoPE compile, the matmul rewrite from 0xClandestine that we already had a copy of, etc.) Conflict resolution: took theirs for mlx_lm/models/deepseek_v4.py wholesale. Our previous fork patches that are now superseded: - f4dd9e7 / 2a1dcf6 "drop fp32 casts in Indexer / MoEGate / Compressor" — Angelos' 81a8c57 covers the same ground more cleanly. Fork patches that need re-applying separately on top: - mlx_lm/profiler.py span/finalize hooks scattered across deepseek_v4.py (attn_q_lora / attn_kv_proj / attn_compressor / attn_indexer / attn_sdpa_sparse / attn_sdpa_dense / moe_* spans). - 1d78d62 Indexer wq_b/weights_proj sharding for TP.

[Deepseek] Attempt to fix IndexError on batching.

… + non-sparse attention refactor New since our last merge attempt (then reverted for prefill regression): - 83d7e74 (Angelos, Apr 29) "Refactor compressor and compressed non-sparse attention" — 225 line touchup. Suspected to address the prefill regression we hit. - d8a47ae (pcuenca) "Ensure dtype going into mx.fast.rope is int32" - ad9e329 (pcuenca) "Attempt to fix IndexError on batching" — could fix the c=4 SIGABRT we hit during AIME eval. Took theirs for mlx_lm/models/deepseek_v4.py wholesale. Will re-apply our sliding-window indexer + EXO_DSV4_INDEX_TOPK env knob on top in a follow-up commit. mxfp4 detection in utils.py preserved. Note: switch_mlp call signature is still the 2-arg `(x, inds)` form in PR ml-explore#1192 (scores applied externally in DeepseekV4MoE). Keep EXO_DSV4_FUSED_MOE=0 until FusedDeepseekV4SwitchGLU is updated in exo's auto_parallel.py.

Previously reverted PR ml-explore#1192 today suspecting Apple's compressor refactor (commit 83d7e74) caused our AIME quality failure. That was wrong — the AIME failure was max_tokens=16384 too low for Think mode (model card recommends >=384K for Think Max). The re-merge brings back: - 83d7e74 (Angelos) compressor + non-sparse attention refactor - 4951496 RoPE kernel scaling fix - 81a8c57 GLU/gate cast simplification - ad9e329 (pcuenca) "Attempt to fix IndexError on batching" — likely fixes the recurring c=2 SIGABRT in mlx_lm/generate.py:1369 _step we've hit 4+ times today during AIME runs Re-applied EXO_DSV4_INDEX_TOPK + EXO_DSV4_INDEXER_WINDOW env knobs on top of the new Indexer, with the windowed-coordinate index translation so V4Attention's sparse-pooled-attention gather works. Keep EXO_DSV4_FUSED_MOE=0 — switch_mlp's 2-arg API still doesn't match exo's FusedDeepseekV4SwitchGLU monkey-patch.

…xplore#1192 Cherry-pick upstream PR ml-explore#1192 commit ad9e329 (Pedro Cuenca, "Attempt to fix IndexError on batching") onto our pre-Apple-refactor base. The fix is structural and doesn't depend on the post-83d7e74 refactored Compressor — it just touches DeepseekV4Cache.filter() to zero-pad compressor/indexer state when batch_indices references slots that haven't emitted pooled tokens yet. Without it, c=2 batched decode SIGABRTs deep in mlx_lm/generate.py:_step (we hit this 4+ times today during AIME-5 c=2 runs). Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

…h-model@2120a1b) 58 commits ahead. Major changes: - Refactor DeepseekV4Cache for state management (8ac37c0) - Add DeepseekV4SwitchGLU + BatchRotatingKVCache + PoolingCache (c2c9801) - Refactor compressor + non-sparse compressed attention (83d7e74) - HC sinkhorn normalization + sinkhorn unit tests (c0d9222, f7ff216) - Replace einsum with matmul in HyperConnection (ef8c95d) - Numerical stability + compressed-attn mask padding (c6a7828) - pcuenca's IndexError fix on batching (ad9e329, was already cherry-picked separately as 9f6a9d1, now superseded) - pcuenca's RoPE dtype fix (d8a47ae) - Quant predicate refactor (d59907c, f4f7b4d, 166bba6, cbb0b72) Conflict resolution: - mlx_lm/models/deepseek_v4.py: 39 conflicts. Took theirs entirely (the PR's substantial refactoring of Indexer/Compressor/Cache makes our sliding-window patches non-applicable). Re-added EXO_DSV4_INDEX_TOPK env override to the new Indexer class — the simplest of our three env knobs to forward-port. EXO_DSV4_INDEXER_WINDOW + WINDOW_LATE patches are NOT carried forward; the new code restructures pooled attention enough that the slicing strategy needs a different implementation, which can be re-added later if perf testing shows it's needed. - mlx_lm/utils.py: auto-merged. Our mxfp4 quant-config inference patch (commit 6d0c164) survived intact (verified via grep). - mlx_lm/generate.py: auto-merged. PR added BatchPoolingCache support in _make_cache; clean addition. - mlx_lm/models/cache.py: auto-merged. PR added 539+ lines for PoolingCache and BatchPoolingCache. - tests/test_models.py: auto-merged. This is a significant rebase — needs cluster validation before promoting to the standard deploy. Likely to require rebench of DSv4-Flash-8bit AIME runs and recheck of the IOGPU residency-set fix interaction with the new BatchRotatingKVCache code path.

The Indexer.__call__ score computation runs every decode step on indexer-equipped layers (~21 layers per token). The score block was plain Python uncompiled: scores = q @ pooled.T (cast fp32) scores = mx.maximum(scores, 0) * scale scores = (scores * weights.T[..., None]).sum(axis=1) Extract to a free @mx.compile function. Per the dsv4_optimization_results memory the Indexer was already a known hot path and earlier fork patches had a similar compiled helper before the PR ml-explore#1192 merge took theirs. Restoring it on top of the new structure.

…c10538) 4 new commits: - 5c10538 Fix tensor parallel distributed - 7efb57f Fix batch cache edge case - c58834b Add hyper_connection.py (moves HyperConnection out of deepseek_v4.py) - ca8b299 Simplify HyperConnection Auto-merge clean — our EXO_DSV4_INDEX_TOPK env override on Indexer.__init__ and the @mx.compile _indexer_score helper both survived. Verified file parses.

ivanfioravanti · 2026-05-02T14:19:01Z

Prompt longer than 2K leads to no response at all
cat 4k.txt | mlx_lm.generate --model ~/DeepSeek-V4-Flash-5bit --max-tokens 200 --prompt -

I've used https://raw.githubusercontent.com/ivanfioravanti/llm_context_benchmarks/refs/heads/master/4k.txt

ivanfioravanti · 2026-05-02T14:56:42Z

convert is not working properly, 5bit or 6 bit, use 4.349 bits per weight.

❯ mlx_lm.convert --hf-path deepseek-ai/DeepSeek-V4-Flash -q --q-bits 6 --mlx-path ~/DeepSeek-V4-Flash-6bit
[INFO] Loading
Fetching 67 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:00<00:00, 8842.62it/s]
Download complete: : 0.00B [00:00, ?B/s] | 0/67 [00:00<?, ?it/s]
[transformers] You are using a model of type deepseek_v4 to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a sam2_video checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.
[transformers] PreTrainedConfig got `key=rope_scaling` in kwargs but hasn't set it as attribute. For RoPE standardization you need to set `self.rope_parameters` in model's config.
[INFO] Using dtype: bfloat16
[INFO] Quantizing
[INFO] Quantized model with 4.349 bits per weight.

It seems related to the fact that starting from FP8 model, mxp4 + mxfp8 is automatically applied independently from --q-bits.

@Blaizzy

The pinned mlx-lm v0.31.3 has no DeepSeek V4 support. Bring it in through the omlx/patches pattern rather than touching the upstream package. New `omlx/patches/deepseek_v4/`: - 1:1 copies of PR 1192's `deepseek_v4.py`, `hyper_connection.py`, and the `PoolingCache` / `BatchPoolingCache` additions to `cache.py`. - Function-replacement patches for `mlx_lm.utils.load_model` (F8_E8M0 dtype fallback + fp8 quant branch) and `mlx_lm.generate._make_cache` (PoolingCache → BatchPoolingCache). - `AutoTokenizer` wrapper that retries with `PreTrainedConfig()` when transformers ≤5.7.0 does not yet recognize `model_type=deepseek_v4`. Becomes dead code automatically once transformers ships native support. - `PoolingCacheHandler` / `BatchPoolingCacheHandler` registered with the omlx CacheTypeRegistry so SSD and prefix cache state extraction do not silently fall through to `DefaultCacheHandler`. omlx core: conditional dispatch in `utils/model_loading.py`, `engine/batched.py`, and `models/llm.py` (gated on `config.json::model_type=="deepseek_v4"` so other models pay nothing); two new enum values and class-name mappings in `cache/type_handlers.py` and `cache/type_registry.py`. 31 new unit tests in `tests/test_deepseek_v4_patch.py`. 344 cache / engine / patches regression tests still pass. Huge thanks to @Blaizzy for the upstream V4 implementation in ml-explore/mlx-lm#1192 — this commit is just a thin glue layer around that work.

Blaizzy added 9 commits November 12, 2025 12:45

fix input_embeddings prefill bug in generate_step

2bfc361

format

5cf134c

Merge branch 'ml-explore:main' into main

2e09bd2

Merge branch 'ml-explore:main' into main

d59907c

Merge branch 'ml-explore:main' into main

e06ca01

Merge branch 'ml-explore:main' into main

5c965a5

add DS4

18c12cb

Fix DeepSeek V4 Flash generation math

28d991f

Generalize safetensors dtype fallback

be714fd

Blaizzy changed the title ~~Add DeepSeekv4 (Flash/Pro)~~ Add DeepSeek-v4 (Flash/Pro) Apr 24, 2026

Blaizzy added 3 commits April 24, 2026 16:25

Remove deprecated DeepSeek V4 tests for sanitization and model loading

a833f26

Refactor DeepSeek V4 tests to consolidate and clarify test cases

28a83ea

format

bc1c285

Blaizzy added 2 commits April 24, 2026 16:33

keep experts quantized

f4f7b4d

fix fp32 promotion (3 tps -> 11 tps)

e3221c4

Fix DeepSeek V4 batched routing

16b268a

pcuenca reviewed Apr 24, 2026

View reviewed changes

Comment thread mlx_lm/utils.py

pcuenca reviewed Apr 24, 2026

View reviewed changes

Comment thread mlx_lm/models/deepseek_v4.py Outdated

Blaizzy added 4 commits April 24, 2026 19:08

compile _limited_swiglu, hc_split_sinkhorn and make robust predicate

201267d

Enhance quant_predicate to support new projection modes and add @mx.c…

f26e519

…ompile decorator to _limited_swiglu function

keep experts in mxfp4

61e006e

Refactor quantized linear layer

f2b4e9d

Blaizzy commented Apr 24, 2026

View reviewed changes

Blaizzy added 3 commits April 24, 2026 20:27

Implement optimized Sinkhorn operations in DeepSeek V4 and add corres…

c0d9222

…ponding unit tests for HyperConnection

Optimize matrix multiplication in HyperConnection by replacing einsum…

ef8c95d

… with matmul for improved performance

HC sinkhorn normalization

f7ff216

adurham mentioned this pull request Apr 27, 2026

perf(deepseek_v4): fuse switch_mlp gate_proj + up_proj into single gather_qmm exo-explore/exo#1999

Open

4 tasks

graydh mentioned this pull request Apr 28, 2026

无法加载 DeepSeek-V4-Flash 模型 - unsupported dtype F8_E8M0 和 model type deepseek_v4 not supported jundot/omlx#956

Open

angeloskath added 4 commits April 28, 2026 01:23

Enable loading the original checkpoint

f83460f

Simplify GLU and gate remove intermediate castings

81a8c57

Fix RoPE to use the kernel by scaling freqs

4951496

Start simplifying and speeding up the attention

3cf5282

raullenchai mentioned this pull request Apr 29, 2026

feat: day-0 DeepSeek-V4-Flash support (vendored from mlx-lm #1192) raullenchai/Rapid-MLX#168

Merged

7 tasks

pcuenca and others added 5 commits April 29, 2026 10:59

Attempt to fix IndexError on batching.

ad9e329

Refactor compressor and compressed non-sparse attention

83d7e74

Ensure dtype going into mx.fast.rope is int32

d8a47ae

Merge pull request #21 from pcuenca/rope-dtype

42a416b

Merge pull request #20 from pcuenca/index-error

85cb244

[Deepseek] Attempt to fix IndexError on batching.

Major cache refactor and attention simplification

2120a1b

angeloskath added 4 commits May 1, 2026 02:07

Simplify HyperConnection

ca8b299

Add hyper_connection.py

c58834b

Fix batch cache edge case

7efb57f

Fix tensor parallel distributed

5c10538

tomholford mentioned this pull request May 5, 2026

deepseek_v4 not supported lmstudio-ai/lmstudio-bug-tracker#1872

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeepSeek-v4 (Flash/Pro)#1192

Add DeepSeek-v4 (Flash/Pro)#1192
Blaizzy wants to merge 62 commits intoml-explore:mainfrom
Blaizzy:pc/add-deepseekv4flash-model

Blaizzy commented Apr 24, 2026 •

edited

Loading

Uh oh!

Blaizzy commented Apr 24, 2026 •

edited

Loading

Uh oh!

Blaizzy commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

machiabeli commented Apr 24, 2026

Uh oh!

Blaizzy commented Apr 24, 2026 •

edited

Loading

Uh oh!

Blaizzy Apr 24, 2026 •

edited

Loading

Uh oh!

Blaizzy Apr 24, 2026

Uh oh!

trevorgordon981 commented Apr 28, 2026

Uh oh!

Tonoken3 commented Apr 29, 2026

Uh oh!

ivanfioravanti commented May 2, 2026

Uh oh!

ivanfioravanti commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

Blaizzy commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Blaizzy commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Blaizzy commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

machiabeli commented Apr 24, 2026

Uh oh!

Blaizzy commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Blaizzy Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Blaizzy Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

trevorgordon981 commented Apr 28, 2026

Uh oh!

Tonoken3 commented Apr 29, 2026

Performance update — Apple optimizations live ✨

Uh oh!

ivanfioravanti commented May 2, 2026

Uh oh!

ivanfioravanti commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Blaizzy commented Apr 24, 2026 •

edited

Loading

Blaizzy commented Apr 24, 2026 •

edited

Loading

Blaizzy commented Apr 24, 2026 •

edited

Loading

Blaizzy commented Apr 24, 2026 •

edited

Loading

Blaizzy Apr 24, 2026 •

edited

Loading