Add DeepSeek-v4 (Flash/Pro)#1192
Conversation
|
You can now run it on a 256GB Mac by keeping a experts in 4bit! We could do 5bit since it's much better than 4bit right now. I'm open to opinions @angeloskath
|
|
Hey @Blaizzy — just flagging some technical notes since we're both working on V4 support and PR #1189 landed ~10 hours earlier with significant overlap: Compressed attention mask direction (line 770-773): Sinkhorn normalization: sqrtsoftplus numerical stability: Happy to coordinate if the maintainers want to consolidate into one PR. Our implementation has live generation validation at 21.86 tok/s on M3 Ultra (DeepSeek-V4-Flash-4bit, 160GB peak). |
|
Hey @machiabeli, thanks! Yes, same person who left the earlier feedback, good to connect properly. I've been poking at this in parallel and landed on something close to the source numerically with minimal changes, but there's definitely room to combine approaches. A PR from you on the compressed attention mask, Sinkhorn norm, and sqrt-softplus would be really welcome, happy to review and merge what works best. Or I can cherry pick and add you as a co-author. |
| if ( | ||
| config.get("quantization", None) is None | ||
| and getattr(model_args, "quantization", None) is not None | ||
| and any(k.endswith(".scales") for k in weights) | ||
| ): | ||
| config["quantization"] = model_args.quantization | ||
|
|
||
| def _quantize(quantization): | ||
| def class_predicate(p, m): | ||
| if not hasattr(m, "to_quantized"): | ||
| return False | ||
| if f"{p}.scales" not in weights: | ||
| return False | ||
| # Handle custom per layer quantizations | ||
| if p in config["quantization"]: | ||
| return config["quantization"][p] | ||
| if not hasattr(m, "to_quantized"): | ||
| return False | ||
| return f"{p}.scales" in weights | ||
| return True |
There was a problem hiding this comment.
The goal here is to preserve mxfp4 expert quant since MLX supports it. So I made the quantize_config key in the config class default to that, and these changes help prequantized models load properly.
It can be done via predicate but couldn't find an elegant way of doing it.
Note: it doesn't affect any model.
There was a problem hiding this comment.
Alternative is to dequant -> requant similar to how we do with FP8.
…ponding unit tests for HyperConnection
… with matmul for improved performance
|
Following up on #issuecomment-4329720377 with more specific findings. I attempted to build a true 8-bit conversion myself by streaming through Hit an architectural wall on the routed-expert dimensions that I can't reconcile from public artifacts. Source (deepseek-ai/DeepSeek-V4-Flash) per-expert shapes: Your Both repos' configs say Param accounting confirms it: source's routed experts at (2048, 2048) sum to ~138B params (consistent with the 148 GB I8 file size), while your switch_mlp shapes imply ~280B in routed experts alone, which lines up with mxfp4's compressed 155 GB file size for a fuller param count. The shared_experts block in source is sized correctly ( Two questions, in order of usefulness to me:
If (1) has a clean answer I can finish the converter myself for (2). If not, only you have the conversion path that produces the right shapes. Either way thanks for the work — |
Performance update — Apple optimizations live ✨After pulling the latest with @angeloskath's three commits (RoPE kernel native path, GLU cast simplification, original-checkpoint loader): Setup: Mac Studio M3 Ultra 512GB, mxfp8, single-machine, no TP
The thing that jumps out: TPS is now context-flat (33.8 vs 33.6 vs 33.6 across very different workloads). Previous builds had visible decay on longer generations. This is the signature of the cast/RoPE overhead going away. Cumulative trajectory on this hardware:
~+50% from baseline in 4 days of community iteration. Quality remains pristine (no token drops, no repetition loops, perfect Japanese & code). Welcome to the PR @angeloskath — the kernel-native RoPE path is doing real work here on Apple Silicon. 🙌 |
…'s fp32-cast patches Pulls in PR ml-explore#1192 from upstream including: - Apple-team commits from Angelos Katharopoulos: - 3cf5282 "Start simplifying and speeding up the attention" (2026-04-29) - 4951496 "Fix RoPE to use the kernel by scaling freqs" (2026-04-28) - 81a8c57 "Simplify GLU and gate remove intermediate castings" (2026-04-28) - Blaizzy refactor stack (output projection, KV reshape, BatchRotatingKVCache, scoring/RoPE compile, the matmul rewrite from 0xClandestine that we already had a copy of, etc.) Conflict resolution: took theirs for mlx_lm/models/deepseek_v4.py wholesale. Our previous fork patches that are now superseded: - f4dd9e7 / 2a1dcf6 "drop fp32 casts in Indexer / MoEGate / Compressor" — Angelos' 81a8c57 covers the same ground more cleanly. Fork patches that need re-applying separately on top: - mlx_lm/profiler.py span/finalize hooks scattered across deepseek_v4.py (attn_q_lora / attn_kv_proj / attn_compressor / attn_indexer / attn_sdpa_sparse / attn_sdpa_dense / moe_* spans). - 1d78d62 Indexer wq_b/weights_proj sharding for TP.
[Deepseek] Attempt to fix IndexError on batching.
… + non-sparse attention refactor New since our last merge attempt (then reverted for prefill regression): - 83d7e74 (Angelos, Apr 29) "Refactor compressor and compressed non-sparse attention" — 225 line touchup. Suspected to address the prefill regression we hit. - d8a47ae (pcuenca) "Ensure dtype going into mx.fast.rope is int32" - ad9e329 (pcuenca) "Attempt to fix IndexError on batching" — could fix the c=4 SIGABRT we hit during AIME eval. Took theirs for mlx_lm/models/deepseek_v4.py wholesale. Will re-apply our sliding-window indexer + EXO_DSV4_INDEX_TOPK env knob on top in a follow-up commit. mxfp4 detection in utils.py preserved. Note: switch_mlp call signature is still the 2-arg `(x, inds)` form in PR ml-explore#1192 (scores applied externally in DeepseekV4MoE). Keep EXO_DSV4_FUSED_MOE=0 until FusedDeepseekV4SwitchGLU is updated in exo's auto_parallel.py.
Previously reverted PR ml-explore#1192 today suspecting Apple's compressor refactor (commit 83d7e74) caused our AIME quality failure. That was wrong — the AIME failure was max_tokens=16384 too low for Think mode (model card recommends >=384K for Think Max). The re-merge brings back: - 83d7e74 (Angelos) compressor + non-sparse attention refactor - 4951496 RoPE kernel scaling fix - 81a8c57 GLU/gate cast simplification - ad9e329 (pcuenca) "Attempt to fix IndexError on batching" — likely fixes the recurring c=2 SIGABRT in mlx_lm/generate.py:1369 _step we've hit 4+ times today during AIME runs Re-applied EXO_DSV4_INDEX_TOPK + EXO_DSV4_INDEXER_WINDOW env knobs on top of the new Indexer, with the windowed-coordinate index translation so V4Attention's sparse-pooled-attention gather works. Keep EXO_DSV4_FUSED_MOE=0 — switch_mlp's 2-arg API still doesn't match exo's FusedDeepseekV4SwitchGLU monkey-patch.
…xplore#1192 Cherry-pick upstream PR ml-explore#1192 commit ad9e329 (Pedro Cuenca, "Attempt to fix IndexError on batching") onto our pre-Apple-refactor base. The fix is structural and doesn't depend on the post-83d7e74 refactored Compressor — it just touches DeepseekV4Cache.filter() to zero-pad compressor/indexer state when batch_indices references slots that haven't emitted pooled tokens yet. Without it, c=2 batched decode SIGABRTs deep in mlx_lm/generate.py:_step (we hit this 4+ times today during AIME-5 c=2 runs). Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
…h-model@2120a1b) 58 commits ahead. Major changes: - Refactor DeepseekV4Cache for state management (8ac37c0) - Add DeepseekV4SwitchGLU + BatchRotatingKVCache + PoolingCache (c2c9801) - Refactor compressor + non-sparse compressed attention (83d7e74) - HC sinkhorn normalization + sinkhorn unit tests (c0d9222, f7ff216) - Replace einsum with matmul in HyperConnection (ef8c95d) - Numerical stability + compressed-attn mask padding (c6a7828) - pcuenca's IndexError fix on batching (ad9e329, was already cherry-picked separately as 9f6a9d1, now superseded) - pcuenca's RoPE dtype fix (d8a47ae) - Quant predicate refactor (d59907c, f4f7b4d, 166bba6, cbb0b72) Conflict resolution: - mlx_lm/models/deepseek_v4.py: 39 conflicts. Took theirs entirely (the PR's substantial refactoring of Indexer/Compressor/Cache makes our sliding-window patches non-applicable). Re-added EXO_DSV4_INDEX_TOPK env override to the new Indexer class — the simplest of our three env knobs to forward-port. EXO_DSV4_INDEXER_WINDOW + WINDOW_LATE patches are NOT carried forward; the new code restructures pooled attention enough that the slicing strategy needs a different implementation, which can be re-added later if perf testing shows it's needed. - mlx_lm/utils.py: auto-merged. Our mxfp4 quant-config inference patch (commit 6d0c164) survived intact (verified via grep). - mlx_lm/generate.py: auto-merged. PR added BatchPoolingCache support in _make_cache; clean addition. - mlx_lm/models/cache.py: auto-merged. PR added 539+ lines for PoolingCache and BatchPoolingCache. - tests/test_models.py: auto-merged. This is a significant rebase — needs cluster validation before promoting to the standard deploy. Likely to require rebench of DSv4-Flash-8bit AIME runs and recheck of the IOGPU residency-set fix interaction with the new BatchRotatingKVCache code path.
The Indexer.__call__ score computation runs every decode step on indexer-equipped layers (~21 layers per token). The score block was plain Python uncompiled: scores = q @ pooled.T (cast fp32) scores = mx.maximum(scores, 0) * scale scores = (scores * weights.T[..., None]).sum(axis=1) Extract to a free @mx.compile function. Per the dsv4_optimization_results memory the Indexer was already a known hot path and earlier fork patches had a similar compiled helper before the PR ml-explore#1192 merge took theirs. Restoring it on top of the new structure.
…c10538) 4 new commits: - 5c10538 Fix tensor parallel distributed - 7efb57f Fix batch cache edge case - c58834b Add hyper_connection.py (moves HyperConnection out of deepseek_v4.py) - ca8b299 Simplify HyperConnection Auto-merge clean — our EXO_DSV4_INDEX_TOPK env override on Indexer.__init__ and the @mx.compile _indexer_score helper both survived. Verified file parses.
|
Prompt longer than 2K leads to no response at all I've used https://raw.githubusercontent.com/ivanfioravanti/llm_context_benchmarks/refs/heads/master/4k.txt |
|
convert is not working properly, 5bit or 6 bit, use 4.349 bits per weight. ❯ mlx_lm.convert --hf-path deepseek-ai/DeepSeek-V4-Flash -q --q-bits 6 --mlx-path ~/DeepSeek-V4-Flash-6bit It seems related to the fact that starting from FP8 model, mxp4 + mxfp8 is automatically applied independently from --q-bits. |
The pinned mlx-lm v0.31.3 has no DeepSeek V4 support. Bring it in through the omlx/patches pattern rather than touching the upstream package. New `omlx/patches/deepseek_v4/`: - 1:1 copies of PR 1192's `deepseek_v4.py`, `hyper_connection.py`, and the `PoolingCache` / `BatchPoolingCache` additions to `cache.py`. - Function-replacement patches for `mlx_lm.utils.load_model` (F8_E8M0 dtype fallback + fp8 quant branch) and `mlx_lm.generate._make_cache` (PoolingCache → BatchPoolingCache). - `AutoTokenizer` wrapper that retries with `PreTrainedConfig()` when transformers ≤5.7.0 does not yet recognize `model_type=deepseek_v4`. Becomes dead code automatically once transformers ships native support. - `PoolingCacheHandler` / `BatchPoolingCacheHandler` registered with the omlx CacheTypeRegistry so SSD and prefix cache state extraction do not silently fall through to `DefaultCacheHandler`. omlx core: conditional dispatch in `utils/model_loading.py`, `engine/batched.py`, and `models/llm.py` (gated on `config.json::model_type=="deepseek_v4"` so other models pay nothing); two new enum values and class-name mappings in `cache/type_handlers.py` and `cache/type_registry.py`. 31 new unit tests in `tests/test_deepseek_v4_patch.py`. 344 cache / engine / patches regression tests still pass. Huge thanks to @Blaizzy for the upstream V4 implementation in ml-explore/mlx-lm#1192 — this commit is just a thin glue layer around that work.


Note: Please install this transformers PR from source to avoid tokenizer bugs.
pip install git+https://github.com/huggingface/transformers.git@refs/pull/45643/headWeights here:
https://huggingface.co/collections/mlx-community/deepseek-v4