Skip to content

Add DeepSeek-v4 (Flash/Pro)#1192

Open
Blaizzy wants to merge 62 commits intoml-explore:mainfrom
Blaizzy:pc/add-deepseekv4flash-model
Open

Add DeepSeek-v4 (Flash/Pro)#1192
Blaizzy wants to merge 62 commits intoml-explore:mainfrom
Blaizzy:pc/add-deepseekv4flash-model

Conversation

@Blaizzy
Copy link
Copy Markdown
Contributor

@Blaizzy Blaizzy commented Apr 24, 2026

Note: Please install this transformers PR from source to avoid tokenizer bugs.

pip install git+https://github.com/huggingface/transformers.git@refs/pull/45643/head

Weights here:
https://huggingface.co/collections/mlx-community/deepseek-v4

image

@Blaizzy Blaizzy changed the title Add DeepSeekv4 (Flash/Pro) Add DeepSeek-v4 (Flash/Pro) Apr 24, 2026
@Blaizzy
Copy link
Copy Markdown
Contributor Author

Blaizzy commented Apr 24, 2026

You can now run it on a 256GB Mac by keeping a experts in 4bit!

We could do 5bit since it's much better than 4bit right now. I'm open to opinions @angeloskath

image

@Blaizzy
Copy link
Copy Markdown
Contributor Author

Blaizzy commented Apr 24, 2026

It's faster now!
Screenshot 2026-04-24 at 21 54 13

Comment thread mlx_lm/utils.py
Comment thread mlx_lm/models/deepseek_v4.py Outdated
@machiabeli
Copy link
Copy Markdown

Hey @Blaizzy — just flagging some technical notes since we're both working on V4 support and PR #1189 landed ~10 hours earlier with significant overlap:

Compressed attention mask direction (line 770-773):
The mask padding for compressed KV rows uses mx.ones, but create_attention_mask returns negative values for blocked positions. Padding with ones would block attention to compressed rows rather than allow it. PR #1189 uses mx.zeros here.

Sinkhorn normalization:
The Python loop path (line 222-226) dispatches ~40 kernel launches per call (softmax + iters x sum + div). PR #1189 has a fused Metal kernel that does this in a single register-resident dispatch — benchmarked at 3.5-5.7x faster on micro, 1.83x end-to-end.

sqrtsoftplus numerical stability:
nn.softplus(x) can overflow for large scores. PR #1189 uses mx.logaddexp(scores, zeros) which is log-sum-exp stable.

Happy to coordinate if the maintainers want to consolidate into one PR. Our implementation has live generation validation at 21.86 tok/s on M3 Ultra (DeepSeek-V4-Flash-4bit, 160GB peak).

@Blaizzy
Copy link
Copy Markdown
Contributor Author

Blaizzy commented Apr 24, 2026

Hey @machiabeli, thanks!

Yes, same person who left the earlier feedback, good to connect properly.

I've been poking at this in parallel and landed on something close to the source numerically with minimal changes, but there's definitely room to combine approaches. A PR from you on the compressed attention mask, Sinkhorn norm, and sqrt-softplus would be really welcome, happy to review and merge what works best.

Or I can cherry pick and add you as a co-author.

Comment thread mlx_lm/utils.py Outdated
Comment on lines +395 to +411
if (
config.get("quantization", None) is None
and getattr(model_args, "quantization", None) is not None
and any(k.endswith(".scales") for k in weights)
):
config["quantization"] = model_args.quantization

def _quantize(quantization):
def class_predicate(p, m):
if not hasattr(m, "to_quantized"):
return False
if f"{p}.scales" not in weights:
return False
# Handle custom per layer quantizations
if p in config["quantization"]:
return config["quantization"][p]
if not hasattr(m, "to_quantized"):
return False
return f"{p}.scales" in weights
return True
Copy link
Copy Markdown
Contributor Author

@Blaizzy Blaizzy Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to preserve mxfp4 expert quant since MLX supports it. So I made the quantize_config key in the config class default to that, and these changes help prequantized models load properly.

It can be done via predicate but couldn't find an elegant way of doing it.

Note: it doesn't affect any model.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative is to dequant -> requant similar to how we do with FP8.

@trevorgordon981
Copy link
Copy Markdown

Following up on #issuecomment-4329720377 with more specific findings.

I attempted to build a true 8-bit conversion myself by streaming through deepseek-ai/DeepSeek-V4-Flash (the 148 GB native FP8 release): read each shard, dequantize the I8 + F8_E8M0 (block-scale) and F8_E4M3 weights to bf16, re-quantize at affine q8 group_size=64, write MLX-format shards with the sanitize-equivalent name remapping (embed.weightmodel.embed_tokens.weight, hc_attn_Xattn_hc.X, experts stacked into switch_mlp.{gate,down,up}_proj, etc.). The dequant + requantize math validates fine (round-trip error 0.74% rel). Pipeline runs in ~5 min for the full 46-shard input.

Hit an architectural wall on the routed-expert dimensions that I can't reconcile from public artifacts.

Source (deepseek-ai/DeepSeek-V4-Flash) per-expert shapes:

experts.X.w1.weight: I8 [2048, 2048]
experts.X.w2.weight: I8 [4096, 1024]
experts.X.w3.weight: I8 [2048, 2048]

Your mlx-community/DeepSeek-V4-Flash-8bit switch_mlp (inferred from the mxfp4 packed/scales shapes):

gate_proj per expert (bf16): [2048, 4096]
down_proj per expert (bf16): [4096, 2048]
up_proj   per expert (bf16): [2048, 4096]

Both repos' configs say hidden_size=4096, moe_intermediate_size=2048, n_routed_experts=256, n_shared_experts=1 — they should produce the same per-expert shapes. But the source's last dims are exactly half of yours.

Param accounting confirms it: source's routed experts at (2048, 2048) sum to ~138B params (consistent with the 148 GB I8 file size), while your switch_mlp shapes imply ~280B in routed experts alone, which lines up with mxfp4's compressed 155 GB file size for a fuller param count.

The shared_experts block in source is sized correctly (shared_experts.w1.weight: F8_E4M3 [2048, 4096] matches your shared_experts.gate_proj (2048, 4096)). Only the routed experts disagree.

Two questions, in order of usefulness to me:

  1. How does your conversion process derive (2048, 4096) per-expert weights from the source's (2048, 2048)? Concat of w1+w3 (which would only give you gate_proj, not also up_proj), TP-shard combination across pairs of source experts, or some V4-Flash-specific reshaping I'm not seeing in mlx_lm/models/deepseek_v4.py?
  2. Would a true 8-bit version (no bits: 4 override on the FFN/MoE expert layers) be feasible to publish? My use case is a 512 GB Mac Studio backend; ~280-300 GB q8 fits comfortably with KV cache headroom and would be a quality bump over the current mxfp4-on-FFN profile. Happy to run the same harness against it if you publish.

If (1) has a clean answer I can finish the converter myself for (2). If not, only you have the conversion path that produces the right shapes.

Either way thanks for the work — dd6b92f mixed quant has been serving Alfred backend reliably for several hours now.

@Tonoken3
Copy link
Copy Markdown

Performance update — Apple optimizations live ✨

After pulling the latest with @angeloskath's three commits (RoPE kernel native path, GLU cast simplification, original-checkpoint loader):

Setup: Mac Studio M3 Ultra 512GB, mxfp8, single-machine, no TP

Prompt type TTFT TPS Notes
Short JP (~50 tok output) 2793 ms (cold) 33.8 tok/s first run, model warmup
Long JP (~500 char essay) 232 ms (cached) 33.6 tok/s flat
Code (Python w/ docstrings) 199 ms 33.6 tok/s flat

The thing that jumps out: TPS is now context-flat (33.8 vs 33.6 vs 33.6 across very different workloads). Previous builds had visible decay on longer generations. This is the signature of the cast/RoPE overhead going away.

Cumulative trajectory on this hardware:

Build TPS
Initial mxfp8 (with token-drop regression) 22.4
Sinkhorn / HC fusion (0xClandestine) 26.7
acf650c9 orientation fix (Blaizzy) 30.2
Apple kernel optimizations (angeloskath) 33.6–33.8

~+50% from baseline in 4 days of community iteration. Quality remains pristine (no token drops, no repetition loops, perfect Japanese & code).

Welcome to the PR @angeloskath — the kernel-native RoPE path is doing real work here on Apple Silicon. 🙌

adurham pushed a commit to adurham/mlx-lm that referenced this pull request Apr 29, 2026
…'s fp32-cast patches

Pulls in PR ml-explore#1192 from upstream including:
- Apple-team commits from Angelos Katharopoulos:
  - 3cf5282 "Start simplifying and speeding up the attention" (2026-04-29)
  - 4951496 "Fix RoPE to use the kernel by scaling freqs" (2026-04-28)
  - 81a8c57 "Simplify GLU and gate remove intermediate castings" (2026-04-28)
- Blaizzy refactor stack (output projection, KV reshape, BatchRotatingKVCache,
  scoring/RoPE compile, the matmul rewrite from 0xClandestine that we already
  had a copy of, etc.)

Conflict resolution: took theirs for mlx_lm/models/deepseek_v4.py wholesale.
Our previous fork patches that are now superseded:
- f4dd9e7 / 2a1dcf6 "drop fp32 casts in Indexer / MoEGate / Compressor" —
  Angelos' 81a8c57 covers the same ground more cleanly.

Fork patches that need re-applying separately on top:
- mlx_lm/profiler.py span/finalize hooks scattered across deepseek_v4.py
  (attn_q_lora / attn_kv_proj / attn_compressor / attn_indexer /
  attn_sdpa_sparse / attn_sdpa_dense / moe_* spans).
- 1d78d62 Indexer wq_b/weights_proj sharding for TP.
adurham pushed a commit to adurham/mlx-lm that referenced this pull request Apr 29, 2026
… + non-sparse attention refactor

New since our last merge attempt (then reverted for prefill regression):

- 83d7e74 (Angelos, Apr 29) "Refactor compressor and compressed
  non-sparse attention" — 225 line touchup. Suspected to address
  the prefill regression we hit.
- d8a47ae (pcuenca) "Ensure dtype going into mx.fast.rope is int32"
- ad9e329 (pcuenca) "Attempt to fix IndexError on batching" — could
  fix the c=4 SIGABRT we hit during AIME eval.

Took theirs for mlx_lm/models/deepseek_v4.py wholesale. Will
re-apply our sliding-window indexer + EXO_DSV4_INDEX_TOPK env knob
on top in a follow-up commit. mxfp4 detection in utils.py preserved.

Note: switch_mlp call signature is still the 2-arg `(x, inds)` form
in PR ml-explore#1192 (scores applied externally in DeepseekV4MoE). Keep
EXO_DSV4_FUSED_MOE=0 until FusedDeepseekV4SwitchGLU is updated in
exo's auto_parallel.py.
adurham pushed a commit to adurham/mlx-lm that referenced this pull request Apr 30, 2026
Previously reverted PR ml-explore#1192 today suspecting Apple's compressor
refactor (commit 83d7e74) caused our AIME quality failure. That was
wrong — the AIME failure was max_tokens=16384 too low for Think
mode (model card recommends >=384K for Think Max). The re-merge
brings back:

- 83d7e74 (Angelos) compressor + non-sparse attention refactor
- 4951496 RoPE kernel scaling fix
- 81a8c57 GLU/gate cast simplification
- ad9e329 (pcuenca) "Attempt to fix IndexError on batching" — likely
  fixes the recurring c=2 SIGABRT in mlx_lm/generate.py:1369 _step
  we've hit 4+ times today during AIME runs

Re-applied EXO_DSV4_INDEX_TOPK + EXO_DSV4_INDEXER_WINDOW env knobs
on top of the new Indexer, with the windowed-coordinate index
translation so V4Attention's sparse-pooled-attention gather works.

Keep EXO_DSV4_FUSED_MOE=0 — switch_mlp's 2-arg API still doesn't
match exo's FusedDeepseekV4SwitchGLU monkey-patch.
adurham pushed a commit to adurham/mlx-lm that referenced this pull request Apr 30, 2026
…xplore#1192

Cherry-pick upstream PR ml-explore#1192 commit ad9e329 (Pedro Cuenca, "Attempt
to fix IndexError on batching") onto our pre-Apple-refactor base.

The fix is structural and doesn't depend on the post-83d7e74
refactored Compressor — it just touches DeepseekV4Cache.filter() to
zero-pad compressor/indexer state when batch_indices references
slots that haven't emitted pooled tokens yet. Without it, c=2
batched decode SIGABRTs deep in mlx_lm/generate.py:_step (we hit
this 4+ times today during AIME-5 c=2 runs).

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
adurham pushed a commit to adurham/mlx-lm that referenced this pull request May 1, 2026
…h-model@2120a1b)

58 commits ahead. Major changes:
- Refactor DeepseekV4Cache for state management (8ac37c0)
- Add DeepseekV4SwitchGLU + BatchRotatingKVCache + PoolingCache (c2c9801)
- Refactor compressor + non-sparse compressed attention (83d7e74)
- HC sinkhorn normalization + sinkhorn unit tests (c0d9222, f7ff216)
- Replace einsum with matmul in HyperConnection (ef8c95d)
- Numerical stability + compressed-attn mask padding (c6a7828)
- pcuenca's IndexError fix on batching (ad9e329, was already cherry-picked
  separately as 9f6a9d1, now superseded)
- pcuenca's RoPE dtype fix (d8a47ae)
- Quant predicate refactor (d59907c, f4f7b4d, 166bba6, cbb0b72)

Conflict resolution:
- mlx_lm/models/deepseek_v4.py: 39 conflicts. Took theirs entirely (the
  PR's substantial refactoring of Indexer/Compressor/Cache makes our
  sliding-window patches non-applicable). Re-added EXO_DSV4_INDEX_TOPK
  env override to the new Indexer class — the simplest of our three env
  knobs to forward-port. EXO_DSV4_INDEXER_WINDOW + WINDOW_LATE patches
  are NOT carried forward; the new code restructures pooled attention
  enough that the slicing strategy needs a different implementation,
  which can be re-added later if perf testing shows it's needed.
- mlx_lm/utils.py: auto-merged. Our mxfp4 quant-config inference patch
  (commit 6d0c164) survived intact (verified via grep).
- mlx_lm/generate.py: auto-merged. PR added BatchPoolingCache support
  in _make_cache; clean addition.
- mlx_lm/models/cache.py: auto-merged. PR added 539+ lines for
  PoolingCache and BatchPoolingCache.
- tests/test_models.py: auto-merged.

This is a significant rebase — needs cluster validation before promoting
to the standard deploy. Likely to require rebench of DSv4-Flash-8bit
AIME runs and recheck of the IOGPU residency-set fix interaction with
the new BatchRotatingKVCache code path.
adurham pushed a commit to adurham/mlx-lm that referenced this pull request May 1, 2026
The Indexer.__call__ score computation runs every decode step on
indexer-equipped layers (~21 layers per token). The score block was
plain Python uncompiled:
  scores = q @ pooled.T (cast fp32)
  scores = mx.maximum(scores, 0) * scale
  scores = (scores * weights.T[..., None]).sum(axis=1)

Extract to a free @mx.compile function. Per the dsv4_optimization_results
memory the Indexer was already a known hot path and earlier fork patches
had a similar compiled helper before the PR ml-explore#1192 merge took theirs.
Restoring it on top of the new structure.
adurham pushed a commit to adurham/mlx-lm that referenced this pull request May 1, 2026
…c10538)

4 new commits:
- 5c10538 Fix tensor parallel distributed
- 7efb57f Fix batch cache edge case
- c58834b Add hyper_connection.py (moves HyperConnection out of deepseek_v4.py)
- ca8b299 Simplify HyperConnection

Auto-merge clean — our EXO_DSV4_INDEX_TOPK env override on Indexer.__init__
and the @mx.compile _indexer_score helper both survived. Verified file parses.
@ivanfioravanti
Copy link
Copy Markdown
Contributor

Prompt longer than 2K leads to no response at all
cat 4k.txt | mlx_lm.generate --model ~/DeepSeek-V4-Flash-5bit --max-tokens 200 --prompt -

I've used https://raw.githubusercontent.com/ivanfioravanti/llm_context_benchmarks/refs/heads/master/4k.txt

@ivanfioravanti
Copy link
Copy Markdown
Contributor

convert is not working properly, 5bit or 6 bit, use 4.349 bits per weight.

❯ mlx_lm.convert --hf-path deepseek-ai/DeepSeek-V4-Flash -q --q-bits 6 --mlx-path ~/DeepSeek-V4-Flash-6bit
[INFO] Loading
Fetching 67 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:00<00:00, 8842.62it/s]
Download complete: : 0.00B [00:00, ?B/s] | 0/67 [00:00<?, ?it/s]
[transformers] You are using a model of type deepseek_v4 to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a sam2_video checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.
[transformers] PreTrainedConfig got `key=rope_scaling` in kwargs but hasn't set it as attribute. For RoPE standardization you need to set `self.rope_parameters` in model's config.
[INFO] Using dtype: bfloat16
[INFO] Quantizing
[INFO] Quantized model with 4.349 bits per weight.

It seems related to the fact that starting from FP8 model, mxp4 + mxfp8 is automatically applied independently from --q-bits.

jundot added a commit to jundot/omlx that referenced this pull request May 6, 2026
The pinned mlx-lm v0.31.3 has no DeepSeek V4 support. Bring it in
through the omlx/patches pattern rather than touching the upstream
package.

New `omlx/patches/deepseek_v4/`:
- 1:1 copies of PR 1192's `deepseek_v4.py`, `hyper_connection.py`,
  and the `PoolingCache` / `BatchPoolingCache` additions to `cache.py`.
- Function-replacement patches for `mlx_lm.utils.load_model`
  (F8_E8M0 dtype fallback + fp8 quant branch) and
  `mlx_lm.generate._make_cache` (PoolingCache → BatchPoolingCache).
- `AutoTokenizer` wrapper that retries with `PreTrainedConfig()` when
  transformers ≤5.7.0 does not yet recognize `model_type=deepseek_v4`.
  Becomes dead code automatically once transformers ships native
  support.
- `PoolingCacheHandler` / `BatchPoolingCacheHandler` registered with
  the omlx CacheTypeRegistry so SSD and prefix cache state extraction
  do not silently fall through to `DefaultCacheHandler`.

omlx core: conditional dispatch in `utils/model_loading.py`,
`engine/batched.py`, and `models/llm.py` (gated on
`config.json::model_type=="deepseek_v4"` so other models pay nothing);
two new enum values and class-name mappings in `cache/type_handlers.py`
and `cache/type_registry.py`.

31 new unit tests in `tests/test_deepseek_v4_patch.py`. 344 cache /
engine / patches regression tests still pass.

Huge thanks to @Blaizzy for the upstream V4 implementation in
ml-explore/mlx-lm#1192 — this commit is just a thin glue layer
around that work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.