Add quantized KV cache for MLX draft by Ziqiao-git · Pull Request #89 · z-lab/dflash

Ziqiao-git · 2026-04-24T21:44:14Z

Adds quantize_kv_bits=4/8 to load_draft() and --draft-quantize-kv-bits to benchmark CLI.

Similar to the existing sliding_window_size option, this bounds the draft KV cache memory footprint (int4 = 4x smaller, int8 = 2x smaller).

Implementation: DFlashAttention dequantizes cache output before concatenating with fp16 proposal KV. This keeps the cache compact without changing the SDPA kernel.

Spot-checked on Qwen3.5-4B (gsm8k, thinking mode): outputs identical across fp16/int4/int8 for the first 500 chars. Acceptance within noise. Happy to add unit tests if needed.

Add quantized KV cache support for MLX draft model

e84106b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add quantized KV cache for MLX draft#89

Add quantized KV cache for MLX draft#89
Ziqiao-git wants to merge 1 commit intoz-lab:mainfrom
Ziqiao-git:quantized-draft-kv

Ziqiao-git commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ziqiao-git commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants