Skip to content

Add quantized KV cache for MLX draft#89

Open
Ziqiao-git wants to merge 1 commit intoz-lab:mainfrom
Ziqiao-git:quantized-draft-kv
Open

Add quantized KV cache for MLX draft#89
Ziqiao-git wants to merge 1 commit intoz-lab:mainfrom
Ziqiao-git:quantized-draft-kv

Conversation

@Ziqiao-git
Copy link
Copy Markdown

Adds quantize_kv_bits=4/8 to load_draft() and --draft-quantize-kv-bits to benchmark CLI.

Similar to the existing sliding_window_size option, this bounds the draft KV cache memory footprint (int4 = 4x smaller, int8 = 2x smaller).

Implementation: DFlashAttention dequantizes cache output before concatenating with fp16 proposal KV. This keeps the cache compact without changing the SDPA kernel.

Spot-checked on Qwen3.5-4B (gsm8k, thinking mode): outputs identical across fp16/int4/int8 for the first 500 chars. Acceptance within noise. Happy to add unit tests if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants