Skip to content

[Bugfix] Allow fp8_e5m2 KV cache on W4A16 compressed-tensors models (V100/SM70)#49

Open
rivetphilbot wants to merge 1 commit into
1CatAI:mainfrom
rivetphilbot:p7-fp8-kv-ct-classifier-fix
Open

[Bugfix] Allow fp8_e5m2 KV cache on W4A16 compressed-tensors models (V100/SM70)#49
rivetphilbot wants to merge 1 commit into
1CatAI:mainfrom
rivetphilbot:p7-fp8-kv-ct-classifier-fix

Conversation

@rivetphilbot
Copy link
Copy Markdown

Summary

CompressedTensorsConfig.get_quant_method() unconditionally registers a CompressedTensorsKVCacheMethod for every Attention layer when the model uses compressed-tensors quantization — even when kv_cache_scheme is None (i.e. the checkpoint does not ship KV cache scales).

That registration cascades into should_load_quant_weights() at vllm/model_executor/layers/attention/attention.py:166 returning True, the classifier asserting the quant method is a BaseKVCacheMethod (it is), and then the e5m2 guard at attention.py:167 refusing --kv-cache-dtype fp8_e5m2 with:

ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.

…despite the checkpoint being a plain W4A16 model with no FP8 scales at all.

On V100 / SM70 this is severe: Triton on SM70 only supports fp8e5 (not fp8e4nv), so fp8_e5m2 is the only FP8 KV path available — and this bug blocks it for every compressed-tensors W4A16 model, of which there are now many in production (Qwen3.5/3.6 variants, DeepSeek W4A16, etc.).

Fix

Short-circuit get_quant_method() for Attention layers when kv_cache_scheme is None. Returning None lets should_load_quant_weights() return False, the classifier branch is skipped, and the user's --kv-cache-dtype choice is honored as a runtime decision rather than treated as a checkpoint property.

12 lines inserted, 0 removed. No behavior change for models that DO ship kv_cache_scheme — they continue to receive CompressedTensorsKVCacheMethod exactly as before.

Validation

Tested on Deckard-40B-W4A16 (Qwen3_5ForConditionalGeneration, hybrid linear+full attention) at TP=2 on dual Tesla V100-PCIE-32GB:

  • ✅ Boots cleanly with --kv-cache-dtype fp8_e5m2 --max-model-len 262144 --max-num-seqs 16 --gpu-memory-utilization 0.89
  • FLASH_ATTN_V100 fast paths engaged for prefill + decode (no fallback to triton_attn)
  • ✅ Available KV cache memory 6.45 GiB, GPU KV cache size 68,992 tokens, 1.03× concurrency at 262,144
  • ✅ 9-stream concurrent aggregate decode throughput 75 tok/s vs single-stream baseline 20 tok/s (3.75× multiplier)
  • ✅ 128K-token needle-in-haystack at 75% depth retrieved verbatim with default k_scale = v_scale = 1.0 (no calibrated FP8 KV scales needed)
  • ✅ Correctness probes at temp=0 (math, code, factual, multi-step reasoning) all pass

Test plan

  • Boots with --kv-cache-dtype fp8_e5m2 on a W4A16 compressed-tensors model
  • Boots with --kv-cache-dtype auto (FP16 KV) on the same model — no regression
  • Long-context retrieval validated at 128K
  • Maintainers: please verify no regression on a model that DOES ship kv_cache_scheme (e.g. a calibrated FP8 KV checkpoint). The guard is is None, so registration is preserved when scales are present.

🤖 Authored by RivetOS Claude (Opus 4.7, 1M context)

CompressedTensorsConfig.get_quant_method() unconditionally registers
CompressedTensorsKVCacheMethod for every Attention layer when the model
uses compressed-tensors, even when kv_cache_scheme is None (i.e. the
checkpoint does not ship per-layer KV cache scales).

Once that method is registered, should_load_quant_weights() at
vllm/model_executor/layers/attention/attention.py:166 returns True, the
classifier asserts the quant method is a BaseKVCacheMethod (it is), and
the e5m2 guard at attention.py:167 then refuses --kv-cache-dtype
fp8_e5m2 with "fp8_e5m2 kv-cache is not supported with fp8 checkpoints"
- despite the checkpoint being a plain W4A16 model with no FP8 scales
at all.

This blocks FP8 KV cache entirely on V100 / SM70 for compressed-tensors
W4A16 models, because Triton on SM70 only supports fp8e5 (not fp8e4nv),
so fp8_e5m2 is the only FP8 KV path available on this hardware.

Fix: short-circuit get_quant_method() for Attention layers when
kv_cache_scheme is None. Returning None lets should_load_quant_weights()
return False, the classifier branch is skipped, and the user's
--kv-cache-dtype choice (fp8_e5m2 in this case) is honored as a
runtime decision rather than a checkpoint property.

Validated on Deckard-40B-W4A16 (Qwen3_5ForConditionalGeneration) at
TP=2 on dual Tesla V100-PCIE-32GB:

- Boots cleanly with --kv-cache-dtype fp8_e5m2, --max-model-len 262144
  (native), --max-num-seqs 16, --gpu-memory-utilization 0.89
- FLASH_ATTN_V100 fast paths engaged for prefill + decode (no fallback
  to triton_attn): confirmed via per-layer log emit at
  flash_attn_v100.py:520, :501, :547
- Available KV cache memory 6.45 GiB, GPU KV cache size 68,992 tokens,
  1.03x concurrency for 262,144 tokens per request
- 9-stream concurrent aggregate decode throughput 75 tok/s
  (vs single-stream baseline 20 tok/s = 3.75x multiplier)
- 128K-token needle-in-haystack at 75% depth retrieved verbatim with
  default k_scale=v_scale=1.0 (no calibrated FP8 KV scales required)
- Correctness probes at temp=0 (math, code, factual, multi-step
  reasoning) all pass

No behavior change for models that DO ship kv_cache_scheme - they
continue to receive CompressedTensorsKVCacheMethod as before.

Co-Authored-By: RivetOS Claude (Opus 4.7, 1M context) <noreply@anthropic.com>
@valentijnvenus
Copy link
Copy Markdown

nice one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants