[Bugfix] Allow fp8_e5m2 KV cache on W4A16 compressed-tensors models (V100/SM70) by rivetphilbot · Pull Request #49 · 1CatAI/1Cat-vLLM

rivetphilbot · 2026-05-22T16:59:24Z

Summary

CompressedTensorsConfig.get_quant_method() unconditionally registers a CompressedTensorsKVCacheMethod for every Attention layer when the model uses compressed-tensors quantization — even when kv_cache_scheme is None (i.e. the checkpoint does not ship KV cache scales).

That registration cascades into should_load_quant_weights() at vllm/model_executor/layers/attention/attention.py:166 returning True, the classifier asserting the quant method is a BaseKVCacheMethod (it is), and then the e5m2 guard at attention.py:167 refusing --kv-cache-dtype fp8_e5m2 with:

ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.

…despite the checkpoint being a plain W4A16 model with no FP8 scales at all.

On V100 / SM70 this is severe: Triton on SM70 only supports fp8e5 (not fp8e4nv), so fp8_e5m2 is the only FP8 KV path available — and this bug blocks it for every compressed-tensors W4A16 model, of which there are now many in production (Qwen3.5/3.6 variants, DeepSeek W4A16, etc.).

Fix

Short-circuit get_quant_method() for Attention layers when kv_cache_scheme is None. Returning None lets should_load_quant_weights() return False, the classifier branch is skipped, and the user's --kv-cache-dtype choice is honored as a runtime decision rather than treated as a checkpoint property.

12 lines inserted, 0 removed. No behavior change for models that DO ship kv_cache_scheme — they continue to receive CompressedTensorsKVCacheMethod exactly as before.

Validation

Tested on Deckard-40B-W4A16 (Qwen3_5ForConditionalGeneration, hybrid linear+full attention) at TP=2 on dual Tesla V100-PCIE-32GB:

✅ Boots cleanly with --kv-cache-dtype fp8_e5m2 --max-model-len 262144 --max-num-seqs 16 --gpu-memory-utilization 0.89
✅ FLASH_ATTN_V100 fast paths engaged for prefill + decode (no fallback to triton_attn)
✅ Available KV cache memory 6.45 GiB, GPU KV cache size 68,992 tokens, 1.03× concurrency at 262,144
✅ 9-stream concurrent aggregate decode throughput 75 tok/s vs single-stream baseline 20 tok/s (3.75× multiplier)
✅ 128K-token needle-in-haystack at 75% depth retrieved verbatim with default k_scale = v_scale = 1.0 (no calibrated FP8 KV scales needed)
✅ Correctness probes at temp=0 (math, code, factual, multi-step reasoning) all pass

Test plan

Boots with --kv-cache-dtype fp8_e5m2 on a W4A16 compressed-tensors model
Boots with --kv-cache-dtype auto (FP16 KV) on the same model — no regression
Long-context retrieval validated at 128K
Maintainers: please verify no regression on a model that DOES ship kv_cache_scheme (e.g. a calibrated FP8 KV checkpoint). The guard is is None, so registration is preserved when scales are present.

🤖 Authored by RivetOS Claude (Opus 4.7, 1M context)

CompressedTensorsConfig.get_quant_method() unconditionally registers CompressedTensorsKVCacheMethod for every Attention layer when the model uses compressed-tensors, even when kv_cache_scheme is None (i.e. the checkpoint does not ship per-layer KV cache scales). Once that method is registered, should_load_quant_weights() at vllm/model_executor/layers/attention/attention.py:166 returns True, the classifier asserts the quant method is a BaseKVCacheMethod (it is), and the e5m2 guard at attention.py:167 then refuses --kv-cache-dtype fp8_e5m2 with "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" - despite the checkpoint being a plain W4A16 model with no FP8 scales at all. This blocks FP8 KV cache entirely on V100 / SM70 for compressed-tensors W4A16 models, because Triton on SM70 only supports fp8e5 (not fp8e4nv), so fp8_e5m2 is the only FP8 KV path available on this hardware. Fix: short-circuit get_quant_method() for Attention layers when kv_cache_scheme is None. Returning None lets should_load_quant_weights() return False, the classifier branch is skipped, and the user's --kv-cache-dtype choice (fp8_e5m2 in this case) is honored as a runtime decision rather than a checkpoint property. Validated on Deckard-40B-W4A16 (Qwen3_5ForConditionalGeneration) at TP=2 on dual Tesla V100-PCIE-32GB: - Boots cleanly with --kv-cache-dtype fp8_e5m2, --max-model-len 262144 (native), --max-num-seqs 16, --gpu-memory-utilization 0.89 - FLASH_ATTN_V100 fast paths engaged for prefill + decode (no fallback to triton_attn): confirmed via per-layer log emit at flash_attn_v100.py:520, :501, :547 - Available KV cache memory 6.45 GiB, GPU KV cache size 68,992 tokens, 1.03x concurrency for 262,144 tokens per request - 9-stream concurrent aggregate decode throughput 75 tok/s (vs single-stream baseline 20 tok/s = 3.75x multiplier) - 128K-token needle-in-haystack at 75% depth retrieved verbatim with default k_scale=v_scale=1.0 (no calibrated FP8 KV scales required) - Correctness probes at temp=0 (math, code, factual, multi-step reasoning) all pass No behavior change for models that DO ship kv_cache_scheme - they continue to receive CompressedTensorsKVCacheMethod as before. Co-Authored-By: RivetOS Claude (Opus 4.7, 1M context) <noreply@anthropic.com>

valentijnvenus · 2026-05-26T09:31:42Z

nice one!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Allow fp8_e5m2 KV cache on W4A16 compressed-tensors models (V100/SM70)#49

[Bugfix] Allow fp8_e5m2 KV cache on W4A16 compressed-tensors models (V100/SM70)#49
rivetphilbot wants to merge 1 commit into
1CatAI:mainfrom
rivetphilbot:p7-fp8-kv-ct-classifier-fix

rivetphilbot commented May 22, 2026

Uh oh!

valentijnvenus commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rivetphilbot commented May 22, 2026

Summary

Fix

Validation

Test plan

Uh oh!

valentijnvenus commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants