From b7a5dc3004e1b0c64d5683447119cbc652ded527 Mon Sep 17 00:00:00 2001 From: RivetOS Claude <266195206+rivetphilbot@users.noreply.github.com> Date: Fri, 22 May 2026 16:56:08 +0000 Subject: [PATCH] P7: skip CT KVCacheMethod when kv_cache_scheme is None CompressedTensorsConfig.get_quant_method() unconditionally registers CompressedTensorsKVCacheMethod for every Attention layer when the model uses compressed-tensors, even when kv_cache_scheme is None (i.e. the checkpoint does not ship per-layer KV cache scales). Once that method is registered, should_load_quant_weights() at vllm/model_executor/layers/attention/attention.py:166 returns True, the classifier asserts the quant method is a BaseKVCacheMethod (it is), and the e5m2 guard at attention.py:167 then refuses --kv-cache-dtype fp8_e5m2 with "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" - despite the checkpoint being a plain W4A16 model with no FP8 scales at all. This blocks FP8 KV cache entirely on V100 / SM70 for compressed-tensors W4A16 models, because Triton on SM70 only supports fp8e5 (not fp8e4nv), so fp8_e5m2 is the only FP8 KV path available on this hardware. Fix: short-circuit get_quant_method() for Attention layers when kv_cache_scheme is None. Returning None lets should_load_quant_weights() return False, the classifier branch is skipped, and the user's --kv-cache-dtype choice (fp8_e5m2 in this case) is honored as a runtime decision rather than a checkpoint property. Validated on Deckard-40B-W4A16 (Qwen3_5ForConditionalGeneration) at TP=2 on dual Tesla V100-PCIE-32GB: - Boots cleanly with --kv-cache-dtype fp8_e5m2, --max-model-len 262144 (native), --max-num-seqs 16, --gpu-memory-utilization 0.89 - FLASH_ATTN_V100 fast paths engaged for prefill + decode (no fallback to triton_attn): confirmed via per-layer log emit at flash_attn_v100.py:520, :501, :547 - Available KV cache memory 6.45 GiB, GPU KV cache size 68,992 tokens, 1.03x concurrency for 262,144 tokens per request - 9-stream concurrent aggregate decode throughput 75 tok/s (vs single-stream baseline 20 tok/s = 3.75x multiplier) - 128K-token needle-in-haystack at 75% depth retrieved verbatim with default k_scale=v_scale=1.0 (no calibrated FP8 KV scales required) - Correctness probes at temp=0 (math, code, factual, multi-step reasoning) all pass No behavior change for models that DO ship kv_cache_scheme - they continue to receive CompressedTensorsKVCacheMethod as before. Co-Authored-By: RivetOS Claude (Opus 4.7, 1M context) --- .../compressed_tensors/compressed_tensors.py | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py index c7d36e7a0e..11501eb99a 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py @@ -180,6 +180,18 @@ def get_quant_method( return quant_method if isinstance(layer, Attention): + # Only register a KV-cache quant method when the checkpoint + # actually ships KV scales. Without this guard, compressed- + # tensors W4A16 models (which set kv_cache_scheme=None) are + # routed through CompressedTensorsKVCacheMethod and then + # misclassified by should_load_quant_weights() in + # attention.py:166 as "FP8 checkpoints", which refuses + # --kv-cache-dtype fp8_e5m2 at attention.py:167. That ban + # is the only FP8 KV path Triton supports on V100/SM70 + # (Triton on SM70 rejects fp8e4nv), so without this fix + # V100 deployments cannot use FP8 KV cache on a W4A16 model. + if self.kv_cache_scheme is None: + return None return CompressedTensorsKVCacheMethod(self) if isinstance(layer, FusedMoE): return CompressedTensorsMoEMethod.get_moe_method(