From b7a5dc3004e1b0c64d5683447119cbc652ded527 Mon Sep 17 00:00:00 2001
From: RivetOS Claude <266195206+rivetphilbot@users.noreply.github.com>
Date: Fri, 22 May 2026 16:56:08 +0000
Subject: [PATCH] P7: skip CT KVCacheMethod when kv_cache_scheme is None

CompressedTensorsConfig.get_quant_method() unconditionally registers
CompressedTensorsKVCacheMethod for every Attention layer when the model
uses compressed-tensors, even when kv_cache_scheme is None (i.e. the
checkpoint does not ship per-layer KV cache scales).

Once that method is registered, should_load_quant_weights() at
vllm/model_executor/layers/attention/attention.py:166 returns True, the
classifier asserts the quant method is a BaseKVCacheMethod (it is), and
the e5m2 guard at attention.py:167 then refuses --kv-cache-dtype
fp8_e5m2 with "fp8_e5m2 kv-cache is not supported with fp8 checkpoints"
- despite the checkpoint being a plain W4A16 model with no FP8 scales
at all.

This blocks FP8 KV cache entirely on V100 / SM70 for compressed-tensors
W4A16 models, because Triton on SM70 only supports fp8e5 (not fp8e4nv),
so fp8_e5m2 is the only FP8 KV path available on this hardware.

Fix: short-circuit get_quant_method() for Attention layers when
kv_cache_scheme is None. Returning None lets should_load_quant_weights()
return False, the classifier branch is skipped, and the user's
--kv-cache-dtype choice (fp8_e5m2 in this case) is honored as a
runtime decision rather than a checkpoint property.

Validated on Deckard-40B-W4A16 (Qwen3_5ForConditionalGeneration) at
TP=2 on dual Tesla V100-PCIE-32GB:

- Boots cleanly with --kv-cache-dtype fp8_e5m2, --max-model-len 262144
  (native), --max-num-seqs 16, --gpu-memory-utilization 0.89
- FLASH_ATTN_V100 fast paths engaged for prefill + decode (no fallback
  to triton_attn): confirmed via per-layer log emit at
  flash_attn_v100.py:520, :501, :547
- Available KV cache memory 6.45 GiB, GPU KV cache size 68,992 tokens,
  1.03x concurrency for 262,144 tokens per request
- 9-stream concurrent aggregate decode throughput 75 tok/s
  (vs single-stream baseline 20 tok/s = 3.75x multiplier)
- 128K-token needle-in-haystack at 75% depth retrieved verbatim with
  default k_scale=v_scale=1.0 (no calibrated FP8 KV scales required)
- Correctness probes at temp=0 (math, code, factual, multi-step
  reasoning) all pass

No behavior change for models that DO ship kv_cache_scheme - they
continue to receive CompressedTensorsKVCacheMethod as before.

Co-Authored-By: RivetOS Claude (Opus 4.7, 1M context) <noreply@anthropic.com>
---
 .../compressed_tensors/compressed_tensors.py         | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index c7d36e7a0e..11501eb99a 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -180,6 +180,18 @@ def get_quant_method(
                 return quant_method
 
         if isinstance(layer, Attention):
+            # Only register a KV-cache quant method when the checkpoint
+            # actually ships KV scales. Without this guard, compressed-
+            # tensors W4A16 models (which set kv_cache_scheme=None) are
+            # routed through CompressedTensorsKVCacheMethod and then
+            # misclassified by should_load_quant_weights() in
+            # attention.py:166 as "FP8 checkpoints", which refuses
+            # --kv-cache-dtype fp8_e5m2 at attention.py:167. That ban
+            # is the only FP8 KV path Triton supports on V100/SM70
+            # (Triton on SM70 rejects fp8e4nv), so without this fix
+            # V100 deployments cannot use FP8 KV cache on a W4A16 model.
+            if self.kv_cache_scheme is None:
+                return None
             return CompressedTensorsKVCacheMethod(self)
         if isinstance(layer, FusedMoE):
             return CompressedTensorsMoEMethod.get_moe_method(