[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#25
[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#25rivetphilbot wants to merge 4 commits into
Conversation
- Revert sledgehammer in compressed_tensors.py:345 — restore proper capability-vs-min_capability gate. - Add SM70 carve-out in compressed_tensors_wNa16.get_min_capability: return 70 on V100 (CC 7.0), 75 elsewhere. Validates: Qwen3.6-27B-AWQ-INT4 now boots through engine init and TP=2 NCCL all_reduce, advances to weight-loading where it fails at qwen3_5.py:531 with BasevLLMParameter.output_dim — meaning admittance is unblocked and we hit the next layer (DeltaNet merged-projection parameter setup, work for P3a).
CT WNA16 creates auxiliary BasevLLMParameter (weight_shape) and RowvLLMParameter (weight_g_idx) that have no output_dim, since they hold metadata or input-dim-sharded indices rather than output-dim weight data. The tuple-shard load path in qwen3_5.py:load_weights hit AttributeError when the qkvz stacked-load mapping reached one of these auxiliary tensors. Fix: when output_dim is missing, fall through to the standard non-shard weight_loader (last-write-wins for replicated metadata) and break out of the sub-id loop. Validated: Qwen3.6-27B-AWQ-INT4 now completes model loading on V100 (11.28 GiB / GPU) and runs through 90s of memory profiling forward passes. Remaining failure is KV-cache budget tuning only.
Replace 1Cat fork's vendored sm70_884_4.cu (lmdeploy v0.12.1) with the upstream lmdeploy main version (commit e5fbd4da). Pure registration changes -- mainloop_sm70.h, iterator_sm70.h, scheduler_sm70.cuh are byte-identical between the two. Source: lmdeploy PR #4429 'fully implement compressed-tensors gs32 support' (Mar 23, commit c8eaf3b2). Net diff: +45 -11 lines. Adds: - New Config_U4_d<kColMajor> block with 21 gs32 tile sizes (1Cat had ZERO Config_U4_d gs32 tiles) - Expanded Config_U4_g<kColMajor> gs32 block: 17 tiles (vs 6 in 1Cat) Drops: - gs64 tile block (1Cat carried it; both deployed quants are gs32) Cyankiwi quants (qwen3.6-27b-int4, granite-4.1-8b-awq-int4) are gs32 -- both directly affected. More tile candidates = better autotune match for actual decode shapes (M=2-4 batched at TP=2). Profile basis: turbomind::gemm = 83% of decode GPU time on V100 dual TP=2 serving Qwen3.6-27B; recurrent GDN <1%, attention <1%. The GEMM autotuner has no gs32 candidates in the most common kColMajor layout in the 1Cat snapshot, forcing fallback paths. Build: full _C.abi3.so rebuild required (kernel registration is statically linked). TORCH_CUDA_ARCH_LIST=7.0 for V100-only build.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
… patches needed (#4) cyankiwi/granite-4.1-8b-AWQ-INT4 is dense ``GraniteForCausalLM`` quantized as compressed-tensors W4A16 group_size=32 asymmetric. The fork's existing ``TurboMindAsymLinearKernel`` (CT pack -> AWQ pack -> awq_sm70_prepare / awq_gemm_sm70) handles this format end-to-end -- so the patches in 1Cat-vLLM PR 1CatAI#25 ("[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading") turn out to be redundant for granite: - PR P0 (new ``SM70TurboMindLinearKernel``): functionally a re-impl of our ``TurboMindAsymLinearKernel`` -- same scalar type, same group sizes, same CT->AWQ transcode, same kernel dispatch. - PR P1 (``min_capability=70`` for CT WNA16): already in the fork's ``compressed_tensors_wNa16.py`` (returns 70 unconditionally). - PR P3a (qwen3_5 weight-loader fix): only relevant to hybrid DeltaNet models like Qwen3.6-27B; granite is pure dense attention. - PR P6 (sm70_884_4.cu tile registry expansion): not needed for granite. The asymmetric path ends up dispatching through ``Config_U4_g`` gs32 (which we have), not ``Config_U4_d`` gs32 (which we don't); shapes match and output is correct. What does matter is the PR's operational note about ``FLASH_ATTN_V100``: with default chunked-prefill split, granite produces all-token-id-0 ("!!!!...") garbage. The PR's workaround (``--compilation-config compile_ranges_split_points:[]``) prevents that silent fallback. With the workaround AND cudagraph engaged (no ``--enforce-eager``), local bench at TP=2 dual V100 32GB SXM2, 32-prompt -> 128-gen: batch=1 : 126.6 tok/s decode (single-stream) batch=8 : 586.8 tok/s aggregate / 73.3 per-seq batch=16 : eager-only sweep showed 261.7 aggregate, but cudagraph scaling not yet measured at this batch size Eager-mode is ~3x slower (32.7 tok/s batch=1); the V4-Flash convention of ``enforce_eager=True`` does NOT carry over -- granite has none of V4-Flash's three cudagraph blockers, so capture engages cleanly. ``TRITON_ATTN`` (no compilation-config tweak required) is a slower fallback at ~26 tok/s eager batch=1. Adds: - ``tests/models/test_granite_v100_smoke.py``: TP=1 single-prompt math probe with selectable attention backend; was used to isolate that the original "!!!!..." failure was the FLASH_ATTN_V100 silent fallback, not the kernel. - ``tests/models/test_granite_v100_bench.py``: TP=2 throughput bench parameterized by ``BATCH``, ``ENFORCE_EAGER``, ``PR_CHUNKED_PREFILL_FIX``, ``ATTN_BACKEND``. Has a coherence gate (``token_id == 0`` < 10% of generated tokens) so a silent fallback is detected, not confused with a perf regression. - README "Verified models" row + a new "Quick run" section in the same docker-compose-env-var style as the Qwen3.6 / V4-Flash examples. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds an SM70 (V100) path for
compressed-tensorsdense WNA16 quantized models, plus DeltaNet weight-loading and tile-config alignment fixes. Validated end-to-end on dual V100-32GB TP=2 with two different model architectures.Status: DRAFT. Branch is pushed for visibility and discussion. Author identities + commit organization will be cleaned up before this is marked ready for review (see "Known issues" below). Patches themselves are in working order.
Patches in this branch
wipcommit)vllm/model_executor/layers/quantization/kernels/mixed_precision/sm70_turbomind.py(new, 193 LoC)SM70TurboMindLinearKernelfor dense compressed-tensors WNA16. Transcodes CT pack format → AWQ pack format and dispatches through the existingawq_sm70_prepare/awq_gemm_sm70TurboMind kernels. Registers in_POSSIBLE_KERNELS[CUDA].e9a7f8a)compressed_tensors_wNa16.pyCompressedTensorsWNA16schemes on SM70 by overriding the hardcodedmin_capability=75when an SM70-capable kernel is registered. Lets the kernel selector handle availability rather than reject up-front.f46eec7)vllm/model_executor/models/qwen3_5.py:512-543output_dimparams. CT WNA16 creates auxiliary tensors (weight_shape,weight_g_idx) that lackoutput_dim; the legacyweight_loader_with_aliaspath crashes on them. Routes those through the standardweight_loaderand breaks the sub-id loop.5de6ca5)src/turbomind/kernels/gemm/kernel/sm70_884_4.cuConfig_U4_dgs32 tile sizes and expandsConfig_U4_ggs32 from 6→17 tiles. Pure registry additions; mainloop/iterator/scheduler unchanged.Validated on V100 dual-TP
Models confirmed loading & generating correctly:
cyankiwi/Qwen3.6-27B-AWQ-INT4(hybrid DeltaNet + attention dense, 16/64 attention layers)cyankiwi/granite-4.1-8b-AWQ-INT4(pure dense attention)Steady-state decode throughput: ~40 tok/s for Qwen3.6-27B and ~70 tok/s for Granite-4.1-8B at TP=2 on dual V100-32GB. Operator-level profiling shows decode is GEMM-bound (~83% in
turbomind::gemm) — this is the V100 HBM-bandwidth ceiling for 27B-class dense, not a patch-related limitation.Quality probes (Qwen3.6-27B-AWQ-INT4, temp=0, non-thinking mode):
What this enables
Today,
1Cat-vLLMSM70 support covers:awqformat (works, viaAWQLinearMethod)compressed-tensorsMoE (works, viaCompressedTensorsSM70WNA16MoEMethodfrom prior MoE work)This patch series adds the missing third corner:
compressed-tensorsdense on V100. Modern community quants (cyankiwi, llmcompressor-based) increasingly default to compressed-tensors format; without this path, V100 users are limited to legacy AWQ quants only.Operational notes for reviewers
nvidia-nccl-cu13shadowingnvidia-nccl-cu12in cu128 builds causesunhandled cuda errorat firstall_reduce. Fix: ensure onlynvidia-nccl-cu12is installed. We hit this twice during patch development; if you see the same crash, this is almost certainly the cause.--compilation-config compile_ranges_split_points:[](disables chunked prefill split) when used withFLASH_ATTN_V100; without this we observed silent fallback. Likely a separate issue worth its own investigation.Known issues — to clean up before marking ready
wipcommit (3f1fefa) contains ~806 lines of mostly-whitespace diff againstcompressed_tensors_moe.pyandqwen3_next.py. The substantive content (~83 lines) is the P0 kernel + DeltaNet param routing. This will be split into proper P0 (kernel + register) and P3a-prelim commits, with whitespace noise dropped.Phil <phil@pve3>,phil <phil@rivetos.local>,SM70 Patch <sm70-patch@local>). Will normalize to a single identity before ready-for-review.Happy to take feedback on the patch shapes themselves before doing the cleanup pass — I'd rather restructure once based on review than do it twice.
Test plan
FLASH_ATTN_V100backend