[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading by rivetphilbot · Pull Request #25 · 1CatAI/1Cat-vLLM

rivetphilbot · 2026-05-04T20:19:15Z

Summary

Adds an SM70 (V100) path for compressed-tensors dense WNA16 quantized models, plus DeltaNet weight-loading and tile-config alignment fixes. Validated end-to-end on dual V100-32GB TP=2 with two different model architectures.

Status: DRAFT. Branch is pushed for visibility and discussion. Author identities + commit organization will be cleaned up before this is marked ready for review (see "Known issues" below). Patches themselves are in working order.

Patches in this branch

Patch	File	What it does
P0 (in `wip` commit)	`vllm/model_executor/layers/quantization/kernels/mixed_precision/sm70_turbomind.py` (new, 193 LoC)	New `SM70TurboMindLinearKernel` for dense compressed-tensors WNA16. Transcodes CT pack format → AWQ pack format and dispatches through the existing `awq_sm70_prepare` / `awq_gemm_sm70` TurboMind kernels. Registers in `_POSSIBLE_KERNELS[CUDA]`.
P1 (`e9a7f8a`)	`compressed_tensors_wNa16.py`	Allow `CompressedTensorsWNA16` schemes on SM70 by overriding the hardcoded `min_capability=75` when an SM70-capable kernel is registered. Lets the kernel selector handle availability rather than reject up-front.
P3a (`f46eec7`)	`vllm/model_executor/models/qwen3_5.py:512-543`	Skip the tuple-shard split for non-`output_dim` params. CT WNA16 creates auxiliary tensors (`weight_shape`, `weight_g_idx`) that lack `output_dim`; the legacy `weight_loader_with_alias` path crashes on them. Routes those through the standard `weight_loader` and breaks the sub-id loop.
P6 (`5de6ca5`)	`src/turbomind/kernels/gemm/kernel/sm70_884_4.cu`	Sync to lmdeploy main: adds 21 new `Config_U4_d` gs32 tile sizes and expands `Config_U4_g` gs32 from 6→17 tiles. Pure registry additions; mainloop/iterator/scheduler unchanged.

Validated on V100 dual-TP

Models confirmed loading & generating correctly:

cyankiwi/Qwen3.6-27B-AWQ-INT4 (hybrid DeltaNet + attention dense, 16/64 attention layers)
cyankiwi/granite-4.1-8b-AWQ-INT4 (pure dense attention)

Steady-state decode throughput: ~40 tok/s for Qwen3.6-27B and ~70 tok/s for Granite-4.1-8B at TP=2 on dual V100-32GB. Operator-level profiling shows decode is GEMM-bound (~83% in turbomind::gemm) — this is the V100 HBM-bandwidth ceiling for 27B-class dense, not a patch-related limitation.

Quality probes (Qwen3.6-27B-AWQ-INT4, temp=0, non-thinking mode):

RealWorldQA: 75.42% (Qwen published: 84.1)
MMMU (full): 52.33% raw / ~62-69% methodology-adjusted (Qwen published: 82.9)
Wrong-answer audit: residual gap is plausible reasoning errors + truncation + extraction parser limits, not kernel garbage. Detailed audit available on request.
A 40-probe head-to-head vs Qwen3.6-35B-A3B (the previously-validated MoE model) on this build showed 27B-INT4 winning decisively on tool-call structure (88% vs 62%) and matching or slightly trailing on reasoning categories at expected per-active-param ratios.

What this enables

Today, 1Cat-vLLM SM70 support covers:

Legacy awq format (works, via AWQLinearMethod)
compressed-tensors MoE (works, via CompressedTensorsSM70WNA16MoEMethod from prior MoE work)

This patch series adds the missing third corner: compressed-tensors dense on V100. Modern community quants (cyankiwi, llmcompressor-based) increasingly default to compressed-tensors format; without this path, V100 users are limited to legacy AWQ quants only.

Operational notes for reviewers

NCCL package conflict. nvidia-nccl-cu13 shadowing nvidia-nccl-cu12 in cu128 builds causes unhandled cuda error at first all_reduce. Fix: ensure only nvidia-nccl-cu12 is installed. We hit this twice during patch development; if you see the same crash, this is almost certainly the cause.
Chunked prefill. Decode benefits from --compilation-config compile_ranges_split_points:[] (disables chunked prefill split) when used with FLASH_ATTN_V100; without this we observed silent fallback. Likely a separate issue worth its own investigation.

Known issues — to clean up before marking ready

The wip commit (3f1fefa) contains ~806 lines of mostly-whitespace diff against compressed_tensors_moe.py and qwen3_next.py. The substantive content (~83 lines) is the P0 kernel + DeltaNet param routing. This will be split into proper P0 (kernel + register) and P3a-prelim commits, with whitespace noise dropped.
Author identities are inconsistent across commits (Phil <phil@pve3>, phil <phil@rivetos.local>, SM70 Patch <sm70-patch@local>). Will normalize to a single identity before ready-for-review.
P1 currently reverts a sledgehammer hack from the wip commit; in the cleaned version the sledgehammer never appears.

Happy to take feedback on the patch shapes themselves before doing the cleanup pass — I'd rather restructure once based on review than do it twice.

Test plan

Boot Qwen3.6-27B-AWQ-INT4 on dual V100-32GB TP=2 with FLASH_ATTN_V100 backend
Boot Granite-4.1-8B-AWQ-INT4 on the same hardware (independent validation of P0/P1)
Smoke-test text reasoning, code, factual recall at temp=0
Smoke-test multimodal (image input → vision tower → LLM): vision tower ignored from quantization, runs in BF16 native attention, color-identification probe correct
Operator-level profile to confirm no kernel-correctness regression
Full RealWorldQA (765q) + MMMU (900q) academic comparison vs Qwen-published
Head-to-head vs Qwen3.6-35B-A3B on this build, 40-probe agentic workload set

…tempts)

- Revert sledgehammer in compressed_tensors.py:345 — restore proper capability-vs-min_capability gate. - Add SM70 carve-out in compressed_tensors_wNa16.get_min_capability: return 70 on V100 (CC 7.0), 75 elsewhere. Validates: Qwen3.6-27B-AWQ-INT4 now boots through engine init and TP=2 NCCL all_reduce, advances to weight-loading where it fails at qwen3_5.py:531 with BasevLLMParameter.output_dim — meaning admittance is unblocked and we hit the next layer (DeltaNet merged-projection parameter setup, work for P3a).

CT WNA16 creates auxiliary BasevLLMParameter (weight_shape) and RowvLLMParameter (weight_g_idx) that have no output_dim, since they hold metadata or input-dim-sharded indices rather than output-dim weight data. The tuple-shard load path in qwen3_5.py:load_weights hit AttributeError when the qkvz stacked-load mapping reached one of these auxiliary tensors. Fix: when output_dim is missing, fall through to the standard non-shard weight_loader (last-write-wins for replicated metadata) and break out of the sub-id loop. Validated: Qwen3.6-27B-AWQ-INT4 now completes model loading on V100 (11.28 GiB / GPU) and runs through 90s of memory profiling forward passes. Remaining failure is KV-cache budget tuning only.

Replace 1Cat fork's vendored sm70_884_4.cu (lmdeploy v0.12.1) with the upstream lmdeploy main version (commit e5fbd4da). Pure registration changes -- mainloop_sm70.h, iterator_sm70.h, scheduler_sm70.cuh are byte-identical between the two. Source: lmdeploy PR #4429 'fully implement compressed-tensors gs32 support' (Mar 23, commit c8eaf3b2). Net diff: +45 -11 lines. Adds: - New Config_U4_d<kColMajor> block with 21 gs32 tile sizes (1Cat had ZERO Config_U4_d gs32 tiles) - Expanded Config_U4_g<kColMajor> gs32 block: 17 tiles (vs 6 in 1Cat) Drops: - gs64 tile block (1Cat carried it; both deployed quants are gs32) Cyankiwi quants (qwen3.6-27b-int4, granite-4.1-8b-awq-int4) are gs32 -- both directly affected. More tile candidates = better autotune match for actual decode shapes (M=2-4 batched at TP=2). Profile basis: turbomind::gemm = 83% of decode GPU time on V100 dual TP=2 serving Qwen3.6-27B; recurrent GDN <1%, attention <1%. The GEMM autotuner has no gs32 candidates in the most common kColMajor layout in the 1Cat snapshot, forcing fallback paths. Build: full _C.abi3.so rebuild required (kernel registration is statically linked). TORCH_CUDA_ARCH_LIST=7.0 for V100-only build.

github-actions · 2026-05-04T20:19:31Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

… patches needed (#4) cyankiwi/granite-4.1-8b-AWQ-INT4 is dense ``GraniteForCausalLM`` quantized as compressed-tensors W4A16 group_size=32 asymmetric. The fork's existing ``TurboMindAsymLinearKernel`` (CT pack -> AWQ pack -> awq_sm70_prepare / awq_gemm_sm70) handles this format end-to-end -- so the patches in 1Cat-vLLM PR 1CatAI#25 ("[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading") turn out to be redundant for granite: - PR P0 (new ``SM70TurboMindLinearKernel``): functionally a re-impl of our ``TurboMindAsymLinearKernel`` -- same scalar type, same group sizes, same CT->AWQ transcode, same kernel dispatch. - PR P1 (``min_capability=70`` for CT WNA16): already in the fork's ``compressed_tensors_wNa16.py`` (returns 70 unconditionally). - PR P3a (qwen3_5 weight-loader fix): only relevant to hybrid DeltaNet models like Qwen3.6-27B; granite is pure dense attention. - PR P6 (sm70_884_4.cu tile registry expansion): not needed for granite. The asymmetric path ends up dispatching through ``Config_U4_g`` gs32 (which we have), not ``Config_U4_d`` gs32 (which we don't); shapes match and output is correct. What does matter is the PR's operational note about ``FLASH_ATTN_V100``: with default chunked-prefill split, granite produces all-token-id-0 ("!!!!...") garbage. The PR's workaround (``--compilation-config compile_ranges_split_points:[]``) prevents that silent fallback. With the workaround AND cudagraph engaged (no ``--enforce-eager``), local bench at TP=2 dual V100 32GB SXM2, 32-prompt -> 128-gen: batch=1 : 126.6 tok/s decode (single-stream) batch=8 : 586.8 tok/s aggregate / 73.3 per-seq batch=16 : eager-only sweep showed 261.7 aggregate, but cudagraph scaling not yet measured at this batch size Eager-mode is ~3x slower (32.7 tok/s batch=1); the V4-Flash convention of ``enforce_eager=True`` does NOT carry over -- granite has none of V4-Flash's three cudagraph blockers, so capture engages cleanly. ``TRITON_ATTN`` (no compilation-config tweak required) is a slower fallback at ~26 tok/s eager batch=1. Adds: - ``tests/models/test_granite_v100_smoke.py``: TP=1 single-prompt math probe with selectable attention backend; was used to isolate that the original "!!!!..." failure was the FLASH_ATTN_V100 silent fallback, not the kernel. - ``tests/models/test_granite_v100_bench.py``: TP=2 throughput bench parameterized by ``BATCH``, ``ENFORCE_EAGER``, ``PR_CHUNKED_PREFILL_FIX``, ``ATTN_BACKEND``. Has a coherence gate (``token_id == 0`` < 10% of generated tokens) so a silent fallback is detected, not confused with a perf regression. - README "Verified models" row + a new "Quick run" section in the same docker-compose-env-var style as the Qwen3.6 / V4-Flash examples. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rivetphilbot · 2026-05-19T01:40:14Z

Superseded by #45 — a clean rebuild of this series onto current main (post-1.0.0). Same patch content, rebuilt history: 6 atomic single-author commits, no whitespace/line-ending noise, +335/−23 instead of +736/−429. Closing this in favor of #45.

SM70 Patch and others added 4 commits May 3, 2026 13:20

wip: SM70 dense CT + DeltaNet baseline (inherited from prior patch at…

3f1fefa

…tempts)

humanjesse mentioned this pull request May 7, 2026

Verify cyankiwi/granite-4.1-8b-AWQ-INT4 on V100 (no patches needed) humanjesse/vllm-v100#4

Merged

4 tasks

rivetphilbot mentioned this pull request May 19, 2026

[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading #45

Open

6 tasks

rivetphilbot closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#25

[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#25
rivetphilbot wants to merge 4 commits into
1CatAI:mainfrom
rivetphilbot:sm70-dense-ct-deltanet

rivetphilbot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

rivetphilbot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rivetphilbot commented May 4, 2026

Summary

Patches in this branch

Validated on V100 dual-TP

What this enables

Operational notes for reviewers

Known issues — to clean up before marking ready

Test plan

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

rivetphilbot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant