Skip to content

[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#25

Closed
rivetphilbot wants to merge 4 commits into
1CatAI:mainfrom
rivetphilbot:sm70-dense-ct-deltanet
Closed

[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#25
rivetphilbot wants to merge 4 commits into
1CatAI:mainfrom
rivetphilbot:sm70-dense-ct-deltanet

Conversation

@rivetphilbot
Copy link
Copy Markdown

Summary

Adds an SM70 (V100) path for compressed-tensors dense WNA16 quantized models, plus DeltaNet weight-loading and tile-config alignment fixes. Validated end-to-end on dual V100-32GB TP=2 with two different model architectures.

Status: DRAFT. Branch is pushed for visibility and discussion. Author identities + commit organization will be cleaned up before this is marked ready for review (see "Known issues" below). Patches themselves are in working order.

Patches in this branch

Patch File What it does
P0 (in wip commit) vllm/model_executor/layers/quantization/kernels/mixed_precision/sm70_turbomind.py (new, 193 LoC) New SM70TurboMindLinearKernel for dense compressed-tensors WNA16. Transcodes CT pack format → AWQ pack format and dispatches through the existing awq_sm70_prepare / awq_gemm_sm70 TurboMind kernels. Registers in _POSSIBLE_KERNELS[CUDA].
P1 (e9a7f8a) compressed_tensors_wNa16.py Allow CompressedTensorsWNA16 schemes on SM70 by overriding the hardcoded min_capability=75 when an SM70-capable kernel is registered. Lets the kernel selector handle availability rather than reject up-front.
P3a (f46eec7) vllm/model_executor/models/qwen3_5.py:512-543 Skip the tuple-shard split for non-output_dim params. CT WNA16 creates auxiliary tensors (weight_shape, weight_g_idx) that lack output_dim; the legacy weight_loader_with_alias path crashes on them. Routes those through the standard weight_loader and breaks the sub-id loop.
P6 (5de6ca5) src/turbomind/kernels/gemm/kernel/sm70_884_4.cu Sync to lmdeploy main: adds 21 new Config_U4_d gs32 tile sizes and expands Config_U4_g gs32 from 6→17 tiles. Pure registry additions; mainloop/iterator/scheduler unchanged.

Validated on V100 dual-TP

Models confirmed loading & generating correctly:

  • cyankiwi/Qwen3.6-27B-AWQ-INT4 (hybrid DeltaNet + attention dense, 16/64 attention layers)
  • cyankiwi/granite-4.1-8b-AWQ-INT4 (pure dense attention)

Steady-state decode throughput: ~40 tok/s for Qwen3.6-27B and ~70 tok/s for Granite-4.1-8B at TP=2 on dual V100-32GB. Operator-level profiling shows decode is GEMM-bound (~83% in turbomind::gemm) — this is the V100 HBM-bandwidth ceiling for 27B-class dense, not a patch-related limitation.

Quality probes (Qwen3.6-27B-AWQ-INT4, temp=0, non-thinking mode):

  • RealWorldQA: 75.42% (Qwen published: 84.1)
  • MMMU (full): 52.33% raw / ~62-69% methodology-adjusted (Qwen published: 82.9)
  • Wrong-answer audit: residual gap is plausible reasoning errors + truncation + extraction parser limits, not kernel garbage. Detailed audit available on request.
  • A 40-probe head-to-head vs Qwen3.6-35B-A3B (the previously-validated MoE model) on this build showed 27B-INT4 winning decisively on tool-call structure (88% vs 62%) and matching or slightly trailing on reasoning categories at expected per-active-param ratios.

What this enables

Today, 1Cat-vLLM SM70 support covers:

  1. Legacy awq format (works, via AWQLinearMethod)
  2. compressed-tensors MoE (works, via CompressedTensorsSM70WNA16MoEMethod from prior MoE work)

This patch series adds the missing third corner: compressed-tensors dense on V100. Modern community quants (cyankiwi, llmcompressor-based) increasingly default to compressed-tensors format; without this path, V100 users are limited to legacy AWQ quants only.

Operational notes for reviewers

  • NCCL package conflict. nvidia-nccl-cu13 shadowing nvidia-nccl-cu12 in cu128 builds causes unhandled cuda error at first all_reduce. Fix: ensure only nvidia-nccl-cu12 is installed. We hit this twice during patch development; if you see the same crash, this is almost certainly the cause.
  • Chunked prefill. Decode benefits from --compilation-config compile_ranges_split_points:[] (disables chunked prefill split) when used with FLASH_ATTN_V100; without this we observed silent fallback. Likely a separate issue worth its own investigation.

Known issues — to clean up before marking ready

  • The wip commit (3f1fefa) contains ~806 lines of mostly-whitespace diff against compressed_tensors_moe.py and qwen3_next.py. The substantive content (~83 lines) is the P0 kernel + DeltaNet param routing. This will be split into proper P0 (kernel + register) and P3a-prelim commits, with whitespace noise dropped.
  • Author identities are inconsistent across commits (Phil <phil@pve3>, phil <phil@rivetos.local>, SM70 Patch <sm70-patch@local>). Will normalize to a single identity before ready-for-review.
  • P1 currently reverts a sledgehammer hack from the wip commit; in the cleaned version the sledgehammer never appears.

Happy to take feedback on the patch shapes themselves before doing the cleanup pass — I'd rather restructure once based on review than do it twice.

Test plan

  • Boot Qwen3.6-27B-AWQ-INT4 on dual V100-32GB TP=2 with FLASH_ATTN_V100 backend
  • Boot Granite-4.1-8B-AWQ-INT4 on the same hardware (independent validation of P0/P1)
  • Smoke-test text reasoning, code, factual recall at temp=0
  • Smoke-test multimodal (image input → vision tower → LLM): vision tower ignored from quantization, runs in BF16 native attention, color-identification probe correct
  • Operator-level profile to confirm no kernel-correctness regression
  • Full RealWorldQA (765q) + MMMU (900q) academic comparison vs Qwen-published
  • Head-to-head vs Qwen3.6-35B-A3B on this build, 40-probe agentic workload set

SM70 Patch and others added 4 commits May 3, 2026 13:20
- Revert sledgehammer in compressed_tensors.py:345 — restore proper
  capability-vs-min_capability gate.
- Add SM70 carve-out in compressed_tensors_wNa16.get_min_capability:
  return 70 on V100 (CC 7.0), 75 elsewhere.

Validates: Qwen3.6-27B-AWQ-INT4 now boots through engine init and TP=2
NCCL all_reduce, advances to weight-loading where it fails at
qwen3_5.py:531 with BasevLLMParameter.output_dim — meaning admittance
is unblocked and we hit the next layer (DeltaNet merged-projection
parameter setup, work for P3a).
CT WNA16 creates auxiliary BasevLLMParameter (weight_shape) and
RowvLLMParameter (weight_g_idx) that have no output_dim, since they
hold metadata or input-dim-sharded indices rather than output-dim
weight data. The tuple-shard load path in qwen3_5.py:load_weights
hit AttributeError when the qkvz stacked-load mapping reached one
of these auxiliary tensors.

Fix: when output_dim is missing, fall through to the standard
non-shard weight_loader (last-write-wins for replicated metadata)
and break out of the sub-id loop.

Validated: Qwen3.6-27B-AWQ-INT4 now completes model loading on
V100 (11.28 GiB / GPU) and runs through 90s of memory profiling
forward passes. Remaining failure is KV-cache budget tuning only.
Replace 1Cat fork's vendored sm70_884_4.cu (lmdeploy v0.12.1) with the
upstream lmdeploy main version (commit e5fbd4da). Pure registration
changes -- mainloop_sm70.h, iterator_sm70.h, scheduler_sm70.cuh are
byte-identical between the two.

Source: lmdeploy PR #4429 'fully implement compressed-tensors gs32
support' (Mar 23, commit c8eaf3b2).

Net diff: +45 -11 lines.

Adds:
- New Config_U4_d<kColMajor> block with 21 gs32 tile sizes
  (1Cat had ZERO Config_U4_d gs32 tiles)
- Expanded Config_U4_g<kColMajor> gs32 block: 17 tiles (vs 6 in 1Cat)

Drops:
- gs64 tile block (1Cat carried it; both deployed quants are gs32)

Cyankiwi quants (qwen3.6-27b-int4, granite-4.1-8b-awq-int4) are gs32
-- both directly affected. More tile candidates = better autotune
match for actual decode shapes (M=2-4 batched at TP=2).

Profile basis: turbomind::gemm = 83% of decode GPU time on V100 dual
TP=2 serving Qwen3.6-27B; recurrent GDN <1%, attention <1%. The GEMM
autotuner has no gs32 candidates in the most common kColMajor layout
in the 1Cat snapshot, forcing fallback paths.

Build: full _C.abi3.so rebuild required (kernel registration is
statically linked). TORCH_CUDA_ARCH_LIST=7.0 for V100-only build.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

humanjesse added a commit to humanjesse/vllm-v100 that referenced this pull request May 7, 2026
… patches needed (#4)

cyankiwi/granite-4.1-8b-AWQ-INT4 is dense ``GraniteForCausalLM`` quantized
as compressed-tensors W4A16 group_size=32 asymmetric. The fork's existing
``TurboMindAsymLinearKernel`` (CT pack -> AWQ pack -> awq_sm70_prepare /
awq_gemm_sm70) handles this format end-to-end -- so the patches in
1Cat-vLLM PR 1CatAI#25 ("[V100/SM70] Add compressed-tensors dense WNA16 path
+ DeltaNet weight loading") turn out to be redundant for granite:

  - PR P0 (new ``SM70TurboMindLinearKernel``): functionally a re-impl of
    our ``TurboMindAsymLinearKernel`` -- same scalar type, same group
    sizes, same CT->AWQ transcode, same kernel dispatch.
  - PR P1 (``min_capability=70`` for CT WNA16): already in the fork's
    ``compressed_tensors_wNa16.py`` (returns 70 unconditionally).
  - PR P3a (qwen3_5 weight-loader fix): only relevant to hybrid DeltaNet
    models like Qwen3.6-27B; granite is pure dense attention.
  - PR P6 (sm70_884_4.cu tile registry expansion): not needed for
    granite. The asymmetric path ends up dispatching through
    ``Config_U4_g`` gs32 (which we have), not ``Config_U4_d`` gs32
    (which we don't); shapes match and output is correct.

What does matter is the PR's operational note about ``FLASH_ATTN_V100``:
with default chunked-prefill split, granite produces all-token-id-0
("!!!!...") garbage. The PR's workaround
(``--compilation-config compile_ranges_split_points:[]``) prevents that
silent fallback. With the workaround AND cudagraph engaged (no
``--enforce-eager``), local bench at TP=2 dual V100 32GB SXM2,
32-prompt -> 128-gen:

  batch=1  : 126.6 tok/s decode  (single-stream)
  batch=8  : 586.8 tok/s aggregate / 73.3 per-seq
  batch=16 : eager-only sweep showed 261.7 aggregate, but cudagraph
             scaling not yet measured at this batch size

Eager-mode is ~3x slower (32.7 tok/s batch=1); the V4-Flash convention
of ``enforce_eager=True`` does NOT carry over -- granite has none of
V4-Flash's three cudagraph blockers, so capture engages cleanly.
``TRITON_ATTN`` (no compilation-config tweak required) is a slower
fallback at ~26 tok/s eager batch=1.

Adds:
  - ``tests/models/test_granite_v100_smoke.py``: TP=1 single-prompt
    math probe with selectable attention backend; was used to isolate
    that the original "!!!!..." failure was the FLASH_ATTN_V100 silent
    fallback, not the kernel.
  - ``tests/models/test_granite_v100_bench.py``: TP=2 throughput bench
    parameterized by ``BATCH``, ``ENFORCE_EAGER``,
    ``PR_CHUNKED_PREFILL_FIX``, ``ATTN_BACKEND``. Has a coherence
    gate (``token_id == 0`` < 10% of generated tokens) so a silent
    fallback is detected, not confused with a perf regression.
  - README "Verified models" row + a new "Quick run" section in the
    same docker-compose-env-var style as the Qwen3.6 / V4-Flash
    examples.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rivetphilbot
Copy link
Copy Markdown
Author

Superseded by #45 — a clean rebuild of this series onto current main (post-1.0.0). Same patch content, rebuilt history: 6 atomic single-author commits, no whitespace/line-ending noise, +335/−23 instead of +736/−429. Closing this in favor of #45.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant