Skip to content

[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#45

Open
rivetphilbot wants to merge 6 commits into
1CatAI:mainfrom
rivetphilbot:sm70-dense-ct-deltanet-v2
Open

[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#45
rivetphilbot wants to merge 6 commits into
1CatAI:mainfrom
rivetphilbot:sm70-dense-ct-deltanet-v2

Conversation

@rivetphilbot
Copy link
Copy Markdown

Summary

Adds an SM70 (V100) path for compressed-tensors dense WNA16 quantized models, plus DeltaNet weight-loading and tile-config alignment fixes. Validated end-to-end on dual V100-32GB TP=2 with two different model architectures.

This branch (sm70-dense-ct-deltanet-v2) is a clean rebuild of the earlier draft #25 onto current main. It replaces #25, which was opened against main @ 197f1cc6 (pre-1.0.0) and had drifted into conflict. The patch content is unchanged; the history is rebuilt: 6 atomic single-author commits, no whitespace/line-ending noise, the dead supported = True hack + its revert cancelled out.

+335/−23 across 7 files (was +736/−429 across 8). Per-file CRLF preserved to match main.

Patches in this branch

Commit What it does
[SM70] Add V100 dense WNA16 TurboMind linear kernel New SM70TurboMindLinearKernel for dense compressed-tensors WNA16: transcodes CT pack format → AWQ pack format and dispatches through the existing awq_sm70_prepare / awq_gemm_sm70 TurboMind kernels. Registers in _POSSIBLE_KERNELS[CUDA].
[SM70] Admit V100 (CC 7.0) in CompressedTensorsWNA16 Override the hardcoded min_capability=75 when an SM70-capable kernel is registered, so the kernel selector handles availability rather than rejecting up-front.
[SM70] Wire compressed-tensors MoE decode buffers for V100 SM70 CT MoE batched-GEMM decode buffers. Also fixes a live dangling import on main: compressed_tensors_moe.py imports _DEFAULT_MAX_TOKENS, which awq_sm70_moe.py renamed to _DEFAULT_PERSISTENT_MAX_TOKENS.
[Qwen3] Keep router gate / split GDN projections unquantized under CT CT exposes its skip list as .ignore; re-expressed as a clean fallback on main's renamed _uses_split_gdn_input_projections.
[Qwen3.5] Skip tuple-shard split for non-output-dim CT params CT WNA16 creates auxiliary tensors (weight_shape, weight_g_idx) that lack output_dim; the legacy weight_loader_with_alias path crashes on them. Routes those through the standard weight_loader.
[SM70] Sync sm70_884_4.cu kernel registry to lmdeploy main (gs32) Adds 21 new Config_U4_d gs32 tile sizes and expands Config_U4_g gs32 from 6→17 tiles. Pure registry additions; mainloop/iterator/scheduler unchanged.

Validated on V100 dual-TP

Models confirmed loading & generating correctly:

  • cyankiwi/Qwen3.6-27B-AWQ-INT4 (hybrid DeltaNet + attention dense, 16/64 attention layers)
  • cyankiwi/granite-4.1-8b-AWQ-INT4 (pure dense attention)

Steady-state decode throughput: ~40 tok/s for Qwen3.6-27B and ~70 tok/s for Granite-4.1-8B at TP=2 on dual V100-32GB. Operator-level profiling shows decode is GEMM-bound (~83% in turbomind::gemm) — the V100 HBM-bandwidth ceiling for 27B-class dense, not a patch-related limitation.

Quality probes (Qwen3.6-27B-AWQ-INT4, temp=0, non-thinking mode):

  • RealWorldQA: 75.42% (Qwen published: 84.1)
  • MMMU (full): 52.33% raw / ~62-69% methodology-adjusted (Qwen published: 82.9)
  • Wrong-answer audit: residual gap is plausible reasoning errors + truncation + extraction parser limits, not kernel garbage. Detailed audit available on request.

What this enables

Today, 1Cat-vLLM SM70 support covers:

  1. Legacy awq format (works, via AWQLinearMethod)
  2. compressed-tensors MoE (works, via CompressedTensorsSM70WNA16MoEMethod from prior MoE work)

This patch series adds the missing third corner: compressed-tensors dense on V100. Modern community quants (cyankiwi, llmcompressor-based) increasingly default to compressed-tensors format; without this path, V100 users are limited to legacy AWQ quants only.

Operational notes for reviewers

  • NCCL package conflict. nvidia-nccl-cu13 shadowing nvidia-nccl-cu12 in cu128 builds causes unhandled cuda error at first all_reduce. Fix: ensure only nvidia-nccl-cu12 is installed.
  • Chunked prefill. Decode benefits from --compilation-config compile_ranges_split_points:[] (disables chunked prefill split) when used with FLASH_ATTN_V100; without this we observed silent fallback. Likely a separate issue worth its own investigation.
  • Overlap with main's 1.0.0 SM70 work. main's 1.0.0 carries its own dense-SM70 helpers in qwen3_5.py (_parse_sm70_moe_dense_allowlist, _mark_default_sm70_dense_modules, _sm70_f16_force_enable, etc.). These sit on different code paths and don't conflict with this series, but maintainers may have opinions on consolidating the two.

Test plan

  • Boot Qwen3.6-27B-AWQ-INT4 on dual V100-32GB TP=2 with FLASH_ATTN_V100 backend
  • Boot Granite-4.1-8B-AWQ-INT4 on the same hardware (independent validation)
  • Smoke-test text reasoning, code, factual recall at temp=0
  • Smoke-test multimodal (image input → vision tower → LLM)
  • Operator-level profile to confirm no kernel-correctness regression
  • Full RealWorldQA (765q) + MMMU (900q) academic comparison vs Qwen-published

Supersedes #25.

Add SM70TurboMindLinearKernel, an MPLinearKernel implementation that
routes compressed-tensors / AWQ WNA16 dense GEMMs through the bundled
TurboMind sm70_884_4 INT4 path. V100 (CC 7.0) has only first-gen FP16
WMMA cores and no Turing INT4 tensor-core GEMM, so the stock CUTLASS /
Machete kernels are unavailable; this kernel gives dense WNA16 layers a
working code path on SM70.

Register it at the head of the CUDA _POSSIBLE_KERNELS priority list so
it is preferred when running on V100; on newer architectures the
existing kernels still win their min-capability checks.
CompressedTensorsWNA16.get_min_capability hard-coded 75, so loading a
compressed-tensors WNA16 model on a V100 failed with 'Failed to find a
kernel that can implement the WNA16 linear layer' before the new
SM70TurboMindLinearKernel ever got a chance to bid.

Lower the reported minimum to 70 specifically when running on an SM70
device (CC 7.0). Older pre-Turing GPUs (sm_60/61/62) still get 75 and
remain correctly rejected, since only V100 has the FP16 WMMA path the
TurboMind kernel relies on.
CompressedTensorsSM70WNA16MoEMethod delegates its decode to
AWQSM70MoEMethod.apply(), but only allocated a subset of the buffers
that path reads, so a CT-quantized MoE model crashed on the first
decode step on V100.

- Allocate the full buffer set AWQSM70MoEMethod.process_weights_after_
  loading creates: gate/up and permutation scratch, sorted-output and
  m-index buffers, int64 expert offsets, and the single-token batched
  pointer buffers.
- Publish sm70_hidden/intermediate logical+aligned sizes (CT weights
  are already in TurboMind layout, so logical == aligned).
- Build per-expert StridedPtr row views and record sm70_ptr_row_bytes
  for the batched GEMM path.
- Pass interleave_gated_silu=True to awq_sm70_prepare so the fused
  gate/up weights match the decode kernel's expectation.

Also switch the import to _DEFAULT_PERSISTENT_MAX_TOKENS; awq_sm70_moe
renamed _DEFAULT_MAX_TOKENS, leaving the old name a dangling import.
…compressed-tensors

Two related fixes for running Qwen3.5/3.6 compressed-tensors checkpoints:

- Qwen3NextSparseMoeBlock: the MoE router gate is stored as bf16 in the
  checkpoint and has no quantized form. Passing the model quant_config
  to its ReplicatedLinear made the loader expect quantized weights;
  force quant_config=None so the gate stays bf16.

- _uses_split_gdn_input_projections only inspected modules_to_not_convert
  and ignored_layers. Compressed-tensors records its skip list under the
  ignore attribute, so the BF16 in_proj_a / in_proj_b GDN projections of
  a CT checkpoint were not detected and the split-projection layout was
  not selected. Consult quant_config.ignore as a final fallback.
CompressedTensorsWNA16 creates auxiliary parameters -- weight_shape
(BasevLLMParameter) and weight_g_idx (RowvLLMParameter) -- that hold
metadata or input-dim-sharded indices rather than output-dim weight
data, so they have no output_dim attribute.

When the qkvz stacked-load mapping in Qwen3_5Model.load_weights reached
one of these via the tuple-shard path, it hit AttributeError on
param.output_dim. Fix: when output_dim is absent, load through the
standard non-shard weight_loader (last-write-wins for replicated
metadata) and break out of the sub-id loop. The companion debug log
now formats output_dim with %s / getattr default so it tolerates the
missing attribute too.
Replace the fork's vendored sm70_884_4.cu tile registry (lmdeploy
v0.12.1) with the upstream lmdeploy main version (commit e5fbd4da,
from PR #4429 'fully implement compressed-tensors gs32 support').
mainloop_sm70.h, iterator_sm70.h and scheduler_sm70.cuh are byte
identical between the two snapshots -- only the Registry::sm70_884_4
tile-config list changed.

- Add a Config_U4_d<kColMajor> block with 21 gs32 tiles (the fork
  carried none for this layout).
- Expand the Config_U4_g<kColMajor> gs32 block from 6 to 17 tiles.
- Drop the gs64 block; both deployed quants (qwen3.6-27b-int4,
  granite-4.1-8b-awq-int4) are gs32.

Decode on V100 TP=2 is ~83% turbomind::gemm; the autotuner had no
gs32 candidates in the most common kColMajor layout, forcing fallback
tiles. Net diff +45 -11. Requires a full _C extension rebuild since
kernel registration is statically linked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant