[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading by rivetphilbot · Pull Request #45 · 1CatAI/1Cat-vLLM

rivetphilbot · 2026-05-19T01:40:09Z

Summary

Adds an SM70 (V100) path for compressed-tensors dense WNA16 quantized models, plus DeltaNet weight-loading and tile-config alignment fixes. Validated end-to-end on dual V100-32GB TP=2 with two different model architectures.

This branch (sm70-dense-ct-deltanet-v2) is a clean rebuild of the earlier draft #25 onto current main. It replaces #25, which was opened against main @ 197f1cc6 (pre-1.0.0) and had drifted into conflict. The patch content is unchanged; the history is rebuilt: 6 atomic single-author commits, no whitespace/line-ending noise, the dead supported = True hack + its revert cancelled out.

+335/−23 across 7 files (was +736/−429 across 8). Per-file CRLF preserved to match main.

Patches in this branch

Commit	What it does
`[SM70]` Add V100 dense WNA16 TurboMind linear kernel	New `SM70TurboMindLinearKernel` for dense compressed-tensors WNA16: transcodes CT pack format → AWQ pack format and dispatches through the existing `awq_sm70_prepare` / `awq_gemm_sm70` TurboMind kernels. Registers in `_POSSIBLE_KERNELS[CUDA]`.
`[SM70]` Admit V100 (CC 7.0) in CompressedTensorsWNA16	Override the hardcoded `min_capability=75` when an SM70-capable kernel is registered, so the kernel selector handles availability rather than rejecting up-front.
`[SM70]` Wire compressed-tensors MoE decode buffers for V100	SM70 CT MoE batched-GEMM decode buffers. Also fixes a live dangling import on `main`: `compressed_tensors_moe.py` imports `_DEFAULT_MAX_TOKENS`, which `awq_sm70_moe.py` renamed to `_DEFAULT_PERSISTENT_MAX_TOKENS`.
`[Qwen3]` Keep router gate / split GDN projections unquantized under CT	CT exposes its skip list as `.ignore`; re-expressed as a clean fallback on `main`'s renamed `_uses_split_gdn_input_projections`.
`[Qwen3.5]` Skip tuple-shard split for non-output-dim CT params	CT WNA16 creates auxiliary tensors (`weight_shape`, `weight_g_idx`) that lack `output_dim`; the legacy `weight_loader_with_alias` path crashes on them. Routes those through the standard `weight_loader`.
`[SM70]` Sync `sm70_884_4.cu` kernel registry to lmdeploy main (gs32)	Adds 21 new `Config_U4_d` gs32 tile sizes and expands `Config_U4_g` gs32 from 6→17 tiles. Pure registry additions; mainloop/iterator/scheduler unchanged.

Validated on V100 dual-TP

Models confirmed loading & generating correctly:

cyankiwi/Qwen3.6-27B-AWQ-INT4 (hybrid DeltaNet + attention dense, 16/64 attention layers)
cyankiwi/granite-4.1-8b-AWQ-INT4 (pure dense attention)

Steady-state decode throughput: ~40 tok/s for Qwen3.6-27B and ~70 tok/s for Granite-4.1-8B at TP=2 on dual V100-32GB. Operator-level profiling shows decode is GEMM-bound (~83% in turbomind::gemm) — the V100 HBM-bandwidth ceiling for 27B-class dense, not a patch-related limitation.

Quality probes (Qwen3.6-27B-AWQ-INT4, temp=0, non-thinking mode):

RealWorldQA: 75.42% (Qwen published: 84.1)
MMMU (full): 52.33% raw / ~62-69% methodology-adjusted (Qwen published: 82.9)
Wrong-answer audit: residual gap is plausible reasoning errors + truncation + extraction parser limits, not kernel garbage. Detailed audit available on request.

What this enables

Today, 1Cat-vLLM SM70 support covers:

Legacy awq format (works, via AWQLinearMethod)
compressed-tensors MoE (works, via CompressedTensorsSM70WNA16MoEMethod from prior MoE work)

This patch series adds the missing third corner: compressed-tensors dense on V100. Modern community quants (cyankiwi, llmcompressor-based) increasingly default to compressed-tensors format; without this path, V100 users are limited to legacy AWQ quants only.

Operational notes for reviewers

NCCL package conflict. nvidia-nccl-cu13 shadowing nvidia-nccl-cu12 in cu128 builds causes unhandled cuda error at first all_reduce. Fix: ensure only nvidia-nccl-cu12 is installed.
Chunked prefill. Decode benefits from --compilation-config compile_ranges_split_points:[] (disables chunked prefill split) when used with FLASH_ATTN_V100; without this we observed silent fallback. Likely a separate issue worth its own investigation.
Overlap with main's 1.0.0 SM70 work. main's 1.0.0 carries its own dense-SM70 helpers in qwen3_5.py (_parse_sm70_moe_dense_allowlist, _mark_default_sm70_dense_modules, _sm70_f16_force_enable, etc.). These sit on different code paths and don't conflict with this series, but maintainers may have opinions on consolidating the two.

Test plan

Boot Qwen3.6-27B-AWQ-INT4 on dual V100-32GB TP=2 with FLASH_ATTN_V100 backend
Boot Granite-4.1-8B-AWQ-INT4 on the same hardware (independent validation)
Smoke-test text reasoning, code, factual recall at temp=0
Smoke-test multimodal (image input → vision tower → LLM)
Operator-level profile to confirm no kernel-correctness regression
Full RealWorldQA (765q) + MMMU (900q) academic comparison vs Qwen-published

Supersedes #25.

Add SM70TurboMindLinearKernel, an MPLinearKernel implementation that routes compressed-tensors / AWQ WNA16 dense GEMMs through the bundled TurboMind sm70_884_4 INT4 path. V100 (CC 7.0) has only first-gen FP16 WMMA cores and no Turing INT4 tensor-core GEMM, so the stock CUTLASS / Machete kernels are unavailable; this kernel gives dense WNA16 layers a working code path on SM70. Register it at the head of the CUDA _POSSIBLE_KERNELS priority list so it is preferred when running on V100; on newer architectures the existing kernels still win their min-capability checks.

CompressedTensorsWNA16.get_min_capability hard-coded 75, so loading a compressed-tensors WNA16 model on a V100 failed with 'Failed to find a kernel that can implement the WNA16 linear layer' before the new SM70TurboMindLinearKernel ever got a chance to bid. Lower the reported minimum to 70 specifically when running on an SM70 device (CC 7.0). Older pre-Turing GPUs (sm_60/61/62) still get 75 and remain correctly rejected, since only V100 has the FP16 WMMA path the TurboMind kernel relies on.

CompressedTensorsSM70WNA16MoEMethod delegates its decode to AWQSM70MoEMethod.apply(), but only allocated a subset of the buffers that path reads, so a CT-quantized MoE model crashed on the first decode step on V100. - Allocate the full buffer set AWQSM70MoEMethod.process_weights_after_ loading creates: gate/up and permutation scratch, sorted-output and m-index buffers, int64 expert offsets, and the single-token batched pointer buffers. - Publish sm70_hidden/intermediate logical+aligned sizes (CT weights are already in TurboMind layout, so logical == aligned). - Build per-expert StridedPtr row views and record sm70_ptr_row_bytes for the batched GEMM path. - Pass interleave_gated_silu=True to awq_sm70_prepare so the fused gate/up weights match the decode kernel's expectation. Also switch the import to _DEFAULT_PERSISTENT_MAX_TOKENS; awq_sm70_moe renamed _DEFAULT_MAX_TOKENS, leaving the old name a dangling import.

…compressed-tensors Two related fixes for running Qwen3.5/3.6 compressed-tensors checkpoints: - Qwen3NextSparseMoeBlock: the MoE router gate is stored as bf16 in the checkpoint and has no quantized form. Passing the model quant_config to its ReplicatedLinear made the loader expect quantized weights; force quant_config=None so the gate stays bf16. - _uses_split_gdn_input_projections only inspected modules_to_not_convert and ignored_layers. Compressed-tensors records its skip list under the ignore attribute, so the BF16 in_proj_a / in_proj_b GDN projections of a CT checkpoint were not detected and the split-projection layout was not selected. Consult quant_config.ignore as a final fallback.

CompressedTensorsWNA16 creates auxiliary parameters -- weight_shape (BasevLLMParameter) and weight_g_idx (RowvLLMParameter) -- that hold metadata or input-dim-sharded indices rather than output-dim weight data, so they have no output_dim attribute. When the qkvz stacked-load mapping in Qwen3_5Model.load_weights reached one of these via the tuple-shard path, it hit AttributeError on param.output_dim. Fix: when output_dim is absent, load through the standard non-shard weight_loader (last-write-wins for replicated metadata) and break out of the sub-id loop. The companion debug log now formats output_dim with %s / getattr default so it tolerates the missing attribute too.

Replace the fork's vendored sm70_884_4.cu tile registry (lmdeploy v0.12.1) with the upstream lmdeploy main version (commit e5fbd4da, from PR #4429 'fully implement compressed-tensors gs32 support'). mainloop_sm70.h, iterator_sm70.h and scheduler_sm70.cuh are byte identical between the two snapshots -- only the Registry::sm70_884_4 tile-config list changed. - Add a Config_U4_d<kColMajor> block with 21 gs32 tiles (the fork carried none for this layout). - Expand the Config_U4_g<kColMajor> gs32 block from 6 to 17 tiles. - Drop the gs64 block; both deployed quants (qwen3.6-27b-int4, granite-4.1-8b-awq-int4) are gs32. Decode on V100 TP=2 is ~83% turbomind::gemm; the autotuner had no gs32 candidates in the most common kColMajor layout, forcing fallback tiles. Net diff +45 -11. Requires a full _C extension rebuild since kernel registration is statically linked.

rivetphilbot added 6 commits May 19, 2026 01:29

rivetphilbot mentioned this pull request May 19, 2026

[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading #25

Closed

7 tasks

rivetphilbot marked this pull request as ready for review May 19, 2026 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#45

[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#45
rivetphilbot wants to merge 6 commits into
1CatAI:mainfrom
rivetphilbot:sm70-dense-ct-deltanet-v2

rivetphilbot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rivetphilbot commented May 19, 2026

Summary

Patches in this branch

Validated on V100 dual-TP

What this enables

Operational notes for reviewers

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant