[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#45
Open
rivetphilbot wants to merge 6 commits into
Open
[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading#45rivetphilbot wants to merge 6 commits into
rivetphilbot wants to merge 6 commits into
Conversation
Add SM70TurboMindLinearKernel, an MPLinearKernel implementation that routes compressed-tensors / AWQ WNA16 dense GEMMs through the bundled TurboMind sm70_884_4 INT4 path. V100 (CC 7.0) has only first-gen FP16 WMMA cores and no Turing INT4 tensor-core GEMM, so the stock CUTLASS / Machete kernels are unavailable; this kernel gives dense WNA16 layers a working code path on SM70. Register it at the head of the CUDA _POSSIBLE_KERNELS priority list so it is preferred when running on V100; on newer architectures the existing kernels still win their min-capability checks.
CompressedTensorsWNA16.get_min_capability hard-coded 75, so loading a compressed-tensors WNA16 model on a V100 failed with 'Failed to find a kernel that can implement the WNA16 linear layer' before the new SM70TurboMindLinearKernel ever got a chance to bid. Lower the reported minimum to 70 specifically when running on an SM70 device (CC 7.0). Older pre-Turing GPUs (sm_60/61/62) still get 75 and remain correctly rejected, since only V100 has the FP16 WMMA path the TurboMind kernel relies on.
CompressedTensorsSM70WNA16MoEMethod delegates its decode to AWQSM70MoEMethod.apply(), but only allocated a subset of the buffers that path reads, so a CT-quantized MoE model crashed on the first decode step on V100. - Allocate the full buffer set AWQSM70MoEMethod.process_weights_after_ loading creates: gate/up and permutation scratch, sorted-output and m-index buffers, int64 expert offsets, and the single-token batched pointer buffers. - Publish sm70_hidden/intermediate logical+aligned sizes (CT weights are already in TurboMind layout, so logical == aligned). - Build per-expert StridedPtr row views and record sm70_ptr_row_bytes for the batched GEMM path. - Pass interleave_gated_silu=True to awq_sm70_prepare so the fused gate/up weights match the decode kernel's expectation. Also switch the import to _DEFAULT_PERSISTENT_MAX_TOKENS; awq_sm70_moe renamed _DEFAULT_MAX_TOKENS, leaving the old name a dangling import.
…compressed-tensors Two related fixes for running Qwen3.5/3.6 compressed-tensors checkpoints: - Qwen3NextSparseMoeBlock: the MoE router gate is stored as bf16 in the checkpoint and has no quantized form. Passing the model quant_config to its ReplicatedLinear made the loader expect quantized weights; force quant_config=None so the gate stays bf16. - _uses_split_gdn_input_projections only inspected modules_to_not_convert and ignored_layers. Compressed-tensors records its skip list under the ignore attribute, so the BF16 in_proj_a / in_proj_b GDN projections of a CT checkpoint were not detected and the split-projection layout was not selected. Consult quant_config.ignore as a final fallback.
CompressedTensorsWNA16 creates auxiliary parameters -- weight_shape (BasevLLMParameter) and weight_g_idx (RowvLLMParameter) -- that hold metadata or input-dim-sharded indices rather than output-dim weight data, so they have no output_dim attribute. When the qkvz stacked-load mapping in Qwen3_5Model.load_weights reached one of these via the tuple-shard path, it hit AttributeError on param.output_dim. Fix: when output_dim is absent, load through the standard non-shard weight_loader (last-write-wins for replicated metadata) and break out of the sub-id loop. The companion debug log now formats output_dim with %s / getattr default so it tolerates the missing attribute too.
Replace the fork's vendored sm70_884_4.cu tile registry (lmdeploy v0.12.1) with the upstream lmdeploy main version (commit e5fbd4da, from PR #4429 'fully implement compressed-tensors gs32 support'). mainloop_sm70.h, iterator_sm70.h and scheduler_sm70.cuh are byte identical between the two snapshots -- only the Registry::sm70_884_4 tile-config list changed. - Add a Config_U4_d<kColMajor> block with 21 gs32 tiles (the fork carried none for this layout). - Expand the Config_U4_g<kColMajor> gs32 block from 6 to 17 tiles. - Drop the gs64 block; both deployed quants (qwen3.6-27b-int4, granite-4.1-8b-awq-int4) are gs32. Decode on V100 TP=2 is ~83% turbomind::gemm; the autotuner had no gs32 candidates in the most common kColMajor layout, forcing fallback tiles. Net diff +45 -11. Requires a full _C extension rebuild since kernel registration is statically linked.
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an SM70 (V100) path for
compressed-tensorsdense WNA16 quantized models, plus DeltaNet weight-loading and tile-config alignment fixes. Validated end-to-end on dual V100-32GB TP=2 with two different model architectures.This branch (
sm70-dense-ct-deltanet-v2) is a clean rebuild of the earlier draft #25 onto currentmain. It replaces #25, which was opened againstmain@197f1cc6(pre-1.0.0) and had drifted into conflict. The patch content is unchanged; the history is rebuilt: 6 atomic single-author commits, no whitespace/line-ending noise, the deadsupported = Truehack + its revert cancelled out.+335/−23 across 7 files (was +736/−429 across 8). Per-file CRLF preserved to match
main.Patches in this branch
[SM70]Add V100 dense WNA16 TurboMind linear kernelSM70TurboMindLinearKernelfor dense compressed-tensors WNA16: transcodes CT pack format → AWQ pack format and dispatches through the existingawq_sm70_prepare/awq_gemm_sm70TurboMind kernels. Registers in_POSSIBLE_KERNELS[CUDA].[SM70]Admit V100 (CC 7.0) in CompressedTensorsWNA16min_capability=75when an SM70-capable kernel is registered, so the kernel selector handles availability rather than rejecting up-front.[SM70]Wire compressed-tensors MoE decode buffers for V100main:compressed_tensors_moe.pyimports_DEFAULT_MAX_TOKENS, whichawq_sm70_moe.pyrenamed to_DEFAULT_PERSISTENT_MAX_TOKENS.[Qwen3]Keep router gate / split GDN projections unquantized under CT.ignore; re-expressed as a clean fallback onmain's renamed_uses_split_gdn_input_projections.[Qwen3.5]Skip tuple-shard split for non-output-dim CT paramsweight_shape,weight_g_idx) that lackoutput_dim; the legacyweight_loader_with_aliaspath crashes on them. Routes those through the standardweight_loader.[SM70]Syncsm70_884_4.cukernel registry to lmdeploy main (gs32)Config_U4_dgs32 tile sizes and expandsConfig_U4_ggs32 from 6→17 tiles. Pure registry additions; mainloop/iterator/scheduler unchanged.Validated on V100 dual-TP
Models confirmed loading & generating correctly:
cyankiwi/Qwen3.6-27B-AWQ-INT4(hybrid DeltaNet + attention dense, 16/64 attention layers)cyankiwi/granite-4.1-8b-AWQ-INT4(pure dense attention)Steady-state decode throughput: ~40 tok/s for Qwen3.6-27B and ~70 tok/s for Granite-4.1-8B at TP=2 on dual V100-32GB. Operator-level profiling shows decode is GEMM-bound (~83% in
turbomind::gemm) — the V100 HBM-bandwidth ceiling for 27B-class dense, not a patch-related limitation.Quality probes (Qwen3.6-27B-AWQ-INT4, temp=0, non-thinking mode):
What this enables
Today,
1Cat-vLLMSM70 support covers:awqformat (works, viaAWQLinearMethod)compressed-tensorsMoE (works, viaCompressedTensorsSM70WNA16MoEMethodfrom prior MoE work)This patch series adds the missing third corner:
compressed-tensorsdense on V100. Modern community quants (cyankiwi, llmcompressor-based) increasingly default to compressed-tensors format; without this path, V100 users are limited to legacy AWQ quants only.Operational notes for reviewers
nvidia-nccl-cu13shadowingnvidia-nccl-cu12in cu128 builds causesunhandled cuda errorat firstall_reduce. Fix: ensure onlynvidia-nccl-cu12is installed.--compilation-config compile_ranges_split_points:[](disables chunked prefill split) when used withFLASH_ATTN_V100; without this we observed silent fallback. Likely a separate issue worth its own investigation.main's 1.0.0 carries its own dense-SM70 helpers inqwen3_5.py(_parse_sm70_moe_dense_allowlist,_mark_default_sm70_dense_modules,_sm70_f16_force_enable, etc.). These sit on different code paths and don't conflict with this series, but maintainers may have opinions on consolidating the two.Test plan
FLASH_ATTN_V100backendSupersedes #25.