Add Qwen3.6-27B contrib model with vLLM APC baseline#164
Add Qwen3.6-27B contrib model with vLLM APC baseline#164m-deepankar-singh wants to merge 21 commits into
Conversation
|
Working with our team to evaluate. |
Qwen3.6: Fused DeltaNet Direct-Solve Follow-UpSummaryThis branch is a reviewer-friendly presentation of the fused DeltaNet direct-solve result for Qwen3.6. It is intentionally based on PR 164, The important result is that the fused DeltaNet CTE path is now coherent with realistic Qwen gate values when the Neumann power-doubling solve is replaced by a direct triangular RHS solve. Branch LineageThe actual development history was: PR 164 is the original Qwen3.6 vLLM APC baseline for this branch, and PR 164 This clean branch extracts the direct-solve fused DeltaNet work and its result artifacts onto PR 164 for review. It does not include the entire Why This ExistsThe original fused DeltaNet path used a Neumann power-doubling solve for the recurrence. That approach is mathematically convenient, but it is fragile for realistic Qwen gate scales because it repeatedly forms full matrix powers and can amplify numerical error. In practice, the fused path could produce unstable or incoherent tokens. The chunked DeltaNet path already used a more stable direct triangular solve. This branch ports that idea to the fused path: compute the causal recurrence through a direct triangular RHS solve instead of Neumann power-doubling. The goal is not to claim the fused path is now the final production baseline by itself. The goal is to make the fused-kernel stability fix reviewable and to preserve the validation evidence that it produces coherent output inside the full experimental runtime lineage. Major Changes From PR 164 To The Tested BranchThe full tested branch differs from PR 164 by roughly 105 source/result files. The main work streams from Hybrid APC Runtime
vLLM / NxDI Scheduler Bridge
Qwen Model Execution
NxDI Prefix-Cache Plumbing
DeltaNet NKI Kernels
FP8 / Artifact Compile Path
Serving And API Compatibility
Validation And Results
What This Clean Branch AddsThis branch extracts only the fused DeltaNet follow-up commits:
Concretely, it changes:
Implementation DetailsThe fused kernel previously used Neumann power-doubling to solve the recurrent DeltaNet update. The direct-solve version computes the lower-triangular causal recurrence explicitly in the RHS solve path. This avoids the repeated full-matrix power operations that were unstable under realistic gate values. The validator was also made standalone enough to load the fused NKI kernel directly. This matters because review and debug runs should not depend on package import side effects. The CPU regression test was updated to cover realistic decay/gate scales and to catch the class of instability that showed up in the fused branch. Validation ResultsValidation artifacts are stored under: Local Checkspython3 -m py_compile \
contrib/models/Qwen3.6-27B/src/nki_kernels/nki_deltanet_fused.py \
contrib/models/Qwen3.6-27B/scripts/validate_deltanet_fused_nki.py \
contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py
python3 -m pytest contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py -qResult: Artifact Under TestRuntime:
CoherenceFile: Result:
DecodeFile: Result:
Cold / Warm PrefillFile:
MemoryFile: Result:
Memory caveat: this is a Neuron high-water peak from the Hybrid APC artifact, Known LimitationsThe compiled artifact used for this validation has That is an artifact bucket coverage limitation, not a direct-solve correctness failure. A long-context follow-up compile should include prefix buckets beyond 16K, ideally through the target 64K/128K range. The artifact results were produced from What Is Intentionally Not IncludedThis clean branch does not include the full 80+ commit |
Summary
Relationship to PR #140
This PR keeps the same high-level Qwen3.6 architecture described in PR #140: Qwen3.6-27B is a post-training update of Qwen3.5-27B with the same
qwen3_5architecture, 64 layers, and[3 DeltaNet + 1 GQA]pattern. The additional focus here is the production serving path:Architecture
[3 DeltaNet + 1 GQA] x 16Files
Test Results
Static Checks
git diff --check: PASSpython3 -m py_compileon model, kernel, vLLM helper, and OpenAI server Python files: PASSbash -non vLLM shell helpers: PASSLocal unit-test execution on the Mac checkout is blocked because the Neuron runtime packages (
neuronx_distributed) are not installed there. Hardware/unit validation below was run on Trn2 with the Neuron inference environment.Unit Coverage
The contrib includes 57 CPU unit tests:
test_config.pytest_weight_conversion.pytest_hybrid_cache_manager.pytest_deltanet_decay.pyCoverage includes config parsing, Qwen3.6/Qwen3.5 architectural compatibility, weight conversion, q/gate split handling, RMSNorm +1 conversion, hybrid cache allocation, DeltaNet state cache shapes, and decay handling.
Long-Context vLLM/APC Validation
Hardware:
trn2.3xlarge, TP=4, LNC=2, SDK 2.29, vLLM Neuron plugin path, MLP-only FP8 artifact, CTE bucket 512.APC / Prefix Cache Validation
Native vLLM APC was validated with exact greedy output matches.
Shared-prefix concurrency at 1/2/4 requests returned all requested markers exactly. The current artifact is compiled for
max_num_seqs=1, so requests queue rather than true multi-sequence batching.Notes and Limitations
Checklist
contrib/models/Qwen3.6-27B/