Skip to content

Add Qwen3.6-27B contrib model with vLLM APC baseline#164

Open
m-deepankar-singh wants to merge 21 commits into
aws-neuron:mainfrom
m-deepankar-singh:contrib/qwen36-27b-vllm-apc-pr
Open

Add Qwen3.6-27B contrib model with vLLM APC baseline#164
m-deepankar-singh wants to merge 21 commits into
aws-neuron:mainfrom
m-deepankar-singh:contrib/qwen36-27b-vllm-apc-pr

Conversation

@m-deepankar-singh
Copy link
Copy Markdown

@m-deepankar-singh m-deepankar-singh commented May 13, 2026

Summary

  • Adds a contrib implementation of Qwen3.6-27B, a 27B dense hybrid DeltaNet + GQA model.
  • Builds on Jim Burtoft's Qwen3.6-27B contrib work in PR Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B) #140 and the shared Qwen3.5/Qwen3.6 hybrid architecture pattern.
  • Adds the validated Qwen3.6 dense text/VL model code, DeltaNet NKI kernels, hybrid cache manager, MLP-only FP8 compile path, vLLM Neuron launch helpers, and OpenAI-compatible serving utilities.
  • Adds a vLLM/APC long-context baseline for 128K serving on Trn2 with chunked prefill and prefix-cache validation.

Relationship to PR #140

This PR keeps the same high-level Qwen3.6 architecture described in PR #140: Qwen3.6-27B is a post-training update of Qwen3.5-27B with the same qwen3_5 architecture, 64 layers, and [3 DeltaNet + 1 GQA] pattern. The additional focus here is the production serving path:

  • hybrid cache manager for DeltaNet recurrent/conv state plus GQA KV cache;
  • chunked prefill for long prompts;
  • MLP-only FP8 artifact path for 128K support on trn2.3xlarge;
  • vLLM Neuron registry/launcher support;
  • native vLLM APC validation for repeated and partial-prefix reuse.

Architecture

Feature Value
Layers 64
Layer pattern [3 DeltaNet + 1 GQA] x 16
Hidden size 5120
GQA attention 24 heads, 4 KV heads, head_dim=256
DeltaNet 48 value heads, 16 key heads, k_dim=v_dim=128
Position encoding Partial RoPE, mRoPE-compatible
Vocabulary 248,320
Long-context artifact 131,072 tokens, CTE bucket 512

Files

contrib/models/Qwen3.6-27B/
├── README.md
├── scripts/
│   └── openai_compat_server.py
├── src/
│   ├── __init__.py
│   ├── modeling_qwen35.py
│   ├── modeling_qwen35_vision.py
│   ├── modeling_qwen35_vl.py
│   └── nki_kernels/
│       ├── __init__.py
│       ├── nki_deltanet.py
│       ├── nki_deltanet_chunked.py
│       └── nki_deltanet_fused.py
├── test/
│   ├── integration/
│   │   ├── qwen36_27b_compile_fp8.py
│   │   └── test_model.py
│   └── unit/
│       ├── test_config.py
│       ├── test_deltanet_decay.py
│       ├── test_hybrid_cache_manager.py
│       └── test_weight_conversion.py
└── vllm/
    ├── README.md
    ├── hf_qwen35_config.py
    ├── install_qwen36_vllm.sh
    ├── patch_nxdi_registry.py
    ├── qwen36_chat_proxy.py
    ├── run_offline_inference.py
    ├── serve_qwen36.py
    ├── sitecustomize.py
    └── start_vllm_server.sh

Test Results

Static Checks

  • git diff --check: PASS
  • python3 -m py_compile on model, kernel, vLLM helper, and OpenAI server Python files: PASS
  • bash -n on vLLM shell helpers: PASS

Local unit-test execution on the Mac checkout is blocked because the Neuron runtime packages (neuronx_distributed) are not installed there. Hardware/unit validation below was run on Trn2 with the Neuron inference environment.

Unit Coverage

The contrib includes 57 CPU unit tests:

Test module Tests
test_config.py 26
test_weight_conversion.py 16
test_hybrid_cache_manager.py 13
test_deltanet_decay.py 2

Coverage includes config parsing, Qwen3.6/Qwen3.5 architectural compatibility, weight conversion, q/gate split handling, RMSNorm +1 conversion, hybrid cache allocation, DeltaNet state cache shapes, and decay handling.

Long-Context vLLM/APC Validation

Hardware: trn2.3xlarge, TP=4, LNC=2, SDK 2.29, vLLM Neuron plugin path, MLP-only FP8 artifact, CTE bucket 512.

Scenario Result
128K artifact compile/load PASS
32K and 64K short-after-long state reset PASS
32K and 64K needle retrieval prompts PASS
Prefill throughput 404-428 tok/s from 512 through 64K prompt tokens
Decode throughput 26.3-26.6 tok/s
Decode TPOT ~37.6-38.0 ms/token
Cold 512-token TTFT ~1.2-1.3s, derived from measured prefill plus one decode step
Cold 32K-token TTFT ~76.6-81.1s, derived from measured prefill plus one decode step
Cold 64K-token TTFT ~153-162s, derived from measured prefill plus one decode step
Peak Neuron device memory ~53.25 GB decimal during 64K eval

APC / Prefix Cache Validation

Native vLLM APC was validated with exact greedy output matches.

APC scenario Cold Warm Speedup Result
Server exact-repeat, ~10.8K prompt tokens 26.68s 1.67s 16.0x exact text match
Offline exact-repeat 26.19s 2.38s 11.0x exact token-ID match
Offline partial-prefix reuse 25.52s 1.70s 15.0x exact token-ID match
Server cross-prefix reuse 25.17s 1.36s 18.5x exact text match

Shared-prefix concurrency at 1/2/4 requests returned all requested markers exactly. The current artifact is compiled for max_num_seqs=1, so requests queue rather than true multi-sequence batching.

Notes and Limitations

  • This PR is contrib-scoped and does not modify core NxDI files.
  • The validated 128K serving path uses MLP-only FP8; sensitive modules and DeltaNet recurrent state remain BF16/FP32 where required for stability.
  • Qwen3.6 full-attention head_dim is 256, so the stock head_dim<=128 CTE flash-attention path is not used.
  • vLLM APC is validated for exact-repeat and partial-prefix reuse with the current artifact, but continuous batching and speculative decoding are follow-up work.
  • Native Qwen MTP speculative decoding is intentionally not included in this baseline PR.

Checklist

  • Contrib-only changes under contrib/models/Qwen3.6-27B/
  • Qwen3.6-27B text model implementation
  • DeltaNet NKI kernels
  • Hybrid cache manager tests
  • MLP-only FP8 compile path
  • vLLM Neuron registry and launcher helpers
  • OpenAI-compatible guarded serving helper
  • 128K vLLM/APC hardware validation on Trn2
  • PR body credits Jim Burtoft's PR Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B) #140 as the baseline/reference
  • CI / reviewer hardware validation

@m-deepankar-singh m-deepankar-singh marked this pull request as ready for review May 13, 2026 05:31
@aws-reutermj
Copy link
Copy Markdown

Working with our team to evaluate.

@m-deepankar-singh
Copy link
Copy Markdown
Author

m-deepankar-singh commented May 22, 2026

Qwen3.6: Fused DeltaNet Direct-Solve Follow-Up

Summary

This branch is a reviewer-friendly presentation of the fused DeltaNet direct-solve result for Qwen3.6. It is intentionally based on PR 164, contrib/qwen36-27b-vllm-apc-pr at ac7df71, so reviewers can see the direct-solve change and validation artifacts without also reviewing the full experimental branch stack.

The important result is that the fused DeltaNet CTE path is now coherent with realistic Qwen gate values when the Neumann power-doubling solve is replaced by a direct triangular RHS solve.

Branch Lineage

The actual development history was:

PR 164 / vLLM APC baseline, building on PR #140
  -> experimental
      -> qwen-fused-neumann-stable-decay

PR 164 is the original Qwen3.6 vLLM APC baseline for this branch, and PR 164
itself builds on Jim Burtoft's Qwen3.6-27B contrib work in PR #140. After that,
the experimental branch accumulated the runtime and validation work needed to
make Hybrid APC usable and measurable. The final
qwen-fused-neumann-stable-decay branch was created from experimental and
added the fused DeltaNet stability work.

This clean branch extracts the direct-solve fused DeltaNet work and its result artifacts onto PR 164 for review. It does not include the entire experimental branch history.

Why This Exists

The original fused DeltaNet path used a Neumann power-doubling solve for the recurrence. That approach is mathematically convenient, but it is fragile for realistic Qwen gate scales because it repeatedly forms full matrix powers and can amplify numerical error. In practice, the fused path could produce unstable or incoherent tokens.

The chunked DeltaNet path already used a more stable direct triangular solve. This branch ports that idea to the fused path: compute the causal recurrence through a direct triangular RHS solve instead of Neumann power-doubling.

The goal is not to claim the fused path is now the final production baseline by itself. The goal is to make the fused-kernel stability fix reviewable and to preserve the validation evidence that it produces coherent output inside the full experimental runtime lineage.

Major Changes From PR 164 To The Tested Branch

The full tested branch differs from PR 164 by roughly 105 source/result files. The main work streams from experimental were:

Hybrid APC Runtime

  • Added Hybrid APC request records and cache metadata.
  • Added checkpoint-slot lifecycle management.
  • Added restore and commit masks for GDN/recurrent state reuse.
  • Added backed prefix reads and stricter unbacked-read guards.
  • Added explicit metadata contracts so runtime decisions are scheduler-authorized instead of inferred locally.

vLLM / NxDI Scheduler Bridge

  • Added Qwen-specific vLLM scheduler patching for Hybrid APC.
  • Propagated request IDs into the Neuron model runner path.
  • Recognized cached chunked-prefill continuations.
  • Tracked active scheduled suffix lengths.
  • Added no-prefix fallback handling.
  • Authorized backed prefix continuations through scheduler metadata.

Qwen Model Execution

  • Extended Qwen chunked prefill for Hybrid APC.
  • Added GDN checkpoint commit/restore handling.
  • Added text-only CTE input handling.
  • Added compact CTE masks.
  • Fixed prefix/suffix boundary handling.
  • Guarded decode rows from unnecessary APC restore handling.
  • Added chat-template thinking controls for validation and serving.

NxDI Prefix-Cache Plumbing

  • Updated prefix-cache model wrapper paths for vectorized APC args.
  • Fixed prefix-cache bucket selection and padded-row safety.
  • Added cached decode row handling.
  • Added async checkpoint lifecycle cleanup.
  • Added unit tests around bucket selection, async execution, and Hybrid APC prefix cache behavior.

DeltaNet NKI Kernels

  • Added DeltaNet backend compile controls.
  • Added chunked and fused DeltaNet validator paths.
  • Tested masked Neumann variants.
  • Stabilized the fused DeltaNet kernel.
  • Replaced the fused Neumann power-doubling solve with the direct triangular RHS solve.

FP8 / Artifact Compile Path

  • Added FP8 MLP-only compile configuration coverage.
  • Added artifact config audit guardrails.
  • Aligned validation max prompt length with compiled artifacts.
  • Added pa_num_blocks and bucket sanity checks.
  • Added 128K/TKG bucket validation support.

Serving And API Compatibility

  • Updated OpenAI-compatible proxy/server behavior.
  • Normalized chat-template enable_thinking=false.
  • Fixed stop-sequence handling.
  • Added server startup/offline inference helpers.
  • Added OpenAI API probe scripts and results.

Validation And Results

  • Added exactness validation.
  • Added OpenAI chat APC validation.
  • Added boundary APC probes.
  • Added context sweeps.
  • Added offline decode benchmarking.
  • Added memory/perf capture.
  • Recorded 4K/128K FP8 Hybrid APC results, decode results, and fused direct-solve results.

What This Clean Branch Adds

This branch extracts only the fused DeltaNet follow-up commits:

  • Stabilize Qwen fused DeltaNet decay
  • Add isolated fused DeltaNet NKI validation
  • Load fused NKI kernel directly in validator
  • Fix fused DeltaNet solve stability
  • Validation/result documentation commits

Concretely, it changes:

  • contrib/models/Qwen3.6-27B/src/nki_kernels/nki_deltanet_fused.py
  • contrib/models/Qwen3.6-27B/scripts/validate_deltanet_fused_nki.py
  • contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py
  • profile_artifacts/qwen36_fused_directsolve_20260522/*

Implementation Details

The fused kernel previously used Neumann power-doubling to solve the recurrent DeltaNet update. The direct-solve version computes the lower-triangular causal recurrence explicitly in the RHS solve path. This avoids the repeated full-matrix power operations that were unstable under realistic gate values.

The validator was also made standalone enough to load the fused NKI kernel directly. This matters because review and debug runs should not depend on package import side effects.

The CPU regression test was updated to cover realistic decay/gate scales and to catch the class of instability that showed up in the fused branch.

Validation Results

Validation artifacts are stored under:

profile_artifacts/qwen36_fused_directsolve_20260522/

Local Checks

python3 -m py_compile \
  contrib/models/Qwen3.6-27B/src/nki_kernels/nki_deltanet_fused.py \
  contrib/models/Qwen3.6-27B/scripts/validate_deltanet_fused_nki.py \
  contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py

python3 -m pytest contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py -q

Result:

2 passed

Artifact Under Test

qwen36_27b_128k_fp8_mlp_edgebf16_hybrid_apc_nki_fusedstable_directsolve_retry_b256_cte256_512_pfx16k_slots64_tkg8192_32768_131072_async_20260522T130050Z

Runtime:

  • Instance: trn2.3xlarge
  • Runtime path: offline vLLM/NxDI
  • Sampling: on-device greedy
  • output_logits=false
  • TKG buckets: [8192, 32768, 131072]
  • Prefix buckets in this artifact: up to 16384

Coherence

File:

qwen36_directsolve_chat_coherence_20260522T1332Z.json

Result:

  • Overall pass: true
  • Chat template: enable_thinking=false
  • Fact, code, and prefix-cache prompts produced real non-repetitive output.
  • Smoke decode throughput: about 20.5 tok/s

Decode

File:

qwen36_directsolve_decode_bench_20260522T1348Z.json

Result:

  • Average decode throughput: 21.63 tok/s
  • TPOT: 46.2 ms/token
  • 128-token decode latency: 5.92 s

Cold / Warm Prefill

File:

context_sweep_partial_20260522T1348Z.json
Prompt tokens Cold TTFT Cold prefill Warm TTFT Warm prefill
512 1.31 s 390 tok/s 0.42 s 1.2k tok/s
4096 7.03 s 582 tok/s 0.42 s 9.8k tok/s
8192 13.61 s 602 tok/s 0.43 s 18.9k tok/s
16384 27.84 s 589 tok/s 0.45 s 36.3k tok/s

Memory

File:

qwen36_directsolve_perf_capture_20260522T1348Z.json

Result:

  • Neuron HBM peak sum: 60.1 GiB
  • Host process RSS peak: 46.3 GiB
  • Main logical cores peaked around 14.57 GiB each on cores 0, 2, 4, and 6

Memory caveat: this is a Neuron high-water peak from the Hybrid APC artifact,
not a prompt-length-only 16K allocation. In decimal units the direct-solve peak
is 64.54 GB, versus the PR 164 vLLM/APC README's ~53.25 GB decimal 64K
eval number. The artifact was compiled with pa_num_blocks=512,
max_gdn_checkpoint_slots=64, and TKG buckets [8192, 32768, 131072]. The
64-slot GDN checkpoint bank is expected to reserve about 9.85 GB decimal
across TP=4, explaining most of the delta. A strict memory regression claim
still needs like-for-like A/B measurement with the same capture script and
comparable cache/bucket settings.

Known Limitations

The compiled artifact used for this validation has prefix_buckets only through 16384. The 32K row failed with:

Prefix len 16640 exceeds largest bucket 16384 for context_encoding_model

That is an artifact bucket coverage limitation, not a direct-solve correctness failure. A long-context follow-up compile should include prefix buckets beyond 16K, ideally through the target 64K/128K range.

The artifact results were produced from qwen-fused-neumann-stable-decay, which includes the full experimental runtime lineage. This clean branch shows the fused direct-solve extraction and result evidence, but it should not be read as proof that PR 164 plus only these extracted commits reproduces every Hybrid APC runtime behavior from experimental.

What Is Intentionally Not Included

This clean branch does not include the full 80+ commit experimental stack. It also does not include large raw logs, obsolete investigation branches, or temporary scripts. Those were useful during development but would make this review branch hard to inspect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants