Add Qwen3.6-27B contrib model with vLLM APC baseline by m-deepankar-singh · Pull Request #164 · aws-neuron/neuronx-distributed-inference

m-deepankar-singh · 2026-05-13T05:25:34Z

Summary

Adds a contrib implementation of Qwen3.6-27B, a 27B dense hybrid DeltaNet + GQA model.
Builds on Jim Burtoft's Qwen3.6-27B contrib work in PR Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B) #140 and the shared Qwen3.5/Qwen3.6 hybrid architecture pattern.
Adds the validated Qwen3.6 dense text/VL model code, DeltaNet NKI kernels, hybrid cache manager, MLP-only FP8 compile path, vLLM Neuron launch helpers, and OpenAI-compatible serving utilities.
Adds a vLLM/APC long-context baseline for 128K serving on Trn2 with chunked prefill and prefix-cache validation.

Relationship to PR #140

This PR keeps the same high-level Qwen3.6 architecture described in PR #140: Qwen3.6-27B is a post-training update of Qwen3.5-27B with the same qwen3_5 architecture, 64 layers, and [3 DeltaNet + 1 GQA] pattern. The additional focus here is the production serving path:

hybrid cache manager for DeltaNet recurrent/conv state plus GQA KV cache;
chunked prefill for long prompts;
MLP-only FP8 artifact path for 128K support on trn2.3xlarge;
vLLM Neuron registry/launcher support;
native vLLM APC validation for repeated and partial-prefix reuse.

Architecture

Feature	Value
Layers	64
Layer pattern	`[3 DeltaNet + 1 GQA] x 16`
Hidden size	5120
GQA attention	24 heads, 4 KV heads, head_dim=256
DeltaNet	48 value heads, 16 key heads, k_dim=v_dim=128
Position encoding	Partial RoPE, mRoPE-compatible
Vocabulary	248,320
Long-context artifact	131,072 tokens, CTE bucket 512

Files

contrib/models/Qwen3.6-27B/
├── README.md
├── scripts/
│   └── openai_compat_server.py
├── src/
│   ├── __init__.py
│   ├── modeling_qwen35.py
│   ├── modeling_qwen35_vision.py
│   ├── modeling_qwen35_vl.py
│   └── nki_kernels/
│       ├── __init__.py
│       ├── nki_deltanet.py
│       ├── nki_deltanet_chunked.py
│       └── nki_deltanet_fused.py
├── test/
│   ├── integration/
│   │   ├── qwen36_27b_compile_fp8.py
│   │   └── test_model.py
│   └── unit/
│       ├── test_config.py
│       ├── test_deltanet_decay.py
│       ├── test_hybrid_cache_manager.py
│       └── test_weight_conversion.py
└── vllm/
    ├── README.md
    ├── hf_qwen35_config.py
    ├── install_qwen36_vllm.sh
    ├── patch_nxdi_registry.py
    ├── qwen36_chat_proxy.py
    ├── run_offline_inference.py
    ├── serve_qwen36.py
    ├── sitecustomize.py
    └── start_vllm_server.sh

Test Results

Static Checks

git diff --check: PASS
python3 -m py_compile on model, kernel, vLLM helper, and OpenAI server Python files: PASS
bash -n on vLLM shell helpers: PASS

Local unit-test execution on the Mac checkout is blocked because the Neuron runtime packages (neuronx_distributed) are not installed there. Hardware/unit validation below was run on Trn2 with the Neuron inference environment.

Unit Coverage

The contrib includes 57 CPU unit tests:

Test module	Tests
`test_config.py`	26
`test_weight_conversion.py`	16
`test_hybrid_cache_manager.py`	13
`test_deltanet_decay.py`	2

Coverage includes config parsing, Qwen3.6/Qwen3.5 architectural compatibility, weight conversion, q/gate split handling, RMSNorm +1 conversion, hybrid cache allocation, DeltaNet state cache shapes, and decay handling.

Long-Context vLLM/APC Validation

Hardware: trn2.3xlarge, TP=4, LNC=2, SDK 2.29, vLLM Neuron plugin path, MLP-only FP8 artifact, CTE bucket 512.

Scenario	Result
128K artifact compile/load	PASS
32K and 64K short-after-long state reset	PASS
32K and 64K needle retrieval prompts	PASS
Prefill throughput	404-428 tok/s from 512 through 64K prompt tokens
Decode throughput	26.3-26.6 tok/s
Decode TPOT	~37.6-38.0 ms/token
Cold 512-token TTFT	~1.2-1.3s, derived from measured prefill plus one decode step
Cold 32K-token TTFT	~76.6-81.1s, derived from measured prefill plus one decode step
Cold 64K-token TTFT	~153-162s, derived from measured prefill plus one decode step
Peak Neuron device memory	~53.25 GB decimal during 64K eval

APC / Prefix Cache Validation

Native vLLM APC was validated with exact greedy output matches.

APC scenario	Cold	Warm	Speedup	Result
Server exact-repeat, ~10.8K prompt tokens	26.68s	1.67s	16.0x	exact text match
Offline exact-repeat	26.19s	2.38s	11.0x	exact token-ID match
Offline partial-prefix reuse	25.52s	1.70s	15.0x	exact token-ID match
Server cross-prefix reuse	25.17s	1.36s	18.5x	exact text match

Shared-prefix concurrency at 1/2/4 requests returned all requested markers exactly. The current artifact is compiled for max_num_seqs=1, so requests queue rather than true multi-sequence batching.

Notes and Limitations

This PR is contrib-scoped and does not modify core NxDI files.
The validated 128K serving path uses MLP-only FP8; sensitive modules and DeltaNet recurrent state remain BF16/FP32 where required for stability.
Qwen3.6 full-attention head_dim is 256, so the stock head_dim<=128 CTE flash-attention path is not used.
vLLM APC is validated for exact-repeat and partial-prefix reuse with the current artifact, but continuous batching and speculative decoding are follow-up work.
Native Qwen MTP speculative decoding is intentionally not included in this baseline PR.

Checklist

aws-reutermj · 2026-05-14T22:16:46Z

Working with our team to evaluate.

(cherry picked from commit 8af0219)

(cherry picked from commit fd30e32)

(cherry picked from commit 3995389)

(cherry picked from commit ae2613d)

m-deepankar-singh · 2026-05-22T17:12:39Z

Qwen3.6: Fused DeltaNet Direct-Solve Follow-Up

Summary

This branch is a reviewer-friendly presentation of the fused DeltaNet direct-solve result for Qwen3.6. It is intentionally based on PR 164, contrib/qwen36-27b-vllm-apc-pr at ac7df71, so reviewers can see the direct-solve change and validation artifacts without also reviewing the full experimental branch stack.

The important result is that the fused DeltaNet CTE path is now coherent with realistic Qwen gate values when the Neumann power-doubling solve is replaced by a direct triangular RHS solve.

Branch Lineage

The actual development history was:

PR 164 / vLLM APC baseline, building on PR #140
  -> experimental
      -> qwen-fused-neumann-stable-decay

PR 164 is the original Qwen3.6 vLLM APC baseline for this branch, and PR 164
itself builds on Jim Burtoft's Qwen3.6-27B contrib work in PR #140. After that,
the experimental branch accumulated the runtime and validation work needed to
make Hybrid APC usable and measurable. The final
qwen-fused-neumann-stable-decay branch was created from experimental and
added the fused DeltaNet stability work.

This clean branch extracts the direct-solve fused DeltaNet work and its result artifacts onto PR 164 for review. It does not include the entire experimental branch history.

Why This Exists

The original fused DeltaNet path used a Neumann power-doubling solve for the recurrence. That approach is mathematically convenient, but it is fragile for realistic Qwen gate scales because it repeatedly forms full matrix powers and can amplify numerical error. In practice, the fused path could produce unstable or incoherent tokens.

The chunked DeltaNet path already used a more stable direct triangular solve. This branch ports that idea to the fused path: compute the causal recurrence through a direct triangular RHS solve instead of Neumann power-doubling.

The goal is not to claim the fused path is now the final production baseline by itself. The goal is to make the fused-kernel stability fix reviewable and to preserve the validation evidence that it produces coherent output inside the full experimental runtime lineage.

Major Changes From PR 164 To The Tested Branch

The full tested branch differs from PR 164 by roughly 105 source/result files. The main work streams from experimental were:

Hybrid APC Runtime

Added Hybrid APC request records and cache metadata.
Added checkpoint-slot lifecycle management.
Added restore and commit masks for GDN/recurrent state reuse.
Added backed prefix reads and stricter unbacked-read guards.
Added explicit metadata contracts so runtime decisions are scheduler-authorized instead of inferred locally.

vLLM / NxDI Scheduler Bridge

Added Qwen-specific vLLM scheduler patching for Hybrid APC.
Propagated request IDs into the Neuron model runner path.
Recognized cached chunked-prefill continuations.
Tracked active scheduled suffix lengths.
Added no-prefix fallback handling.
Authorized backed prefix continuations through scheduler metadata.

Qwen Model Execution

Extended Qwen chunked prefill for Hybrid APC.
Added GDN checkpoint commit/restore handling.
Added text-only CTE input handling.
Added compact CTE masks.
Fixed prefix/suffix boundary handling.
Guarded decode rows from unnecessary APC restore handling.
Added chat-template thinking controls for validation and serving.

NxDI Prefix-Cache Plumbing

Updated prefix-cache model wrapper paths for vectorized APC args.
Fixed prefix-cache bucket selection and padded-row safety.
Added cached decode row handling.
Added async checkpoint lifecycle cleanup.
Added unit tests around bucket selection, async execution, and Hybrid APC prefix cache behavior.

DeltaNet NKI Kernels

Added DeltaNet backend compile controls.
Added chunked and fused DeltaNet validator paths.
Tested masked Neumann variants.
Stabilized the fused DeltaNet kernel.
Replaced the fused Neumann power-doubling solve with the direct triangular RHS solve.

FP8 / Artifact Compile Path

Added FP8 MLP-only compile configuration coverage.
Added artifact config audit guardrails.
Aligned validation max prompt length with compiled artifacts.
Added pa_num_blocks and bucket sanity checks.
Added 128K/TKG bucket validation support.

Serving And API Compatibility

Updated OpenAI-compatible proxy/server behavior.
Normalized chat-template enable_thinking=false.
Fixed stop-sequence handling.
Added server startup/offline inference helpers.
Added OpenAI API probe scripts and results.

Validation And Results

Added exactness validation.
Added OpenAI chat APC validation.
Added boundary APC probes.
Added context sweeps.
Added offline decode benchmarking.
Added memory/perf capture.
Recorded 4K/128K FP8 Hybrid APC results, decode results, and fused direct-solve results.

What This Clean Branch Adds

This branch extracts only the fused DeltaNet follow-up commits:

Stabilize Qwen fused DeltaNet decay
Add isolated fused DeltaNet NKI validation
Load fused NKI kernel directly in validator
Fix fused DeltaNet solve stability
Validation/result documentation commits

Concretely, it changes:

contrib/models/Qwen3.6-27B/src/nki_kernels/nki_deltanet_fused.py
contrib/models/Qwen3.6-27B/scripts/validate_deltanet_fused_nki.py
contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py
profile_artifacts/qwen36_fused_directsolve_20260522/*

Implementation Details

The fused kernel previously used Neumann power-doubling to solve the recurrent DeltaNet update. The direct-solve version computes the lower-triangular causal recurrence explicitly in the RHS solve path. This avoids the repeated full-matrix power operations that were unstable under realistic gate values.

The validator was also made standalone enough to load the fused NKI kernel directly. This matters because review and debug runs should not depend on package import side effects.

The CPU regression test was updated to cover realistic decay/gate scales and to catch the class of instability that showed up in the fused branch.

Validation Results

Validation artifacts are stored under:

profile_artifacts/qwen36_fused_directsolve_20260522/

Local Checks

python3 -m py_compile \
  contrib/models/Qwen3.6-27B/src/nki_kernels/nki_deltanet_fused.py \
  contrib/models/Qwen3.6-27B/scripts/validate_deltanet_fused_nki.py \
  contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py

python3 -m pytest contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py -q

Result:

2 passed

Artifact Under Test

qwen36_27b_128k_fp8_mlp_edgebf16_hybrid_apc_nki_fusedstable_directsolve_retry_b256_cte256_512_pfx16k_slots64_tkg8192_32768_131072_async_20260522T130050Z

Runtime:

Instance: trn2.3xlarge
Runtime path: offline vLLM/NxDI
Sampling: on-device greedy
output_logits=false
TKG buckets: [8192, 32768, 131072]
Prefix buckets in this artifact: up to 16384

Coherence

File:

qwen36_directsolve_chat_coherence_20260522T1332Z.json

Result:

Overall pass: true
Chat template: enable_thinking=false
Fact, code, and prefix-cache prompts produced real non-repetitive output.
Smoke decode throughput: about 20.5 tok/s

Decode

File:

qwen36_directsolve_decode_bench_20260522T1348Z.json

Result:

Average decode throughput: 21.63 tok/s
TPOT: 46.2 ms/token
128-token decode latency: 5.92 s

Cold / Warm Prefill

File:

context_sweep_partial_20260522T1348Z.json

Prompt tokens	Cold TTFT	Cold prefill	Warm TTFT	Warm prefill
512	1.31 s	390 tok/s	0.42 s	1.2k tok/s
4096	7.03 s	582 tok/s	0.42 s	9.8k tok/s
8192	13.61 s	602 tok/s	0.43 s	18.9k tok/s
16384	27.84 s	589 tok/s	0.45 s	36.3k tok/s

Memory

File:

qwen36_directsolve_perf_capture_20260522T1348Z.json

Result:

Neuron HBM peak sum: 60.1 GiB
Host process RSS peak: 46.3 GiB
Main logical cores peaked around 14.57 GiB each on cores 0, 2, 4, and 6

Memory caveat: this is a Neuron high-water peak from the Hybrid APC artifact,
not a prompt-length-only 16K allocation. In decimal units the direct-solve peak
is 64.54 GB, versus the PR 164 vLLM/APC README's ~53.25 GB decimal 64K
eval number. The artifact was compiled with pa_num_blocks=512,
max_gdn_checkpoint_slots=64, and TKG buckets [8192, 32768, 131072]. The
64-slot GDN checkpoint bank is expected to reserve about 9.85 GB decimal
across TP=4, explaining most of the delta. A strict memory regression claim
still needs like-for-like A/B measurement with the same capture script and
comparable cache/bucket settings.

Known Limitations

The compiled artifact used for this validation has prefix_buckets only through 16384. The 32K row failed with:

Prefix len 16640 exceeds largest bucket 16384 for context_encoding_model

That is an artifact bucket coverage limitation, not a direct-solve correctness failure. A long-context follow-up compile should include prefix buckets beyond 16K, ideally through the target 64K/128K range.

The artifact results were produced from qwen-fused-neumann-stable-decay, which includes the full experimental runtime lineage. This clean branch shows the fused direct-solve extraction and result evidence, but it should not be read as proof that PR 164 plus only these extracted commits reproduces every Hybrid APC runtime behavior from experimental.

What Is Intentionally Not Included

This clean branch does not include the full 80+ commit experimental stack. It also does not include large raw logs, obsolete investigation branches, or temporary scripts. Those were useful during development but would make this review branch hard to inspect.

m-deepankar-singh added 2 commits May 13, 2026 10:49

Contrib: add Qwen3.6-27B vLLM APC baseline

22d58b7

Docs: align Qwen3.6 README with contrib guidelines

bb94c28

m-deepankar-singh marked this pull request as ready for review May 13, 2026 05:31

Docs: add Qwen3.6 TTFT and TPOT benchmarks

6d6ae62

m-deepankar-singh added 17 commits May 18, 2026 20:58

Docs: summarize Hybrid APC follow-up status

8760bf5

Docs: correct Trn2 compatibility matrix

ac7df71

Stabilize Qwen fused DeltaNet decay

11c550c

(cherry picked from commit 8af0219)

Add isolated fused DeltaNet NKI validation

0a68b82

(cherry picked from commit fd30e32)

Load fused NKI kernel directly in validator

d84b3c7

(cherry picked from commit 3995389)

Fix fused DeltaNet solve stability

3e56e58

(cherry picked from commit ae2613d)

Record Qwen fused direct-solve validation

e938d71

Document Qwen fused direct-solve PR delta

4c44489

Clarify Qwen fused direct-solve branch lineage

b609c87

Summarize Qwen experimental branch changes

f895246

Expand Qwen fused direct-solve PR draft

6d758a2

Document Qwen fused direct-solve validation

935dab9

Clarify Qwen3.6-only validation scope

724f9f2

Verify Qwen validation environment versions

79a2d03

Correct Qwen3.6 lineage to PR 140

3698a4c

Document Qwen HBM comparison caveat

df6d936

Document Qwen fused validation follow-ups

24babfb

Remove recommended next step from PR body

882e9da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3.6-27B contrib model with vLLM APC baseline#164

Add Qwen3.6-27B contrib model with vLLM APC baseline#164
m-deepankar-singh wants to merge 21 commits into
aws-neuron:mainfrom
m-deepankar-singh:contrib/qwen36-27b-vllm-apc-pr

m-deepankar-singh commented May 13, 2026 •

edited

Loading

Uh oh!

aws-reutermj commented May 14, 2026

Uh oh!

m-deepankar-singh commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

m-deepankar-singh commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Relationship to PR #140

Architecture

Files

Test Results

Static Checks

Unit Coverage

Long-Context vLLM/APC Validation

APC / Prefix Cache Validation

Notes and Limitations

Checklist

Uh oh!

aws-reutermj commented May 14, 2026

Uh oh!

m-deepankar-singh commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3.6: Fused DeltaNet Direct-Solve Follow-Up

Summary

Branch Lineage

Why This Exists

Major Changes From PR 164 To The Tested Branch

Hybrid APC Runtime

vLLM / NxDI Scheduler Bridge

Qwen Model Execution

NxDI Prefix-Cache Plumbing

DeltaNet NKI Kernels

FP8 / Artifact Compile Path

Serving And API Compatibility

Validation And Results

What This Clean Branch Adds

Implementation Details

Validation Results

Local Checks

Artifact Under Test

Coherence

Decode

Cold / Warm Prefill

Memory

Known Limitations

What Is Intentionally Not Included

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

m-deepankar-singh commented May 13, 2026 •

edited

Loading

m-deepankar-singh commented May 22, 2026 •

edited

Loading