Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
22d58b7
Contrib: add Qwen3.6-27B vLLM APC baseline
m-deepankar-singh May 13, 2026
bb94c28
Docs: align Qwen3.6 README with contrib guidelines
m-deepankar-singh May 13, 2026
6d6ae62
Docs: add Qwen3.6 TTFT and TPOT benchmarks
m-deepankar-singh May 13, 2026
8760bf5
Docs: summarize Hybrid APC follow-up status
m-deepankar-singh May 18, 2026
ac7df71
Docs: correct Trn2 compatibility matrix
m-deepankar-singh May 18, 2026
11c550c
Stabilize Qwen fused DeltaNet decay
m-deepankar-singh May 22, 2026
0a68b82
Add isolated fused DeltaNet NKI validation
m-deepankar-singh May 22, 2026
d84b3c7
Load fused NKI kernel directly in validator
m-deepankar-singh May 22, 2026
3e56e58
Fix fused DeltaNet solve stability
m-deepankar-singh May 22, 2026
e938d71
Record Qwen fused direct-solve validation
m-deepankar-singh May 22, 2026
4c44489
Document Qwen fused direct-solve PR delta
m-deepankar-singh May 22, 2026
b609c87
Clarify Qwen fused direct-solve branch lineage
m-deepankar-singh May 22, 2026
f895246
Summarize Qwen experimental branch changes
m-deepankar-singh May 22, 2026
6d758a2
Expand Qwen fused direct-solve PR draft
m-deepankar-singh May 22, 2026
935dab9
Document Qwen fused direct-solve validation
m-deepankar-singh May 22, 2026
724f9f2
Clarify Qwen3.6-only validation scope
m-deepankar-singh May 22, 2026
79a2d03
Verify Qwen validation environment versions
m-deepankar-singh May 22, 2026
3698a4c
Correct Qwen3.6 lineage to PR 140
m-deepankar-singh May 22, 2026
df6d936
Document Qwen HBM comparison caveat
m-deepankar-singh May 22, 2026
24babfb
Document Qwen fused validation follow-ups
m-deepankar-singh May 22, 2026
882e9da
Remove recommended next step from PR body
m-deepankar-singh May 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
435 changes: 435 additions & 0 deletions contrib/models/Qwen3.6-27B/README.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Qwen3.6 Fused Direct-Solve Validation

This note records the reviewer-facing validation for the fused DeltaNet CTE
direct-solve branch. The raw JSON captures are kept in
`profile_artifacts/qwen36_fused_directsolve_20260522/`.

## Artifact

| Field | Value |
|-------|-------|
| Source branch | `qwen-fused-neumann-stable-decay` |
| Source commit | `ae2613d` |
| Clean PR branch | `qwen36-pr164-directsolve-baseline` |
| Artifact | `qwen36_27b_128k_fp8_mlp_edgebf16_hybrid_apc_nki_fusedstable_directsolve_retry_b256_cte256_512_pfx16k_slots64_tkg8192_32768_131072_async_20260522T130050Z` |
| Runtime | trn2.3xlarge, TP=4, LNC=2 |
| Serving path | Offline vLLM/NxDI, on-device greedy sampling |

The Trn2 validation host was rechecked on 2026-05-22 before publishing these
notes: it was reachable, reported instance type `trn2.3xlarge`, logical Neuron
core config `2`, and contained the artifact directory at 36G.

The clean PR branch is based on PR 164 and extracts the fused DeltaNet
direct-solve change. The measured artifact was compiled from the full
experimental runtime lineage:

```text
PR 164 / vLLM APC baseline, building on PR #140
-> experimental
-> qwen-fused-neumann-stable-decay
```

PR 164 itself builds on Jim Burtoft's Qwen3.6-27B contrib work in PR #140.
That lineage matters because the artifact also uses the Hybrid APC runtime,
FP8 128K compile settings, and vLLM/NxDI serving fixes developed after the
original PR 164 baseline.

## What Changed

The fused DeltaNet CTE kernel previously used Neumann power-doubling for the
intra-chunk correction. That approximation was fast, but it was not stable for
realistic Qwen3.6 gate/decay values and produced invalid output in fused BF16
experiments.

The direct-solve version replaces that correction with the lower-triangular
solve strategy used by the stable chunked CTE path. This keeps the fused
single-kernel structure while avoiding the observed Neumann instability.

## Coherence

Raw file:
`../../../../profile_artifacts/qwen36_fused_directsolve_20260522/qwen36_directsolve_chat_coherence_20260522T1332Z.json`

| Check | Result |
|-------|--------|
| Overall pass | PASS |
| Chat template | `enable_thinking=false` |
| Fact prompt | Coherent |
| Code prompt | Coherent |
| Prefix-cache prompt | Coherent |
| Smoke decode | ~20.5 tok/s |

The sampled outputs were non-repetitive, real text. This addresses the earlier
failure mode where the fused path could emit invalid repeated tokens.

## Decode

Raw file:
`../../../../profile_artifacts/qwen36_fused_directsolve_20260522/qwen36_directsolve_decode_bench_20260522T1348Z.json`

| Metric | Result |
|--------|--------|
| Average decode throughput | 21.63 tok/s |
| TPOT | 46.2 ms/token |
| 128-token decode average latency | 5.92s |
| Sampling | On-device greedy, `output_logits=false` |

## Cold And Warm Prefill

Raw file:
`../../../../profile_artifacts/qwen36_fused_directsolve_20260522/context_sweep_partial_20260522T1348Z.json`

| Prompt tokens | Cold TTFT | Cold prefill | Warm TTFT | Warm prefill |
|--------------:|----------:|-------------:|----------:|-------------:|
| 512 | 1.31s | 390 tok/s | 0.42s | 1.2K tok/s |
| 4,096 | 7.03s | 582 tok/s | 0.42s | 9.8K tok/s |
| 8,192 | 13.61s | 602 tok/s | 0.43s | 18.9K tok/s |
| 16,384 | 27.84s | 589 tok/s | 0.45s | 36.3K tok/s |

The 32K row did not complete because this artifact was compiled with
`prefix_buckets` only through 16K:

```text
Prefix len 16640 exceeds largest bucket 16384 for context_encoding_model
```

That is a bucket-coverage limitation of this artifact, not a fused direct-solve
correctness failure. Longer-context warm APC validation requires recompiling
with larger prefix buckets.

## Memory

Raw file:
`../../../../profile_artifacts/qwen36_fused_directsolve_20260522/qwen36_directsolve_perf_capture_20260522T1348Z.json`

| Metric | Result |
|--------|--------|
| Neuron HBM peak sum | 60.1 GiB |
| Host process RSS peak | 46.3 GiB |
| Main logical-core peaks | ~14.57 GiB on cores 0, 2, 4, and 6 |

The HBM number above is a high-water peak from the Hybrid APC artifact. In
decimal units it is `64.54 GB`, so the comparison against PR 164's `~53.25 GB`
decimal 64K vLLM/APC eval is an `~11.29 GB` delta. It was identical for the
context-sweep and decode-benchmark phases in the raw capture, so it should be
read primarily as artifact/runtime static allocation rather than as memory
consumed by a 16K prompt alone.

The artifact config includes `pa_num_blocks=512`,
`max_gdn_checkpoint_slots=64`, and token-generation buckets
`[8192, 32768, 131072]`. The 64-slot GDN checkpoint bank is expected to reserve
about `38.49 MB` per checkpoint per TP rank, or `9.85 GB` decimal across
TP=4. That accounts for most of the observed delta; the remaining difference
can plausibly come from bucket/runtime/capture differences, but needs a
like-for-like rerun of the PR 164 artifact and the direct-solve artifact with
the same memory capture script and comparable cache/bucket settings.

## Observed Package Versions

These versions were read from the running validation host and the active NxDI
venv on 2026-05-22.

| Component | Version |
|-----------|---------|
| neuronx-distributed-inference | 0.9.17334+ced6ae4e |
| neuronx-distributed | 0.18.27753+1cafd54f |
| neuronx-cc | 2.24.8799.0+6f62ff7c |
| nki | 0.3.0+23928721754.g18aa1271 |
| torch | 2.9.1 |
| torch-neuronx | 2.9.0.2.13.26312+8e870898 |
| torch-xla | 2.9.0 |
| transformers | 4.57.6 |
| aws-neuronx-runtime-lib | 2.31.24.0-0b044f4ce |
| aws-neuronx-tools | 2.29.22.0-b486b0ade |
| NXDI venv | `/opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/` |

## Validation Commands

Local CPU checks used for this branch:

```bash
python3 -m py_compile \
contrib/models/Qwen3.6-27B/src/nki_kernels/nki_deltanet_fused.py \
contrib/models/Qwen3.6-27B/scripts/validate_deltanet_fused_nki.py \
contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py

python3 -m pytest contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py -q
```

Result:

```text
2 passed
```

Neuron validation used the compiled artifact above on trn2.3xlarge.
Loading