diff --git a/NEXTN.md b/NEXTN.md index 85a4d914390..c2f1000ecec 100644 --- a/NEXTN.md +++ b/NEXTN.md @@ -16,16 +16,16 @@ See also `MTP.md` (Gemma) and `docs/speculative.md` for shared CLI concepts. ## 0. Pre-built model GGUFs -Recommended source for Qwen 3.6 combined `*_MTP.gguf` checkpoints is the -**unsloth** Hugging Face collection — the same files exercised in the -matrix bench (§7): +**Recommended:** the [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176) collection — drop-in combined `*_MTP.gguf` quants tuned for this fork. Each repo ships Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`, plus the `mmproj` for vision and a copy of `imatrix_unsloth.gguf_file` for reproducibility. Upstream Unsloth files keep working too — same arch metadata, same NextN tail. -| Target | Combined `_MTP.gguf` (target + NextN head) | Recommended quant | Architecture | +| Target | Recommended (AtomicChat UDT) | Upstream baseline (Unsloth) | Architecture | |---|---|---|---| -| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | **`UD-Q4_K_XL`** (22.9 GB) | `qwen35moe` | -| Qwen 3.6 27B (dense) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | **`UD-Q4_K_XL`** | `qwen35` | +| Qwen 3.6 35B-A3B (MoE) | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) (`Q4_K_XL` ≈ 20.7 GiB) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | `qwen35moe` | +| Qwen 3.6 27B (dense) | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) (`Q4_K_XL` ≈ 17.7 GiB) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | `qwen35` | -Both repos ship `UD-IQ1_M` … `BF16` quants. The shared-model NextN path +**Why UDT** — built on Unsloth's public MTP-aware [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file), then layered with this fork's tensor-type masks (see §8): every `blk.*.nextn.*` / `mtp.*` tensor pinned to `Q8_0` to preserve draft acceptance, and `attn_q` / `attn_k` lifted to `Q6_K` so the file pairs cleanly with TurboQuant3 KV. End-to-end recipe & runbook: [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md). Attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging). + +The shared-model NextN path works on **any** of them as long as the file contains the NextN auxiliary head (`nextn_predict_layers > 0`) — which all `*-MTP-GGUF` quants do by construction. `scripts/verify-qwen36-nextn-gguf.py` will refuse to load a @@ -37,15 +37,15 @@ the same file in the HF cache and takes the shared-model branch: ```bash # 35B-A3B MoE (headline +24-36 % cell in the matrix) llama-server \ - -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \ - -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \ + -hf AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \ + -hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \ --spec-type nextn --draft-max 2 --draft-min 1 \ -c 8192 -ngl 99 -ngld 99 -fa on # 27B dense llama-server \ - -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \ - -hfd unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \ + -hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \ + -hfd AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \ --spec-type nextn --draft-max 2 --draft-min 1 \ -c 8192 -ngl 99 -ngld 99 -fa on ``` @@ -121,8 +121,8 @@ let `llama-server` pull both from Hugging Face into the local cache: ```bash llama-server \ - -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \ - -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \ + -hf AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \ + -hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \ --spec-type nextn --draft-max 2 --draft-min 1 ``` @@ -130,11 +130,12 @@ llama-server \ ## 7. Performance notes (MacBook Pro M4 Max, 40-core GPU, 48 GB, Metal) -Median TPS over 3 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`, +Median TPS over 2 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`, NextN draft DM=2 (single async chain), context 8192. Single-slot (`--parallel 1 -np 1 --cont-batching`), full GPU offload (`-ngl 99 -ngld 99 -fa on`), -shared-model draft path (no second mmap of combined `_MTP.gguf`). See -`.scratch/bench-logs/qwen-matrix-fullrun-20260512-222625.md`. +shared-model draft path (no second mmap of combined `_MTP.gguf`), +AtomicChat **`UDT-Q4_K_XL_MTP`** file. See +`.scratch/bench-logs/qwen-udt-ab-20260513-132549.md`. ### Bench host @@ -191,3 +192,46 @@ The jump came from a single architectural change: dropping the second now runs without OOM and posts +24-36%; (b) draft KV cache resized only for the NextN layer (`kv_only_nextn = true` is mutated transparently in `llama_context` ctor for draft); (c) the NextN graph builder now flows through `LLM_GRAPH_TYPE_NEXTN` instead of `override_arch`. + +--- + +## 8. UDT quantization recipe (calibration + masks) + +**Goal:** keep Unsloth’s **MTP-aware imatrix** (public `imatrix_unsloth.gguf_file` per HF repo) while applying **AtomicChat-specific** `--tensor-type-file` overrides: + +| File | Extra tensors vs base | +|------|-------------------------| +| `scripts/quantize-masks/qwen36-ud-base.txt` | `token_embd` / `output` high bit width; `attn_v` / `ffn_down` lifted; `ffn_gate_inp` for MoE | +| `qwen36-ud-v1-nextn.txt` | All `blk.*.nextn.*` and `mtp.*` at `q8_0` (draft-head preservation) | +| `qwen36-ud-v2-turbo3.txt` | `attn_q` / `attn_k` at `q6_K` (stack with TurboQuant3 KV) | +| `qwen36-ud-v3-combined.txt` | Union of v1 + v2 (default release build) | + +**Build entrypoints** + +- Single quant: `scripts/quantize-qwen-udt.sh` +- Full sweep: `scripts/quantize-qwen-udt-matrix.sh` +- Remote / bench / HF: **[docs/qwen-udt/RUNBOOK.md](../docs/qwen-udt/RUNBOOK.md)** + +**Note:** `UDT` filenames use `…Q4_K_XL…` as a product tag; `llama-quantize` is still invoked with family types `Q4_K_M`, `Q5_K_M`, etc. + +--- + +## 9. Released artifacts — AtomicChat UDT collection + +The recipe above ships as two ready-to-pull Hugging Face repos, grouped into one collection: + +- Collection — [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176) +- 27B dense — [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) +- 35B-A3B MoE — [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) + +What's actually in each repo, and why it's a bit unusual for a quant drop: + +- **5 quants per model, all `_MTP.gguf`** — `Q3_K_XL` / `Q4_K_XL` / `Q5_K_XL` / `Q6_K` / `Q8_K_XL`. Every file already includes the NextN auxiliary head, so the same path works for `-m` *and* `-md` — no second GGUF, no second mmap, no second tokenizer. +- **NextN-preserve mask (V1)** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. The cost is ~10 MiB of file size; the win is that the draft head stays close to BF16 fidelity, which keeps `acceptance` high under `--spec-type nextn`. Plain UD quants compress the head at the same bit-width as the body and bleed acceptance under `turbo3` KV. +- **TurboQuant3-friendly mask (V2)** — attention Q/K bumped to `Q6_K`. This is the piece we tuned specifically for this fork: when KV is compressed to 3-bit via `-ctk turbo3 -ctv turbo3`, the attention scores see extra dequant noise on K, so giving Q/K a little more headroom on the weight side cancels most of it out. +- **Default release = V3 (V1 ∪ V2)** — the combined mask shipped on Hugging Face. V1-only and V2-only quants exist as ablation artifacts in the build tree but are not published; the V3 file simply has both lifts at once. +- **mmproj mirrored from Unsloth** — `mmproj-F16.gguf` and `mmproj-BF16.gguf` re-hosted byte-for-byte from the corresponding `unsloth/Qwen3.6-*-MTP-GGUF` repo so a single `-hf` line gets you target + draft + projector. +- **`imatrix_unsloth.gguf_file` re-hosted** — same artifact as Unsloth's (77-chunk, MTP-aware), included in each repo so the build is reproducible from a clean clone of the recipe. +- **Apache-2.0**, attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), [@TheTom](https://github.com/TheTom) (TurboQuant), AtomicChat (UDT masks + packaging). Fork: [`AtomicBot-ai/atomic-llama-cpp-turboquant`](https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant). + +The whole pipeline (download → quantize on H100 → bench on M4 Max → upload) is scripted in [`docs/qwen-udt/RUNBOOK.md`](../docs/qwen-udt/RUNBOOK.md); re-running it on the same Unsloth sources reproduces the published files bit-identical. diff --git a/README.md b/README.md index 05dfb4caef7..fb60aac7197 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ LLM inference in C/C++ ## Hot topics - **Gemma 4 MTP speculative decoding: pair a `gemma4` target with the official `gemma4_assistant` head (loaded via `--mtp-head`) for ~+30-50 % short-prompt throughput. See [MTP.md](MTP.md) and the pre-built Q4 assistant GGUFs at the [AtomicChat/Gemma 4 Assistant GGUF collection](https://huggingface.co/collections/AtomicChat/gemma-4-assistant-gguf).** -- **Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands **+24-36 % tps** on Qwen 3.6 35B-A3B MoE, **+5-7 % tps** on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md) and the pre-built combined `_MTP.gguf` quants at [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF).** +- **Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands +24-36 % tps on Qwen 3.6 35B-A3B MoE, +5-7 % tps on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md). Recommended pre-built combined `_MTP.gguf` quants live in the **[AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176)** collection ([27B](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) · [35B-A3B](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF)) — built with the Unsloth public MTP-aware imatrix + fork masks that pin NextN/MTP tensors to `Q8_0` (preserves draft acceptance) and lift attention Q/K to `Q6_K` (pairs cleanly with TurboQuant3 KV); upstream sources also work: [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF).** - **TurboQuant KV cache & weights: WHT-rotated low-bit quantization with backend-native kernels (Metal `TurboFlash`, CUDA, Vulkan, HIP). Use `-ctk turbo3 -ctv turbo3` for ~4.3× KV compression, or quantize weights to `TQ4_1S`/`TQ3_1S`. See [Compression below](#turboquant-kv-cache--weight-compression).** - **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.** - **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)** @@ -198,14 +198,22 @@ Highlights: ### Pre-built model GGUFs -Recommended source is the **unsloth** Hugging Face collection — the same -combined `*_MTP.gguf` files exercised in the matrix bench. The -`UD-Q4_K_XL` quant is the recommended default (matches the bench cells). +**Recommended:** the AtomicChat **UDT** (UD-Turbo) collection — drop-in combined `_MTP.gguf` quants tuned for this fork. One repo per model, 5 quants each (Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`), plus the `mmproj` for vision and the original Unsloth imatrix re-hosted for reproducibility: -| Target | Combined `_MTP.gguf` (target + NextN head) | -|---|---| -| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | -| Qwen 3.6 27B (dense) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | +| Target | Recommended (AtomicChat UDT) | Upstream baseline (Unsloth) | +|---|---|---| +| Qwen 3.6 35B-A3B (MoE) | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | +| Qwen 3.6 27B (dense) | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | + +What makes UDT different from a vanilla `llama-quantize -imatrix` run: + +- **MTP-aware imatrix** — calibrated by Unsloth with the NextN head active (we re-host their public [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file) so you can reproduce or re-mix on top of it). +- **NextN-preserve mask** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. Tiny size cost (~10 MiB), keeps draft acceptance high. +- **TurboQuant3-friendly mask** — `attn_q` / `attn_k` bumped to `Q6_K` so the file pairs cleanly with `-ctk turbo3 -ctv turbo3`. +- **Combined `_MTP.gguf`** — target + NextN head in one file, ready for the shared-model speculative path (`-m` and `-md` point at the same path; no second mmap). +- **Apache-2.0**, full attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging). + +Collection: [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176). Full recipe & runbook: [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md). Mask files: [`scripts/quantize-masks/qwen36-ud-{base,v1-nextn,v2-turbo3,v3-combined}.txt`](scripts/quantize-masks). ### Quick start @@ -213,8 +221,8 @@ combined `*_MTP.gguf` files exercised in the matrix bench. The # Pull both target (-hf) and draft (-hfd) from the same HF combined _MTP.gguf; # they resolve to the same cached file → the server takes the shared-model branch. llama-server \ - -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \ - -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \ + -hf AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \ + -hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \ --spec-type nextn \ --draft-max 2 --draft-min 1 \ -c 8192 \ @@ -760,13 +768,15 @@ To learn more about model quantization, [read this documentation](tools/quantize 35B-A3B MoE the combination is **+24-36 % tps** vs the same target without speculation. - Pre-built combined `_MTP.gguf` quants (recommended **`UD-Q4_K_XL`**, + Pre-built combined `_MTP.gguf` quants (recommended **`Q4_K_XL`**, matches the matrix bench cells): | Target | Combined `_MTP.gguf` | |---|---| - | Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | - | Qwen 3.6 27B (dense) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | + | Qwen 3.6 35B-A3B (MoE) — AtomicChat UDT | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) | + | Qwen 3.6 27B (dense) — AtomicChat UDT | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) | + | Qwen 3.6 35B-A3B (MoE) — Unsloth | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | + | Qwen 3.6 27B (dense) — Unsloth | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | ```bash # Pull both target (-hf) and draft (-hfd) from the same HF combined _MTP.gguf. diff --git a/docs/qwen-udt/RUNBOOK.md b/docs/qwen-udt/RUNBOOK.md new file mode 100644 index 00000000000..4430a426b1c --- /dev/null +++ b/docs/qwen-udt/RUNBOOK.md @@ -0,0 +1,160 @@ +# Qwen 3.6 UDT (UD-Turbo) — runbook + +This runbook covers **remote CUDA quantization**, **local Metal throughput benches**, and **Hugging Face release** for the AtomicChat `UDT` GGUF line. It implements the mask variants: + +| Variant | Mask file | Intent | +|--------|-----------|--------| +| `base` | `scripts/quantize-masks/qwen36-ud-base.txt` | Reproduce Unsloth-style imatrix + selective high-bit tensors | +| `v1` | `scripts/quantize-masks/qwen36-ud-v1-nextn.txt` | Bump NextN / MTP tensors to `q8_0` (acceptance-focused) | +| `v2` | `scripts/quantize-masks/qwen36-ud-v2-turbo3.txt` | Bump `attn_q` / `attn_k` to `q6_K` (TurboQuant3 KV stack) | +| `v3` | `scripts/quantize-masks/qwen36-ud-v3-combined.txt` | Union of v1 + v2 (default release recipe) | + +**Naming:** filenames use `...UDT-Q4_K_XL...` while `llama-quantize` is invoked with base family types `Q4_K_M` (the `XL` token denotes the extra tensor-type-file overrides, matching Unsloth’s naming style). + +**Attribution:** model weights follow the Qwen license; calibration uses Unsloth’s public `imatrix_unsloth.gguf_file` from each HF repo; masks and tooling are from this fork. + +--- + +## 0. Layout + +| Path | Role | +|------|------| +| `.scratch/qwen-ud-sources/27b/` | 27B BF16 shards + `imatrix_unsloth.gguf_file` + optional reference `UD-Q4_K_XL.gguf` | +| `.scratch/qwen-ud-sources/35a3b/` | Same for 35B-A3B | +| `.scratch/qwen-udt-quants/` | Output GGUFs | +| `.scratch/quant-logs/` | Matrix quantization logs | +| `.scratch/bench-logs/` | Local bench markdown | + +--- + +## 1. Remote host (Ubuntu + CUDA) + +```bash +ssh ubuntu@192.222.54.232 +git clone --depth 1 --branch master https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant.git +cd atomic-llama-cpp-turboquant +bash scripts/qwen-udt/remote-bootstrap.sh +``` + +Optional: set `REPO_URL` / `REPO_BRANCH` / `DEST` before running `remote-bootstrap.sh` on a machine **without** an existing checkout (see script header). + +Log in for downloads: + +```bash +huggingface-cli login +``` + +### 1.1 Download BF16 + imatrix + +From the repo root on the remote: + +```bash +bash scripts/qwen-udt/hf-download-sources.sh +``` + +If `huggingface-cli` rejects `--include`, download `BF16/` manually from the Hugging Face UI into `.scratch/qwen-ud-sources/{27b,35a3b}/BF16/`. + +### 1.2 Single quant + +```bash +export LLAMA_QUANTIZE="$PWD/build/bin/llama-quantize" +export QWEN_UDT_SOURCES_DIR="$PWD/.scratch/qwen-ud-sources" +export QWEN_UDT_OUT_DIR="$PWD/.scratch/qwen-udt-quants" +./scripts/quantize-qwen-udt.sh 27b Q4_K_M v3 +``` + +### 1.3 Full matrix (32 jobs: 2 models × 4 ftypes × 4 variants) + +```bash +unset IMATRIX_FILE BF16_INPUT +export QWEN_UDT_SKIP_BASE=0 # set to 1 after sanity passes to save disk/time +./scripts/quantize-qwen-udt-matrix.sh +``` + +### 1.4 Sanity (27B Q4 base vs Unsloth reference) + +```bash +./scripts/qwen-udt/run-sanity-q4-27b.sh +``` + +Compare reported perplexity and file size. Note: the reference `UD-Q4_K_XL.gguf` may be **non-MTP** while the reproduced artifact is **`*_MTP.gguf`** — small PPL deltas are expected. + +--- + +## 2. Copy artifacts to local Mac (Metal) + +```bash +export REMOTE=ubuntu@192.222.54.232 +export REMOTE_DIR='~/atomic-llama-cpp-turboquant/.scratch/qwen-udt-quants' +bash scripts/qwen-udt/rsync-pull-quants.example.sh +``` + +--- + +## 3. Local throughput matrix + +Build `llama-server` locally (Metal). Then: + +```bash +export BENCH_MATRIX_MD="$PWD/.scratch/bench-logs/qwen-udt-matrix-$(date +%Y%m%d).md" +mkdir -p "$(dirname "$BENCH_MATRIX_MD")" +export QWEN_UDT_BENCH_DIR="$PWD/.scratch/qwen-udt-quants" +bash scripts/bench-qwen-udt-matrix-local.sh +``` + +Ablation subsets (optional): + +```bash +export BENCH_MODES_FILTER='f16-nextn,turbo3-nextn' # v1 focus +bash scripts/bench-matrix-qwen.sh # with QWEN*_MTP paths set manually +``` + +--- + +## 4. Perplexity (quality) + +```bash +sh scripts/get-wikitext-2.sh +export WIKI_FILE="$PWD/wikitext-2-raw/wiki.test.raw" +./scripts/bench-qwen-udt-quality.sh \ + .scratch/qwen-udt-quants/Qwen3.6-27B-UDT-Q4_K_XL_MTP.gguf +``` + +Append the printed table into `BENCH_MATRIX_MD` by hand or with `tee -a`. + +--- + +## 5. Release decision + +Copy the template into `.scratch` (gitignored) for local edits, then fill after benches: + +```bash +mkdir -p .scratch/bench-logs +cp docs/qwen-udt/release-decision.md .scratch/bench-logs/qwen-udt-release-decision.md +``` + +Edit `.scratch/bench-logs/qwen-udt-release-decision.md` (or edit [release-decision.md](./release-decision.md) in-repo and sync). + +--- + +## 6. Hugging Face upload + +Create empty model repos (if they do not exist): + +- `AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF` +- `AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF` + +Add `README.md` from [release/qwen-udt/MODEL_CARD_TEMPLATE.md](../../release/qwen-udt/MODEL_CARD_TEMPLATE.md). Then: + +```bash +huggingface-cli login +./scripts/qwen-udt/hf-upload-qwen-udt.sh /path/to/quants +``` + +Create an HF **Collection** in the UI grouping both repos. + +--- + +## 7. MoE mask follow-up + +If 35B-A3B shows router regressions, inspect tensor names (`llama-gguf-dump` / `gguf-py`) and extend `qwen36-ud-base.txt` with additional `ffn_*` or expert-specific overrides. This is expected to be an iterative step after the first 35B bench pass. diff --git a/docs/qwen-udt/release-decision.md b/docs/qwen-udt/release-decision.md new file mode 100644 index 00000000000..22f71b95c39 --- /dev/null +++ b/docs/qwen-udt/release-decision.md @@ -0,0 +1,33 @@ +# Qwen 3.6 UDT — release decision log + +**Status:** pending benchmark data. + +After running: + +1. `scripts/bench-qwen-udt-matrix-local.sh` (throughput + acceptance) +2. `scripts/bench-qwen-udt-quality.sh` (WikiText-2 PPL) + +fill the table below. Per `(model, bit-width)` publish **`v3`** by default. Publish **`v1`** or **`v2`** separately only if they beat `v3` on their target metric without regressing PPL / unrelated modes. + +## Matrix + +| Model | Bit (UDT) | PPL (wiki) v3 | PPL v1 | PPL v2 | Notes | +|-------|-----------|---------------|--------|--------|-------| +| 27B | Q3_K_XL | | | | | +| 27B | Q4_K_XL | | | | | +| 27B | Q5_K_XL | | | | | +| 27B | Q6_K_XL | | | | | +| 35B-A3B | Q3_K_XL | | | | | +| 35B-A3B | Q4_K_XL | | | | | +| 35B-A3B | Q5_K_XL | | | | | +| 35B-A3B | Q6_K_XL | | | | | + +## Ship list + +- [ ] `AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF` — files: +- [ ] `AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF` — files: + +## Sign-off + +- Date: +- Reviewer: diff --git a/scripts/bench-matrix-qwen.sh b/scripts/bench-matrix-qwen.sh index dfa2b0d6c1b..d832c0fd0c0 100755 --- a/scripts/bench-matrix-qwen.sh +++ b/scripts/bench-matrix-qwen.sh @@ -7,6 +7,11 @@ # bash scripts/bench-matrix-qwen.sh # Env overrides: # QWEN27_BASE, QWEN27_MTP, QWEN35_BASE, QWEN35_MTP — GGUF paths +# QWEN_BENCH_COMBINED_GGUF_ONLY — if 1, set QWEN*_BASE to same path as QWEN*_MTP (only *_MTP.gguf exists) +# BENCH_MODES_FILTER — comma-separated mode ids (subset of f16-base,turbo3-base,f16-nextn,turbo3-nextn) +# BENCH_MATRIX_MD — append markdown summary table to this file +# BENCH_LABEL — optional markdown heading printed before the summary block +# BENCH_QWEN_MODELS — all (default) | 27 | 35 (run only one row in the matrix) # HOST, PORT, SHORT_N, LONG_N, RUNS, CTX (context size for server) set -uo pipefail @@ -32,14 +37,29 @@ QWEN27_MTP="${QWEN27_MTP:-$ROOT/.scratch/Qwen3.6-27B-UD-Q4_K_XL_MTP/Qwen3.6-27B- QWEN35_BASE="${QWEN35_BASE:-$ROOT/.scratch/qwen-3.6-35b-a3b/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf}" QWEN35_MTP="${QWEN35_MTP:-$ROOT/.scratch/Qwen3.6-35B-A3B-UD-Q4_K_XL_MTP/Qwen3.6-35B-A3B-UD-Q4_K_XL_MTP.gguf}" +if [[ "${QWEN_BENCH_COMBINED_GGUF_ONLY:-0}" == "1" ]]; then + QWEN27_BASE="$QWEN27_MTP" + QWEN35_BASE="$QWEN35_MTP" +fi + PROMPT='Write a detailed 300-word essay about the history of artificial intelligence, including early pioneers like Alan Turing and John McCarthy, key milestones such as the Dartmouth Conference and the development of expert systems, and future predictions about AGI and superintelligence.' # model_id | run_script | gguf_baseline (no NextN) | gguf_mtp (NextN draft = same file) -MODELS=( +ALL_MODELS=( "qwen-27B|${ROOT}/scripts/run-qwen36-27b-nextn-server.sh|${QWEN27_BASE}|${QWEN27_MTP}" "qwen-35B-A3B|${ROOT}/scripts/run-qwen36-35ba3b-nextn-server.sh|${QWEN35_BASE}|${QWEN35_MTP}" ) +case "${BENCH_QWEN_MODELS:-all}" in + all) MODELS=("${ALL_MODELS[@]}") ;; + 27) MODELS=("${ALL_MODELS[0]}") ;; + 35) MODELS=("${ALL_MODELS[1]}") ;; + *) + echo "error: BENCH_QWEN_MODELS must be all|27|35 (got ${BENCH_QWEN_MODELS})" >&2 + exit 1 + ;; +esac + # mode_id | SPEC (off|nextn) | CTK (propagated to CTV/CTKD/CTVD by run script) MODES=( "f16-base|off|f16" @@ -48,6 +68,12 @@ MODES=( "turbo3-nextn|nextn|turbo3" ) +mode_allowed() { + local mid="$1" + [[ -z "${BENCH_MODES_FILTER:-}" ]] && return 0 + [[ ",${BENCH_MODES_FILTER}," == *",${mid},"* ]] +} + SRV_LOG=$(mktemp -t bench-matrix-qwen-srv.XXXXXX.log) stop_server() { @@ -263,28 +289,43 @@ for model_entry in "${MODELS[@]}"; do fi for mode_entry in "${MODES[@]}"; do IFS='|' read -r mode_id spec ctk <<< "${mode_entry}" + mode_allowed "$mode_id" || continue run_cell "${model_id}" "${run_script}" "${gguf_base}" "${gguf_mtp}" "${mode_id}" "${spec}" "${ctk}" || true done done -echo "" -echo "## Qwen3.6 bench matrix (median tps over ${RUNS} runs; accept% from draft_n / draft_n_accepted)" -echo "" -printf "| model | mode | short tps (n=%d) | long tps (n=%d) | short accept | long accept |\n" "${SHORT_N}" "${LONG_N}" -echo "|---|---|---:|---:|---:|---:|" -for model_entry in "${MODELS[@]}"; do - IFS='|' read -r model_id _ _ _ <<< "${model_entry}" - for mode_entry in "${MODES[@]}"; do - IFS='|' read -r mode_id _ _ <<< "${mode_entry}" - short_val="${RESULTS["${model_id}|${mode_id}|short"]:-N/A|-}" - long_val="${RESULTS["${model_id}|${mode_id}|long"]:-N/A|-}" - short_tps="${short_val%|*}" - short_acc="${short_val#*|}" - long_tps="${long_val%|*}" - long_acc="${long_val#*|}" - printf "| %s | %s | %s | %s | %s%% | %s%% |\n" \ - "${model_id}" "${mode_id}" "${short_tps}" "${long_tps}" "${short_acc}" "${long_acc}" +emit_summary() { + if [[ -n "${BENCH_LABEL:-}" ]]; then + echo "${BENCH_LABEL}" + echo "" + fi + echo "## Qwen3.6 bench matrix (median tps over ${RUNS} runs; accept% from draft_n / draft_n_accepted)" + echo "" + printf "| model | mode | short tps (n=%d) | long tps (n=%d) | short accept | long accept |\n" "${SHORT_N}" "${LONG_N}" + echo "|---|---|---:|---:|---:|---:|" + for model_entry in "${MODELS[@]}"; do + IFS='|' read -r model_id _ _ _ <<< "${model_entry}" + for mode_entry in "${MODES[@]}"; do + IFS='|' read -r mode_id _ _ <<< "${mode_entry}" + mode_allowed "$mode_id" || continue + short_val="${RESULTS["${model_id}|${mode_id}|short"]:-N/A|-}" + long_val="${RESULTS["${model_id}|${mode_id}|long"]:-N/A|-}" + short_tps="${short_val%|*}" + short_acc="${short_val#*|}" + long_tps="${long_val%|*}" + long_acc="${long_val#*|}" + printf "| %s | %s | %s | %s | %s%% | %s%% |\n" \ + "${model_id}" "${mode_id}" "${short_tps}" "${long_tps}" "${short_acc}" "${long_acc}" + done done -done -echo "" -echo "(last server log: ${SRV_LOG})" + echo "" + echo "(last server log: ${SRV_LOG})" +} + +if [[ -n "${BENCH_MATRIX_MD:-}" ]]; then + echo "" + emit_summary | tee -a "$BENCH_MATRIX_MD" +else + echo "" + emit_summary +fi diff --git a/scripts/bench-qwen-udt-matrix-local.sh b/scripts/bench-qwen-udt-matrix-local.sh new file mode 100755 index 00000000000..c301cd78707 --- /dev/null +++ b/scripts/bench-qwen-udt-matrix-local.sh @@ -0,0 +1,71 @@ +#!/usr/bin/env bash +# Run scripts/bench-matrix-qwen.sh for each UDT GGUF in a directory (combined MTP files). +# +# Usage: +# QWEN_UDT_BENCH_DIR=/path/to/quants ./scripts/bench-qwen-udt-matrix-local.sh +# Env: +# QWEN_UDT_BENCH_DIR directory containing *.gguf (default: ${ROOT}/.scratch/qwen-udt-quants) +# BENCH_MATRIX_MD append all summaries to this markdown file +# QWEN_UDT_BENCH_TAG optional filter substring (e.g. "-V1" or "Q4_K_XL") +# QWEN_UDT_ABLATION_AUTO if 1 (default), restrict BENCH_MODES_FILTER for -V1 / -V2 filenames + +set -euo pipefail + +ROOT="$(cd "$(dirname "$0")/.." && pwd)" +DIR="${QWEN_UDT_BENCH_DIR:-${ROOT}/.scratch/qwen-udt-quants}" +TAG="${QWEN_UDT_BENCH_TAG:-}" +OUT_MD="${BENCH_MATRIX_MD:-}" +ABL_AUTO="${QWEN_UDT_ABLATION_AUTO:-1}" + +if [[ ! -d "$DIR" ]]; then + echo "error: directory not found: $DIR" >&2 + exit 1 +fi + +shopt -s nullglob +mapfile -t FILES < <(find "$DIR" -maxdepth 1 -name '*.gguf' -print | sort) +shopt -u nullglob + +if [[ ${#FILES[@]} -eq 0 ]]; then + echo "error: no .gguf under $DIR" >&2 + exit 1 +fi + +for f in "${FILES[@]}"; do + base=$(basename "$f") + [[ -n "$TAG" && "$base" != *"$TAG"* ]] && continue + echo "" >&2 + echo "===== bench: $base =====" >&2 + if [[ "$base" == *"35B-A3B"* ]] || [[ "$base" == *"35B-A3b"* ]]; then + kind=35 + elif [[ "$base" == *"27B"* ]]; then + kind=27 + else + echo "skip (unknown model prefix): $base" >&2 + continue + fi + ( + if [[ "$kind" == "35" ]]; then + export BENCH_QWEN_MODELS=35 + export QWEN35_MTP="$f" QWEN35_BASE="$f" + else + export BENCH_QWEN_MODELS=27 + export QWEN27_MTP="$f" QWEN27_BASE="$f" + fi + export QWEN_BENCH_COMBINED_GGUF_ONLY=1 + export BENCH_LABEL="### UDT bench: ${base}" + if [[ -n "$OUT_MD" ]]; then + export BENCH_MATRIX_MD="$OUT_MD" + fi + if [[ "$ABL_AUTO" == "1" ]]; then + if [[ "$base" == *"-V1"* ]]; then + export BENCH_MODES_FILTER="f16-nextn,turbo3-nextn" + elif [[ "$base" == *"-V2"* ]]; then + export BENCH_MODES_FILTER="turbo3-base,turbo3-nextn" + else + unset BENCH_MODES_FILTER || true + fi + fi + bash "${ROOT}/scripts/bench-matrix-qwen.sh" + ) +done diff --git a/scripts/bench-qwen-udt-quality.sh b/scripts/bench-qwen-udt-quality.sh new file mode 100755 index 00000000000..cafca1ef126 --- /dev/null +++ b/scripts/bench-qwen-udt-quality.sh @@ -0,0 +1,69 @@ +#!/usr/bin/env bash +# Run llama-perplexity on one or more GGUFs (WikiText-2 test split by default; +# optional second pass on a small chat-style text file). +# +# Usage: +# ./scripts/bench-qwen-udt-quality.sh model1.gguf [model2.gguf ...] +# Env: +# ROOT, LLAMA_PERPLEXITY, WIKI_FILE, CHAT_FILE, PPL_THREADS, PPL_NGL +# +# CHAT_FILE defaults to scripts/qwen-udt/sample-chat-calib.txt when USE_CHAT_CALIB=1. + +set -euo pipefail + +ROOT="$(cd "$(dirname "$0")/.." && pwd)" +PPL="${LLAMA_PERPLEXITY:-${ROOT}/build/bin/llama-perplexity}" +WIKI="${WIKI_FILE:-${ROOT}/wikitext-2-raw/wiki.test.raw}" +CHAT="${CHAT_FILE:-${ROOT}/scripts/qwen-udt/sample-chat-calib.txt}" +USE_CHAT="${USE_CHAT_CALIB:-1}" +NGL="${PPL_NGL:-99}" +THREADS="${PPL_THREADS:-8}" + +if [[ $# -lt 1 ]]; then + echo "usage: $0 [more.gguf ...]" >&2 + exit 1 +fi + +if [[ ! -f "$WIKI" ]]; then + echo "error: wiki corpus not found: $WIKI" >&2 + echo "hint: run (cd \"$ROOT\" && sh scripts/get-wikitext-2.sh)" >&2 + exit 1 +fi +if [[ ! -f "$PPL" ]]; then + echo "error: llama-perplexity not found: $PPL" >&2 + exit 1 +fi + +run_ppl_table() { + local label="$1" + local datafile="$2" + echo "" + echo "## PPL — ${label}" + echo "" + echo "| file | tail metric line |" + echo "|---|---|" + for gguf in "${@:3}"; do + if [[ ! -f "$gguf" ]]; then + echo "| $gguf | MISSING |" + continue + fi + log=$(mktemp -t ppl-qwen-udt.XXXXXX.log) + if ! "$PPL" -m "$gguf" -f "$datafile" -ngl "$NGL" -t "$THREADS" >"$log" 2>&1; then + echo "| $(basename "$gguf") | FAIL |" + rm -f "$log" + continue + fi + tail_line=$(grep -E '[Pp]erplexity|ppl' "$log" | tail -1 || true) + echo "| $(basename "$gguf") | ${tail_line:-see $log} |" + rm -f "$log" + done +} + +run_ppl_table "WikiText-2 (wiki.test.raw)" "$WIKI" "$@" + +if [[ "$USE_CHAT" == "1" && -f "$CHAT" ]]; then + run_ppl_table "chat-style calib (${CHAT})" "$CHAT" "$@" +elif [[ "$USE_CHAT" == "1" ]]; then + echo "" >&2 + echo "info: USE_CHAT_CALIB=1 but CHAT_FILE missing: $CHAT (skip second pass)" >&2 +fi diff --git a/scripts/get-wikitext-2.sh b/scripts/get-wikitext-2.sh index bd03ad35263..8b172e30eab 100755 --- a/scripts/get-wikitext-2.sh +++ b/scripts/get-wikitext-2.sh @@ -17,11 +17,11 @@ have_cmd() { } dl() { - [ -f "$2" ] && return if have_cmd wget; then - wget "$1" -O "$2" + wget -q "$1" -O "$2" || return 1 elif have_cmd curl; then - curl -L "$1" -o "$2" + # -f: fail on HTTP errors (avoid HTML error pages named .zip) + curl -fsSL "$1" -o "$2" || return 1 else die "Please install wget or curl" fi @@ -30,8 +30,13 @@ dl() { have_cmd unzip || die "Please install unzip" if [ ! -f "$FILE" ]; then - dl "$URL" "$ZIP" || exit - unzip -o "$ZIP" || exit + if [ -f "$ZIP" ] && ! unzip -t "$ZIP" >/dev/null 2>&1; then + rm -f -- "$ZIP" + fi + if [ ! -f "$ZIP" ]; then + dl "$URL" "$ZIP" || exit 1 + fi + unzip -o "$ZIP" || exit 1 rm -f -- "$ZIP" fi diff --git a/scripts/quantize-masks/README.md b/scripts/quantize-masks/README.md new file mode 100644 index 00000000000..31325e24b1f --- /dev/null +++ b/scripts/quantize-masks/README.md @@ -0,0 +1,14 @@ +# Quantize tensor-type masks + +Text files consumed by `llama-quantize --tensor-type-file`. Each non-empty line is `regex=ggml_type` (tensor name regex is lower-cased by quantize; type names are case-insensitive). + +## Qwen 3.6 UDT + +| File | Purpose | +|------|---------| +| `qwen36-ud-base.txt` | Baseline dynamic recipe + MoE router input | +| `qwen36-ud-v1-nextn.txt` | Preserve NextN / MTP head weights | +| `qwen36-ud-v2-turbo3.txt` | Lift Q/K for TurboQuant3 KV stacks | +| `qwen36-ud-v3-combined.txt` | Default release (v1 ∪ v2) | + +See [docs/qwen-udt/RUNBOOK.md](../../docs/qwen-udt/RUNBOOK.md). diff --git a/scripts/quantize-masks/qwen36-ud-base.txt b/scripts/quantize-masks/qwen36-ud-base.txt new file mode 100644 index 00000000000..a5f6f92418c --- /dev/null +++ b/scripts/quantize-masks/qwen36-ud-base.txt @@ -0,0 +1,5 @@ +token_embd\..*=q8_0 +output\.weight=q8_0 +attn_v\..*=q6_K +ffn_down\..*=q6_K +ffn_gate_inp\..*=q8_0 diff --git a/scripts/quantize-masks/qwen36-ud-v1-nextn.txt b/scripts/quantize-masks/qwen36-ud-v1-nextn.txt new file mode 100644 index 00000000000..90e656aba26 --- /dev/null +++ b/scripts/quantize-masks/qwen36-ud-v1-nextn.txt @@ -0,0 +1,7 @@ +token_embd\..*=q8_0 +output\.weight=q8_0 +attn_v\..*=q6_K +ffn_down\..*=q6_K +ffn_gate_inp\..*=q8_0 +blk\.[0-9]+\.nextn\..*=q8_0 +mtp\..*=q8_0 diff --git a/scripts/quantize-masks/qwen36-ud-v2-turbo3.txt b/scripts/quantize-masks/qwen36-ud-v2-turbo3.txt new file mode 100644 index 00000000000..e128bd1d3c1 --- /dev/null +++ b/scripts/quantize-masks/qwen36-ud-v2-turbo3.txt @@ -0,0 +1,7 @@ +token_embd\..*=q8_0 +output\.weight=q8_0 +attn_v\..*=q6_K +ffn_down\..*=q6_K +ffn_gate_inp\..*=q8_0 +blk\.[0-9]+\.attn_q\..*=q6_K +blk\.[0-9]+\.attn_k\..*=q6_K diff --git a/scripts/quantize-masks/qwen36-ud-v3-combined.txt b/scripts/quantize-masks/qwen36-ud-v3-combined.txt new file mode 100644 index 00000000000..2a2e1c0d326 --- /dev/null +++ b/scripts/quantize-masks/qwen36-ud-v3-combined.txt @@ -0,0 +1,9 @@ +token_embd\..*=q8_0 +output\.weight=q8_0 +attn_v\..*=q6_K +ffn_down\..*=q6_K +ffn_gate_inp\..*=q8_0 +blk\.[0-9]+\.nextn\..*=q8_0 +mtp\..*=q8_0 +blk\.[0-9]+\.attn_q\..*=q6_K +blk\.[0-9]+\.attn_k\..*=q6_K diff --git a/scripts/quantize-qwen-udt-matrix.sh b/scripts/quantize-qwen-udt-matrix.sh new file mode 100755 index 00000000000..8d18c1cfb13 --- /dev/null +++ b/scripts/quantize-qwen-udt-matrix.sh @@ -0,0 +1,45 @@ +#!/usr/bin/env bash +# Batch quantize: 2 models x 4 ftypes x 4 variants (base,v1,v2,v3) = 32 jobs. +# Skip base after validation with QWEN_UDT_SKIP_BASE=1. +# +# Environment: same as scripts/quantize-qwen-udt.sh plus +# QWEN_UDT_SKIP_BASE if 1, skip base variant rows +# QWEN_UDT_LOG_DIR per-job logs (default: ${ROOT}/.scratch/quant-logs) + +set -euo pipefail + +unset IMATRIX_FILE BF16_INPUT || true + +ROOT="$(cd "$(dirname "$0")/.." && pwd)" +LOG_DIR="${QWEN_UDT_LOG_DIR:-${ROOT}/.scratch/quant-logs}" +TS="$(date +%Y%m%d-%H%M%S)" +MASTER_LOG="${LOG_DIR}/matrix-${TS}.log" +mkdir -p "$LOG_DIR" + +MODELS=(27b 35a3b) +FTYPES=(Q3_K_M Q4_K_M Q5_K_M Q6_K) +VARIANTS=(base v1 v2 v3) + +if [[ "${QWEN_UDT_SKIP_BASE:-0}" == "1" ]]; then + VARIANTS=(v1 v2 v3) +fi + +echo "info: logging to ${MASTER_LOG}" | tee -a "$MASTER_LOG" + +for m in "${MODELS[@]}"; do + for f in "${FTYPES[@]}"; do + for v in "${VARIANTS[@]}"; do + echo "" | tee -a "$MASTER_LOG" + echo "=== $(date -Iseconds) START ${m} ${f} ${v} ===" | tee -a "$MASTER_LOG" + t0=$(date +%s) + if ! "${ROOT}/scripts/quantize-qwen-udt.sh" "$m" "$f" "$v" >>"$MASTER_LOG" 2>&1; then + echo "FAIL: ${m} ${f} ${v}" | tee -a "$MASTER_LOG" + exit 1 + fi + t1=$(date +%s) + echo "=== DONE ${m} ${f} ${v} in $((t1 - t0))s ===" | tee -a "$MASTER_LOG" + done + done +done + +echo "info: matrix complete. Master log: ${MASTER_LOG}" | tee -a "$MASTER_LOG" diff --git a/scripts/quantize-qwen-udt.sh b/scripts/quantize-qwen-udt.sh new file mode 100755 index 00000000000..ab2be7aa870 --- /dev/null +++ b/scripts/quantize-qwen-udt.sh @@ -0,0 +1,128 @@ +#!/usr/bin/env bash +# Quantize Qwen 3.6 27B / 35B-A3B MTP GGUF with Unsloth MTP-aware imatrix + UDT tensor-type masks. +# +# Usage: +# ./scripts/quantize-qwen-udt.sh <27b|35a3b> [out.gguf] +# +# Environment: +# ROOT repo root (default: parent of scripts/) +# LLAMA_QUANTIZE path to llama-quantize (default: ${ROOT}/build/bin/llama-quantize) +# QWEN_UDT_SOURCES_DIR directory with BF16 shards + imatrix (see docs/qwen-udt/RUNBOOK.md) +# QUANT_THREADS thread count (default: nproc) +# QWEN_UDT_OUT_DIR output directory (default: ${ROOT}/.scratch/qwen-udt-quants) +# +# Output filename (when out.gguf omitted): +# Qwen3.6-27B-UDT-Q{3,4,5,6}_K_XL[-V1|-V2|-base]_MTP.gguf +# Qwen3.6-35B-A3B-UDT-Q{3,4,5,6}_K_XL[-V1|-V2|-base]_MTP.gguf +# v3 (release) omits the -V3 suffix. + +set -euo pipefail + +ROOT="$(cd "$(dirname "$0")/.." && pwd)" +QUANT="${LLAMA_QUANTIZE:-${ROOT}/build/bin/llama-quantize}" +SOURCES="${QWEN_UDT_SOURCES_DIR:-${ROOT}/.scratch/qwen-ud-sources}" +OUT_DIR="${QWEN_UDT_OUT_DIR:-${ROOT}/.scratch/qwen-udt-quants}" +THREADS="${QUANT_THREADS:-$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 8)}" + +usage() { + echo "usage: $0 <27b|35a3b> [out.gguf]" >&2 + exit 1 +} + +[[ $# -ge 3 ]] || usage + +MODEL="$1" +FTYPE="$2" +VARIANT="$3" +CUSTOM_OUT="${4:-}" + +case "$MODEL" in + 27b) + PREFIX="Qwen3.6-27B" + SUB="${QWEN_UDT_SUBDIR_27:-27b}" + IMT="${IMATRIX_FILE:-${SOURCES}/${SUB}/imatrix_unsloth.gguf_file}" + INP="${BF16_INPUT:-}" + if [[ -z "$INP" ]]; then + shopt -s nullglob + cand=( "${SOURCES}/${SUB}/BF16"/Qwen3.6-27B-BF16-*.gguf "${SOURCES}/${SUB}"/Qwen3.6-27B-BF16-*.gguf ) + shopt -u nullglob + INP="${cand[0]:-}" + fi + ;; + 35a3b) + PREFIX="Qwen3.6-35B-A3B" + SUB="${QWEN_UDT_SUBDIR_35:-35a3b}" + IMT="${IMATRIX_FILE:-${SOURCES}/${SUB}/imatrix_unsloth.gguf_file}" + INP="${BF16_INPUT:-}" + if [[ -z "$INP" ]]; then + shopt -s nullglob + cand=( "${SOURCES}/${SUB}/BF16"/Qwen3.6-35B-A3B-BF16-*.gguf "${SOURCES}/${SUB}"/Qwen3.6-35B-A3B-BF16-*.gguf ) + shopt -u nullglob + INP="${cand[0]:-}" + fi + ;; + *) + usage + ;; +esac + +case "$FTYPE" in + Q3_K_M|Q4_K_M|Q5_K_M|Q6_K|Q8_0) ;; + *) echo "error: unsupported ftype '$FTYPE' (expected Q3_K_M|Q4_K_M|Q5_K_M|Q6_K|Q8_0)" >&2; exit 1 ;; +esac + +case "$FTYPE" in + Q3_K_M|Q4_K_M|Q5_K_M) XL_TAG="${FTYPE/_K_M/_K_XL}" ;; + Q6_K) XL_TAG="Q6_K" ;; + Q8_0) XL_TAG="Q8_K_XL" ;; +esac +case "$VARIANT" in + base) SUFFIX="-base" ;; + v1) SUFFIX="-V1" ;; + v2) SUFFIX="-V2" ;; + v3) SUFFIX="" ;; + *) echo "error: unknown variant '$VARIANT'" >&2; exit 1 ;; +esac + +case "$VARIANT" in + base) MASK="${ROOT}/scripts/quantize-masks/qwen36-ud-base.txt" ;; + v1) MASK="${ROOT}/scripts/quantize-masks/qwen36-ud-v1-nextn.txt" ;; + v2) MASK="${ROOT}/scripts/quantize-masks/qwen36-ud-v2-turbo3.txt" ;; + v3) MASK="${ROOT}/scripts/quantize-masks/qwen36-ud-v3-combined.txt" ;; +esac + +if [[ ! -f "$MASK" ]]; then + echo "error: mask file not found: $MASK" >&2 + exit 1 +fi +if [[ ! -f "$IMT" ]]; then + echo "error: imatrix not found: $IMT (set IMATRIX_FILE or QWEN_UDT_SOURCES_DIR)" >&2 + exit 1 +fi +if [[ ! -f "$INP" ]]; then + echo "error: BF16 input shard not found: $INP" >&2 + echo "hint: download Unsloth BF16 shards; use first shard *-00001-of-*.gguf as BF16_INPUT" >&2 + exit 1 +fi +if [[ ! -x "$QUANT" && ! -f "$QUANT" ]]; then + echo "error: llama-quantize not found: $QUANT" >&2 + exit 1 +fi + +mkdir -p "$OUT_DIR" +if [[ -n "$CUSTOM_OUT" ]]; then + OUT="$CUSTOM_OUT" +else + OUT="${OUT_DIR}/${PREFIX}-UDT-${XL_TAG}${SUFFIX}_MTP.gguf" +fi + +echo "info: quantize model=${MODEL} ftype=${FTYPE} variant=${VARIANT}" >&2 +echo "info: in=${INP}" >&2 +echo "info: out=${OUT}" >&2 +echo "info: imatrix=${IMT}" >&2 +echo "info: mask=${MASK}" >&2 + +exec "$QUANT" \ + --imatrix "$IMT" \ + --tensor-type-file "$MASK" \ + "$INP" "$OUT" "$FTYPE" "$THREADS" diff --git a/scripts/qwen-udt/hf-download-sources.sh b/scripts/qwen-udt/hf-download-sources.sh new file mode 100755 index 00000000000..0f0955495f4 --- /dev/null +++ b/scripts/qwen-udt/hf-download-sources.sh @@ -0,0 +1,47 @@ +#!/usr/bin/env bash +# Download BF16 shards + imatrix + reference UD-Q4_K_XL.gguf from Unsloth HF repos. +# +# Usage: +# bash scripts/qwen-udt/hf-download-sources.sh [DEST_DIR] +# +# Default DEST_DIR: ${REPO}/.scratch/qwen-ud-sources with per-model subdirs 27b/ and 35a3b/ +# +# Requires: `hf` (huggingface_hub>=1.0) or the older `huggingface-cli`. + +set -euo pipefail + +ROOT="$(cd "$(dirname "$0")/../.." && pwd)" +DEST="${1:-${ROOT}/.scratch/qwen-ud-sources}" + +if command -v hf >/dev/null 2>&1; then + HF=hf +elif command -v huggingface-cli >/dev/null 2>&1; then + HF=huggingface-cli +else + echo 'error: neither `hf` nor `huggingface-cli` found (pip install -U "huggingface_hub[cli]")' >&2 + exit 1 +fi + +mkdir -p "$DEST/27b" "$DEST/35a3b" + +dl() { + local repo="$1"; shift + local local_dir="$1"; shift + "$HF" download "$repo" "$@" --local-dir "$local_dir" +} + +echo "info: 27B — imatrix..." +dl unsloth/Qwen3.6-27B-MTP-GGUF "$DEST/27b" imatrix_unsloth.gguf_file +echo "info: 27B — reference UD-Q4_K_XL..." +dl unsloth/Qwen3.6-27B-MTP-GGUF "$DEST/27b" Qwen3.6-27B-UD-Q4_K_XL.gguf +echo "info: 27B — BF16 shards..." +dl unsloth/Qwen3.6-27B-MTP-GGUF "$DEST/27b" --include "BF16/*" + +echo "info: 35B-A3B — imatrix..." +dl unsloth/Qwen3.6-35B-A3B-MTP-GGUF "$DEST/35a3b" imatrix_unsloth.gguf_file +echo "info: 35B-A3B — reference UD-Q4_K_XL..." +dl unsloth/Qwen3.6-35B-A3B-MTP-GGUF "$DEST/35a3b" Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf +echo "info: 35B-A3B — BF16 shards..." +dl unsloth/Qwen3.6-35B-A3B-MTP-GGUF "$DEST/35a3b" --include "BF16/*" + +echo "ok: sources under $DEST/{27b,35a3b}" diff --git a/scripts/qwen-udt/hf-upload-qwen-udt.sh b/scripts/qwen-udt/hf-upload-qwen-udt.sh new file mode 100755 index 00000000000..e8bddb05d41 --- /dev/null +++ b/scripts/qwen-udt/hf-upload-qwen-udt.sh @@ -0,0 +1,56 @@ +#!/usr/bin/env bash +# Upload release GGUFs to Hugging Face (AtomicChat org). +# +# Usage: +# bash scripts/qwen-udt/hf-upload-qwen-udt.sh /path/to/local/quants/dir +# +# Repos (create empty model repos + README via the HF UI first if needed): +# AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF +# AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF +# +# Requires: `hf` (huggingface_hub>=1.0) or the older `huggingface-cli`. + +set -euo pipefail + +ROOT="$(cd "$(dirname "$0")/../.." && pwd)" +DIR="${1:-${ROOT}/.scratch/qwen-udt-quants}" + +if command -v hf >/dev/null 2>&1; then + HF=hf +elif command -v huggingface-cli >/dev/null 2>&1; then + HF=huggingface-cli +else + echo 'error: neither `hf` nor `huggingface-cli` found' >&2 + exit 1 +fi + +if [[ ! -d "$DIR" ]]; then + echo "error: not a directory: $DIR" >&2 + exit 1 +fi + +upload_one() { + local f="$1" + local base + base=$(basename "$f") + local repo + if [[ "$base" == Qwen3.6-27B-* ]]; then + repo="AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF" + elif [[ "$base" == Qwen3.6-35B-A3B-* ]]; then + repo="AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF" + else + echo "skip: $base" + return + fi + echo "info: upload $base -> $repo" + "$HF" upload "$repo" "$f" "$base" --repo-type model --commit-message "Add ${base}" +} + +shopt -s nullglob +for f in "$DIR"/Qwen3.6-27B-UDT-*.gguf "$DIR"/Qwen3.6-35B-A3B-UDT-*.gguf; do + [[ -f "$f" ]] || continue + upload_one "$f" +done +shopt -u nullglob + +echo "ok: upload pass complete (large files may take hours)." diff --git a/scripts/qwen-udt/ppl-matrix-remote.sh b/scripts/qwen-udt/ppl-matrix-remote.sh new file mode 100644 index 00000000000..800ebf3b64d --- /dev/null +++ b/scripts/qwen-udt/ppl-matrix-remote.sh @@ -0,0 +1,51 @@ +#!/usr/bin/env bash +# Run llama-perplexity over wikitext-2 + sample-chat for every UDT quant on the remote CUDA box. +# +# Writes one CSV row per quant to .scratch/bench-logs/qwen-udt-ppl-.csv. + +set -euo pipefail + +ROOT="$(cd "$(dirname "$0")/../.." && pwd)" +PPL="${LLAMA_PERPLEXITY:-${ROOT}/build/bin/llama-perplexity}" +QDIR="${QWEN_UDT_QUANT_DIR:-${ROOT}/.scratch/qwen-udt-quants}" +WIKI="${WIKITEXT:-${ROOT}/wikitext-2-raw/wiki.test.raw}" +CHAT="${CHAT_CALIB:-${ROOT}/scripts/qwen-udt/sample-chat-calib.txt}" +NGL="${NGL:-99}" +THREADS="${THREADS:-8}" +STAMP="$(date -u +%Y%m%d-%H%M%S)" +OUT="${ROOT}/.scratch/bench-logs/qwen-udt-ppl-${STAMP}.csv" + +mkdir -p "$(dirname "$OUT")" +[[ -f "$WIKI" ]] || { echo "error: wikitext not found: $WIKI" >&2; exit 1; } +[[ -x "$PPL" || -f "$PPL" ]] || { echo "error: $PPL not built" >&2; exit 1; } + +echo "file,size_bytes,ppl_wikitext2,ppl_wikitext2_err,ppl_chat,ppl_chat_err,seconds" > "$OUT" + +run_ppl() { + local model="$1" + local corpus="$2" + "$PPL" -m "$model" -f "$corpus" -ngl "$NGL" -t "$THREADS" 2>&1 \ + | awk '/Final estimate: PPL =/ {print $5","$7}' \ + | head -1 +} + +for f in "$QDIR"/Qwen3.6-27B-UDT-*.gguf "$QDIR"/Qwen3.6-35B-A3B-UDT-*.gguf; do + [[ -f "$f" ]] || continue + base="$(basename "$f")" + size="$(stat -c %s "$f" 2>/dev/null || stat -f %z "$f")" + echo "info: PPL $base" >&2 + t0=$(date +%s) + wiki_line=$(run_ppl "$f" "$WIKI") + chat_line="" + if [[ -f "$CHAT" ]]; then + chat_line=$(run_ppl "$f" "$CHAT") + fi + t1=$(date +%s) + wiki_val="${wiki_line%%,*}" + wiki_err="${wiki_line##*,}" + chat_val="${chat_line%%,*}" + chat_err="${chat_line##*,}" + echo "$base,$size,${wiki_val:-NA},${wiki_err:-NA},${chat_val:-NA},${chat_err:-NA},$((t1-t0))" >> "$OUT" +done + +echo "ok: $OUT" diff --git a/scripts/qwen-udt/remote-bootstrap.sh b/scripts/qwen-udt/remote-bootstrap.sh new file mode 100755 index 00000000000..1890c546431 --- /dev/null +++ b/scripts/qwen-udt/remote-bootstrap.sh @@ -0,0 +1,48 @@ +#!/usr/bin/env bash +# One-shot bootstrap for Ubuntu CUDA host: deps, optional clone, build quantize + perplexity + server. +# +# Typical usage (already inside a git checkout): +# cd atomic-llama-cpp-turboquant +# bash scripts/qwen-udt/remote-bootstrap.sh +# +# Fresh machine (no checkout yet): +# export REPO_URL=https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant.git +# export REPO_BRANCH=master +# export DEST=$HOME/atomic-llama-cpp-turboquant +# bash remote-bootstrap.sh +# +# Requires: NVIDIA driver + CUDA toolkit matching the driver (nvidia-smi works). + +set -euo pipefail + +REPO_URL="${REPO_URL:-https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant.git}" +REPO_BRANCH="${REPO_BRANCH:-master}" + +if [[ -f "$(pwd)/CMakeLists.txt" ]]; then + DEST="$(pwd)" +elif [[ -n "${DEST:-}" && -f "$DEST/CMakeLists.txt" ]]; then + DEST="$(cd "$DEST" && pwd)" +else + DEST="${DEST:-$HOME/atomic-llama-cpp-turboquant}" + if [[ ! -d "$DEST/.git" ]]; then + git clone --depth 1 --branch "$REPO_BRANCH" "$REPO_URL" "$DEST" + else + git -C "$DEST" fetch --depth 1 origin "$REPO_BRANCH" || true + git -C "$DEST" checkout "$REPO_BRANCH" || true + git -C "$DEST" pull --ff-only || true + fi +fi + +sudo apt-get update +sudo apt-get install -y build-essential cmake git git-lfs python3-venv python3-pip curl ca-certificates pkg-config + +cmake -S "$DEST" -B "$DEST/build" \ + -DCMAKE_BUILD_TYPE=Release \ + -DGGML_CUDA=ON + +cmake --build "$DEST/build" -j "$(nproc)" --target llama-quantize llama-imatrix llama-perplexity llama-server + +python3 -m pip install --user -U "huggingface_hub[cli]" + +echo "ok: repo at $DEST ; binaries in $DEST/build/bin/" +ls -la "$DEST/build/bin/llama-quantize" "$DEST/build/bin/llama-perplexity" diff --git a/scripts/qwen-udt/rsync-pull-quants.example.sh b/scripts/qwen-udt/rsync-pull-quants.example.sh new file mode 100755 index 00000000000..6e86df49f1c --- /dev/null +++ b/scripts/qwen-udt/rsync-pull-quants.example.sh @@ -0,0 +1,18 @@ +#!/usr/bin/env bash +# Example: copy quantized GGUFs from remote CUDA host to local Mac for Metal benches. +# +# export REMOTE=ubuntu@192.222.54.232 +# export REMOTE_DIR=~/atomic-llama-cpp-turboquant/.scratch/qwen-udt-quants +# bash scripts/qwen-udt/rsync-pull-quants.example.sh +# +# Adjust REMOTE / REMOTE_DIR to match your layout. + +set -euo pipefail + +ROOT="$(cd "$(dirname "$0")/../.." && pwd)" +REMOTE="${REMOTE:-ubuntu@192.222.54.232}" +REMOTE_DIR="${REMOTE_DIR:-~/atomic-llama-cpp-turboquant/.scratch/qwen-udt-quants}" +LOCAL="${LOCAL_DIR:-${ROOT}/.scratch/qwen-udt-quants}" + +mkdir -p "$LOCAL" +rsync -avP --progress "${REMOTE}:${REMOTE_DIR}/" "$LOCAL/" diff --git a/scripts/qwen-udt/run-sanity-q4-27b.sh b/scripts/qwen-udt/run-sanity-q4-27b.sh new file mode 100755 index 00000000000..fc7dc462bde --- /dev/null +++ b/scripts/qwen-udt/run-sanity-q4-27b.sh @@ -0,0 +1,38 @@ +#!/usr/bin/env bash +# Phase 2c sanity: reproduce Unsloth UD-Q4_K_XL on 27B with our base mask + imatrix, then compare PPL to their published GGUF. +# +# Prereqs: scripts/qwen-udt/hf-download-sources.sh completed; llama-quantize + llama-perplexity built. +# +# Env: +# QWEN_UDT_SOURCES_DIR, ROOT, LLAMA_QUANTIZE, LLAMA_PERPLEXITY, WIKI_FILE + +set -euo pipefail + +ROOT="$(cd "$(dirname "$0")/../.." && pwd)" +SOURCES="${QWEN_UDT_SOURCES_DIR:-${ROOT}/.scratch/qwen-ud-sources}" +OUT="${SANITY_OUT:-${ROOT}/.scratch/qwen-udt-quants/Qwen3.6-27B-UDT-Q4_K_XL-base_MTP.gguf}" +REF="${SANITY_REF:-${SOURCES}/27b/Qwen3.6-27B-UD-Q4_K_XL.gguf}" +WIKI="${WIKI_FILE:-${ROOT}/wikitext-2-raw/wiki.test.raw}" + +if [[ ! -f "$REF" ]]; then + echo "error: reference GGUF missing: $REF (run scripts/qwen-udt/hf-download-sources.sh)" >&2 + exit 1 +fi + +echo "info: quantize 27B Q4_K_M base -> $OUT" +QWEN_UDT_SOURCES_DIR="$SOURCES" QWEN_UDT_OUT_DIR="$(dirname "$OUT")" \ + "${ROOT}/scripts/quantize-qwen-udt.sh" 27b Q4_K_M base "$OUT" + +if [[ ! -f "$WIKI" ]]; then + echo "info: fetching wikitext-2..." + (cd "$ROOT" && sh scripts/get-wikitext-2.sh) +fi + +PPL="${LLAMA_PERPLEXITY:-${ROOT}/build/bin/llama-perplexity}" +echo "info: PPL reference (Unsloth UD-Q4_K_XL)" +"$PPL" -m "$REF" -f "$WIKI" -ngl 99 -t 8 2>&1 | tail -5 + +echo "info: PPL ours (UDT base mask, MTP output)" +"$PPL" -m "$OUT" -f "$WIKI" -ngl 99 -t 8 2>&1 | tail -5 + +echo "ok: compare the two perplexity lines manually (expect small delta; MTP file vs non-MTP ref may differ slightly)." diff --git a/scripts/qwen-udt/sample-chat-calib.txt b/scripts/qwen-udt/sample-chat-calib.txt new file mode 100644 index 00000000000..965b7cf4435 --- /dev/null +++ b/scripts/qwen-udt/sample-chat-calib.txt @@ -0,0 +1,8 @@ +User: Explain speculative decoding in one paragraph. +Assistant: Speculative decoding runs a smaller draft model or auxiliary head to predict several tokens ahead; the large target model verifies them in parallel or with minimal extra compute, accepting a prefix of correct predictions and rolling back when a mismatch occurs. + +User: What is a KV cache? +Assistant: The KV cache stores key and value tensors from past tokens so attention can reuse them instead of recomputing the full history on each new token. + +User: Name three loss functions used in language modeling. +Assistant: Cross-entropy on next-token prediction, label smoothing cross-entropy, and optionally auxiliary losses such as z-loss for stabilizing large logits.