|
| 1 | +# Chapter 01: llama.cpp Toolchain & Patches |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +This project uses a **fork of llama.cpp** at `github.com/pestopoppa/llama.cpp` with local optimizations for AMD EPYC 9655 "Turin" architecture. The fork includes parallel tensor repack (2.2x model loading speedup), sliding window attention (SWA) fixes for speculative decoding, and prompt lookup ported to llama-server. |
| 6 | + |
| 7 | +The toolchain uses **git worktrees** to isolate production and experimental work, preventing branch conflicts when multiple agents share access. Production inference MUST use the `production-consolidated` branch - feature work happens in separate worktrees. |
| 8 | + |
| 9 | +## Git Worktree Architecture |
| 10 | + |
| 11 | +The codebase is split into two physical directories sharing a single git history. Production lives at `/mnt/raid0/llm/llama.cpp` and must always stay on the `production-consolidated` branch — all benchmarks and orchestration use this build. Experimental work happens in `/mnt/raid0/llm/llama.cpp-experimental`, where you can switch branches freely without affecting production. |
| 12 | + |
| 13 | +<details> |
| 14 | +<summary>Directory layout and worktree rules</summary> |
| 15 | + |
| 16 | +| Directory | Branch | Purpose | |
| 17 | +|-----------|--------|---------| |
| 18 | +| `/mnt/raid0/llm/llama.cpp` | `production-consolidated` | **Production** - benchmarks, stable inference | |
| 19 | +| `/mnt/raid0/llm/llama.cpp-experimental` | `feature/*` branches | **Experimental** - new features, research | |
| 20 | + |
| 21 | +**Production directory** (`/mnt/raid0/llm/llama.cpp`): |
| 22 | +- **NEVER** checkout a different branch |
| 23 | +- **NEVER** commit experimental work |
| 24 | +- Stay on `production-consolidated` at all times |
| 25 | +- All benchmarks and orchestration use this build |
| 26 | + |
| 27 | +**Experimental directory** (`/mnt/raid0/llm/llama.cpp-experimental`): |
| 28 | +- Switch branches freely |
| 29 | +- Test new features without affecting production |
| 30 | +- Build with `./build/bin/llama-*` binaries |
| 31 | +- Changes here don't affect production |
| 32 | + |
| 33 | +</details> |
| 34 | + |
| 35 | +<details> |
| 36 | +<summary>Common operations and branch verification</summary> |
| 37 | + |
| 38 | +<details> |
| 39 | +<summary>Code: worktree management commands</summary> |
| 40 | + |
| 41 | +```bash |
| 42 | +# Check current worktrees |
| 43 | +cd /mnt/raid0/llm/llama.cpp |
| 44 | +git worktree list |
| 45 | + |
| 46 | +# Expected output: |
| 47 | +# /mnt/raid0/llm/llama.cpp 6b43356a1 [production-consolidated] |
| 48 | +# /mnt/raid0/llm/llama.cpp-experimental xxxxxxxx [feature/paged-attention] |
| 49 | + |
| 50 | +# Start experimental work |
| 51 | +cd /mnt/raid0/llm/llama.cpp-experimental |
| 52 | +git checkout production-consolidated |
| 53 | +git checkout -b feature/my-new-feature |
| 54 | + |
| 55 | +# Build experimental version |
| 56 | +cmake -B build -DGGML_NATIVE=ON -DGGML_AVX512=ON |
| 57 | +cmake --build build -j 96 |
| 58 | + |
| 59 | +# Verify binary version |
| 60 | +./build/bin/llama-cli --version |
| 61 | +``` |
| 62 | + |
| 63 | +</details> |
| 64 | + |
| 65 | +<details> |
| 66 | +<summary>Code: branch safety verification</summary> |
| 67 | + |
| 68 | +```bash |
| 69 | +# Manual verification |
| 70 | +cd /mnt/raid0/llm/llama.cpp |
| 71 | +git branch --show-current |
| 72 | +# Output: production-consolidated |
| 73 | + |
| 74 | +# If wrong branch, fix with: |
| 75 | +git checkout production-consolidated |
| 76 | +``` |
| 77 | + |
| 78 | +</details> |
| 79 | + |
| 80 | +**Critical**: Never run benchmarks or live inference on a feature branch. Results won't be reproducible. |
| 81 | + |
| 82 | +</details> |
| 83 | + |
| 84 | +## Production Patches |
| 85 | + |
| 86 | +The fork includes three major optimizations, two already merged upstream. These patches represent the project's direct contributions to the llama.cpp ecosystem — each one solves a real bottleneck we hit during inference optimization. |
| 87 | + |
| 88 | +<details> |
| 89 | +<summary>Patch 1: Parallel Tensor Repack (PR #18239)</summary> |
| 90 | + |
| 91 | +**Status**: Merged upstream |
| 92 | +**Speedup**: 2.2x faster model loading on 96-core systems |
| 93 | + |
| 94 | +Parallelizes the tensor repack operation that converts GGUF tensors to runtime format. Before this patch, model loading was single-threaded and took 45-60 seconds for 235B Q4_K_M models. With parallel repack, loading drops to 20-25 seconds. |
| 95 | + |
| 96 | +**Implementation**: Splits tensor repack across available threads using OpenMP parallel loop with dynamic scheduling. Each thread processes independent tensor blocks, writing to pre-allocated output buffers. |
| 97 | + |
| 98 | +| Method | Load Time | Speedup | |
| 99 | +|--------|-----------|---------| |
| 100 | +| Original (single-threaded) | 54.2s | 1.0x | |
| 101 | +| Parallel repack | 24.8s | **2.2x** | |
| 102 | + |
| 103 | +</details> |
| 104 | + |
| 105 | +<details> |
| 106 | +<summary>Patch 2: SWA Speculation Fix (PR #18720)</summary> |
| 107 | + |
| 108 | +**Status**: Merged upstream |
| 109 | +**Speedup**: Enables spec decode for SWA models (was crashing) |
| 110 | + |
| 111 | +Fixed `std::bad_alloc` crash when using speculative decoding with sliding window attention (SWA) models. The crash occurred in `llama_kv_cache::slot_info` during KV cache initialization because SWA requires consecutive context positions, incompatible with speculation's non-sequential token prediction. |
| 112 | + |
| 113 | +**Root Cause**: Draft model predicted tokens at position `N+K`, but target model with SWA expected consecutive positions `[N, N+1, ..., N+K]`. KV cache allocation failed when trying to allocate non-contiguous slots. |
| 114 | + |
| 115 | +**Fix**: Added SWA compatibility check in speculative decoder initialization. If target model uses SWA, disable speculation and fall back to standard generation. |
| 116 | + |
| 117 | +**Models Affected**: Gemma-3 series (SWA with window size 8192). |
| 118 | + |
| 119 | +</details> |
| 120 | + |
| 121 | +<details> |
| 122 | +<summary>Patch 3: Prompt Lookup for llama-server</summary> |
| 123 | + |
| 124 | +**Status**: Local patch (not yet submitted) |
| 125 | +**Speedup**: 8.6-12.7x on document QA tasks |
| 126 | + |
| 127 | +Ported the prompt lookup optimization from `llama-cli` to `llama-server` to enable document summarization acceleration in the orchestrator stack. Prompt lookup detects repeated n-grams between prompt and generation, copying them directly instead of predicting token-by-token. |
| 128 | + |
| 129 | +**Best Results**: |
| 130 | +- Summarization: 95.18 t/s (12.7x) |
| 131 | +- Code editing: 25.82 t/s (8.6x) |
| 132 | +- Requires source material in context |
| 133 | + |
| 134 | +<details> |
| 135 | +<summary>Code: prompt lookup via API</summary> |
| 136 | + |
| 137 | +```bash |
| 138 | +# Example: Summarization task achieves 95.18 t/s (12.7x baseline) |
| 139 | +curl -X POST http://localhost:8081/completion \ |
| 140 | + -d '{"prompt": "[source document]\n\nSummarize:", "lookup_ngram_min": 3}' |
| 141 | +``` |
| 142 | + |
| 143 | +</details> |
| 144 | +</details> |
| 145 | + |
| 146 | +## Build System |
| 147 | + |
| 148 | +Building llama.cpp for this hardware means enabling AVX-512 and its extensions (VBMI, VNNI) to take full advantage of Zen 5's true 512-bit execution units. The same cmake flags apply to both production and experimental directories. |
| 149 | + |
| 150 | +<details> |
| 151 | +<summary>Build configuration and verification</summary> |
| 152 | + |
| 153 | +<details> |
| 154 | +<summary>Code: production build</summary> |
| 155 | + |
| 156 | +```bash |
| 157 | +cd /mnt/raid0/llm/llama.cpp |
| 158 | + |
| 159 | +# Configure with AVX-512 and native CPU optimizations |
| 160 | +cmake -B build \ |
| 161 | + -DGGML_NATIVE=ON \ |
| 162 | + -DGGML_AVX512=ON \ |
| 163 | + -DGGML_AVX512_VBMI=ON \ |
| 164 | + -DGGML_AVX512_VNNI=ON \ |
| 165 | + -DCMAKE_BUILD_TYPE=Release |
| 166 | + |
| 167 | +# Build with all cores |
| 168 | +cmake --build build -j 96 |
| 169 | + |
| 170 | +# Install binaries (optional) |
| 171 | +cmake --install build --prefix /mnt/raid0/llm/llama.cpp/install |
| 172 | +``` |
| 173 | + |
| 174 | +</details> |
| 175 | + |
| 176 | +<details> |
| 177 | +<summary>Code: experimental build</summary> |
| 178 | + |
| 179 | +```bash |
| 180 | +cd /mnt/raid0/llm/llama.cpp-experimental |
| 181 | + |
| 182 | +# Same configuration |
| 183 | +cmake -B build \ |
| 184 | + -DGGML_NATIVE=ON \ |
| 185 | + -DGGML_AVX512=ON \ |
| 186 | + -DCMAKE_BUILD_TYPE=Release |
| 187 | + |
| 188 | +cmake --build build -j 96 |
| 189 | + |
| 190 | +# Test experimental binary |
| 191 | +./build/bin/llama-cli --version |
| 192 | +./build/bin/llama-cli -m /mnt/raid0/llm/models/test.gguf -p "Hello" |
| 193 | +``` |
| 194 | + |
| 195 | +</details> |
| 196 | + |
| 197 | +<details> |
| 198 | +<summary>Code: AVX-512 verification</summary> |
| 199 | + |
| 200 | +```bash |
| 201 | +# Check AVX-512 support in binary |
| 202 | +./build/bin/llama-cli --version | grep AVX512 |
| 203 | + |
| 204 | +# Expected output: |
| 205 | +# AVX512 = 1 |
| 206 | +# AVX512_VBMI = 1 |
| 207 | +# AVX512_VNNI = 1 |
| 208 | +``` |
| 209 | + |
| 210 | +</details> |
| 211 | + |
| 212 | +**Note**: EPYC 9655 "Turin" has true 512-bit AVX-512 execution units (not double-pumped like Intel Alder/Raptor Lake). AVX-512 VNNI provides 2x INT8 throughput over AVX2. |
| 213 | + |
| 214 | +</details> |
| 215 | + |
| 216 | +## Binary Usage Patterns |
| 217 | + |
| 218 | +Three binaries handle different inference modes: `llama-cli` for interactive and batch work, `llama-speculative` for draft-model acceleration, and `llama-server` for production API serving. Each has its own flags and typical launch patterns. |
| 219 | + |
| 220 | +<details> |
| 221 | +<summary>Binary commands and examples</summary> |
| 222 | + |
| 223 | +<details> |
| 224 | +<summary>Code: llama-cli (Interactive/Batch)</summary> |
| 225 | + |
| 226 | +```bash |
| 227 | +# Standard completion |
| 228 | +OMP_NUM_THREADS=1 numactl --interleave=all \ |
| 229 | + /mnt/raid0/llm/llama.cpp/build/bin/llama-cli \ |
| 230 | + -m /mnt/raid0/llm/models/Qwen2.5-Coder-32B-Q4_K_M.gguf \ |
| 231 | + -p "Write a function to compute factorial" \ |
| 232 | + -n 512 -t 96 --temp 0 |
| 233 | + |
| 234 | +# Prompt lookup (for document QA) |
| 235 | +/mnt/raid0/llm/llama.cpp/build/bin/llama-cli \ |
| 236 | + -m model.gguf \ |
| 237 | + -f prompt_with_source.txt \ |
| 238 | + --lookup-ngram-min 3 \ |
| 239 | + -t 96 |
| 240 | +``` |
| 241 | + |
| 242 | +</details> |
| 243 | + |
| 244 | +<details> |
| 245 | +<summary>Code: llama-speculative (Draft Model)</summary> |
| 246 | + |
| 247 | +```bash |
| 248 | +# External draft model (11x speedup on code) |
| 249 | +OMP_NUM_THREADS=1 numactl --interleave=all \ |
| 250 | + /mnt/raid0/llm/llama.cpp/build/bin/llama-speculative \ |
| 251 | + -m /mnt/raid0/llm/models/Qwen2.5-Coder-32B-Q4_K_M.gguf \ |
| 252 | + -md /mnt/raid0/llm/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf \ |
| 253 | + --draft-max 24 -t 96 -p "prompt" |
| 254 | +``` |
| 255 | + |
| 256 | +</details> |
| 257 | + |
| 258 | +<details> |
| 259 | +<summary>Code: llama-server (Production Orchestrator)</summary> |
| 260 | + |
| 261 | +```bash |
| 262 | +# HOT tier server (port 8080: frontdoor) |
| 263 | +/mnt/raid0/llm/llama.cpp/build/bin/llama-server \ |
| 264 | + -m /mnt/raid0/llm/models/Qwen3-Coder-30B-A3B-Q4_K_M.gguf \ |
| 265 | + --host 0.0.0.0 --port 8080 \ |
| 266 | + --threads 96 --parallel 4 \ |
| 267 | + --override-kv qwen3moe.expert_used_count=int:6 \ |
| 268 | + --ctx-size 32768 --no-mmap |
| 269 | + |
| 270 | +# Worker server (port 8082: spec + lookup) |
| 271 | +/mnt/raid0/llm/llama.cpp/build/bin/llama-server \ |
| 272 | + -m /mnt/raid0/llm/models/Qwen2.5-7B-Instruct-f16.gguf \ |
| 273 | + -md /mnt/raid0/llm/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf \ |
| 274 | + --draft-max 24 --lookup-ngram-min 3 \ |
| 275 | + --host 0.0.0.0 --port 8082 --threads 96 --parallel 4 |
| 276 | +``` |
| 277 | + |
| 278 | +</details> |
| 279 | +</details> |
| 280 | + |
| 281 | +## Known Limitations |
| 282 | + |
| 283 | +Two model families have hard incompatibilities that will silently produce garbage or crash if you ignore them. These aren't bugs to be fixed — they're architectural constraints of the models themselves. |
| 284 | + |
| 285 | +<details> |
| 286 | +<summary>SSM models and BOS token mismatch</summary> |
| 287 | + |
| 288 | +### SSM Models (Qwen3-Next) |
| 289 | + |
| 290 | +**NEVER** use speculative decoding or prompt lookup with SSM architecture models. SSM requires consecutive context positions for state propagation — speculation breaks this invariant. |
| 291 | + |
| 292 | +<details> |
| 293 | +<summary>Code: correct vs incorrect SSM usage</summary> |
| 294 | + |
| 295 | +```bash |
| 296 | +# ❌ WRONG - will produce garbage |
| 297 | +llama-speculative -m Qwen3-Next-80B-A3B-Q4_K_M.gguf -md draft.gguf |
| 298 | + |
| 299 | +# ✅ CORRECT - expert reduction only |
| 300 | +llama-cli -m Qwen3-Next-80B-A3B-Q4_K_M.gguf \ |
| 301 | + --override-kv qwen3next.expert_used_count=int:2 |
| 302 | +``` |
| 303 | + |
| 304 | +</details> |
| 305 | + |
| 306 | +### Qwen3-Coder-480B BOS Token |
| 307 | + |
| 308 | +The 480B model has BOS token mismatch (`BOS=','`) that breaks all speculation: |
| 309 | + |
| 310 | +<details> |
| 311 | +<summary>Code: correct 480B usage</summary> |
| 312 | + |
| 313 | +```bash |
| 314 | +# ❌ Speculation will fail |
| 315 | +llama-speculative -m Qwen3-Coder-480B-A35B-Q4_K_M.gguf -md draft.gguf |
| 316 | + |
| 317 | +# ✅ Use expert reduction only |
| 318 | +llama-cli -m Qwen3-Coder-480B-A35B-Q4_K_M.gguf \ |
| 319 | + --override-kv qwen3moe.expert_used_count=int:3 |
| 320 | +``` |
| 321 | + |
| 322 | +</details> |
| 323 | + |
| 324 | +**Result**: 10.3 t/s with MoE3 (vs 3.0 t/s baseline), but no speculation compatibility. |
| 325 | + |
| 326 | +</details> |
| 327 | + |
| 328 | +## Troubleshooting |
| 329 | + |
| 330 | +When things go wrong, it's usually one of two things: wrong branch or wrong directory. These quick checks cover the most common issues. |
| 331 | + |
| 332 | +<details> |
| 333 | +<summary>Common issues and fixes</summary> |
| 334 | + |
| 335 | +### "Build is using wrong version" |
| 336 | + |
| 337 | +<details> |
| 338 | +<summary>Code: version verification and recovery</summary> |
| 339 | + |
| 340 | +```bash |
| 341 | +pwd # Should be /mnt/raid0/llm/llama.cpp for production |
| 342 | +git branch --show-current # Should be production-consolidated |
| 343 | +./build/bin/llama-cli --version # Verify commit hash |
| 344 | +``` |
| 345 | + |
| 346 | +If on wrong branch: |
| 347 | +```bash |
| 348 | +cd /mnt/raid0/llm/llama.cpp |
| 349 | +git checkout production-consolidated |
| 350 | +cmake --build build -j 96 # Rebuild |
| 351 | +``` |
| 352 | + |
| 353 | +</details> |
| 354 | + |
| 355 | +### "I accidentally worked on production-consolidated" |
| 356 | + |
| 357 | +<details> |
| 358 | +<summary>Code: recovery steps</summary> |
| 359 | + |
| 360 | +1. Stash or commit changes: `git stash` or `git commit -am "WIP"` |
| 361 | +2. Create feature branch: `git checkout -b feature/my-work` |
| 362 | +3. Switch production back: `cd /mnt/raid0/llm/llama.cpp && git checkout production-consolidated` |
| 363 | +4. Move work to experimental: `cd /mnt/raid0/llm/llama.cpp-experimental && git cherry-pick <hash>` |
| 364 | + |
| 365 | +</details> |
| 366 | +</details> |
| 367 | + |
| 368 | +<details> |
| 369 | +<summary>References</summary> |
| 370 | + |
| 371 | +- Fork: https://github.com/pestopoppa/llama.cpp |
| 372 | +- Upstream: https://github.com/ggml-org/llama.cpp |
| 373 | +- PR #18239: Parallel tensor repack (merged) |
| 374 | +- PR #18720: SWA speculation fix (merged) |
| 375 | +- `docs/reference/LLAMA_CPP_WORKTREES.md` - Detailed worktree workflow |
| 376 | + |
| 377 | +</details> |
| 378 | + |
| 379 | +--- |
0 commit comments