Skip to content

Commit 7c30bc3

Browse files
pestopoppaclaude
andcommitted
docs: add EPYC toolchain chapter from research repo redistribution
Toolchain, worktrees, and production branch documentation (formerly Ch03) now lives in docs/epyc/ alongside the fork it documents. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 6e49ca1 commit 7c30bc3

2 files changed

Lines changed: 394 additions & 0 deletions

File tree

docs/epyc/01-toolchain.md

Lines changed: 379 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,379 @@
1+
# Chapter 01: llama.cpp Toolchain & Patches
2+
3+
## Introduction
4+
5+
This project uses a **fork of llama.cpp** at `github.com/pestopoppa/llama.cpp` with local optimizations for AMD EPYC 9655 "Turin" architecture. The fork includes parallel tensor repack (2.2x model loading speedup), sliding window attention (SWA) fixes for speculative decoding, and prompt lookup ported to llama-server.
6+
7+
The toolchain uses **git worktrees** to isolate production and experimental work, preventing branch conflicts when multiple agents share access. Production inference MUST use the `production-consolidated` branch - feature work happens in separate worktrees.
8+
9+
## Git Worktree Architecture
10+
11+
The codebase is split into two physical directories sharing a single git history. Production lives at `/mnt/raid0/llm/llama.cpp` and must always stay on the `production-consolidated` branch — all benchmarks and orchestration use this build. Experimental work happens in `/mnt/raid0/llm/llama.cpp-experimental`, where you can switch branches freely without affecting production.
12+
13+
<details>
14+
<summary>Directory layout and worktree rules</summary>
15+
16+
| Directory | Branch | Purpose |
17+
|-----------|--------|---------|
18+
| `/mnt/raid0/llm/llama.cpp` | `production-consolidated` | **Production** - benchmarks, stable inference |
19+
| `/mnt/raid0/llm/llama.cpp-experimental` | `feature/*` branches | **Experimental** - new features, research |
20+
21+
**Production directory** (`/mnt/raid0/llm/llama.cpp`):
22+
- **NEVER** checkout a different branch
23+
- **NEVER** commit experimental work
24+
- Stay on `production-consolidated` at all times
25+
- All benchmarks and orchestration use this build
26+
27+
**Experimental directory** (`/mnt/raid0/llm/llama.cpp-experimental`):
28+
- Switch branches freely
29+
- Test new features without affecting production
30+
- Build with `./build/bin/llama-*` binaries
31+
- Changes here don't affect production
32+
33+
</details>
34+
35+
<details>
36+
<summary>Common operations and branch verification</summary>
37+
38+
<details>
39+
<summary>Code: worktree management commands</summary>
40+
41+
```bash
42+
# Check current worktrees
43+
cd /mnt/raid0/llm/llama.cpp
44+
git worktree list
45+
46+
# Expected output:
47+
# /mnt/raid0/llm/llama.cpp 6b43356a1 [production-consolidated]
48+
# /mnt/raid0/llm/llama.cpp-experimental xxxxxxxx [feature/paged-attention]
49+
50+
# Start experimental work
51+
cd /mnt/raid0/llm/llama.cpp-experimental
52+
git checkout production-consolidated
53+
git checkout -b feature/my-new-feature
54+
55+
# Build experimental version
56+
cmake -B build -DGGML_NATIVE=ON -DGGML_AVX512=ON
57+
cmake --build build -j 96
58+
59+
# Verify binary version
60+
./build/bin/llama-cli --version
61+
```
62+
63+
</details>
64+
65+
<details>
66+
<summary>Code: branch safety verification</summary>
67+
68+
```bash
69+
# Manual verification
70+
cd /mnt/raid0/llm/llama.cpp
71+
git branch --show-current
72+
# Output: production-consolidated
73+
74+
# If wrong branch, fix with:
75+
git checkout production-consolidated
76+
```
77+
78+
</details>
79+
80+
**Critical**: Never run benchmarks or live inference on a feature branch. Results won't be reproducible.
81+
82+
</details>
83+
84+
## Production Patches
85+
86+
The fork includes three major optimizations, two already merged upstream. These patches represent the project's direct contributions to the llama.cpp ecosystem — each one solves a real bottleneck we hit during inference optimization.
87+
88+
<details>
89+
<summary>Patch 1: Parallel Tensor Repack (PR #18239)</summary>
90+
91+
**Status**: Merged upstream
92+
**Speedup**: 2.2x faster model loading on 96-core systems
93+
94+
Parallelizes the tensor repack operation that converts GGUF tensors to runtime format. Before this patch, model loading was single-threaded and took 45-60 seconds for 235B Q4_K_M models. With parallel repack, loading drops to 20-25 seconds.
95+
96+
**Implementation**: Splits tensor repack across available threads using OpenMP parallel loop with dynamic scheduling. Each thread processes independent tensor blocks, writing to pre-allocated output buffers.
97+
98+
| Method | Load Time | Speedup |
99+
|--------|-----------|---------|
100+
| Original (single-threaded) | 54.2s | 1.0x |
101+
| Parallel repack | 24.8s | **2.2x** |
102+
103+
</details>
104+
105+
<details>
106+
<summary>Patch 2: SWA Speculation Fix (PR #18720)</summary>
107+
108+
**Status**: Merged upstream
109+
**Speedup**: Enables spec decode for SWA models (was crashing)
110+
111+
Fixed `std::bad_alloc` crash when using speculative decoding with sliding window attention (SWA) models. The crash occurred in `llama_kv_cache::slot_info` during KV cache initialization because SWA requires consecutive context positions, incompatible with speculation's non-sequential token prediction.
112+
113+
**Root Cause**: Draft model predicted tokens at position `N+K`, but target model with SWA expected consecutive positions `[N, N+1, ..., N+K]`. KV cache allocation failed when trying to allocate non-contiguous slots.
114+
115+
**Fix**: Added SWA compatibility check in speculative decoder initialization. If target model uses SWA, disable speculation and fall back to standard generation.
116+
117+
**Models Affected**: Gemma-3 series (SWA with window size 8192).
118+
119+
</details>
120+
121+
<details>
122+
<summary>Patch 3: Prompt Lookup for llama-server</summary>
123+
124+
**Status**: Local patch (not yet submitted)
125+
**Speedup**: 8.6-12.7x on document QA tasks
126+
127+
Ported the prompt lookup optimization from `llama-cli` to `llama-server` to enable document summarization acceleration in the orchestrator stack. Prompt lookup detects repeated n-grams between prompt and generation, copying them directly instead of predicting token-by-token.
128+
129+
**Best Results**:
130+
- Summarization: 95.18 t/s (12.7x)
131+
- Code editing: 25.82 t/s (8.6x)
132+
- Requires source material in context
133+
134+
<details>
135+
<summary>Code: prompt lookup via API</summary>
136+
137+
```bash
138+
# Example: Summarization task achieves 95.18 t/s (12.7x baseline)
139+
curl -X POST http://localhost:8081/completion \
140+
-d '{"prompt": "[source document]\n\nSummarize:", "lookup_ngram_min": 3}'
141+
```
142+
143+
</details>
144+
</details>
145+
146+
## Build System
147+
148+
Building llama.cpp for this hardware means enabling AVX-512 and its extensions (VBMI, VNNI) to take full advantage of Zen 5's true 512-bit execution units. The same cmake flags apply to both production and experimental directories.
149+
150+
<details>
151+
<summary>Build configuration and verification</summary>
152+
153+
<details>
154+
<summary>Code: production build</summary>
155+
156+
```bash
157+
cd /mnt/raid0/llm/llama.cpp
158+
159+
# Configure with AVX-512 and native CPU optimizations
160+
cmake -B build \
161+
-DGGML_NATIVE=ON \
162+
-DGGML_AVX512=ON \
163+
-DGGML_AVX512_VBMI=ON \
164+
-DGGML_AVX512_VNNI=ON \
165+
-DCMAKE_BUILD_TYPE=Release
166+
167+
# Build with all cores
168+
cmake --build build -j 96
169+
170+
# Install binaries (optional)
171+
cmake --install build --prefix /mnt/raid0/llm/llama.cpp/install
172+
```
173+
174+
</details>
175+
176+
<details>
177+
<summary>Code: experimental build</summary>
178+
179+
```bash
180+
cd /mnt/raid0/llm/llama.cpp-experimental
181+
182+
# Same configuration
183+
cmake -B build \
184+
-DGGML_NATIVE=ON \
185+
-DGGML_AVX512=ON \
186+
-DCMAKE_BUILD_TYPE=Release
187+
188+
cmake --build build -j 96
189+
190+
# Test experimental binary
191+
./build/bin/llama-cli --version
192+
./build/bin/llama-cli -m /mnt/raid0/llm/models/test.gguf -p "Hello"
193+
```
194+
195+
</details>
196+
197+
<details>
198+
<summary>Code: AVX-512 verification</summary>
199+
200+
```bash
201+
# Check AVX-512 support in binary
202+
./build/bin/llama-cli --version | grep AVX512
203+
204+
# Expected output:
205+
# AVX512 = 1
206+
# AVX512_VBMI = 1
207+
# AVX512_VNNI = 1
208+
```
209+
210+
</details>
211+
212+
**Note**: EPYC 9655 "Turin" has true 512-bit AVX-512 execution units (not double-pumped like Intel Alder/Raptor Lake). AVX-512 VNNI provides 2x INT8 throughput over AVX2.
213+
214+
</details>
215+
216+
## Binary Usage Patterns
217+
218+
Three binaries handle different inference modes: `llama-cli` for interactive and batch work, `llama-speculative` for draft-model acceleration, and `llama-server` for production API serving. Each has its own flags and typical launch patterns.
219+
220+
<details>
221+
<summary>Binary commands and examples</summary>
222+
223+
<details>
224+
<summary>Code: llama-cli (Interactive/Batch)</summary>
225+
226+
```bash
227+
# Standard completion
228+
OMP_NUM_THREADS=1 numactl --interleave=all \
229+
/mnt/raid0/llm/llama.cpp/build/bin/llama-cli \
230+
-m /mnt/raid0/llm/models/Qwen2.5-Coder-32B-Q4_K_M.gguf \
231+
-p "Write a function to compute factorial" \
232+
-n 512 -t 96 --temp 0
233+
234+
# Prompt lookup (for document QA)
235+
/mnt/raid0/llm/llama.cpp/build/bin/llama-cli \
236+
-m model.gguf \
237+
-f prompt_with_source.txt \
238+
--lookup-ngram-min 3 \
239+
-t 96
240+
```
241+
242+
</details>
243+
244+
<details>
245+
<summary>Code: llama-speculative (Draft Model)</summary>
246+
247+
```bash
248+
# External draft model (11x speedup on code)
249+
OMP_NUM_THREADS=1 numactl --interleave=all \
250+
/mnt/raid0/llm/llama.cpp/build/bin/llama-speculative \
251+
-m /mnt/raid0/llm/models/Qwen2.5-Coder-32B-Q4_K_M.gguf \
252+
-md /mnt/raid0/llm/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf \
253+
--draft-max 24 -t 96 -p "prompt"
254+
```
255+
256+
</details>
257+
258+
<details>
259+
<summary>Code: llama-server (Production Orchestrator)</summary>
260+
261+
```bash
262+
# HOT tier server (port 8080: frontdoor)
263+
/mnt/raid0/llm/llama.cpp/build/bin/llama-server \
264+
-m /mnt/raid0/llm/models/Qwen3-Coder-30B-A3B-Q4_K_M.gguf \
265+
--host 0.0.0.0 --port 8080 \
266+
--threads 96 --parallel 4 \
267+
--override-kv qwen3moe.expert_used_count=int:6 \
268+
--ctx-size 32768 --no-mmap
269+
270+
# Worker server (port 8082: spec + lookup)
271+
/mnt/raid0/llm/llama.cpp/build/bin/llama-server \
272+
-m /mnt/raid0/llm/models/Qwen2.5-7B-Instruct-f16.gguf \
273+
-md /mnt/raid0/llm/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf \
274+
--draft-max 24 --lookup-ngram-min 3 \
275+
--host 0.0.0.0 --port 8082 --threads 96 --parallel 4
276+
```
277+
278+
</details>
279+
</details>
280+
281+
## Known Limitations
282+
283+
Two model families have hard incompatibilities that will silently produce garbage or crash if you ignore them. These aren't bugs to be fixed — they're architectural constraints of the models themselves.
284+
285+
<details>
286+
<summary>SSM models and BOS token mismatch</summary>
287+
288+
### SSM Models (Qwen3-Next)
289+
290+
**NEVER** use speculative decoding or prompt lookup with SSM architecture models. SSM requires consecutive context positions for state propagation — speculation breaks this invariant.
291+
292+
<details>
293+
<summary>Code: correct vs incorrect SSM usage</summary>
294+
295+
```bash
296+
# ❌ WRONG - will produce garbage
297+
llama-speculative -m Qwen3-Next-80B-A3B-Q4_K_M.gguf -md draft.gguf
298+
299+
# ✅ CORRECT - expert reduction only
300+
llama-cli -m Qwen3-Next-80B-A3B-Q4_K_M.gguf \
301+
--override-kv qwen3next.expert_used_count=int:2
302+
```
303+
304+
</details>
305+
306+
### Qwen3-Coder-480B BOS Token
307+
308+
The 480B model has BOS token mismatch (`BOS=','`) that breaks all speculation:
309+
310+
<details>
311+
<summary>Code: correct 480B usage</summary>
312+
313+
```bash
314+
# ❌ Speculation will fail
315+
llama-speculative -m Qwen3-Coder-480B-A35B-Q4_K_M.gguf -md draft.gguf
316+
317+
# ✅ Use expert reduction only
318+
llama-cli -m Qwen3-Coder-480B-A35B-Q4_K_M.gguf \
319+
--override-kv qwen3moe.expert_used_count=int:3
320+
```
321+
322+
</details>
323+
324+
**Result**: 10.3 t/s with MoE3 (vs 3.0 t/s baseline), but no speculation compatibility.
325+
326+
</details>
327+
328+
## Troubleshooting
329+
330+
When things go wrong, it's usually one of two things: wrong branch or wrong directory. These quick checks cover the most common issues.
331+
332+
<details>
333+
<summary>Common issues and fixes</summary>
334+
335+
### "Build is using wrong version"
336+
337+
<details>
338+
<summary>Code: version verification and recovery</summary>
339+
340+
```bash
341+
pwd # Should be /mnt/raid0/llm/llama.cpp for production
342+
git branch --show-current # Should be production-consolidated
343+
./build/bin/llama-cli --version # Verify commit hash
344+
```
345+
346+
If on wrong branch:
347+
```bash
348+
cd /mnt/raid0/llm/llama.cpp
349+
git checkout production-consolidated
350+
cmake --build build -j 96 # Rebuild
351+
```
352+
353+
</details>
354+
355+
### "I accidentally worked on production-consolidated"
356+
357+
<details>
358+
<summary>Code: recovery steps</summary>
359+
360+
1. Stash or commit changes: `git stash` or `git commit -am "WIP"`
361+
2. Create feature branch: `git checkout -b feature/my-work`
362+
3. Switch production back: `cd /mnt/raid0/llm/llama.cpp && git checkout production-consolidated`
363+
4. Move work to experimental: `cd /mnt/raid0/llm/llama.cpp-experimental && git cherry-pick <hash>`
364+
365+
</details>
366+
</details>
367+
368+
<details>
369+
<summary>References</summary>
370+
371+
- Fork: https://github.com/pestopoppa/llama.cpp
372+
- Upstream: https://github.com/ggml-org/llama.cpp
373+
- PR #18239: Parallel tensor repack (merged)
374+
- PR #18720: SWA speculation fix (merged)
375+
- `docs/reference/LLAMA_CPP_WORKTREES.md` - Detailed worktree workflow
376+
377+
</details>
378+
379+
---

docs/epyc/INDEX.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# EPYC llama.cpp Documentation
2+
3+
Documentation specific to the epyc-llama fork — toolchain management, worktrees, and production branch safety.
4+
5+
## Chapters
6+
7+
| # | Title | Key Topics |
8+
|---|-------|------------|
9+
| [01](01-toolchain.md) | Toolchain, Worktrees & Production Branch | Build system, upstream rebases, branch safety, worktree workflow |
10+
11+
## Cross-Repository Documentation
12+
13+
- **Orchestration architecture** (routing, memory, server stack): epyc-orchestrator `docs/chapters/`
14+
- **Inference optimization** (speculative decoding, MoE, radix attention): epyc-inference-research `docs/chapters/`
15+
- **Hardware and storage** (EPYC 9655 platform, RAID0 safety): epyc-root `docs/infrastructure/`

0 commit comments

Comments
 (0)