Adds full Gemma 4 (gemma4) architecture support to QVAC-Fabric, Tether's llama.cpp fork.
Base: QVAC-Fabric temp-upstream branch
Target: All Gemma 4 variants (E2B, E4B, etc. — architecture is size-agnostic)
QVAC-Fabric provides memory-based model loading, on-device LoRA hot-swap, BitNet TQ2_0 quantization, and Adreno GPU optimization. It supported gemma3 and gemma3n but had no gemma4 support. This patch bridges that gap.
Tested on NVIDIA RTX 4090 with Gemma 4 E2B Q4_K_M (3.2 GB):
| Metric | Value |
|---|---|
| Prompt processing | 132.5 tokens/sec |
| Generation | 116.5 tokens/sec |
| Model memory | 1,416 MiB |
| Context memory | 780 MiB |
Gemma 4 uses different attention head dimensions for SWA (Sliding Window Attention) and non-SWA layers:
- SWA layers:
n_embd_head = 256,n_rot = 256 - Non-SWA layers:
n_embd_head = 512,n_rot = 512
QVAC-Fabric assumed a single global n_embd_head_k value. This patch modifies:
n_embd_k_gqa()andn_embd_v_gqa()to return per-layer values- KV cache
get_k()andget_v()to use per-layer head dims build_graph_shift()for per-layer RoPE dimensions- Tensor loading to compute correct shapes per layer
QVAC's build_moe_ffn() has a different signature from upstream llama.cpp:
- QVAC:
(cur, gate_inp, up_exps, gate_exps, down_exps, exp_probs_b, n_expert, n_expert_used, type_op, norm_w, scale_w, w_scale, gating_op, il, probs_in, gate_up_exps) - Upstream: uses trailing
*_exps_sparameters instead ofscale_w+w_scale
Several hparams that are functions in upstream llama.cpp are plain member variables in QVAC:
n_embd_head_k/n_embd_head_v— member vars (upstream: functions withilparam)n_rot— member var (upstream: function withilparam)
- QVAC:
tok_embd_per_layer(model struct) - Upstream:
per_layer_tok_embd
11 files, +256 lines:
| File | Changes |
|---|---|
src/llama-arch.h |
LLM_ARCH_GEMMA4, tensor enums, KV enums |
src/llama-arch.cpp |
Name mapping, tensor strings, tensor-to-arch mapping, tensor info entries |
src/llama-model.h |
Struct members: ffn_gate_inp_s, ffn_post_norm_1/2, ffn_pre_norm_2, out_scale |
src/llama-model.cpp |
Hparams parsing, tensor loading (~95 lines), build dispatch, KV reuse |
src/llama-hparams.h |
n_embd_per_layer, n_embd_head_k_swa, n_embd_head_v_swa |
src/llama-hparams.cpp |
Per-layer-aware n_embd_k_gqa() and n_embd_v_gqa() |
src/llama-kv-cache.cpp |
Per-layer head dims in get_k(), get_v(), build_graph_shift() |
src/llama-vocab.h |
LLAMA_VOCAB_PRE_TYPE_GEMMA4 = 50 |
src/llama-vocab.cpp |
Gemma4 tokenizer model, pre-type regex, BPE merge loading |
src/models/models.h |
llm_build_gemma4_iswa struct declaration |
src/models/gemma4-iswa.cpp |
Full graph builder (~299 lines) |
src/CMakeLists.txt |
Added models/gemma4-iswa.cpp |
cd /path/to/qvac-fabric-llm.cpp
git checkout temp-upstream
git apply gemma4-qvac-full.patch
cp models/gemma4-iswa.cpp src/models/gemma4-iswa.cpp
mkdir -p build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_SKIP_INSTALL_RULES=ON
make -j$(nproc)python scripts/patch_structs.py
python scripts/patch_models_h.py
python scripts/patch_fix_all.py
cp models/gemma4-iswa.cpp /path/to/qvac/src/models/
# Rebuild./build/bin/llama-server -m /path/to/gemma4.gguf -ngl 99 --port 8090
curl -s http://localhost:8090/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'-
Tokenizer newline display:
[UNK_BYTE_0x0a]in CLI output for newlines. Cosmetic only — server API works correctly. Root cause: QVAC lacksbyte_encode=falsefor SPM-style BPE. -
Per-layer head dim safety: The
n_embd_k_gqa()/n_embd_v_gqa()changes are global but safe —n_embd_head_k_swadefaults to 0 for non-Gemma4 architectures, so the SWA branch is never taken.
├── README.md
├── gemma4-qvac-full.patch # Unified git diff (git apply ready)
├── models/
│ ├── gemma4-iswa.cpp # Graph builder (new file)
│ └── models.h # Full header with gemma4 struct
├── patches/ # Full patched source files (reference)
│ ├── llama-arch.h / .cpp
│ ├── llama-model.h / .cpp
│ ├── llama-hparams.h / .cpp
│ ├── llama-kv-cache.cpp
│ ├── llama-vocab.h / .cpp
│ └── CMakeLists.txt
└── scripts/ # Automated patch scripts
├── patch_structs.py
├── patch_models_h.py
├── patch_fix_all.py
├── patch_vocab.py
├── patch_kvs.py
├── patch_tensors.py
└── patch_missing.py
Patches provided under MIT (same as QVAC-Fabric). Gemma 4 model weights are subject to Google's Gemma Terms of Use.