Skip to content

tinmanlabsl/qvac-gemma4-patch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QVAC-Fabric Gemma 4 Architecture Patch

Adds full Gemma 4 (gemma4) architecture support to QVAC-Fabric, Tether's llama.cpp fork.

Base: QVAC-Fabric temp-upstream branch
Target: All Gemma 4 variants (E2B, E4B, etc. — architecture is size-agnostic)

Why

QVAC-Fabric provides memory-based model loading, on-device LoRA hot-swap, BitNet TQ2_0 quantization, and Adreno GPU optimization. It supported gemma3 and gemma3n but had no gemma4 support. This patch bridges that gap.

Performance

Tested on NVIDIA RTX 4090 with Gemma 4 E2B Q4_K_M (3.2 GB):

Metric Value
Prompt processing 132.5 tokens/sec
Generation 116.5 tokens/sec
Model memory 1,416 MiB
Context memory 780 MiB

Key Technical Challenges

1. Per-Layer Head Dimensions

Gemma 4 uses different attention head dimensions for SWA (Sliding Window Attention) and non-SWA layers:

  • SWA layers: n_embd_head = 256, n_rot = 256
  • Non-SWA layers: n_embd_head = 512, n_rot = 512

QVAC-Fabric assumed a single global n_embd_head_k value. This patch modifies:

  • n_embd_k_gqa() and n_embd_v_gqa() to return per-layer values
  • KV cache get_k() and get_v() to use per-layer head dims
  • build_graph_shift() for per-layer RoPE dimensions
  • Tensor loading to compute correct shapes per layer

2. QVAC API Differences

QVAC's build_moe_ffn() has a different signature from upstream llama.cpp:

  • QVAC: (cur, gate_inp, up_exps, gate_exps, down_exps, exp_probs_b, n_expert, n_expert_used, type_op, norm_w, scale_w, w_scale, gating_op, il, probs_in, gate_up_exps)
  • Upstream: uses trailing *_exps_s parameters instead of scale_w + w_scale

3. Member Variables vs Functions

Several hparams that are functions in upstream llama.cpp are plain member variables in QVAC:

  • n_embd_head_k / n_embd_head_v — member vars (upstream: functions with il param)
  • n_rot — member var (upstream: function with il param)

4. Field Naming

  • QVAC: tok_embd_per_layer (model struct)
  • Upstream: per_layer_tok_embd

Files Modified

11 files, +256 lines:

File Changes
src/llama-arch.h LLM_ARCH_GEMMA4, tensor enums, KV enums
src/llama-arch.cpp Name mapping, tensor strings, tensor-to-arch mapping, tensor info entries
src/llama-model.h Struct members: ffn_gate_inp_s, ffn_post_norm_1/2, ffn_pre_norm_2, out_scale
src/llama-model.cpp Hparams parsing, tensor loading (~95 lines), build dispatch, KV reuse
src/llama-hparams.h n_embd_per_layer, n_embd_head_k_swa, n_embd_head_v_swa
src/llama-hparams.cpp Per-layer-aware n_embd_k_gqa() and n_embd_v_gqa()
src/llama-kv-cache.cpp Per-layer head dims in get_k(), get_v(), build_graph_shift()
src/llama-vocab.h LLAMA_VOCAB_PRE_TYPE_GEMMA4 = 50
src/llama-vocab.cpp Gemma4 tokenizer model, pre-type regex, BPE merge loading
src/models/models.h llm_build_gemma4_iswa struct declaration
src/models/gemma4-iswa.cpp Full graph builder (~299 lines)
src/CMakeLists.txt Added models/gemma4-iswa.cpp

Quick Start

Option A: Apply unified patch

cd /path/to/qvac-fabric-llm.cpp
git checkout temp-upstream
git apply gemma4-qvac-full.patch
cp models/gemma4-iswa.cpp src/models/gemma4-iswa.cpp
mkdir -p build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_SKIP_INSTALL_RULES=ON
make -j$(nproc)

Option B: Use patch scripts

python scripts/patch_structs.py
python scripts/patch_models_h.py
python scripts/patch_fix_all.py
cp models/gemma4-iswa.cpp /path/to/qvac/src/models/
# Rebuild

Test

./build/bin/llama-server -m /path/to/gemma4.gguf -ngl 99 --port 8090

curl -s http://localhost:8090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'

Known Issues

  1. Tokenizer newline display: [UNK_BYTE_0x0a] in CLI output for newlines. Cosmetic only — server API works correctly. Root cause: QVAC lacks byte_encode=false for SPM-style BPE.

  2. Per-layer head dim safety: The n_embd_k_gqa() / n_embd_v_gqa() changes are global but safe — n_embd_head_k_swa defaults to 0 for non-Gemma4 architectures, so the SWA branch is never taken.

Repository Structure

├── README.md
├── gemma4-qvac-full.patch        # Unified git diff (git apply ready)
├── models/
│   ├── gemma4-iswa.cpp           # Graph builder (new file)
│   └── models.h                  # Full header with gemma4 struct
├── patches/                      # Full patched source files (reference)
│   ├── llama-arch.h / .cpp
│   ├── llama-model.h / .cpp
│   ├── llama-hparams.h / .cpp
│   ├── llama-kv-cache.cpp
│   ├── llama-vocab.h / .cpp
│   └── CMakeLists.txt
└── scripts/                      # Automated patch scripts
    ├── patch_structs.py
    ├── patch_models_h.py
    ├── patch_fix_all.py
    ├── patch_vocab.py
    ├── patch_kvs.py
    ├── patch_tensors.py
    └── patch_missing.py

License

Patches provided under MIT (same as QVAC-Fabric). Gemma 4 model weights are subject to Google's Gemma Terms of Use.

About

Gemma 4 architecture support for QVAC-Fabric llama.cpp fork

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors