QVAC-Fabric Gemma 4 Architecture Patch

Adds full Gemma 4 (gemma4) architecture support to QVAC-Fabric, Tether's llama.cpp fork.

Base: QVAC-Fabric temp-upstream branch
Target: All Gemma 4 variants (E2B, E4B, etc. — architecture is size-agnostic)

Why

QVAC-Fabric provides memory-based model loading, on-device LoRA hot-swap, BitNet TQ2_0 quantization, and Adreno GPU optimization. It supported gemma3 and gemma3n but had no gemma4 support. This patch bridges that gap.

Performance

Tested on NVIDIA RTX 4090 with Gemma 4 E2B Q4_K_M (3.2 GB):

Metric	Value
Prompt processing	132.5 tokens/sec
Generation	116.5 tokens/sec
Model memory	1,416 MiB
Context memory	780 MiB

Key Technical Challenges

1. Per-Layer Head Dimensions

Gemma 4 uses different attention head dimensions for SWA (Sliding Window Attention) and non-SWA layers:

SWA layers: n_embd_head = 256, n_rot = 256
Non-SWA layers: n_embd_head = 512, n_rot = 512

QVAC-Fabric assumed a single global n_embd_head_k value. This patch modifies:

n_embd_k_gqa() and n_embd_v_gqa() to return per-layer values
KV cache get_k() and get_v() to use per-layer head dims
build_graph_shift() for per-layer RoPE dimensions
Tensor loading to compute correct shapes per layer

2. QVAC API Differences

QVAC's build_moe_ffn() has a different signature from upstream llama.cpp:

QVAC: (cur, gate_inp, up_exps, gate_exps, down_exps, exp_probs_b, n_expert, n_expert_used, type_op, norm_w, scale_w, w_scale, gating_op, il, probs_in, gate_up_exps)
Upstream: uses trailing *_exps_s parameters instead of scale_w + w_scale

3. Member Variables vs Functions

Several hparams that are functions in upstream llama.cpp are plain member variables in QVAC:

n_embd_head_k / n_embd_head_v — member vars (upstream: functions with il param)
n_rot — member var (upstream: function with il param)

4. Field Naming

QVAC: tok_embd_per_layer (model struct)
Upstream: per_layer_tok_embd

Files Modified

11 files, +256 lines:

File	Changes
`src/llama-arch.h`	`LLM_ARCH_GEMMA4`, tensor enums, KV enums
`src/llama-arch.cpp`	Name mapping, tensor strings, tensor-to-arch mapping, tensor info entries
`src/llama-model.h`	Struct members: `ffn_gate_inp_s`, `ffn_post_norm_1/2`, `ffn_pre_norm_2`, `out_scale`
`src/llama-model.cpp`	Hparams parsing, tensor loading (~95 lines), build dispatch, KV reuse
`src/llama-hparams.h`	`n_embd_per_layer`, `n_embd_head_k_swa`, `n_embd_head_v_swa`
`src/llama-hparams.cpp`	Per-layer-aware `n_embd_k_gqa()` and `n_embd_v_gqa()`
`src/llama-kv-cache.cpp`	Per-layer head dims in `get_k()`, `get_v()`, `build_graph_shift()`
`src/llama-vocab.h`	`LLAMA_VOCAB_PRE_TYPE_GEMMA4 = 50`
`src/llama-vocab.cpp`	Gemma4 tokenizer model, pre-type regex, BPE merge loading
`src/models/models.h`	`llm_build_gemma4_iswa` struct declaration
`src/models/gemma4-iswa.cpp`	Full graph builder (~299 lines)
`src/CMakeLists.txt`	Added `models/gemma4-iswa.cpp`

Quick Start

Option A: Apply unified patch

cd /path/to/qvac-fabric-llm.cpp
git checkout temp-upstream
git apply gemma4-qvac-full.patch
cp models/gemma4-iswa.cpp src/models/gemma4-iswa.cpp
mkdir -p build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_SKIP_INSTALL_RULES=ON
make -j$(nproc)

Option B: Use patch scripts

python scripts/patch_structs.py
python scripts/patch_models_h.py
python scripts/patch_fix_all.py
cp models/gemma4-iswa.cpp /path/to/qvac/src/models/
# Rebuild

Test

./build/bin/llama-server -m /path/to/gemma4.gguf -ngl 99 --port 8090

curl -s http://localhost:8090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'

Known Issues

Tokenizer newline display: [UNK_BYTE_0x0a] in CLI output for newlines. Cosmetic only — server API works correctly. Root cause: QVAC lacks byte_encode=false for SPM-style BPE.
Per-layer head dim safety: The n_embd_k_gqa() / n_embd_v_gqa() changes are global but safe — n_embd_head_k_swa defaults to 0 for non-Gemma4 architectures, so the SWA branch is never taken.

Repository Structure

├── README.md
├── gemma4-qvac-full.patch        # Unified git diff (git apply ready)
├── models/
│   ├── gemma4-iswa.cpp           # Graph builder (new file)
│   └── models.h                  # Full header with gemma4 struct
├── patches/                      # Full patched source files (reference)
│   ├── llama-arch.h / .cpp
│   ├── llama-model.h / .cpp
│   ├── llama-hparams.h / .cpp
│   ├── llama-kv-cache.cpp
│   ├── llama-vocab.h / .cpp
│   └── CMakeLists.txt
└── scripts/                      # Automated patch scripts
    ├── patch_structs.py
    ├── patch_models_h.py
    ├── patch_fix_all.py
    ├── patch_vocab.py
    ├── patch_kvs.py
    ├── patch_tensors.py
    └── patch_missing.py

License

Patches provided under MIT (same as QVAC-Fabric). Gemma 4 model weights are subject to Google's Gemma Terms of Use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QVAC-Fabric Gemma 4 Architecture Patch

Why

Performance

Key Technical Challenges

1. Per-Layer Head Dimensions

2. QVAC API Differences

3. Member Variables vs Functions

4. Field Naming

Files Modified

Quick Start

Option A: Apply unified patch

Option B: Use patch scripts

Test

Known Issues

Repository Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
models		models
patches		patches
scripts		scripts
README.md		README.md
gemma4-qvac-full.patch		gemma4-qvac-full.patch

Folders and files

Latest commit

History

Repository files navigation

QVAC-Fabric Gemma 4 Architecture Patch

Why

Performance

Key Technical Challenges

1. Per-Layer Head Dimensions

2. QVAC API Differences

3. Member Variables vs Functions

4. Field Naming

Files Modified

Quick Start

Option A: Apply unified patch

Option B: Use patch scripts

Test

Known Issues

Repository Structure

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages