GPU-accelerated LLM inference engine for AMD GPUs, written in Rust with hand-tuned HIP/C++ WMMA (Wave Matrix Multiply-Accumulate) kernels for RDNA 4.
This project builds on the foundational work of oldnordic. Without his original ROCmForge implementation — the model loader, the CPU inference path, the GGUF parser, and the overall architecture — none of the WMMA matrix-core optimisations, the multi-model support, or the interactive chat CLI would have been possible. Thank you for making this project a reality.
First open-source project to bypass hipBLAS/Tensile on gfx1201 with direct matrix-core utilisation. Supports Qwen2.5 (Q4_0), Qwen3, and Llama-3.1 (Q4_K_M) GGUFs with interactive chat, streaming output, and speculative decoding (opt-in).
- WMMA matrix-core acceleration — hand-written RDNA 4 WMMA kernels for prefill GEMM (Q4_0, Q4_1, Q4_K) and FlashAttention (online softmax, GQA, causal masking). Bypasses hipBLAS/Tensile which lacks gfx1201 matrix-core support in ROCm 7.2.
- Q4_K_M quantisation support — industry-standard K-quant format with hierarchical scale/min unpacking, super-block-aware nibble layout, and a fused Q4_K gate+up+SwiGLU decode kernel.
- Multi-model, multi-architecture — Qwen2.5, Qwen3, and Llama-3.1 load from the same binary. GQA ratio, RoPE base, rope_freqs scaling, per-head Q/K RMSNorm, norm epsilon, and chat templates are all auto-configured from GGUF metadata.
- Llama-3 tokeniser — GPT-4-style BPE pre-tokeniser with 7/7 token parity against llama.cpp on Meta-Llama-3.1-8B-Instruct.
- Interactive chat CLI — multi-turn conversations with streaming output, per-architecture chat templates, session statistics, slash commands, Ctrl+C to interrupt generation.
- Speculative decoding (opt-in) — 0.5B draft + 7B target with configurable depth. Currently slower than greedy on most prompt types (acceptance rate ~54 % median, break-even needs ~80 %).
- Arbitrary prompt lengths — WMMA kernels handle any
seq_len ≥ 1via automatic padding to 64-token boundaries. - Structured logging —
RUST_LOG=debugshows kernel dispatch decisions,RUST_LOG=traceshows per-layer timing. - ROCm upgrade validation — built-in benchmark harness with an automatic diff tool for safe ROCm version upgrades.
- CPU fallback path — pure-Rust CPU backend with AVX-512 VNNI for debugging and zero-GPU environments. Not intended for production.
Measured on AMD Radeon RX 9070 XT (gfx1201, RDNA 4), ROCm 7.2.2. All numbers are median of 3 runs at greedy decode (temperature = 0.0).
| Prompt length (pp) | ROCmForge | llama.cpp ROCm | Ratio |
|---|---|---|---|
| pp64 | 799 | 2,912 | 0.27× |
| pp128 | 1,132 | 3,966 | 0.29× |
| pp256 | 1,482 | 4,951 | 0.30× |
| pp512 | 1,689 | 5,158 | 0.33× |
| Decode 128 (tok/s) | 102 | 121 | 0.84× |
| Prompt length (pp) | v0.2.0 | v0.3.0 | llama.cpp | Ratio |
|---|---|---|---|---|
| pp64 | 388 | 744 | 2,000 | 0.37× |
| pp128 | 447 | 1,068 | 2,657 | 0.40× |
| pp256 | 470 | 1,381 | 3,661 | 0.38× |
| pp512 | 472 | 1,579 | 3,756 | 0.42× |
| Decode 128 (tok/s) | 29.9 | 43.3 | 87 | 0.50× |
| Prompt length (pp) | v0.2.0 | v0.3.0 | llama.cpp | Ratio |
|---|---|---|---|---|
| pp64 | 400 | 784 | 2,198 | 0.36× |
| pp128 | 457 | 1,117 | 2,975 | 0.38× |
| pp256 | 475 | 1,420 | 3,925 | 0.36× |
| pp512 | 471 | 1,619 | 3,922 | 0.41× |
| Decode 128 (tok/s) | 30.5 | 44.3 | 93 | 0.48× |
What changed in v0.3.0. Phase 8b closed two bottlenecks that dominated
Q4_K_M performance. A new Q6_K WMMA prefill kernel replaces the
257-dispatch GEMV loop that used to eat 82 % of the pp256 budget on
Qwen3/Llama-3.1 (their attn_v and ffn_down projections are stored
as Q6_K inside the Q4_K_M mixture). A new Q4_K Q8-inline GEMV family
quantises the activation vector to Q8_0 once per token and runs the
dot-product as integer MAC — 3.3× faster QKV, 3.8× faster O-projection,
with the correct dual accumulator needed for Q4_K's affine dmin term.
The Q4_0 path is untouched. Full per-op profiling in
profiling/results/phase8_q4_k_m_analysis.md.
| Metric | ROCmForge | llama.cpp | Ratio |
|---|---|---|---|
| Prefill (tok/s) | 493.5 | 525.4 | 0.94× |
| Time-to-first-token | 49 ms | 46 ms | 1.07× |
| Decode (tok/s) | 102 | 121 | 0.84× |
| Total wall-clock (ms) | 1,303 | 1,103 | 1.18× |
| Milestone | Q4_0 prefill pp256 | Q4_0 decode | Q4_K_M prefill pp256 | Q4_K_M decode | Models |
|---|---|---|---|---|---|
| Project start | 64 | 82 | — | — | Qwen2.5 Q4_0 |
| + WMMA GEMM (Phase 2) | 92 | 102 | — | — | Qwen2.5 Q4_0 |
| + WMMA Attention (Phase 3) | 623 | 102 | — | — | Qwen2.5 Q4_0 |
| + Dispatch fixes (Phase 4, v0.1.0) | 1,484 | 102 | — | — | Qwen2.5 Q4_0 |
| + Q4_K_M + multi-model (Phase 7, v0.2.0) | 1,482 | 102 | 470 | 30 | Qwen2.5, Qwen3, Llama-3.1 |
| + Q6_K WMMA + Q4_K Q8-inline (Phase 8b, v0.3.0) | 1,482 | 102 | 1,381 | 43 | Qwen2.5, Qwen3, Llama-3.1 |
Full analyses:
benches/results/phase4_final_analysis.md,
benches/results/phase7_final_analysis.md,
profiling/results/phase8_q4_k_m_analysis.md.
Implemented but opt-in. Measured honestly on the 15-prompt benchmark it
is a loss at every prompt in greedy mode: median 72 tok/s at depth=1
vs. 102 tok/s baseline. Break-even needs α ≈ 80 % on the current verify
path; best observed is 78 % (code_03). Useful only for workloads with
very high acceptance rates (tightly constrained code, templated output).
Note: speculative decoding is currently wired into the one-shot
rocmforge entry point via --draft-model, but rocmforge chat
accepts the flag and ignores it (banner shows standard path). See
docs/cli-guide.md for details.
Pure-Rust with AVX-512 VNNI Q4_0 GEMV on Zen 4. Functional but not
performance-competitive: Qwen2.5-0.5B ~12 tok/s, Qwen2.5-7B ~0.7 tok/s.
For production inference, use --gpu.
| Component | Version | Notes |
|---|---|---|
| GPU | AMD Radeon RX 9070 XT | gfx1201, RDNA 4, 16 GB VRAM |
| CPU | AMD Ryzen 9 7945HX | Zen 4, AVX-512 (CPU fallback) |
| OS | CachyOS (Arch Linux-based) | Kernel 7.x |
| ROCm | 7.2.1 and 7.2.2 | Both validated via upgrade diff tool |
| Model | Qwen2.5-7B-Instruct Q4_0 | GGUF, mixed 25×Q4_0 + 3×Q4_1 for ffn_down; 102 tok/s decode |
| Model | Qwen3-8B Q4_K_M | Q4_K + Q6_K mixed-precision, per-head Q/K RMSNorm, 43 tok/s decode (v0.3.0) |
| Model | Meta-Llama-3.1-8B-Instruct Q4_K_M | Q4_K + Q6_K, rope_freqs 128 k-context scaling, 44 tok/s decode (v0.3.0) |
| Draft model | Qwen2.5-0.5B-Instruct Q4_0 | For speculative decoding (opt-in) |
| RAM | 64 GB DDR5 |
| Component | Notes |
|---|---|
| Other RDNA 4 GPUs | RX 9070, RX 9060 XT — same gfx1201 arch, should work |
| Arch Linux (vanilla) | Same ROCm packages as CachyOS |
| Ubuntu 24.04 | ROCm 7.2 officially supported; see INSTALL.md for notes |
| Fedora 41+ / RHEL 10 | ROCm 7.2 officially supported; see INSTALL.md for notes |
| Other Qwen / LLaMA sizes | Loader is metadata-driven. 14B hits VRAM on 16 GB. 0.5B tested as draft. |
| Mistral / GLM / Phi / Gemma | Trait table covers these architectures; only Qwen and Llama have been end-to-end validated. |
| Component | Reason |
|---|---|
| RDNA 3 (gfx1100, RX 7900 XTX, etc.) | Kernels use __builtin_amdgcn_wmma_f32_16x16x16_f16_w32_gfx12. RDNA 3 uses a different intrinsic format, a different A/B register layout (8 VGPRs instead of 4), and needs lane duplication (lanes 0–15 → 16–31) that gfx12 does not. Porting is feasible but not attempted. |
| RDNA 2 (gfx1030) and older | No matrix cores. |
| CDNA (MI100/MI200/MI300) | Uses MFMA, not WMMA. Different instruction set. |
| Intel / NVIDIA GPUs | HIP/ROCm backend only. |
| Windows | ROCm is Linux-only for consumer GPUs. |
| Q5_K / Q8_0 prefill | WMMA kernels implemented for Q4_0, Q4_1, and Q4_K only. Q5_K / Q8_0 fall back to scalar / GEMV kernels. |
- Q4_K_M decode gap: Qwen3-8B decode at ~43 tok/s, Llama-3.1-8B at
~44 tok/s vs. llama.cpp's 87 / 93 tok/s (~0.48–0.50×). Post-v0.3.0 the
dominant kernel is the fused Q4_K gate+up+SwiGLU with Q8-inline
activation (~15 ms / token, ~42 % of budget). It is now LDS-bound
rather than ALU-bound; an interleaved/tile variant modelled on the
Q4_0
q8_inline_interleaved_tile4fastpath is the next lever. - Q4_K_M prefill gap: pp256 at ~1,381 tok/s (Qwen3) / ~1,420 tok/s (Llama-3.1) vs. llama.cpp 3,661 / 3,925 (~0.36–0.38×). The Q6_K WMMA kernel added in v0.3.0 removed the GEMV-loop fallback; the remaining gap is mostly unfused norm/RoPE orchestration and FP32↔FP16 activation shuttling, same issue as the Q4_0 synthetic-prefill gap below.
- Q4_0 decode gap: 102 tok/s vs. llama.cpp 117–121 tok/s (~0.84×),
unchanged since project start. Fully profiled in Phase 6. The
9.76 ms per-token budget splits: GEMV 77 % (memory-bandwidth-bound
and CU-saturated), launch overhead 8–13 % (255 kernel launches per
token — exactly the size of the gap to llama.cpp), attention 6 %,
everything else 9 %. A fused RMSNorm + Gate + Up + SwiGLU kernel
exists but has a latent state-corruption bug that appears from
token 2+ when routed into the active decode path — see
docs/known_issues/fused_norm_gate_up_bug.md. Full analysis:profiling/results/decode_profiling_analysis.md. - Synthetic prefill gap at long sequences: at pp256+ ROCmForge is ~3× slower than llama.cpp. The GEMM path is fully WMMA after Phase 4; the gap is unfused norm/RoPE/embedding orchestration and FP32↔FP16 activation shuttling that llama.cpp packs into fewer launches.
- Speculative decoding is slower than greedy on most prompt types.
Acceptance rate ~54 % median, below the ~80 % break-even threshold.
Works via
--draft-modelin the one-shot CLI; accepted-but-ignored inrocmforge chat. - Chat CLI does not persist KV cache between turns: each turn re-prefills the entire conversation history. Multi-turn TTFT grows with history length (roughly linear in total tokens so far).
- Llama-3.1 multi-turn chat produces degraded output. Single-turn
works correctly. Root cause is a position/content-dependent prefill
divergence for longer prompts (~50+ tokens). Qwen3 multi-turn works
perfectly. Workaround: use Qwen3 for interactive chat, or use
Llama-3.1 in single-turn mode with
--no-template. Tracked indocs/known_issues/llama3_multiturn_prefill_bug.md. - Full-decode HIP graph disabled on RDNA 4: graph replay of
device-pointer reads in large graphs (~200+ nodes) returns stale
values — see
hip_graph_device_pointer_bug.md. Tail-only graph (lm_head + argmax) is still active. - rocprofv3 PMC counters hang on gfx1201: hardware perf-counter
capture via
rocprofv3 --pmcis unstable on gfx1201. UseRUST_LOG=trace+ROCMFORGE_PROFILE_PREFILL_OPS=1for profiling. - WMMA attention
head_dim == 128only: Qwen2.5, LLaMA-2, Mistral are covered. Other head dimensions fall back to the scalar attention kernel.
INSTALL.md— installation (Arch tested, Ubuntu / Fedora notes)docs/cli-guide.md— CLI reference (chat + one-shot, slash commands, env flags)CHANGELOG.md— release notes and optimisation historybenches/results/phase4_final_analysis.md— Phase 4 final benchmark analysisdocs/architecture_notes.md— RDNA 4 memory pipelining and WMMA detailsdocs/spec_decode_milestone_summary.md— speculative decoding investigation
MIT. See LICENSE.