ROCmForge

GPU-accelerated LLM inference engine for AMD GPUs, written in Rust with hand-tuned HIP/C++ WMMA (Wave Matrix Multiply-Accumulate) kernels for RDNA 4.

Acknowledgement

This project builds on the foundational work of oldnordic. Without his original ROCmForge implementation — the model loader, the CPU inference path, the GGUF parser, and the overall architecture — none of the WMMA matrix-core optimisations, the multi-model support, or the interactive chat CLI would have been possible. Thank you for making this project a reality.

First open-source project to bypass hipBLAS/Tensile on gfx1201 with direct matrix-core utilisation. Supports Qwen2.5 (Q4_0), Qwen3, and Llama-3.1 (Q4_K_M) GGUFs with interactive chat, streaming output, and speculative decoding (opt-in).

Features

WMMA matrix-core acceleration — hand-written RDNA 4 WMMA kernels for prefill GEMM (Q4_0, Q4_1, Q4_K) and FlashAttention (online softmax, GQA, causal masking). Bypasses hipBLAS/Tensile which lacks gfx1201 matrix-core support in ROCm 7.2.
Q4_K_M quantisation support — industry-standard K-quant format with hierarchical scale/min unpacking, super-block-aware nibble layout, and a fused Q4_K gate+up+SwiGLU decode kernel.
Multi-model, multi-architecture — Qwen2.5, Qwen3, and Llama-3.1 load from the same binary. GQA ratio, RoPE base, rope_freqs scaling, per-head Q/K RMSNorm, norm epsilon, and chat templates are all auto-configured from GGUF metadata.
Llama-3 tokeniser — GPT-4-style BPE pre-tokeniser with 7/7 token parity against llama.cpp on Meta-Llama-3.1-8B-Instruct.
Interactive chat CLI — multi-turn conversations with streaming output, per-architecture chat templates, session statistics, slash commands, Ctrl+C to interrupt generation.
Speculative decoding (opt-in) — 0.5B draft + 7B target with configurable depth. Currently slower than greedy on most prompt types (acceptance rate ~54 % median, break-even needs ~80 %).
Arbitrary prompt lengths — WMMA kernels handle any seq_len ≥ 1 via automatic padding to 64-token boundaries.
Structured logging — RUST_LOG=debug shows kernel dispatch decisions, RUST_LOG=trace shows per-layer timing.
ROCm upgrade validation — built-in benchmark harness with an automatic diff tool for safe ROCm version upgrades.
CPU fallback path — pure-Rust CPU backend with AVX-512 VNNI for debugging and zero-GPU environments. Not intended for production.

Performance (v0.3.0)

Measured on AMD Radeon RX 9070 XT (gfx1201, RDNA 4), ROCm 7.2.2. All numbers are median of 3 runs at greedy decode (temperature = 0.0).

Qwen2.5-7B Q4_0 (reference model, unchanged)

Prompt length (pp)	ROCmForge	llama.cpp ROCm	Ratio
pp64	799	2,912	0.27×
pp128	1,132	3,966	0.29×
pp256	1,482	4,951	0.30×
pp512	1,689	5,158	0.33×
Decode 128 (tok/s)	102	121	0.84×

Qwen3-8B Q4_K_M (+194 % prefill, +52 % decode vs v0.2.0)

Prompt length (pp)	v0.2.0	v0.3.0	llama.cpp	Ratio
pp64	388	744	2,000	0.37×
pp128	447	1,068	2,657	0.40×
pp256	470	1,381	3,661	0.38×
pp512	472	1,579	3,756	0.42×
Decode 128 (tok/s)	29.9	43.3	87	0.50×

Llama-3.1-8B-Instruct Q4_K_M (+199 % prefill, +49 % decode vs v0.2.0)

Prompt length (pp)	v0.2.0	v0.3.0	llama.cpp	Ratio
pp64	400	784	2,198	0.36×
pp128	457	1,117	2,975	0.38×
pp256	475	1,420	3,925	0.36×
pp512	471	1,619	3,922	0.41×
Decode 128 (tok/s)	30.5	44.3	93	0.48×

What changed in v0.3.0. Phase 8b closed two bottlenecks that dominated Q4_K_M performance. A new Q6_K WMMA prefill kernel replaces the 257-dispatch GEMV loop that used to eat 82 % of the pp256 budget on Qwen3/Llama-3.1 (their attn_v and ffn_down projections are stored as Q6_K inside the Q4_K_M mixture). A new Q4_K Q8-inline GEMV family quantises the activation vector to Q8_0 once per token and runs the dot-product as integer MAC — 3.3× faster QKV, 3.8× faster O-projection, with the correct dual accumulator needed for Q4_K's affine dmin term. The Q4_0 path is untouched. Full per-op profiling in profiling/results/phase8_q4_k_m_analysis.md.

Real-world prompts — Qwen2.5-7B Q4_0 (15 prompts, 19–41 tokens, 128 generated)

Metric	ROCmForge	llama.cpp	Ratio
Prefill (tok/s)	493.5	525.4	0.94×
Time-to-first-token	49 ms	46 ms	1.07×
Decode (tok/s)	102	121	0.84×
Total wall-clock (ms)	1,303	1,103	1.18×

Optimisation history

Milestone	Q4_0 prefill pp256	Q4_0 decode	Q4_K_M prefill pp256	Q4_K_M decode	Models
Project start	64	82	—	—	Qwen2.5 Q4_0
+ WMMA GEMM (Phase 2)	92	102	—	—	Qwen2.5 Q4_0
+ WMMA Attention (Phase 3)	623	102	—	—	Qwen2.5 Q4_0
+ Dispatch fixes (Phase 4, v0.1.0)	1,484	102	—	—	Qwen2.5 Q4_0
+ Q4_K_M + multi-model (Phase 7, v0.2.0)	1,482	102	470	30	Qwen2.5, Qwen3, Llama-3.1
+ Q6_K WMMA + Q4_K Q8-inline (Phase 8b, v0.3.0)	1,482	102	1,381	43	Qwen2.5, Qwen3, Llama-3.1

Full analyses: benches/results/phase4_final_analysis.md, benches/results/phase7_final_analysis.md, profiling/results/phase8_q4_k_m_analysis.md.

Speculative decoding

Implemented but opt-in. Measured honestly on the 15-prompt benchmark it is a loss at every prompt in greedy mode: median 72 tok/s at depth=1 vs. 102 tok/s baseline. Break-even needs α ≈ 80 % on the current verify path; best observed is 78 % (code_03). Useful only for workloads with very high acceptance rates (tightly constrained code, templated output).

Note: speculative decoding is currently wired into the one-shot rocmforge entry point via --draft-model, but rocmforge chat accepts the flag and ignores it (banner shows standard path). See docs/cli-guide.md for details.

CPU backend

Pure-Rust with AVX-512 VNNI Q4_0 GEMV on Zen 4. Functional but not performance-competitive: Qwen2.5-0.5B ~12 tok/s, Qwen2.5-7B ~0.7 tok/s. For production inference, use --gpu.

Compatibility

Tested ✅

Component	Version	Notes
GPU	AMD Radeon RX 9070 XT	gfx1201, RDNA 4, 16 GB VRAM
CPU	AMD Ryzen 9 7945HX	Zen 4, AVX-512 (CPU fallback)
OS	CachyOS (Arch Linux-based)	Kernel 7.x
ROCm	7.2.1 and 7.2.2	Both validated via upgrade diff tool
Model	Qwen2.5-7B-Instruct Q4_0	GGUF, mixed 25×Q4_0 + 3×Q4_1 for `ffn_down`; 102 tok/s decode
Model	Qwen3-8B Q4_K_M	Q4_K + Q6_K mixed-precision, per-head Q/K RMSNorm, 43 tok/s decode (v0.3.0)
Model	Meta-Llama-3.1-8B-Instruct Q4_K_M	Q4_K + Q6_K, rope_freqs 128 k-context scaling, 44 tok/s decode (v0.3.0)
Draft model	Qwen2.5-0.5B-Instruct Q4_0	For speculative decoding (opt-in)
RAM	64 GB DDR5

Expected to work (untested) ⚠️

Component	Notes
Other RDNA 4 GPUs	RX 9070, RX 9060 XT — same gfx1201 arch, should work
Arch Linux (vanilla)	Same ROCm packages as CachyOS
Ubuntu 24.04	ROCm 7.2 officially supported; see INSTALL.md for notes
Fedora 41+ / RHEL 10	ROCm 7.2 officially supported; see INSTALL.md for notes
Other Qwen / LLaMA sizes	Loader is metadata-driven. 14B hits VRAM on 16 GB. 0.5B tested as draft.
Mistral / GLM / Phi / Gemma	Trait table covers these architectures; only Qwen and Llama have been end-to-end validated.

Not expected to work ❌

Component	Reason
RDNA 3 (gfx1100, RX 7900 XTX, etc.)	Kernels use `__builtin_amdgcn_wmma_f32_16x16x16_f16_w32_gfx12`. RDNA 3 uses a different intrinsic format, a different A/B register layout (8 VGPRs instead of 4), and needs lane duplication (lanes 0–15 → 16–31) that gfx12 does not. Porting is feasible but not attempted.
RDNA 2 (gfx1030) and older	No matrix cores.
CDNA (MI100/MI200/MI300)	Uses MFMA, not WMMA. Different instruction set.
Intel / NVIDIA GPUs	HIP/ROCm backend only.
Windows	ROCm is Linux-only for consumer GPUs.
Q5_K / Q8_0 prefill	WMMA kernels implemented for Q4_0, Q4_1, and Q4_K only. Q5_K / Q8_0 fall back to scalar / GEMV kernels.

Known issues

Q4_K_M decode gap: Qwen3-8B decode at ~43 tok/s, Llama-3.1-8B at ~44 tok/s vs. llama.cpp's 87 / 93 tok/s (~0.48–0.50×). Post-v0.3.0 the dominant kernel is the fused Q4_K gate+up+SwiGLU with Q8-inline activation (~15 ms / token, ~42 % of budget). It is now LDS-bound rather than ALU-bound; an interleaved/tile variant modelled on the Q4_0 q8_inline_interleaved_tile4 fastpath is the next lever.
Q4_K_M prefill gap: pp256 at ~1,381 tok/s (Qwen3) / ~1,420 tok/s (Llama-3.1) vs. llama.cpp 3,661 / 3,925 (~0.36–0.38×). The Q6_K WMMA kernel added in v0.3.0 removed the GEMV-loop fallback; the remaining gap is mostly unfused norm/RoPE orchestration and FP32↔FP16 activation shuttling, same issue as the Q4_0 synthetic-prefill gap below.
Q4_0 decode gap: 102 tok/s vs. llama.cpp 117–121 tok/s (~0.84×), unchanged since project start. Fully profiled in Phase 6. The 9.76 ms per-token budget splits: GEMV 77 % (memory-bandwidth-bound and CU-saturated), launch overhead 8–13 % (255 kernel launches per token — exactly the size of the gap to llama.cpp), attention 6 %, everything else 9 %. A fused RMSNorm + Gate + Up + SwiGLU kernel exists but has a latent state-corruption bug that appears from token 2+ when routed into the active decode path — see docs/known_issues/fused_norm_gate_up_bug.md. Full analysis: profiling/results/decode_profiling_analysis.md.
Synthetic prefill gap at long sequences: at pp256+ ROCmForge is ~3× slower than llama.cpp. The GEMM path is fully WMMA after Phase 4; the gap is unfused norm/RoPE/embedding orchestration and FP32↔FP16 activation shuttling that llama.cpp packs into fewer launches.
Speculative decoding is slower than greedy on most prompt types. Acceptance rate ~54 % median, below the ~80 % break-even threshold. Works via --draft-model in the one-shot CLI; accepted-but-ignored in rocmforge chat.
Chat CLI does not persist KV cache between turns: each turn re-prefills the entire conversation history. Multi-turn TTFT grows with history length (roughly linear in total tokens so far).
Llama-3.1 multi-turn chat produces degraded output. Single-turn works correctly. Root cause is a position/content-dependent prefill divergence for longer prompts (~50+ tokens). Qwen3 multi-turn works perfectly. Workaround: use Qwen3 for interactive chat, or use Llama-3.1 in single-turn mode with --no-template. Tracked in docs/known_issues/llama3_multiturn_prefill_bug.md.
Full-decode HIP graph disabled on RDNA 4: graph replay of device-pointer reads in large graphs (~200+ nodes) returns stale values — see hip_graph_device_pointer_bug.md. Tail-only graph (lm_head + argmax) is still active.
rocprofv3 PMC counters hang on gfx1201: hardware perf-counter capture via rocprofv3 --pmc is unstable on gfx1201. Use RUST_LOG=trace + ROCMFORGE_PROFILE_PREFILL_OPS=1 for profiling.
WMMA attention head_dim == 128 only: Qwen2.5, LLaMA-2, Mistral are covered. Other head dimensions fall back to the scalar attention kernel.

Documentation

INSTALL.md — installation (Arch tested, Ubuntu / Fedora notes)
docs/cli-guide.md — CLI reference (chat + one-shot, slash commands, env flags)
CHANGELOG.md — release notes and optimisation history
benches/results/phase4_final_analysis.md — Phase 4 final benchmark analysis
docs/architecture_notes.md — RDNA 4 memory pipelining and WMMA details
docs/spec_decode_milestone_summary.md — speculative decoding investigation

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 399 Commits
.perf		.perf
.rocprofv3		.rocprofv3
benches		benches
docs		docs
examples		examples
hip_kernels		hip_kernels
profiling		profiling
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
INSTALL.md		INSTALL.md
MANUAL.md		MANUAL.md
OPTIMIZATION_PLAN.md		OPTIMIZATION_PLAN.md
README.md		README.md
arch		arch
bench_simple.rs		bench_simple.rs
build.rs		build.rs
improvements.md		improvements.md
libarch.rlib		libarch.rlib
manual.md		manual.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ROCmForge

Acknowledgement

Features

Performance (v0.3.0)

Qwen2.5-7B Q4_0 (reference model, unchanged)

Qwen3-8B Q4_K_M (+194 % prefill, +52 % decode vs v0.2.0)

Llama-3.1-8B-Instruct Q4_K_M (+199 % prefill, +49 % decode vs v0.2.0)

Real-world prompts — Qwen2.5-7B Q4_0 (15 prompts, 19–41 tokens, 128 generated)

Optimisation history

Speculative decoding

CPU backend

Compatibility

Tested ✅

Expected to work (untested) ⚠️

Not expected to work ❌

Known issues

Documentation

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ROCmForge

Acknowledgement

Features

Performance (v0.3.0)

Qwen2.5-7B Q4_0 (reference model, unchanged)

Qwen3-8B Q4_K_M (+194 % prefill, +52 % decode vs v0.2.0)

Llama-3.1-8B-Instruct Q4_K_M (+199 % prefill, +49 % decode vs v0.2.0)

Real-world prompts — Qwen2.5-7B Q4_0 (15 prompts, 19–41 tokens, 128 generated)

Optimisation history

Speculative decoding

CPU backend

Compatibility

Tested ✅

Expected to work (untested) ⚠️

Not expected to work ❌

Known issues

Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages