ggml-cpu: add repack GEMM and GEMV for floating-point #17791

taimur-10x · 2025-12-05T11:22:55Z

Summary

This PR adds repacking and GEMM/GEMV kernels for floating-point (FP16 and FP32) for RVV (with the zvfh extension).

Key Changes

Added RVV kernels for GEMM with tiling:
- 7 x {16, 32, 64, 128} (selected based on VLEN)
Added RVV kernels for GEMV with tiling:
- 1 x {16, 32, 64, 128} (selected based on VLEN)
Added scalar functions for repacking. They support arbitrary tile sizes.
Generic scalar fallbacks for GEMM/GEMV operations.
ggml_quantize_mat_t is refactored to ggml_repack_mat_t to allow for a common interface for both quantization and floating-point repacking.
Additional template parameter NB_ROWS added to select the number of rows to interleave for repacking. Previously, this was fixed at 4.

Tile Sizes

The repack operation interleaves N rows of activations with an interleave size of K, and M columns of weights with an interleave size of K.

NxK is fixed at 7x1. This introduces 7 accumulators with LMUL=4 (7 x 4 = 28 registers), each accumulating M results.

M is varied based on the available VLEN:

VLEN	Tile Size (N x M x K)
128	7 x 16 x 1
256	7 x 32 x 1
512	7 x 64 x 1
1024	7 x 128 x 1

M is the maximum number of values that can be loaded in (LMUL=2 for F16, LMUL=4 for F32).

Testing

Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit, 512-bit and 1024-bit) for a range of input sizes.

Benchmarking Results

End-to-end benchmarking on BananaPI-BPI F3 (VLEN=256)

Prefill / Prompt Processing (GEMM)

Tokens / Second

Model	Prompt Size	Repack GEMM (7x32)	Vec Dot
Tinyllama F16 1.1B	28	24.72	8.31
Tinyllama F16 1.1B	32	16.72	8.42
Tinyllama F16 1.1B	64	22.55	8.57
Tinyllama F16 1.1B	128	22.78	8.78
Tinyllama F16 1.1B	256	21.82	8.57
Tinyllama F16 1.1B	512	21.81	8.68

Result: ~2x-3x speedup over vec_dot

Decode (GEMV)

Tokens / Second

Model	Decode Size (Prompt=32)	Repack GEMV (1x32)	Vec Dot
Tinyllama F16 1.1B	10	3.37	3.11
Tinyllama F16 1.1B	16	3.29	3.45
Tinyllama F16 1.1B	32	3.12	3.25
Tinyllama F16 1.1B	64	3.23	3.27
Tinyllama F16 1.1B	100	3.04	3.15
Tinyllama F16 1.1B	128	3.09	3.2
Tinyllama F16 1.1B	256	3.15	3.19

Result: No noticeable improvement, as decode remains memory-bound.

Additional Notes

Current fallback model requires every architecture to have a scalar fallback for each implementation. This creates a clutter in arch-fallback.h as 7xMx1 is very RVV-specific tiling, and should not be used by other architectures.
GEMM reaches peak performance when the prompt is a multiple of 7 (for example, prompt=28). To handle leftovers, it defaults to GEMV, which impacts performance. Ideally, there should be leftover Nx32 kernels which handle each leftover case from 2-6 leftover tokens.

ggml-cpu: add repack GEMM and GEMV for floating-point

5b0894f

taimur-10x requested a review from ggerganov as a code owner December 5, 2025 11:22

taimur-10x marked this pull request as draft December 5, 2025 11:24

loci-dev mentioned this pull request Dec 5, 2025

UPSTREAM PR #17791: ggml-cpu: add repack GEMM and GEMV for floating-point auroralabs-loci/llama.cpp#453

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: add repack GEMM and GEMV for floating-point #17791

ggml-cpu: add repack GEMM and GEMV for floating-point #17791

taimur-10x commented Dec 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ggml-cpu: add repack GEMM and GEMV for floating-point #17791

Are you sure you want to change the base?

ggml-cpu: add repack GEMM and GEMV for floating-point #17791

Conversation

taimur-10x commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Tile Sizes

Testing

Benchmarking Results

Prefill / Prompt Processing (GEMM)

Tokens / Second

Decode (GEMV)

Tokens / Second

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taimur-10x commented Dec 5, 2025 •

edited

Loading