Skip to content

Commit 0cd4f47

Browse files
authored
kleidiai : support for concurrent sme and neon kernel execution (ggml-org#20070)
1 parent af237f3 commit 0cd4f47

3 files changed

Lines changed: 975 additions & 256 deletions

File tree

docs/build.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -599,7 +599,13 @@ If KleidiAI is enabled, the output will contain a line similar to:
599599
```
600600
load_tensors: CPU_KLEIDIAI model buffer size = 3474.00 MiB
601601
```
602-
KleidiAI's microkernels implement optimized tensor operations using Arm CPU features such as dotprod, int8mm and SME. llama.cpp selects the most efficient kernel based on runtime CPU feature detection. However, on platforms that support SME, you must manually enable SME microkernels by setting the environment variable `GGML_KLEIDIAI_SME=1`.
602+
KleidiAI’s microkernels implement optimized tensor operations using Arm CPU features such as dotprod, int8mm, SVE, and SME. Llama.cpp selects the most efficient kernels at runtime based on detected CPU capabilities.
603+
On CPUs that support SME, SME microkernels are enabled automatically using runtime detection.
604+
The environment variable GGML_KLEIDIAI_SME can be used to control SME behavior:
605+
- Not set: enable SME automatically if supported and detected.
606+
- 0: disable SME.
607+
- <n> > 0: enable SME and assume <n> available SME units (override auto detection).
608+
If SME is not supported by the CPU, SME microkernels are always disabled.
603609
604610
Depending on your build target, other higher priority backends may be enabled by default. To ensure the CPU backend is used, you must disable the higher priority backends either at compile time, e.g. -DGGML_METAL=OFF, or during run-time using the command line option `--device none`.
605611

ggml/src/ggml-cpu/kleidiai/kernels.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -520,7 +520,7 @@ static ggml_kleidiai_kernels gemm_gemv_kernels[] = {
520520
/* .packed_stride_ex = */ &rhs_stride_fn4<kai_get_rhs_packed_stride_rhs_pack_nxk_qsi4c32pscalef16_qsu4c32s16s0>,
521521
/* .pack_func_ex = */ &rhs_pack_fn12<kai_run_rhs_pack_nxk_qsi4c32pscalef16_qsu4c32s16s0>,
522522
},
523-
/* .required_cpu = */ CPU_FEATURE_DOTPROD | CPU_FEATURE_I8MM,
523+
/* .required_cpu = */ CPU_FEATURE_I8MM,
524524
/* .lhs_type = */ GGML_TYPE_F32,
525525
/* .rhs_type = */ GGML_TYPE_Q4_0,
526526
/* .op_type = */ GGML_TYPE_F32,
@@ -631,7 +631,7 @@ static ggml_kleidiai_kernels gemm_gemv_kernels[] = {
631631
/* .packed_stride_ex = */ &rhs_stride_fn4<kai_get_rhs_packed_stride_rhs_pack_nxk_qsi4c32pscalef16_qsu4c32s16s0>,
632632
/* .pack_func_ex = */ &rhs_pack_fn12<kai_run_rhs_pack_nxk_qsi4c32pscalef16_qsu4c32s16s0>,
633633
},
634-
/* .required_cpu = */ CPU_FEATURE_DOTPROD | CPU_FEATURE_I8MM,
634+
/* .required_cpu = */ CPU_FEATURE_I8MM,
635635
/* .lhs_type = */ GGML_TYPE_F32,
636636
/* .rhs_type = */ GGML_TYPE_Q4_0,
637637
/* .op_type = */ GGML_TYPE_F32,
@@ -801,7 +801,7 @@ static ggml_kleidiai_kernels gemm_gemv_kernels_q8[] = {
801801
/* .packed_stride_ex = */ &rhs_stride_fn4<kai_get_rhs_packed_stride_rhs_pack_nxk_qsi8cxp_qsi8cx_neon>,
802802
/* .pack_func_ex = */ &rhs_pack_scale_fn12<kai_run_rhs_pack_nxk_qsi8cxp_qsi8cx_neon>,
803803
},
804-
/* .required_cpu = */ CPU_FEATURE_DOTPROD | CPU_FEATURE_I8MM,
804+
/* .required_cpu = */ CPU_FEATURE_I8MM,
805805
/* .lhs_type = */ GGML_TYPE_F32,
806806
/* .rhs_type = */ GGML_TYPE_Q8_0,
807807
/* .op_type = */ GGML_TYPE_F32,

0 commit comments

Comments
 (0)