Environment
- Board: Radxa Orion O6
- SoC: CIX P1
- NPU: Zhouyi X2, target
X2_1204MP3, 3 cores
- Driver: AIPU KMD v6.1.0 (
zhouyi-v3 as reported by dmesg)
- SDK: cixbuild 6.1.3753, cix-noe-umd 3.1.2
- cix-npu-onnxruntime 1.2.0 (onnxruntime-zhouyi 1.22.0)
Summary
Every FullyConnected/MatMul workload compiled with cixbuild produces soms_scheduler not found! warnings at inference time, and it appears that all computation runs on a single core. Measured TOPS for a representative LLM-scale MatMul ([64,4096]×[4096,4096], INT8) is ~0.82 TOPS against an expected ~10 TOPS per core.
Reproduction
ONNX export (PyTorch):
import torch
import torch.nn as nn
class MatMul(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(4096, 4096, bias=False)
def forward(self, x):
return self.fc(x)
model = MatMul().eval()
x = torch.randn(64, 4096)
torch.onnx.export(model, x, "matmul_M64_K4096_N4096.onnx",
input_names=["input"], output_names=["output"],
opset_version=11)
cixbuild config:
[Common]
mode = build
dump_flags = True
[Parser]
model_name = matmul_M64_K4096_N4096
model_type = onnx
input_model = /workspace/matmul_M64_K4096_N4096.onnx
input = input
input_shape = [64,4096]
output = output
output_dir = /workspace/out
[Optimizer]
dataset = numpydataset
calibration_data = /workspace/calib/matmul_calib.npy
calibration_batch_size = 1
metric_batch_size = 1
cast_dtypes_for_lib = True
calibration_strategy_for_activation = extrema & <[Convolution]:mean>
quantize_method_for_weight = per_channel_symmetric_full_range
quantize_method_for_activation = per_tensor_asymmetric
activation_bits = 8
weight_bits = 8
bias_bits = 32
lut_items_in_bits = 8
output_dir = /workspace/out
[GBuilder]
target = X2_1204MP3
outputs = matmul_M64_K4096_N4096_int8.cix
tiling = fps
Observed behaviour
At inference time, every FullyConnected layer produces:
soms_scheduler not found!
C++ benchmark results (100 runs, shape [64,4096]×[4096,4096]):
| Precision |
Latency |
Measured TOPS |
| INT16 |
4.81ms |
0.45 TOPS |
| INT8 |
2.63ms |
0.82 TOPS |
Scaling behaviour suggests fixed dispatch overhead rather than compute saturation — smaller N gives worse TOPS (0.41 TOPS at N=512), which is consistent with single-core execution and scheduling overhead dominating.
It is also worth noting that even the single-core performance appears well below what might be expected. At 0.82 TOPS for INT8, this represents roughly 8% of the rated ~10 TOPS per core. While some overhead is expected for a workload of this shape, the gap seems larger than layout conversion overhead alone would account for. It is possible this is a consequence of the FC→Convolution mapping — the compiler appears to remap FullyConnected to Convolution internally, which likely introduces NCHWC32 layout conversion overhead and may not be an efficient path for this operation. We are not certain whether this is expected behaviour or a separate issue, but include it in case it is useful.
Investigation
The compiled IR for a single FullyConnected layer appears clean and correct — the issue does not seem to originate in the Parser or Optimizer stages. GBuilder internally maps FullyConnected → Convolution, which compiles and runs, but appears to use only one of the three available cores.
The Zhouyi ORT EP (cix-npu-onnxruntime) was also tested. It successfully assigns the Gemm node to ZhouyiExecutionProvider (confirmed via verbose logging), but produces ~0.017 TOPS — consistent with the same underlying compiler limitation rather than CPU fallback.
Inspection of the job creation API (aipu_create_job_cfg_t) shows that bind_core_ids multi-core binding is documented as ">v3 only". dmesg reports zhouyi-v3, which may explain why multi-core dispatch is unavailable for this workload — though it is unclear to us whether this is a fundamental hardware constraint or a software gate that could be lifted.
Expected behaviour
FullyConnected/MatMul workloads should utilise available cores, with TOPS scaling accordingly toward the rated 10 TOPS per core.
Question for CIX
Is multi-core GEMM scheduling (soms_scheduler) planned for the v3 driver/SDK, or is it gated on v4+ hardware? If it is a software limitation on the current SDK, is there a workaround or timeline for a fix?
This capability would be significant for transformer/LLM inference workloads, which are dominated by MatMul operations. As things stand, CPU inference via llama.cpp outperforms the NPU for these workloads on the Orion O6.
Environment
X2_1204MP3, 3 coreszhouyi-v3as reported by dmesg)Summary
Every FullyConnected/MatMul workload compiled with cixbuild produces
soms_scheduler not found!warnings at inference time, and it appears that all computation runs on a single core. Measured TOPS for a representative LLM-scale MatMul ([64,4096]×[4096,4096], INT8) is ~0.82 TOPS against an expected ~10 TOPS per core.Reproduction
ONNX export (PyTorch):
cixbuild config:
Observed behaviour
At inference time, every FullyConnected layer produces:
C++ benchmark results (100 runs, shape [64,4096]×[4096,4096]):
Scaling behaviour suggests fixed dispatch overhead rather than compute saturation — smaller N gives worse TOPS (0.41 TOPS at N=512), which is consistent with single-core execution and scheduling overhead dominating.
It is also worth noting that even the single-core performance appears well below what might be expected. At 0.82 TOPS for INT8, this represents roughly 8% of the rated ~10 TOPS per core. While some overhead is expected for a workload of this shape, the gap seems larger than layout conversion overhead alone would account for. It is possible this is a consequence of the FC→Convolution mapping — the compiler appears to remap FullyConnected to Convolution internally, which likely introduces NCHWC32 layout conversion overhead and may not be an efficient path for this operation. We are not certain whether this is expected behaviour or a separate issue, but include it in case it is useful.
Investigation
The compiled IR for a single FullyConnected layer appears clean and correct — the issue does not seem to originate in the Parser or Optimizer stages. GBuilder internally maps FullyConnected → Convolution, which compiles and runs, but appears to use only one of the three available cores.
The Zhouyi ORT EP (
cix-npu-onnxruntime) was also tested. It successfully assigns the Gemm node toZhouyiExecutionProvider(confirmed via verbose logging), but produces ~0.017 TOPS — consistent with the same underlying compiler limitation rather than CPU fallback.Inspection of the job creation API (
aipu_create_job_cfg_t) shows thatbind_core_idsmulti-core binding is documented as ">v3 only".dmesgreportszhouyi-v3, which may explain why multi-core dispatch is unavailable for this workload — though it is unclear to us whether this is a fundamental hardware constraint or a software gate that could be lifted.Expected behaviour
FullyConnected/MatMul workloads should utilise available cores, with TOPS scaling accordingly toward the rated 10 TOPS per core.
Question for CIX
Is multi-core GEMM scheduling (
soms_scheduler) planned for the v3 driver/SDK, or is it gated on v4+ hardware? If it is a software limitation on the current SDK, is there a workaround or timeline for a fix?This capability would be significant for transformer/LLM inference workloads, which are dominated by MatMul operations. As things stand, CPU inference via llama.cpp outperforms the NPU for these workloads on the Orion O6.