soms_scheduler not found! on FullyConnected/MatMul workloads — GEMM appears to run single-core only on Zhouyi X2 (X2_1204MP3)

**Environment**
- Board: Radxa Orion O6
- SoC: CIX P1
- NPU: Zhouyi X2, target `X2_1204MP3`, 3 cores
- Driver: AIPU KMD v6.1.0 (`zhouyi-v3` as reported by dmesg)
- SDK: cixbuild 6.1.3753, cix-noe-umd 3.1.2
- cix-npu-onnxruntime 1.2.0 (onnxruntime-zhouyi 1.22.0)
---
 
### Summary
 
Every FullyConnected/MatMul workload compiled with cixbuild produces `soms_scheduler not found!` warnings at inference time, and it appears that all computation runs on a single core. Measured TOPS for a representative LLM-scale MatMul ([64,4096]×[4096,4096], INT8) is ~0.82 TOPS against an expected ~10 TOPS per core.
 
---
 
### Reproduction
 
ONNX export (PyTorch):
```python
import torch
import torch.nn as nn
 
class MatMul(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(4096, 4096, bias=False)
 
    def forward(self, x):
        return self.fc(x)
 
model = MatMul().eval()
x = torch.randn(64, 4096)
torch.onnx.export(model, x, "matmul_M64_K4096_N4096.onnx",
                  input_names=["input"], output_names=["output"],
                  opset_version=11)
```
 
cixbuild config:
```ini
[Common]
mode = build
dump_flags = True
 
[Parser]
model_name = matmul_M64_K4096_N4096
model_type = onnx
input_model = /workspace/matmul_M64_K4096_N4096.onnx
input = input
input_shape = [64,4096]
output = output
output_dir = /workspace/out
 
[Optimizer]
dataset = numpydataset
calibration_data = /workspace/calib/matmul_calib.npy
calibration_batch_size = 1
metric_batch_size = 1
cast_dtypes_for_lib = True
calibration_strategy_for_activation = extrema & <[Convolution]:mean>
quantize_method_for_weight = per_channel_symmetric_full_range
quantize_method_for_activation = per_tensor_asymmetric
activation_bits = 8
weight_bits = 8
bias_bits = 32
lut_items_in_bits = 8
output_dir = /workspace/out
 
[GBuilder]
target = X2_1204MP3
outputs = matmul_M64_K4096_N4096_int8.cix
tiling = fps
```
 
---
 
### Observed behaviour
 
At inference time, every FullyConnected layer produces:
```
soms_scheduler not found!
```
 
C++ benchmark results (100 runs, shape [64,4096]×[4096,4096]):
 
| Precision | Latency | Measured TOPS |
|-----------|---------|---------------|
| INT16 | 4.81ms | 0.45 TOPS |
| INT8 | 2.63ms | 0.82 TOPS |
 
Scaling behaviour suggests fixed dispatch overhead rather than compute saturation — smaller N gives worse TOPS (0.41 TOPS at N=512), which is consistent with single-core execution and scheduling overhead dominating.
 
It is also worth noting that even the single-core performance appears well below what might be expected. At 0.82 TOPS for INT8, this represents roughly 8% of the rated ~10 TOPS per core. While some overhead is expected for a workload of this shape, the gap seems larger than layout conversion overhead alone would account for. It is possible this is a consequence of the FC→Convolution mapping — the compiler appears to remap FullyConnected to Convolution internally, which likely introduces NCHWC32 layout conversion overhead and may not be an efficient path for this operation. We are not certain whether this is expected behaviour or a separate issue, but include it in case it is useful.
 
---
 
### Investigation
 
The compiled IR for a single FullyConnected layer appears clean and correct — the issue does not seem to originate in the Parser or Optimizer stages. GBuilder internally maps FullyConnected → Convolution, which compiles and runs, but appears to use only one of the three available cores.
 
The Zhouyi ORT EP (`cix-npu-onnxruntime`) was also tested. It successfully assigns the Gemm node to `ZhouyiExecutionProvider` (confirmed via verbose logging), but produces ~0.017 TOPS — consistent with the same underlying compiler limitation rather than CPU fallback.
 
Inspection of the job creation API (`aipu_create_job_cfg_t`) shows that `bind_core_ids` multi-core binding is documented as ">v3 only". `dmesg` reports `zhouyi-v3`, which may explain why multi-core dispatch is unavailable for this workload — though it is unclear to us whether this is a fundamental hardware constraint or a software gate that could be lifted.
 
---
 
### Expected behaviour
 
FullyConnected/MatMul workloads should utilise available cores, with TOPS scaling accordingly toward the rated 10 TOPS per core.
 
---
 
### Question for CIX
 
Is multi-core GEMM scheduling (`soms_scheduler`) planned for the v3 driver/SDK, or is it gated on v4+ hardware? If it is a software limitation on the current SDK, is there a workaround or timeline for a fix?
 
This capability would be significant for transformer/LLM inference workloads, which are dominated by MatMul operations. As things stand, CPU inference via llama.cpp outperforms the NPU for these workloads on the Orion O6.
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

soms_scheduler not found! on FullyConnected/MatMul workloads — GEMM appears to run single-core only on Zhouyi X2 (X2_1204MP3) #34

Summary

Reproduction

Observed behaviour

Investigation

Expected behaviour

Question for CIX

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

soms_scheduler not found! on FullyConnected/MatMul workloads — GEMM appears to run single-core only on Zhouyi X2 (X2_1204MP3) #34

Description

Summary

Reproduction

Observed behaviour

Investigation

Expected behaviour

Question for CIX

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions