Skip to content

soms_scheduler not found! on FullyConnected/MatMul workloads — GEMM appears to run single-core only on Zhouyi X2 (X2_1204MP3) #34

Description

@jimtendo

Environment

  • Board: Radxa Orion O6
  • SoC: CIX P1
  • NPU: Zhouyi X2, target X2_1204MP3, 3 cores
  • Driver: AIPU KMD v6.1.0 (zhouyi-v3 as reported by dmesg)
  • SDK: cixbuild 6.1.3753, cix-noe-umd 3.1.2
  • cix-npu-onnxruntime 1.2.0 (onnxruntime-zhouyi 1.22.0)

Summary

Every FullyConnected/MatMul workload compiled with cixbuild produces soms_scheduler not found! warnings at inference time, and it appears that all computation runs on a single core. Measured TOPS for a representative LLM-scale MatMul ([64,4096]×[4096,4096], INT8) is ~0.82 TOPS against an expected ~10 TOPS per core.


Reproduction

ONNX export (PyTorch):

import torch
import torch.nn as nn
 
class MatMul(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(4096, 4096, bias=False)
 
    def forward(self, x):
        return self.fc(x)
 
model = MatMul().eval()
x = torch.randn(64, 4096)
torch.onnx.export(model, x, "matmul_M64_K4096_N4096.onnx",
                  input_names=["input"], output_names=["output"],
                  opset_version=11)

cixbuild config:

[Common]
mode = build
dump_flags = True
 
[Parser]
model_name = matmul_M64_K4096_N4096
model_type = onnx
input_model = /workspace/matmul_M64_K4096_N4096.onnx
input = input
input_shape = [64,4096]
output = output
output_dir = /workspace/out
 
[Optimizer]
dataset = numpydataset
calibration_data = /workspace/calib/matmul_calib.npy
calibration_batch_size = 1
metric_batch_size = 1
cast_dtypes_for_lib = True
calibration_strategy_for_activation = extrema & <[Convolution]:mean>
quantize_method_for_weight = per_channel_symmetric_full_range
quantize_method_for_activation = per_tensor_asymmetric
activation_bits = 8
weight_bits = 8
bias_bits = 32
lut_items_in_bits = 8
output_dir = /workspace/out
 
[GBuilder]
target = X2_1204MP3
outputs = matmul_M64_K4096_N4096_int8.cix
tiling = fps

Observed behaviour

At inference time, every FullyConnected layer produces:

soms_scheduler not found!

C++ benchmark results (100 runs, shape [64,4096]×[4096,4096]):

Precision Latency Measured TOPS
INT16 4.81ms 0.45 TOPS
INT8 2.63ms 0.82 TOPS

Scaling behaviour suggests fixed dispatch overhead rather than compute saturation — smaller N gives worse TOPS (0.41 TOPS at N=512), which is consistent with single-core execution and scheduling overhead dominating.

It is also worth noting that even the single-core performance appears well below what might be expected. At 0.82 TOPS for INT8, this represents roughly 8% of the rated ~10 TOPS per core. While some overhead is expected for a workload of this shape, the gap seems larger than layout conversion overhead alone would account for. It is possible this is a consequence of the FC→Convolution mapping — the compiler appears to remap FullyConnected to Convolution internally, which likely introduces NCHWC32 layout conversion overhead and may not be an efficient path for this operation. We are not certain whether this is expected behaviour or a separate issue, but include it in case it is useful.


Investigation

The compiled IR for a single FullyConnected layer appears clean and correct — the issue does not seem to originate in the Parser or Optimizer stages. GBuilder internally maps FullyConnected → Convolution, which compiles and runs, but appears to use only one of the three available cores.

The Zhouyi ORT EP (cix-npu-onnxruntime) was also tested. It successfully assigns the Gemm node to ZhouyiExecutionProvider (confirmed via verbose logging), but produces ~0.017 TOPS — consistent with the same underlying compiler limitation rather than CPU fallback.

Inspection of the job creation API (aipu_create_job_cfg_t) shows that bind_core_ids multi-core binding is documented as ">v3 only". dmesg reports zhouyi-v3, which may explain why multi-core dispatch is unavailable for this workload — though it is unclear to us whether this is a fundamental hardware constraint or a software gate that could be lifted.


Expected behaviour

FullyConnected/MatMul workloads should utilise available cores, with TOPS scaling accordingly toward the rated 10 TOPS per core.


Question for CIX

Is multi-core GEMM scheduling (soms_scheduler) planned for the v3 driver/SDK, or is it gated on v4+ hardware? If it is a software limitation on the current SDK, is there a workaround or timeline for a fix?

This capability would be significant for transformer/LLM inference workloads, which are dominated by MatMul operations. As things stand, CPU inference via llama.cpp outperforms the NPU for these workloads on the Orion O6.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions