This document covers ./test_ci.sh and test_stages.json — the declarative
CI driver for this fork — and walks through adding a new operator test
end-to-end.
# Local host (CPU only):
./test_ci.sh local
# Android device (full matrix):
./test_ci.sh android <serial>
# Run a subset — the filter is the optional third positional argument:
./test_ci.sh android <serial> cpu # CPU + lowmem + llm
./test_ci.sh android <serial> opencl-image # only OpenCL IMAGE stages
./test_ci.sh android <serial> vulkan
./test_ci.sh android <serial> lowmem
./test_ci.sh android <serial> android-ci # bench + smoke + llm onlyThe full matrix is described in test_stages.json. Every parameter (forward
type, precision, gpuMode, thread count, tag, memory mode, dynamic-quant
option, KleidiAI flag, per-stage skip list, smoke model list, benchmark
arguments) lives there. Editing the JSON is the supported way to add, drop,
or retune a stage — no shell edits needed for the common cases.
This infrastructure is also exposed as the
test-ciAgent Skill (skills/test-ci/SKILL.md) so AI coding agents can discover and drive it.
The llm stage runs llm_demo against an MNN-format LLM model. Provisioning
is lazy — the model download (or the LLM_MODEL_DIR check) is deferred
until the llm stage actually runs, and a provisioning failure skips only
that stage. The unit / smoke / bench stages therefore run fine with no network.
| Env var | Meaning |
|---|---|
LLM_MODEL_DIR |
Path to an existing on-disk MNN-format model. When set, the directory is used as-is and nothing is downloaded. Defaults to models/<repo-basename>/. |
LLM_MODEL_REPO |
Model repo id. Default taobao-mnn/Qwen2.5-0.5B-Instruct-MNN (also set in test_stages.json → llm.model_repo). |
LLM_MODEL_SOURCE |
Download source when LLM_MODEL_DIR is unset: huggingface (default) or modelscope. |
LLM_MODEL_URL_BASE |
Override the resolve URL prefix outright (wins over LLM_MODEL_SOURCE). |
# Already have the model on disk → zero download attempts:
LLM_MODEL_DIR=/path/to/Qwen2.5-0.5B-Instruct-MNN ./test_ci.sh local
# huggingface.co unreachable (e.g. mainland China) → fetch from ModelScope:
LLM_MODEL_SOURCE=modelscope ./test_ci.sh android <serial>HuggingFace serves files under resolve/main, ModelScope under
resolve/master. The MNN team mirrors its taobao-mnn/* HuggingFace models
under the MNN/* org on ModelScope, so the built-in default's org is remapped
automatically for ModelScope; an explicitly-set LLM_MODEL_REPO is used
verbatim.
test_ci.sh ← bash driver (local + android modes)
├── test_stages.json ← every stage's parameters live here
└── test/ ← C++ test framework
├── MNNTestSuite.{h,cpp} ← test registry + MNN_TEST_SKIP support
├── main.cpp ← argv → BackendConfig + RuntimeHint
└── op/ ← per-operator tests, one file each
Flow on android mode:
- Build for
arm64-v8a(build_64.sh+ a parallelmake). - Push the artefacts in
ANDROID_BIN_LISTto/data/local/tmp/MNN. - Push the LLM model under
models/<repo>/and the smoke caffe sources. - Convert smoke caffe sources →
.mnnon-device with the just-pushedMNNConvert(tools/converter/libMNNConvertDeps.sois pushed alongside). - Run stages declared in
test_stages.jsonin order:unit/* → lowmem/* → smokeA/* → smokeB/* → bench/* → llm/*. - Print summary:
total / passed / failed / skippedplus per-stagePASS / FAIL / SKIPlines.
| Field | Meaning |
|---|---|
name |
Stage label. Also the per-stage log filename (/s flatten to _). |
filter |
Filter tag. Matched against the positional filter argument or the implicit all. |
comment |
Free-form note about why the stage exists / what it covers. |
binary |
(smoke/bench only) run_test (default), v2basic, backendtest, or benchmark. |
prefix |
First positional arg to run_test.out — test-name prefix or the literal all. |
type |
Forward type. 0=CPU, 3=OpenCL, 7=Vulkan. |
precision |
0=Normal, 1=High, 2=Low. (Per BackendConfig::PrecisionMode.) |
threadOrGpuMode |
CPU: thread count. GPU: bitmask of MNN_GPU_TUNING_* (1=NONE, 2=HEAVY, 4=WIDE) OR'd with MNN_GPU_MEMORY_* (64=BUFFER, 128=IMAGE). e.g. 129 = NONE|IMAGE. |
tag |
Free-form report tag forwarded to run_test.out (printed in TEST_NAME_UNIT<tag> lines). |
memory |
0=Normal, 1=High, 2=Low. Omit when not setting (some stages rely on the default). |
dynamicOption |
RuntimeHint::dynamicQuantOption (0..7). Omit when not setting. |
kleidiAi |
argv[8]: 1 enables KleidiAI on ARM. Omit when not setting. |
skip |
Array of exact test names to skip. Passed to MNNTestSuite::run() via MNN_TEST_SKIP env. Use this for known device/driver bugs you don't want to globally lose coverage on. |
args |
(smoke/bench only) Positional argv array. {model} and {models_dir} get per-iteration substitution. |
| Tag | Meaning |
|---|---|
cpu |
Plain CPU stages (also covers lowmem and llm under the cpu filter). |
opencl-image |
OpenCL with MNN_GPU_MEMORY_IMAGE. |
opencl-buffer |
OpenCL with MNN_GPU_MEMORY_BUFFER. |
vulkan |
Vulkan backend. |
lowmem |
Low-memory configurations (memory=2). |
smoke-opencl |
Smoke A/B per model on OpenCL. |
smoke-vulkan |
Smoke A/B per model on Vulkan. |
llm |
LLM smoke test. |
The opencl filter is a shortcut for opencl-image | opencl-buffer | smoke-opencl.
gpu covers everything OpenCL + Vulkan.
The plain-CPU sweep across the registered C++ tests. Variants in
test_stages.json:
unit/cpu/all— single-thread,Precision_Normal, broadest sanity check.unit/cpu/op-mt— 4-thread, op-only (prefix: "op"). Catches threadpool races.unit/cpu/op-fp16-conv—Precision_Low, only convolution tests. Exercises the FP16 ARM82 path without the rest of the suite.unit/cpu/op-fp16-col2im,unit/cpu/op-fp16-roi— narrow FP16 sweeps for ops that are particularly sensitive.
GPU stages run with TUNING_NONE because we want correctness, not perf —
TUNING_WIDE adds many seconds of per-kernel auto-tuning that's wasted on a
single-shot run. Bench stages flip back to TUNING_WIDE since perf is the
point there.
unit/opencl/op— OpenCL withMEMORY_IMAGE(gpuMode129).unit/opencl/op-buffer— OpenCL withMEMORY_BUFFER(gpuMode65). Critical for catching regressions in BUFFER-only creators (e.g. Attention) that the IMAGE path masks via CPU fallback.unit/vulkan/op— Vulkan withTUNING_NONE(gpuMode1). Vulkan ignoresMEMORY_*bits; image-vs-buffer is selected at build time via theMNN_VULKAN_IMAGECMake option.
Each carries a skip list for device-specific upstream bugs that aren't
ours to fix in this fork. Each entry is documented in
test_stages.json::_documentation.skip_rationale.
memory=2 (Memory_Low) plus various precision × dynamicOption × thread
combinations. Mostly exercises the op/lowMemory/* and weight-i8/i4 quantised
conv tests.
Loads each public smoke model with MNNV2Basic.out, does one forward pass.
Catches model-load and shape-inference regressions; doesn't validate
numerics.
Runs backendTest.out with the CPU oracle vs the named backend; tolerance
defaults to 0.05. Catches backend kernel regressions that produce
numerically wrong output.
benchmark.out over the same model set. Just a perf sanity check (no pass/fail
on numbers — the stage passes as long as the binary doesn't crash and
produces timing lines).
Runs llm_demo against the provisioned config.json + prompt.txt.
There are two pieces:
- Write the C++ test under
test/op/. - (Optional) wire a dedicated
test_ci.shstage intest_stages.jsonso the new test runs in a specific config (precision / threads / backend / memory).
If you only do (1), the new test is automatically picked up by every existing
unit/* stage that matches its prefix (e.g. unit/cpu/all runs everything;
unit/cpu/op-mt runs anything starting with op; unit/opencl/op runs
anything starting with op on OpenCL; …). That's usually enough.
Each test is a class deriving from MNNTestCase registered with
MNNTestSuiteRegister. Minimal template:
// test/op/MyOpTest.cpp
//
// MyOpTest.cpp
// MNNTests
//
// Copyright © 2026, ...
//
#include <MNN/expr/Expr.hpp>
#include <MNN/expr/ExprCreator.hpp>
#include "MNNTestSuite.h"
#include "TestUtils.h"
using namespace MNN::Express;
class MyOpTest : public MNNTestCase {
public:
virtual ~MyOpTest() = default;
virtual bool run(int precision) override {
// 1) Build a small graph with the op under test.
auto x = _Input({1, 4}, NCHW, halide_type_of<float>());
x->writeMap<float>()[0] = 1.0f;
x->writeMap<float>()[1] = 2.0f;
x->writeMap<float>()[2] = 3.0f;
x->writeMap<float>()[3] = 4.0f;
auto y = _Multiply(x, _Scalar<float>(2.0f)); // <- op under test
// 2) Read result and compare against expected.
const std::vector<float> expected = {2.0f, 4.0f, 6.0f, 8.0f};
auto got = y->readMap<float>();
if (!checkVector<float>(got, expected.data(), 4, 0.0001f)) {
MNN_ERROR("MyOpTest: numerical mismatch\n");
return false;
}
return true;
}
};
MNNTestSuiteRegister(MyOpTest, "op/myop");Conventions:
- The test name (
"op/myop") is whatprefixmatches intest_stages.json. Group related tests under a common prefix (e.g.op/binary/myop,op/convolution/myop) so a JSON stage withprefix: "op/binary"automatically picks them up. - Use
checkVectorfor absolute tolerance,checkVectorByRelativeErrorfor relative — both live intest/TestUtils.h. - Read
MNNTestSuite::get()->pStausif you need the runtime config inside the test (it carriesprecision,memory,forwardType,dynamicOption). precisionarg intorun(int)is thePrecision_Normal=0 / High=1 / Low=2selector (soFP32Converter[precision]quantises reference values to match the runtime).
CMake picks up new files in test/op/ automatically (the converter test
target globs the directory). After adding the file:
cd build && make -j$(nproc) run_test.out
./run_test.out op/myop # local CPUIf your test prefix already matches an existing stage, you're done. Examples:
- Test named
op/myopruns inunit/cpu/all,unit/cpu/op-mt,unit/opencl/op,unit/opencl/op-buffer,unit/vulkan/op. - Test named
op/convolution/myopadditionally runs inunit/cpu/op-fp16-conv.
If you need a specific precision / thread / memory / dynamicOption config that
no existing stage covers, add a stage entry to test_stages.json:
// in test_stages.json → android.stages
{
"name": "unit/cpu/myop-fp16-mt",
"filter": "cpu",
"comment": "MyOp at FP16 precision, 4-thread.",
"prefix": "op/myop",
"type": 0,
"precision": 2,
"threadOrGpuMode": 4,
"tag": "fp16myop64",
"memory": 0
}That's it. ./test_ci.sh android <serial> will:
- Build the test binary (your new
MyOpTest.cppis picked up automatically). - Push
run_test.outto the device. - Run the new stage between the existing
unit/*stages.
If your test exposes an upstream backend bug that you don't want blocking
the overall run, add the test name to that stage's skip array. The skip
list is passed via MNN_TEST_SKIP env to MNNTestSuite::run(), which
matches by exact name and prints skip <name> (in MNN_TEST_SKIP)
instead of running it.
{
"name": "unit/opencl/op-buffer",
...
"skip": [
"op/myop" // <-- add here; document why in skip_rationale
]
}Always pair a new skip entry with a one-line entry in
_documentation.skip_rationale describing the upstream bug, so future
maintainers know whether the skip is still needed.
If you need to exercise a model in addition to the existing mobilenet_v{1,2} / squeezenet_v1.{0,1} set:
"android": {
"smoke_models": [
"MobileNet/v1/mobilenet_v1.caffe.mnn",
...
"MyModel/v1/my_model.caffe.mnn" // <-- new
],
...
}Then add the source caffe pair to the SMOKE_SOURCES array in test_ci.sh
so it gets fetched + on-device-converted alongside the existing ones, and
extend _smoke_pair_for to map the .caffe.mnn filename back to its
caffemodel + prototxt source pair.
"android": {
...
"bench_stages": [
...
{
"name": "bench/cpu-fp16",
"filter": "cpu",
"comment": "10-iter benchmark on CPU, 4-thread, Precision_Low (FP16).",
"binary": "benchmark",
"args": ["{models_dir}", "10", "2", "0", "4", "2"]
}
]
}{models_dir} is substituted with /data/local/tmp/MNN/public_models at
dispatch time.
You've added an int4 grouped-conv kernel and want it covered at FP16 + 4-thread + memory-low + dynamicOption=2.
- Create
test/op/Int4GroupConvTest.cppregistering asop/convolution/int4_group. - Add a stage:
{ "name": "lowmem/int4_group-d2-p2", "filter": "lowmem", "comment": "Int4 grouped conv: precision=Low, 4-thread, memory=Low, dyn=2.", "prefix": "op/convolution/int4_group", "type": 0, "precision": 2, "threadOrGpuMode": 4, "tag": "64", "memory": 2, "dynamicOption": 2 } - Run:
./test_ci.sh android <serial> lowmem
You want your new op cross-checked between CPU and GPU.
- Register the test as
op/myop. Make sure the assertion passes on CPU first. - The default stages already cover it:
unit/cpu/all(CPU correctness),unit/opencl/op(OpenCL IMAGE),unit/opencl/op-buffer(OpenCL BUFFER),unit/vulkan/op(Vulkan). - If only one backend's path needs broader coverage, add a dedicated stage
with the desired
prefix+type+threadOrGpuMode.
You merged an upstream sync that exposed a per-channel ~1-LSB drift in
op/foo on Vulkan, and you don't want it blocking CI.
- Add
"op/foo"tounit/vulkan/op'sskiparray. - Add a key under
_documentation.skip_rationale:"vulkan_foo_drift": "op/foo accumulates ~1 LSB drift per channel on Vulkan due to FP16 intermediates in upstream's foo kernel; standalone passes, the bulk run flags it. Tracking upstream issue #N."
- Commit both edits together so
git blamegives future you the full story.
| File | What it provides |
|---|---|
test_ci.sh |
Bash driver. Reads test_stages.json for unit/lowmem/smoke/bench/llm. Pushes libMNNConvertDeps.so so on-device caffe→mnn conversion works. Lazy LLM provisioning (HuggingFace / ModelScope / local LLM_MODEL_DIR). |
skills/test-ci/SKILL.md |
Agent-discoverable Skill entry for this test infrastructure. |
test_stages.json |
Every stage parameter, skip lists, smoke model list, bench entries. Self-documenting via _documentation. |
test/MNNTestSuite.{h,cpp} |
MNN_TEST_SKIP env-var support; Status.dynamicOption so tests can adapt tolerance to the runtime hint. |
test/main.cpp |
Propagates dynamicOption from argv into Status. |
test/op/AttentionTest.cpp |
Test 3 skipped on OpenCL/Vulkan (CPUAttention kv_cache=false upstream TODO). |
test/op/BroadcastToTest.cpp |
Loosens absolute tolerance to 0.002f for non-CPU forwardType (FP16 intermediates). |
test/op/ConvolutionTest.cpp |
errorScale=200 for memory=Low + dynamicOption=1 (1-LSB systematic offset). |
source/backend/opencl/execution/{buffer,image}/Unary* + cl/unary* |
Native OpenCL ERFINV (TF two-branch polynomial). |
source/geometry/GeometryBinary.cpp |
Insert broadcast-to on Vulkan when input/output rank differ (fixes AddBroast). |
{ "android": { // stages dispatched by `./test_ci.sh android` "stages": [ ... ], // run_test.out unit + lowmem stages "smoke_models": [ ... ], // public-model paths "smoke_a_stages": [ ... ], // forward-smoke per (model × backend) "smoke_b_stages": [ ... ], // numeric CPU-vs-backend per (model × backend) "bench_stages": [ ... ] // benchmark.out invocations }, "local": { // stages dispatched by `./test_ci.sh local` "stages": [ ... ], "smoke_a_stages": [ ... ] // (no smoke_b in local — there's no GPU oracle) }, "llm": { // LLM smoke test (both modes) "model_repo": "...", "config_file": "config.json", "prompt_file": "prompt.txt", "stage": { ... } }, "_documentation": { ... } // self-describing schema doc; safe to read }