Skip to content

Releases: NeuroJSON/siamize

siamize v0.2.0 (Agate)

03 Jun 18:43

Choose a tag to compare

siamize v0.2.0 — initial release

native C++/ONNX port of SIAM v0.3 brain/head MRI segmentation


Click here to register and download pre-compiled siamize v0.2.0 binaries and siamize MATLAB/Octave mex functions with CUDA/OpenCL/CoreML backends


siamize is a native, vendor-neutral C++ port of SIAM v0.3 — the Segment It All Model for head/brain tissue segmentation — that runs without PyTorch, nnU-Net, or torchio at deployment time. This is the first public release: the same model runs on NVIDIA, AMD, and Intel GPUs (or any CPU), ships as a prebuilt Docker image, and includes MATLAB / GNU Octave bindings.

Demos

Using siamize cli to segment a T1 volume, showing 1) dynamic weight downloading, 2) handling out-of-memory (OOM) errors on low-memory GPU (RTX 2080) using --lowmem - the 5090 GPU produced about 7s/fold = 5x7 = 35s total processing time; even the 8GB 2080 is able to finish in 16s per fold (80s total).

siamize_demo_mini.mp4

here is a video showing using siamize.m in matlab

siam_matlab_demo_mini.mp4

finally, using the internal high-res TPM output, I was also able to create high quality GM surfaces using my iso2mesh toolbox

siam_matlab_gm_demo_mini.mp4

What's in the box

  • Slim Python reference (py/siam_ref.py) — reproduces SIAM v0.3 inference
    using only PyTorch + numpy + nibabel + scipy + dynamic_network_architectures
    (no nnU-Net, no torchio, no SimpleITK).
  • ONNX export pipeline (tools/onnx_export/) — converts each fold of the
    SIAM v0.3 ResEnc-UNet to fp16 .onnx, validated against the Python reference.
  • C++ standalone binary (src/) — drop-in for siam-pred with no Python at
    runtime, with two interchangeable inference backends (below).

Two inference backends

Binary (entrypoint) Backend Hardware
siamize (default) ONNX Runtime + CUDA EP NVIDIA GPU, or CPU
siamize-opencl MNN + OpenCL NVIDIA / AMD / Intel GPU, or CPU
  • ONNX Runtime / CUDA — 232 KB executable + libonnxruntime.so + per-fold
    fp16 .onnx weights; CUDA EP on NVIDIA, CPU fallback otherwise. A CoreML
    variant is built for macOS arm64.
  • MNN / OpenCL (-DSIAMIZE_BACKEND=mnn) — vendor-neutral GPU inference via
    OpenCL (and Vulkan / Metal). Statically linked (~9 MB, no libMNN.so),
    dlopens the OpenCL ICD at runtime; per-fold fp32 native-Conv3D .mnn weights.

Features

  • 5-fold ensemble, 18 SIAM classes, sliding-window inference at a target
    isotropic spacing (-u, default 0.75 mm).
  • Flexible I/O — input NIfTI (.nii/.nii.gz) or JNIfTI (.jnii/.bnii);
    output a uint8 labelmap, a 4D float32 tissue-probability map (--tpm), an
    SPM12-style 6-class map (-C spm), or JNIfTI (-F jnii).
  • Auto-downloaded weights — fold weights fetch from NeuroJSON on first run
    (mount a volume at /cache to keep them).
  • GPU device selection — flat, 1-based -G N index plus a --list-gpu
    listing of all OpenCL platforms/devices (mapped through MNN's platform
    reordering).
  • Auto-tuning & memory adaptivity — per-GPU OpenCL kernel-tuning cache,
    automatic sliding-window patch shrink on VRAM-tight hosts, and a --lowmem
    preset.
  • Docker imageopenjdata/siamize,
    CUDA 12 + cuDNN 9, bundling both backends. Calendar-versioned vYYYY.M
    (this release: v2026.6).
  • MATLAB / GNU Octave bindings — MEX for both backends, with portable Linux
    MEX (pinned condition_variable::wait to clear the GLIBCXX_3.4.30 load
    failure against bundled MATLAB libstdc++).

Why siamize (vs. upstream SIAM Python)

Same trained SIAM v0.3 weights, but packaged for real-world deployment.

Deployment & dependencies

Upstream siam-pred siamize
Runtime Python ≥3.10 + torch + nnunetv2 + torchio self-contained C++17 binary
Dependency footprint ~5–6 GB ~50 MB binary + ~70 MB ORT lib
Bindings Python CLI only CLI + MATLAB + GNU Octave (MEX)
Container/cluster use full Python + CUDA stack single binary + weights; published Docker image
OS coverage Linux (macOS via mps) Linux / macOS / Windows (all CI-tested)

Hardware

Backend Upstream siamize
NVIDIA CUDA ✅ (PyTorch) ✅ (ORT CUDA EP)
NVIDIA TensorRT ✅ (opt-in)
AMD / Intel GPU ✅ (MNN + OpenCL)
Apple Silicon partial (Metal GPU, no ANE) ✅ CoreML incl. ANE
CPU
Auto-fallback cuda → mps → cpu auto → tensorrt → cuda → coreml → cpu

Weights & performance

Upstream siamize
Format / precision nnU-Net .pth, fp32 .onnx, fp16 (same trained values)
Size per fold 1.14 GB ~270 MB raw / ~90 MB gzipped
5-fold total ~5.7 GB ~450 MB gzipped
Single fold, A100 not benchmarked 9.8 s
Single fold, RTX 2080S 13.3 s
CPU 5-fold ensemble ~25 min ~10.5 min
Determinism run-to-run cuDNN algo search deterministic fp16 graph

Output & I/O features

Capability Upstream siamize
Output containers NIfTI-1 .nii.gz only .nii / .nii.gz / .jnii / .bnii
Probability map --tpm (4D float32, + temperature)
Class remap -C spm (SIAM-18 → SPM12-6)
Embedded label table ✅ JGIFTI LabelTable
Fold selection always 5-fold single or N-fold via -M
Save at inference resolution --upsample

Trade-offs / not (yet) supported

Item Notes
~0.3% boundary-voxel gap vs Python almost entirely the resampling kernel (Catmull-Rom vs scipy cubic B-spline) — not the network or the fp16 cast
Batch-folder input one file per invocation (bash-loop workaround)
4D-volume input split first (same as upstream)
Test-time augmentation neither tool uses it
AMD ROCm EP not wired (OpenCL covers AMD instead)

Weights are unchanged from SIAM v0.3 — no pruning, no quantization beyond fp16,
no architecture trimming.

Patched MNN included

The MNN backend uses a patched MNN fork (v3.5-opencl-conv3d):
CPUConvolution3D / CPUDeconvolution3D now correctly handle MNN's NC4HW4
activation layout (unpack → NCDHW GEMM → repack), fixing incorrect CPU-backend
results. Verified at 667/667 ops and 99.59% end-to-end voxel agreement.

Accuracy

Reproduces the original SIAM output on the bundled sub-01_T1w.nii.gz (5-fold
ensemble, 18 classes) within a known ~0.3% precision gap vs the Python SIAM
pipeline; see the README's
Known precision gap
section.

Quick start (Docker)

# NVIDIA GPU, ONNX Runtime / CUDA -- 5-fold ensemble (default entrypoint)
docker run --rm --gpus all -v "$PWD":/data -v siamize-cache:/cache \
    openjdata/siamize:v2026.6 \
    -i /data/in.nii.gz -o /data/labels.nii.gz -M 0,1,2,3,4 -c cuda

# Vendor-neutral GPU via MNN / OpenCL (NVIDIA / AMD / Intel)
docker run --rm --gpus all -v "$PWD":/data -v siamize-cache:/cache \
    --entrypoint siamize-opencl openjdata/siamize:v2026.6 \
    -i /data/in.nii.gz -o /data/labels.nii.gz -M 0 -c opencl

See the README for
build-from-source, CPU-only, --tpm, -C spm, and JNIfTI examples.

Citation

If you use this in your work, please cite the upstream SIAM paper
(arXiv:2605.02737).

Acknowledgement: This project uses resources and data formats developed as
part of the NeuroJSON project, supported by US National
Institute of Health (NIH) grant
U24-NS124027.