tensorcore

CUDA for Apple Silicon.

tensorcore is the missing software layer that turns the matrix units on M-series GPUs into a training-grade foundation. It does for Metal what cuBLAS + cuDNN + CUTLASS + NCCL + ggml-quants combined do for CUDA: one hardware-aware library, one C ABI, one binary that runs unchanged from M1 (Apple7) through M5 (Apple11).

                          ┌──────────────────────┐
                          │   tensorcore         │
                          │   ─ tc_gemm          │  ← cuBLAS
                          │   ─ tc_attention_*   │  ← cuDNN attention
                          │   ─ tc_conv2d_*      │  ← cuDNN conv
                          │   ─ tc_rmsnorm / RoPE│  ← cuDNN norms
                          │   ─ tc_swiglu / softmax / AdamW
                          │   ─ tc_gemv_quantized│  ← ggml Q4_0 / Q8_0
                          │   ─ tc_gguf_*        │  ← GGUF v3 reader
                          │   ─ tc_allreduce / broadcast / allgather
                          │                       ─ NCCL primitives
                          └──────────┬───────────┘
                                     │
                          ┌──────────▼───────────┐
                          │  Apple GPU            │
                          │  ─ simdgroup_matrix   │  (M1+)
                          │  ─ mpp::tensor_ops    │  (M5+)
                          └───────────────────────┘

The thesis

NVIDIA's moat in AI is software stack maturity × silicon × interconnect. Two of those three are pure software. Apple has the silicon: per-watt inference that NVIDIA can't match without changing chips, and unified memory that removes the entire host/device transfer problem class. What Apple is missing is the cuBLAS-grade kernel library on top of Metal.

tensorcore is the bet that closing that software gap completely changes the economics for any team training models that fit in ≤32 Macs of unified memory.

For the direct mapping of every CUDA primitive to its tensorcore equivalent, see docs/cuda_comparison.md.

What v0.1 ships (measured, M2 Ultra)

Component	Status	Numbers
`tc_gemm` fp32	bit-exact vs Accelerate	2.46 TFLOPS @ 4096³
`tc_gemm` fp16 (Apple7+)	scaled-RMS err ≤ 5e-3 vs ref	17.88 TFLOPS @ 4096³ (~66% of peak)
`tc_gemm` bf16 (Apple9+ native, Apple7..8 fallback)	scaled-RMS ≤ 3e-3	correctness verified
`tc_gemm` int8 (Apple10+ native, Apple7..9 fallback)	bit-exact i32 accum	correctness verified
`tc_gemm_*_128` 128×128 tile	env-flag opt-in	regresses v0.1; v0.2 retunes
`tc_attention_forward` fp16 D=64, causal/GQA/window/ALiBi	scaled-RMS ≤ 1e-3	7.07 TFLOPS @ B=1, H=32, S=4096
`tc_attention_forward` fp16 D=128	correctness verified	bench harness v0.2
`tc_attention_backward` fp16 D=64/D=128	scaled-RMS ≤ 3e-3	LSE-saved scheme
Q4_0 / Q8_0 quantized GEMV plus GPU quantize	bit-exact vs dequant ref	7B decode harness
Q4_0 async-stream batched GEMV	~79% of LPDDR5 peak bw	186 tok/s, 632 GB/s @ synthetic 7B decode
RMSnorm / LayerNorm / RoPE / SwiGLU / softmax / AdamW	Metal, portable CPU, and CUDA managed-memory kernels	C tests + Python smoke
Fused RMSnorm/LayerNorm+GEMV	inference projection primitives	correctness vs separate paths
Conv2D fwd + backward (im2col + GEMM)	scaled-RMS ≤ 1e-3	multi-batch validated
GGUF reader	v3 metadata, tensors, bulk copy, Q4/Q8 descriptors	synthetic + Q4 GEMV end-to-end
Python ctypes binding	full ABI surface, NumPy interop	covered by CTest `python_basic`
Distributed (single-host ring + portable GLOO TCP)	bit-exact local ranks	thread, fork, and TCP transports
MPS + Accelerate fallback	wired, exercised by dispatch	—
Portable CPU backend	builds on Linux / Intel-Mac with `TC_ENABLE_METAL=OFF`; covers buffers, streams, GEMM, attention/training/conv, GGUF, `TC_DIST_SINGLE`, GLOO TCP, DiLoCo, and sparse compression.	for non-Apple mesh workers
CTest suite	27/27 pass on M2 Ultra (25 library/package tests + 2 example smokes)	`ctest --test-dir build`
CMake / pkg-config / Python install	`tensorcore::tensorcore[_shared]`, `tensorcore.pc`	tested out-of-tree

Public C ABI — `include/tensorcore/*.h`

A 1.3K-line C ABI you can read end-to-end in an afternoon. Sixteen public headers including the umbrella. Grouped:

Lifecycle: tc_init, tc_shutdown, tc_device_info_get, tc_buffer_alloc/_free/_map/_size, tc_stream_create/_destroy/_sync.
GEMM: tc_gemm, tc_gemm_async, tc_gemm_batched (fp16, bf16, fp32, int8). Diagnostics: tc_last_backend, tc_backend_name.
Attention: tc_attention_forward/_async, tc_attention_backward. Causal, GQA, sliding window, ALiBi, LSE save — all via the same descriptor.
Training kernels: tc_rmsnorm_*, tc_layernorm_*, tc_rope_forward, tc_swiglu_*, tc_softmax_*, tc_adamw_step, tc_fused_rmsnorm_gemv, tc_fused_layernorm_gemv.
Conv2D: tc_conv2d_forward, tc_conv2d_backward_input, tc_conv2d_backward_weight.
Quantized: tc_quantize_weights, tc_gemv_quantized/_async, tc_quantized_size.
GGUF: tc_gguf_open/_close, metadata getters, tensor iteration, tc_gguf_load_supported_tensors, matrix descriptor helpers, tc_gguf_get_llama_config.
Distributed: tc_dist_init/_finalize, tc_allreduce, tc_broadcast, tc_allgather, tc_barrier.

Complete reference: docs/api_reference.md.

Apple GPU family gating

Family	Chips	Native MMA dtypes	TensorOps M5
Apple7	M1	fp16, fp32	—
Apple8	M2	fp16, fp32	—
Apple9	M3, A17 Pro	+ bf16	—
Apple10	M4	+ int8	—
Apple11	M5	(all of the above)	✓ (SDK 26.0+ + M5 runtime)

bf16 and int8 are software-fallback on older silicon, with the dispatch choosing the fastest available path. One library binary; no per-chip builds. See docs/family_gating.md.

Where it slots in

                          ┌──────────────────┐
                          │   eshkol         │  (compiler/runtime)
                          └────────┬─────────┘
                                   │ FFI bridge (opt-in)
                ┌──────────────────┼──────────────────┐
                │                  │                  │
   ┌────────────▼────────┐ ┌───────▼────────┐ ┌───────▼─────────┐
   │ eshkol-platform     │ │ qgt            │ │ semiclassical   │
   │ (Metal stub now)    │ │ (45 kernels)   │ │ _qllm           │
   └────────────┬────────┘ └───────┬────────┘ └───────┬─────────┘
                │                  │                  │
                └──────────────────┼──────────────────┘
                                   │
                          ┌────────▼─────────┐
                          │   tensorcore     │  ← THIS
                          └────────┬─────────┘
                                   │  Metal API
                          ┌────────▼─────────┐
                          │  Apple GPU       │
                          └──────────────────┘

After ROADMAP.md §v0.4, the three sibling projects retire their bespoke Metal backends and consume one shared kernel library. The SF64 / Ozaki-II / FP24 / FP53 precision modes that today live inside eshkol-platform/lib/backend/gpu/gpu_memory.mm move into tensorcore as named dtypes.

Build

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

ctest --test-dir build --output-on-failure          # 31/31
./build/bench/bench_gemm                             # TFLOPS sweep
./build/bench/bench_attention                        # FlashAttention TFLOPS
./build/bench/bench_inference_7b                     # Q4_0 7B decode harness
./build/examples/hello_gemm                          # minimal C usage
./build/examples/gguf_inspect model.gguf             # inspect a GGUF file
./build/examples/gguf_inspect model.gguf --load-supported
./build/examples/mesh_training_demo --inner 2 --outer 1
./build/examples/mesh_training_demo --inner 2 --outer 1 --checkpoint

On M3 Max, fp16 simdgroup_matrix GEMM should land within ~10% of MLX's hand-tuned kernels (the v0.2 target). On M2 Ultra you should see ~17 TFLOPS at 4096³.

bench_gemm prints the median TFLOPS and the backend that served each call. If you don't see simdgroup_matrix, see docs/troubleshooting.md.

Install and link

cmake --install build --prefix /opt/tensorcore

The install carries the umbrella headers, both libraries (static + shared), the metallib, a CMake package config, and a pkg-config file:

find_package(tensorcore CONFIG REQUIRED)
target_link_libraries(my_app PRIVATE tensorcore::tensorcore_shared)

export PKG_CONFIG_PATH=/opt/tensorcore/lib/pkgconfig
cc main.c $(pkg-config --cflags --libs tensorcore) -o my_app

Python:

python3 -m pip install -e . --no-build-isolation
export TENSORCORE_LIB=/opt/tensorcore/lib/libtensorcore.dylib
python3 -c 'import tensorcore as tc; print(tc.version())'

Complete integration guide: docs/integrating_tensorcore.md. For a copyable out-of-tree project, see examples/native_sdk_consumer.

Layout

tensorcore/
├── include/tensorcore/   ← Public C ABI headers (stable across versions)
├── lib/
│   ├── core/             ← Device init, pipeline cache, buffer pool, autotune
│   ├── ops/              ← gemm.mm, attention.mm, training.mm, conv.mm, quantized.mm
│   ├── fallback/         ← MPS + Accelerate paths
│   ├── tensorops/        ← Metal 4 / M5 TensorOps (SDK-gated)
│   ├── distributed/      ← Single-host ring, portable GLOO TCP, TB5 stubs
│   ├── io/               ← GGUF v3 reader
│   └── c_api/            ← ABI shims
├── kernels/metal/        ← .metal sources → default.metallib
├── cmake/                ← compile_metallib.cmake, tensorcoreConfig.cmake.in, .pc.in
├── tests/                ← CTest correctness, ABI, Python, and CPU-portability tests
├── bench/                ← TFLOPS / tok/s harness
├── examples/             ← hello_gemm, gguf_inspect, decode/training demos
├── eshkol/               ← .esk bindings + FFI bridge for the Eshkol toolchain
│                            (see [eshkol/bridge/INTEGRATION.md](eshkol/bridge/INTEGRATION.md)
│                            for the drop-in steps)
├── python/               ← ctypes Python binding (full ABI surface)
└── docs/                 ← Architecture, API reference, ROADMAP, integration guides

What's next (v0.2)

20+ TFLOPS fp16 4096³ on M2 Ultra via double-buffered K-loads + 128×128 tile retune.
FlashAttention parity with MFA (Br=64 for D=128 on Apple9+, K-block early-exit pruning, split-K).
Full mixed-precision training loop test (small transformer block, matched against PyTorch-MPS gradients).
Broader CUDA mixed-precision training evidence across full transformer loops and multi-host mesh runs.
M ≥ 4 quantized GEMV so prefill works at scale.

The honest "compete-with-NVIDIA" picture, the per-watt advantage, and the silicon-bound vs software-bound axes are all in ROADMAP.md.

Documentation

docs/ — full documentation tree (architecture, API per header, CUDA comparison, dtypes, every kernel area, GGUF, Python, benchmarks, troubleshooting, ICC-grounded codebase audit).
ONBOARDING.md — 30-second tour for a new contributor.
CONTRIBUTING.md — how to add a kernel, add a backend target, run the suites.
ROADMAP.md — what's next and how confident we are.
CHANGELOG.md — what's already shipped, per checkpoint.
SECURITY.md — threat model, supported versions, how to report a vulnerability.
examples/README.md — what each compilable example (hello_gemm, gguf_inspect, decode_step, training_step, mesh_training_demo) demonstrates.
tests/README.md — what each default and portable-CPU correctness test covers and the tolerance it enforces.
bench/README.md — what each bench measures (GEMM TFLOPS sweep, FlashAttention TFLOPS, 7B Q4_0 decode latency).

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tensorcore

The thesis

What v0.1 ships (measured, M2 Ultra)

Public C ABI — `include/tensorcore/*.h`

Apple GPU family gating

Where it slots in

Build

Install and link

Layout

What's next (v0.2)

Documentation

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 268 Commits
.github		.github
.icc		.icc
bench		bench
bindings/pytorch		bindings/pytorch
cmake		cmake
configs		configs
docs		docs
eshkol		eshkol
examples		examples
include/tensorcore		include/tensorcore
kernels/metal		kernels/metal
lib		lib
python		python
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
ONBOARDING.md		ONBOARDING.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

tensorcore

The thesis

What v0.1 ships (measured, M2 Ultra)

Public C ABI — include/tensorcore/*.h

Apple GPU family gating

Where it slots in

Build

Install and link

Layout

What's next (v0.2)

Documentation

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Public C ABI — `include/tensorcore/*.h`

Packages