Skip to content

mechramc/Orion

Repository files navigation

Orion

A local AI runtime that enables training and running small LLMs directly on Apple Silicon using the Neural Engine.

No CoreML. No Metal. No GPU. No cloud.

┌─────────────────────────────────────────────────────────────┐
│  Prompt → Tokenizer → ANE Runtime → CPU Sampling → Output  │
│                                                             │
│  170+ tok/s inference  ·  3.8x faster training via delta    │
│  compile  ·  LoRA hot-swap without recompilation            │
└─────────────────────────────────────────────────────────────┘

Quick Start

# Build (no Xcode required)
make

# Run inference
./orion infer --prompt "The meaning of life is" --max_tokens 128 --ane

# Train a model
./orion train --weights model/blobs/stories110m --dataset data/tinystories.bin --steps 1000

Everything runs offline. No data leaves your device.


Why ANE?

Apple ships a dedicated Neural Engine (NPU) in every iPhone, iPad, and Mac sold since 2020 — over 2 billion devices. Each has a dedicated ML accelerator capable of ~19 TFLOPS fp16 (Apple claims 38 TOPS INT8, but maderix showed the ANE dequantizes to fp16 before computation). Yet Apple only exposes it through CoreML, which:

  • adds significant compilation and dispatch overhead
  • restricts operations to Apple-approved model formats
  • does not support training at all

Orion bypasses CoreML entirely. It talks to the ANE through private frameworks (_ANEClient, _ANECompiler, MIL IR), giving full control over compilation, memory layout, and scheduling. This opens up capabilities Apple never intended:

  • Training on the ANE — forward and backward passes on hardware Apple restricted to inference
  • Delta compilation — reload weights without recompiling, reducing training overhead by 8.5x
  • LoRA hot-swap — adapter matrices as IOSurface inputs, zero recompilation on swap
  • Custom kernel pipelines — compiler-generated MIL from graph IR with optimization passes

Orion builds on foundational work by maderix (private API reverse-engineering, hardware characterization) and ANEgpt (training kernel structure), extending them into a production-quality LLM training and inference runtime with a compiler, stable training (NaN fix), program caching, checkpointing, and benchmark harness.


How Orion Compares

Framework Focus Training ANE Direct Models
MLX GPU inference + training Yes (GPU) No Many
MLC-LLM Portable inference No No Many
Ollama Easy model serving No No Many
llama.cpp Portable CPU/GPU inference No No Many
Orion ANE training + inference Yes (ANE) Yes GPT-2 124M, Stories110M

Orion's niche: the only framework that targets Apple's Neural Engine directly for both training and inference. Others use GPU (Metal) or CPU. Orion uses the NPU — dedicated silicon that's idle in most workloads.


What You Can Do

Inference (3 modes)

./orion infer --prompt "Hello, world" --ane            # Full ANE forward pass
./orion infer --prompt "Hello, world" --ane-prefill     # ANE prefill → CPU decode
./orion infer --prompt "Hello, world"                   # CPU-only baseline

Training

# Train with delta compilation (no exec() restart needed)
./orion train --weights model/blobs/stories110m \
  --dataset data/tinystories.bin \
  --steps 1000 --grad_accum 4 --lr 3e-4

# Resume from checkpoint
./orion train --weights model/blobs/stories110m \
  --dataset data/tinystories.bin \
  --steps 100 --grad_accum 4 --lr 1e-5 \
  --resume checkpoints/ckpt_00500.bin

Training runs: ANE forward/backward → CPU weight updates → Adam optimizer → delta weight reload (no recompilation). ANE programs are compiled once at startup (72 programs, ~4.5s); subsequent steps reload updated weights in ~494ms instead of recompiling (~4,200ms). Verified stable over 1,000 steps in 22 minutes (loss: 12.3→8.9, 0 NaN, no memory leak).

Benchmarking

./orion bench kernels                        # Per-kernel ANE latency (5 kernel types)
./orion bench inference --prompt "Hello"     # End-to-end tok/s, TFLOPS, ANE utilization
./orion bench training --steps 10            # Training step breakdown (fwd/bwd/dW/Adam)
./orion bench kernels --save-baseline        # Save baseline for regression tracking
./orion bench kernels                        # Compare against saved baseline (>15% = WARN)

The benchmark suite measures:

  • TFLOPS — actual compute throughput on ANE
  • Tokens/sec — end-to-end inference speed
  • ANE utilization — time in orion_eval vs total wall time
  • Dispatch overhead — compile + eval scheduling cost
  • Memory — RSS growth across compiles and swaps

This turns Orion into a measurement tool for ANE performance, not just an inference engine.


Architecture

                         orion CLI
                ┌──────────────────────────┐
                │   infer   train   bench   │
                └─────┬───────┬───────┬────┘
                      │       │       │
           ┌──────────┴───────┴───────┴──────────┐
           │          Model Layer                  │
           │  OrionModelConfig (GPT-2, Stories)    │
           │  Tokenizers (BPE, SentencePiece)      │
           │  Weight loading (BLOBFILE format)      │
           └──────────────┬───────────────────────┘
                          │
           ┌──────────────┴───────────────────────┐
           │          Compiler + Kernel Layer         │
           │  Graph IR → optimized MIL text           │
           │  Weight dict builders → BLOBFILE refs    │
           │  Program cache (store/lookup/evict)      │
           └──────────────┬───────────────────────┘
                          │
           ┌──────────────┴───────────────────────┐
           │          Core Runtime                  │
           │  orion_compile_mil → OrionProgram      │
           │  orion_eval (IOSurface I/O)            │
           │  Delta reload (unload→patch→reload)    │
           └──────────────┬───────────────────────┘
                          │
           ┌──────────────┴───────────────────────┐
           │    Apple Neural Engine (private APIs)  │
           │  _ANEClient → _ANECompiler             │
           │  MIL IR → ANE microcode                │
           │  IOSurface-backed fp16 [1,C,1,S]       │
           └────────────────────────────────────────┘

Abstraction layers

Layer Role Status
OrionKernel Wraps MIL generation + weight dict + cache + compile + eval into one call Done
OrionModelConfig Model architecture description (layers, heads, dims, vocab) Done
OrionRuntime ANE init + delta reload + kernel dispatch Done
OrionCompiler Graph IR → optimized MIL text (validate, optimize, codegen) Done

These abstractions make Orion a general ANE model runtime rather than a single-model demo. Adding a new model means implementing its compiler frontend (graph builder) and weight layout — the compile-cache-eval machinery is shared.

Data flow

Inference: Tokenize → embed (CPU) → 12 transformer layers on ANE (bucketed prefill or per-token decode) → final layernorm (ANE) → logits projection (CPU, wte is 73MB — too large for ANE SRAM) → sample → repeat.

Training (single process, no restarts): 72 ANE programs compiled once at startup (~4.5s). Each step: embed (CPU) → forward layers (ANE, 6 outputs per layer for backward reuse) → loss (CPU) → backward layers (ANE, dx path) → dW accumulation (CPU, GCD async) → Adam (CPU) → delta weight reload (unload → update BLOBFILE on disk → reload, ~494ms for 60 kernels). Zero compiles during training — the ~119 compile-per-process limit never applies.

Tensor layout: All ANE I/O uses fp16 [1, C, 1, S] on IOSurface-backed memory. CPU↔ANE data transfer transposes between [seq, d_model] and [d_model, seq].


Use Cases

  • Local AI assistants — private, offline language models on Mac, iPad, or iPhone
  • Educational robotics — a toy robot running a local LLM on an iPad, no cloud dependency
  • Offline copilots — code completion, writing assistance, and reasoning without internet
  • Privacy-first applications — medical, legal, or personal AI where data must never leave the device
  • Edge deployment — any Apple device with an ANE becomes an inference/training endpoint
  • ML research — benchmark and profile the Neural Engine with direct kernel-level control

Project Roadmap

Stage 1 — Orion Core (complete)

ANE training runtime + inference engine. Direct MIL compilation, program caching, weight swapping, benchmark harness. Two models: GPT-2 124M (inference) and Stories110M (training). Runtime abstractions (OrionKernel, OrionRuntime, model registry) make Orion a general toolkit rather than a single-model demo.

Status: Complete. 116/116 tasks.

Stage 2 — Orion Compiler (complete)

Graph-level IR and optimization pipeline that replaces hand-written MIL generators. Models are now defined as compiler frontends (pure C graph builders) instead of hand-written MIL text. The compiler validates, optimizes, and codegens to ANE-compatible MIL automatically.

Capabilities:

  • Graph IR — ~27 ops, fluent builder API, topological ordering
  • Optimization passes — DCE, identity elimination, cast fusion, SRAM annotation, uniform output padding
  • ANE validation — constraint checking before compilation (shape limits, op support)
  • 13 compiler frontends — 4 GPT-2 inference + 6 Stories training + 3 utility kernels
  • MIL diff tool — structural comparison between generated MIL programs

Status: Complete. 32/32 compiler tasks + milgen fully replaced. 21/21 tests pass.

Stage 2.5 — Orion v2.0 (complete)

Three workstreams driven by community feedback (r/MachineLearning) and deferred v1 goals:

  • Delta Compilation (complete) — Bypass ANE recompilation entirely for weight updates. Instead of recompiling 60 programs per step (4,200ms), Orion unloads each program, updates weight files on disk, and reloads (494ms). 8.5x faster recompilation, 3.8x faster total training step. 1,000-step endurance verified: 22 minutes wall time (vs ~85 min with full recompile), zero NaN, no memory leak.
  • LoRA Adapter-as-Input (complete) — Low-Rank Adaptation where adapter matrices A, B are passed as IOSurface inputs. Compiler frontends generate Y = conv1x1(x, W_base) + alpha * (x @ A) @ B. Hot-swap verified: different adapters produce different outputs with zero recompiles. 17/17 tests pass. Discovered 3 new ANE constraints during implementation (#18-#20).
  • SwiftUI Demo App — Deferred to Stage 3.

Status: 13/20 tasks complete. Delta compilation 7/7, LoRA 6/8 (2 deferred), Demo App deferred. Tagged v2.0.

Stage 3 — Orion Platform (future)

Developer toolchain for building local AI applications on Apple devices. Model packaging, deployment, and a simple CLI/API:

pip install orion

orion train tiny mydata.txt
orion chat

Packaging Roadmap

Phase Distribution Status
1 Native binary (clang && ./orion) Done
2 Makefile (make && ./orion) Current
3 Python wrapper (import orion) Planned
4 pip distribution (pip install orion) Planned

Technical Discoveries

20 constraints discovered (6 from upstream, 14 newly documented by Orion). Full reference: docs/ane_constraints.md. Hardware-level constraints (compile limit, weight baking, conv vs matmul) were first documented by maderix; the rest were discovered empirically during Orion development:

Compile/eval failures:

  • ANE rejects the concat MIL op — must use multi-output programs
  • gelu is not a valid MIL op — decompose to tanh approximation
  • ~119 compile limit per process — bypassed entirely by delta compilation (programs compiled once at startup, only reloaded during training)
  • Minimum ~49KB IOSurface allocation — seq_len=1 compiles but fails at eval
  • Multi-output AND multi-input require uniform IOSurface allocation sizes — pad to max
  • Weight dict must be @{} (empty dict), not nil, for weight-free programs
  • milText must be NSData* (UTF-8 bytes), not NSString*

Silent wrong data:

  • Multi-output surfaces ordered alphabetically by MIL variable name, not return tuple order
  • Multi-input surfaces also ordered alphabetically by MIL parameter name
  • ANE reads flat buffer as packed shape data — no stride adjustment for oversized surfaces
  • SDPA causal masks are ignored — must decompose attention manually
  • BLOBFILE offset is uint64(64) (chunk header), not 0 or 128
  • Weights baked at compile time — overwriting BLOBFILE on disk doesn't change outputs (must reload)

Compiler-level:

  • ANE MIL requires named const refs for matmul transpose_x/transpose_y — inline true/false rejected
  • ANE MIL conv does NOT support bias= — bias must be a separate add op
  • Output variable names must reference live nodes — dead names in return tuples cause InvalidMILProgram

How Orion Differs from Upstream

Orion builds on two open-source projects that reverse-engineered Apple's ANE:

maderix/ANE (Part 1, Part 2) ANEgpt Orion
What it is ANE reverse-engineering + hardware characterization Training demo Production LLM runtime
Key contribution Private API discovery, SRAM/TFLOPS benchmarks, 38 TOPS debunk, compile limit, dispatch overhead Forward+backward on ANE Compiler, delta compilation, LoRA, 14 new constraints, complete system
Language C/ObjC C + Python Pure ObjC (no Python at runtime)
Inference Benchmarks only ANE prefill only ANE full forward (prefill + decode)
Training N/A Stories110M (NaN on resume) Stories110M + delta compile (8.5x faster) + stable resume
Weight updates N/A Full recompile Delta reload (unload→update→reload, no compiler)
LoRA N/A N/A IOSurface-input adapters, hot-swap without recompile
Models N/A 1 2 (GPT-2 124M + Stories110M)
Tokenizer N/A Python-side Native BPE + SentencePiece in C
Program cache N/A None Composite key cache with eviction
Benchmarking Hardware microbenchmarks (SRAM, dispatch, peak TFLOPS) N/A 4-mode suite with regression tracking
MIL generation Hardcoded Hardcoded Compiler (graph IR → optimized MIL)

Requirements

  • macOS 15+ (Sequoia) on Apple Silicon (M1 or later)
  • Xcode Command Line Tools
  • Python 3.10+ with torch, transformers (weight conversion only — not needed at runtime)

Building

# Build with Makefile (~50 source files, no Xcode project required)
make            # Build the orion binary
make test       # Build and run all 23 test suites
make bench      # Run benchmark suite
make clean      # Remove all build artifacts

Weight Setup

# GPT-2 124M (inference)
python model/convert/hf_to_blobs_gpt2.py    # → model/blobs/gpt2_124m/

# Stories110M (training)
python model/convert/hf_to_blobs_llama.py    # → model/blobs/stories110m/

Project Status

v1.0: 148/148 tasks complete across 13 phases. Tagged v1.0. v2.0: 13/20 tasks complete across 3 workstreams. Tagged v2.0.

See STATUS.md for the full dashboard and RESULTS.md for comprehensive benchmarks.

Phase Status Highlights
M0: Upstream Validation Complete Verified ANE private APIs on M4 Max
M1: CPU Baseline Complete Full GPT-2 forward pass, 283 tok/s CPU decode
M2: ANE Prefill Complete 12-layer ANE prefill, exact match vs CPU
M3: Training Complete Full Stories110M training loop, checkpointing
M4: Weight Swapping Complete Program cache, 100-iter swap endurance (RSS 1.41x)
Phase 8: ANE Full Forward Complete ANE decode at 5.8ms/tok, tokens match CPU exactly
Phase 9: Benchmark Harness Complete Per-kernel, inference, training, regression tracking
Phase 10: Runtime Abstractions Complete OrionKernel, OrionRuntime, model registry
Phase 11: Build & Quality Complete Makefile, zero warnings, constraints docs
Phase 12: Orion Compiler Complete Graph IR, 5 optimization passes, 13 frontends, MIL codegen
Phase 13: Milgen→Compiler Complete All MIL generators replaced by compiler, 21/21 tests pass
Phase 14: Delta Compilation Complete 8.5x faster recompile, 3.8x faster training, 1000-step endurance
Phase 15: LoRA 6/8 Compiler frontends, adapter loader, hot-swap, 17/17 tests
Phase 16: SwiftUI Demo Pending macOS app with live metrics

Acknowledgements

Built on research from:

  • maderix/ANE — foundational ANE reverse-engineering: private API discovery, hardware characterization (SRAM cliff, dispatch overhead, 38 TOPS debunk, compile limit), and benchmarking tools. See Part 1 and Part 2 blog posts. (MIT)
  • ANEgpt — LLM training harness on ANE (MIT)
  • hollance/neural-engine — ANE community documentation (tensor layout insights)

License

MIT — Murai Labs

About

Local AI runtime for training & running small LLMs directly on Apple Neural Engine (ANE). No CoreML. No Metal. Offline, on-device fine-tuning & inference on M-series silicon.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors