Native Apple Metal GPU inference via mlx-c CGO bindings, implementing the inference.Backend and inference.TextModel interfaces from go-inference for Apple Silicon (M1-M4). Supports Gemma 3, Gemma 4 (dense and MoE), Qwen 2/3, and Llama 3 architectures from HuggingFace safetensors directories and GGUF checkpoints, with fused Metal kernels for RMSNorm, RoPE, scaled dot-product attention, KV cache management, LoRA fine-tuning with AdamW, and batch inference. The root package also exposes an RFC-style direct model API (mlx.LoadModel, model.Generate, model.GenerateStream) and a non-LLM frame-compute API (mlx.NewSession, Session.BeginFrame, Session.FinishFrame, PixelBuffer, KernelRGB565ToRGBA8, KernelNearestScale, KernelScanlineFilter, KernelCRTFilter, KernelSoftenFilter, KernelSharpenFilter) for Apple GPU-accelerated image and emulator workloads. A Python subprocess backend (mlxlm) is provided as a CGO-free alternative. Platform-restricted: darwin/arm64 only; a no-op stub compiles on all other platforms.
Module: dappco.re/go/mlx
Licence: EUPL-1.2
Language: Go 1.26
import (
"context"
"fmt"
"dappco.re/go/inference"
_ "dappco.re/go/mlx" // registers "metal" backend via init()
)
model, err := inference.LoadModel("/Volumes/Data/lem/safetensors/gemma-3-1b/")
if err != nil {
panic(err)
}
defer model.Close()
for tok := range model.Generate(context.Background(), "Hello", inference.WithMaxTokens(256)) {
fmt.Print(tok.Text)
}
if err := model.Err(); err != nil {
panic(err)
}import (
"fmt"
mlx "dappco.re/go/mlx"
)
model, err := mlx.LoadModel("/path/to/model",
mlx.WithContextLength(8192),
mlx.WithQuantization(4),
mlx.WithDevice("gpu"),
)
if err != nil {
panic(err)
}
defer model.Close()
reply, err := model.Generate("Explain Gemma 4 shared KV layers", mlx.WithMaxTokens(128))
if err != nil {
panic(err)
}
fmt.Println(reply)import mlx "dappco.re/go/mlx"
session, err := mlx.NewSession(mlx.WithSessionLabel("frame-pipeline"))
if err != nil {
panic(err)
}
defer session.Close()
src, err := session.NewPixelBuffer(mlx.PixelBufferDesc{
Width: 320,
Height: 224,
Stride: 640,
Format: mlx.PixelRGB565,
})
if err != nil {
panic(err)
}
rgba, err := session.NewPixelBuffer(mlx.PixelBufferDesc{
Width: 320,
Height: 224,
Stride: 1280,
Format: mlx.PixelRGBA8,
})
if err != nil {
panic(err)
}
scaled, err := session.NewPixelBuffer(mlx.PixelBufferDesc{
Width: 960,
Height: 672,
Stride: 3840,
Format: mlx.PixelRGBA8,
})
if err != nil {
panic(err)
}
frameBytes := make([]byte, src.Descriptor().SizeBytes())
if err := src.Upload(frameBytes); err != nil {
panic(err)
}
if err := session.BeginFrame(); err != nil {
panic(err)
}
if err := session.Run(mlx.KernelRGB565ToRGBA8, mlx.KernelArgs{
Inputs: map[string]mlx.Buffer{"src": src},
Outputs: map[string]mlx.Buffer{"dst": rgba},
}); err != nil {
panic(err)
}
if err := session.Run(mlx.KernelNearestScale, mlx.KernelArgs{
Inputs: map[string]mlx.Buffer{"src": rgba},
Outputs: map[string]mlx.Buffer{"dst": scaled},
}); err != nil {
panic(err)
}
if err := session.Run(mlx.KernelScanlineFilter, mlx.KernelArgs{
Inputs: map[string]mlx.Buffer{"src": scaled},
Outputs: map[string]mlx.Buffer{"dst": scaled},
Scalars: map[string]float64{"strength": 0.3},
}); err != nil {
panic(err)
}
frameMetrics, err := session.FinishFrame()
if err != nil {
panic(err)
}
finalFrame, err := scaled.Read()
if err != nil {
panic(err)
}
_ = finalFrame
_ = frameMetricsgo-mlx is positioned as a Go-native research-grade model runner — not just inference. The root package exposes the full training and operations pipeline so harnesses can stop reaching for Python mlx-lm:
| Feature | Function | What it does |
|---|---|---|
| LoRA fine-tuning | mlx.ApplyLoRA + mlx.NewAdamW |
Low-rank adaptation training with AdamW, mixed precision, gradient checkpointing |
| LoRA fusion | mlx.FuseLoRAIntoModelPack(ctx, opts) |
Bake a trained LoRA adapter into the base model as a fresh safetensors pack |
| Knowledge distillation | mlx.RunKnowledgeDistillation(ctx, runner, dataset, cfg) |
KL or soft-CE loss against a teacher's logits, with checkpoint resumption |
| GRPO | mlx.RunGRPOReasoningTraining(ctx, runner, dataset, cfg) |
Group-relative policy optimisation with reward functions and reference KL |
| Eval | mlx.RunModelEval(ctx, model, dataset, cfg) |
Dataset-native perplexity plus pluggable quality probes |
| Model merge | mlx.MergeModelPacks(ctx, opts) |
Linear / SLERP / TIES / DARE merging of multiple model packs with provenance |
| GGUF quantise | mlx.QuantizeModelPackToGGUF(ctx, opts) |
Native Go safetensors → GGUF Q8_0 / Q4_0 / Q4_K_M |
| KV snapshot | snapshot.Save(path) / mlx.LoadKVSnapshot(path) |
Portable binary KV cache (Float32 or Q8 symmetric int8) for session restore |
| HF fit | mlx.PlanHFModelFits(ctx, cfg) |
HuggingFace Hub metadata search to plan what fits on local hardware |
| Attention probe | inference.AttentionInspector adapter |
Extract post-RoPE K vectors per head per layer for analysis |
See docs/ and examples/ for the full surface.
- Compute Guide — frame-oriented Metal compute sessions, pixel buffers, kernels, metrics
- Architecture — CGO binding, model architectures, weight loading, KV cache, attention, batch inference, LoRA training, mlxlm backend
- Models — model loading, supported architectures, tokenisation, chat templates
- Training — LoRA fine-tuning, AdamW, gradient computation, checkpoints, fusion
- Distillation — knowledge distillation (KL, soft cross-entropy)
- GRPO — group-relative policy optimisation for RL
- Eval — dataset-native perplexity, quality probes, eval reports
- Model Operations — merge, GGUF quantise, KV snapshot, HF fit
- Development Guide — prerequisites (mlx-c CMake build), CGO flags, test patterns, benchmarks
- Project History — completed phases, commit hashes, known limitations
- Examples — runnable usage examples organised by type
git submodule update --init --recursive
go generate ./... # builds mlx-c C library (required first time)
go test ./...
go build ./...European Union Public Licence 1.2 — see LICENCE for details.