One Swift call site. Three on-device runtimes. The same LanguageModelSession.respond(...) reaches Apple Intelligence on iOS 26, CoreML on iOS 18, or any mlx-community/* model on the GPU — your code never changes.
Real token counts via each backend's own tokenizer (no chars/4 approximation). Median of 3 timed iterations after one warmup. Same Write a single-sentence Swift fact in under 30 words. prompt, temperature: 0.0, maximumResponseTokens: 80.
| Hardware | Runtime | Quant | TTFT | Decode tok/sec | tok/sec gap |
|---|---|---|---|---|---|
| Apple M4 Max (macOS 26.0) | CoreML / ANE | FP16-ish | 246 ms | 32.8 | — |
| Apple M4 Max (macOS 26.0) | MLX / GPU | 4-bit | 29 ms | 172.8 | 5.3× MLX |
iPhone Air iPhone18,1 (iOS 26.4.2) |
CoreML / ANE | FP16-ish | 661 ms | 34.6 | — |
iPhone Air iPhone18,1 (iOS 26.4.2) |
MLX / GPU | 4-bit | 84 ms | 45.2 | 1.31× MLX |
Two observations the chart kept hiding:
- CoreML decode rate is hardware-flat. Mac M4 Max ANE (32.8 tok/s) and iPhone Air ANE (34.6 tok/s) decode Gemma 4 E2B at essentially the same speed. The Neural Engine is bandwidth-bound on this workload, not compute-bound — and the bandwidth budget is similar on both chips.
- MLX scales with the GPU. Mac 4-bit GPU (172.8 tok/s) is 3.8× faster than iPhone 4-bit GPU (45.2 tok/s). The MLX-vs-CoreML decode gap therefore widens from 1.31× on iPhone to 5.3× on M4 Max — same model, same prompt, just more GPU.
Methodology, Qwen3.5 numbers, sideload instructions →
import PrivateFoundationModels
import PrivateFoundationModelsApple // iOS 26+ — Apple Intelligence
import PrivateFoundationModelsCoreML // iOS 18+ — Apple Neural Engine
import PrivateFoundationModelsMLX // iOS 17+ — Apple GPU, any mlx-community/* model
// Pick a backend at startup. Everything below this is byte-identical to Apple's
// FoundationModels framework.
if #available(iOS 26.0, macOS 26.0, *), AppleFoundationModel.isAvailable {
SystemLanguageModel.default = SystemLanguageModel(backend: AppleFoundationModel.load())
} else {
SystemLanguageModel.default = SystemLanguageModel(
backend: try await CoreMLLanguageModel.load(.lfm2_5_350M))
}
let session = LanguageModelSession(instructions: Instructions("Be brief."))
print(try await session.respond(to: "Capital of France?").content)
// "The capital of France is Paris." — from Apple's actual on-device model on iOS 26,
// or from LFM2.5-350M on the Apple Neural Engine on iOS 18. Your call site doesn't know.@Generable, Tool, @PromptBuilder, streaming, transcripts — all the Apple FM 26 surface, end-to-end verified across all three backends (see Verified below).
Apple shipped FoundationModels with iOS 26. It only runs on iOS 26. It only runs Apple's 3 B on-device model. If you ship an app that has to run today on iOS 18 — or you want to use your own model — you're stuck.
PFM is the iOS 18 polyfill that becomes a runtime passthrough on iOS 26. The same Apple-FM-shaped code compiles unchanged, runs against:
| Backend | Product | iOS | Model |
|---|---|---|---|
| Apple FoundationModels | PrivateFoundationModelsApple |
iOS 26+ | Apple's 3 B on-device LLM (no download, ships in the OS) |
| CoreML / Apple Neural Engine | PrivateFoundationModelsCoreML |
iOS 18+ | LFM2.5, Gemma 4, Qwen3.5, Qwen3-VL, FunctionGemma, EmbeddingGemma |
| MLX / Apple GPU | PrivateFoundationModelsMLX |
iOS 17+ | Any mlx-community/* repo: Llama, Qwen, Gemma, Mistral, Phi, plus VLMs |
The day your deployment target reaches iOS 26 you can either:
s/PrivateFoundationModels/FoundationModels/and delete the package, or- Keep it for the older-OS support and the bring-your-own-model story.
Either way your @Generable types, Tool instances, and respond(...) call sites don't change.
// Package.swift
.package(url: "https://github.com/john-rocky/PrivateFoundationModels", from: "0.10.4"),Pick the backend products you need. Everything is pure SPM; no model files in the repo (they download on first call).
The 5-minute walkthrough — swift package init to streaming @Generable: docs/TUTORIAL.md.
Already on Apple FM and want to backport to iOS 18: docs/MIGRATING_FROM_APPLE_FM.md — a five-step recipe.
Expose any PFM backend over the OpenAI HTTP shape so non-Swift codebases (Python, Node, curl, the official OpenAI SDKs) can drive Apple's on-device model unchanged:
swift run -c release pfm-serve-apple
# [pfm-serve] listening on http://127.0.0.1:11434from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11434/v1", api_key="not-required")
resp = client.chat.completions.create(
model="apple-fm",
messages=[{"role": "user", "content": "Capital of France?"}],
)
# resp.choices[0].message.content == "The capital of France is Paris."Implemented endpoints: POST /v1/chat/completions (with SSE streaming, tool calling, vision content arrays, JSON mode), POST /v1/completions, POST /v1/embeddings, GET /v1/models, GET /healthz, full CORS for browser fetch(). Multi-model loading (Ollama-style) since v0.10.0:
pfm-serve-mlx \
--model mlx-community/Qwen3.5-0.8B-MLX-4bit \
--model mlx-community/FastVLM-0.5B-bf16 \
--embedding-model sentence-transformers/all-MiniLM-L6-v2End-to-end verified against the official openai==2.36 SDK including streaming tool calls and embeddings. Demos in Examples/PythonClient/.
Standardized pfm-bench harness with median-of-3 + warmup. Apples-to-apples cross-runtime numbers on M4 Max and iPhone Air, multi-language coverage across en/es/ko/ja/zh, contributable from any Mac with one command:
swift run -c release pfm-bench-apple --csv-append docs/BENCHMARKS.csv
swift run -c release pfm-bench-coreml --csv-append docs/BENCHMARKS.csv --model qwen3.5-0.8B
# MLX needs xcodebuild
$(find ~/Library/Developer/Xcode/DerivedData -name pfm-bench-mlx -path '*Release*' -type f | head -1) \
--csv-append docs/BENCHMARKS.csvdocs/BENCHMARKS.csv grows per-contributor — both M4 Max and iPhone18,1 (iPhone Air) baselines are already in there. Add your own iPhone via the Examples/PFMiPhoneBench/ one-tap iOS app (auto-starts, AirDrop the CSV out, PR the diff).
Deep dives:
docs/RUNTIME_COMPARISON.md— same model, three runtimesdocs/MULTILANG_BENCH.md— same task, five languagesdocs/BENCHMARKS.md— full methodology
Captured on Apple M4 Max / macOS 26.0 / Xcode 26.1.1, against mlboydaisuke/lfm2.5-350m-coreml, mlx-community/Qwen3.5-0.8B-MLX-4bit, mlx-community/FastVLM-0.5B-bf16, sentence-transformers/all-MiniLM-L6-v2, and Apple's own on-device model:
| Harness | What it proves | Result |
|---|---|---|
swift test |
Session logic, schema decoder, tool dispatch, error wrapping — stub-backed for determinism | 94 / 94 pass |
pfm-verify |
Every public API path against a real CoreML model | 10 / 10 pass (log) |
pfm-portability |
Real Apple-FM-shaped code compiled and ran unchanged | 8 / 8 pass (log) |
pfm-deep |
Every Generable shape × Tool pattern against CoreML | PASS 7 / MODEL 4 / FAIL 0 (log) |
pfm-mlx-deep |
Same matrix routed through MLX-Swift | PASS 9 / MODEL 5 / FAIL 0 (log) |
pfm-apple-deep |
Same matrix through Apple's native FoundationModels | PASS 14 / MODEL 0 / FAIL 0 (log) |
pfm-apple-smoke |
respond + streamResponse + Generable through Apple FM |
✓ load 0 s · respond 0.7 s · stream (log) |
pfm-vision-sample |
OpenAI content array → MLX VLM (FastVLM-0.5B) end-to-end | ✓ identified red top-left, green top-right (log) |
pfm-embeddings-sample |
OpenAI /v1/embeddings → MLXEmbedder (MiniLM-L6-v2) |
✓ 384-dim, semantic ranking correct (log) |
Plus 6 captured runs through the openai Python SDK driving the HTTP server — chat, streaming, function calling, streaming tool calls, vision content arrays, embeddings — all in Examples/PythonClient/.
LanguageModelBackend is two methods (generate + streamGenerate) plus an availability property. Route to llama.cpp, a remote API, your own runtime — see Sources/PrivateFoundationModels/LanguageModelBackend.swift.
PFM mirrors Apple's FoundationModels API surface as of WWDC 2025 / iOS 26.1:
LanguageModelSession—respond(to:),respond(to:generating:),streamResponse(to:),streamResponse(to:generating:),prewarm(),transcript,isResponding,image:overloads.Instructions,GenerationOptions,SamplingMode.Response<Content>,ResponseStream<Content>(AsyncSequence withSnapshot).Transcript+Transcript.Entry(Codable).Toolprotocol,AnyTooltype-erased wrapper, two-turn tool calling.Generableprotocol + macro,GenerationSchema,@Guide(description:).SystemLanguageModel+Availability+UnavailableReason,UseCase,Adapter.Prompt+@PromptBuilder+@InstructionsBuilder.Guardrails(default accept-all; Apple FM passthrough delegates to Apple's).GenerationErrorwith cases matching Apple's where they exist.
If you find a method or initializer in Apple's docs that PFM doesn't ship, open an issue.
- Not affiliated with Apple. "Foundation Models" is Apple's trademark; this is an API-compatible alternative.
- Not a model. It's a thin Swift surface that delegates to whatever backend you wire up.
- Not a grammar-constrained sampler on CoreML / MLX.
@Generableis enforced via system-prompt + post-processing; on retry the schema is re-injected. Apple FM uses Apple's native grammar sampler. Grammar-constrained MLX sampling is on the roadmap.
Examples/PythonClient/— officialopenaiSDK driving pfm-serve. Chat, streaming, function calling, vision, embeddings.Examples/PFMSwitcher/— production-shaped iOS chat app with backend switching and strict release-before-load memory management.Examples/PFMiPhoneBench/— one-tap iPhone bench app. CSV harvest via AirDrop.
The current head is v0.10.4. Full version history in CHANGELOG.md. Next on the list:
- Grammar-constrained sampling on MLX (closes the last "Not a..." disclaimer above).
- Qwen3-VL stateful routing on CoreML.
llama.cpp/ GGUF backend.- Multi-machine bench fill-in (M1 / M2 / M3 / iPhone / iPad / Vision Pro) — see
CONTRIBUTING.md.
Daisuke Majima (@JackdeS11) — founder of Pebble Inc., maintainer of CoreML-Models (1.7k★), CoreML-LLM, and the mlboydaisuke Apple Silicon model collection.
Open to consulting on Apple Silicon LLM inference and on-device deployment — pebble.co.jp.
MIT. See LICENSE. Model weights inherit their own licenses (Gemma: Gemma Terms; Qwen: Apache 2.0; LFM2.5: LFM Open License v1.0).