Skip to content

john-rocky/PrivateFoundationModels

Repository files navigation

PrivateFoundationModels

Swift Platforms CI License: MIT

One Swift call site. Three on-device runtimes. The same LanguageModelSession.respond(...) reaches Apple Intelligence on iOS 26, CoreML on iOS 18, or any mlx-community/* model on the GPU — your code never changes.

Same prompt, same call site, two runtimes — Gemma 4 E2B

Real token counts via each backend's own tokenizer (no chars/4 approximation). Median of 3 timed iterations after one warmup. Same Write a single-sentence Swift fact in under 30 words. prompt, temperature: 0.0, maximumResponseTokens: 80.

Hardware Runtime Quant TTFT Decode tok/sec tok/sec gap
Apple M4 Max (macOS 26.0) CoreML / ANE FP16-ish 246 ms 32.8
Apple M4 Max (macOS 26.0) MLX / GPU 4-bit 29 ms 172.8 5.3× MLX
iPhone Air iPhone18,1 (iOS 26.4.2) CoreML / ANE FP16-ish 661 ms 34.6
iPhone Air iPhone18,1 (iOS 26.4.2) MLX / GPU 4-bit 84 ms 45.2 1.31× MLX

Two observations the chart kept hiding:

  • CoreML decode rate is hardware-flat. Mac M4 Max ANE (32.8 tok/s) and iPhone Air ANE (34.6 tok/s) decode Gemma 4 E2B at essentially the same speed. The Neural Engine is bandwidth-bound on this workload, not compute-bound — and the bandwidth budget is similar on both chips.
  • MLX scales with the GPU. Mac 4-bit GPU (172.8 tok/s) is 3.8× faster than iPhone 4-bit GPU (45.2 tok/s). The MLX-vs-CoreML decode gap therefore widens from 1.31× on iPhone to 5.3× on M4 Max — same model, same prompt, just more GPU.

Methodology, Qwen3.5 numbers, sideload instructions →

30-second value prop

import PrivateFoundationModels
import PrivateFoundationModelsApple    // iOS 26+ — Apple Intelligence
import PrivateFoundationModelsCoreML   // iOS 18+ — Apple Neural Engine
import PrivateFoundationModelsMLX      // iOS 17+ — Apple GPU, any mlx-community/* model

// Pick a backend at startup. Everything below this is byte-identical to Apple's
// FoundationModels framework.
if #available(iOS 26.0, macOS 26.0, *), AppleFoundationModel.isAvailable {
    SystemLanguageModel.default = SystemLanguageModel(backend: AppleFoundationModel.load())
} else {
    SystemLanguageModel.default = SystemLanguageModel(
        backend: try await CoreMLLanguageModel.load(.lfm2_5_350M))
}

let session = LanguageModelSession(instructions: Instructions("Be brief."))
print(try await session.respond(to: "Capital of France?").content)
// "The capital of France is Paris."  — from Apple's actual on-device model on iOS 26,
// or from LFM2.5-350M on the Apple Neural Engine on iOS 18. Your call site doesn't know.

@Generable, Tool, @PromptBuilder, streaming, transcripts — all the Apple FM 26 surface, end-to-end verified across all three backends (see Verified below).

The story

Apple shipped FoundationModels with iOS 26. It only runs on iOS 26. It only runs Apple's 3 B on-device model. If you ship an app that has to run today on iOS 18 — or you want to use your own model — you're stuck.

PFM is the iOS 18 polyfill that becomes a runtime passthrough on iOS 26. The same Apple-FM-shaped code compiles unchanged, runs against:

Backend Product iOS Model
Apple FoundationModels PrivateFoundationModelsApple iOS 26+ Apple's 3 B on-device LLM (no download, ships in the OS)
CoreML / Apple Neural Engine PrivateFoundationModelsCoreML iOS 18+ LFM2.5, Gemma 4, Qwen3.5, Qwen3-VL, FunctionGemma, EmbeddingGemma
MLX / Apple GPU PrivateFoundationModelsMLX iOS 17+ Any mlx-community/* repo: Llama, Qwen, Gemma, Mistral, Phi, plus VLMs

The day your deployment target reaches iOS 26 you can either:

  • s/PrivateFoundationModels/FoundationModels/ and delete the package, or
  • Keep it for the older-OS support and the bring-your-own-model story.

Either way your @Generable types, Tool instances, and respond(...) call sites don't change.

Install

// Package.swift
.package(url: "https://github.com/john-rocky/PrivateFoundationModels", from: "0.10.4"),

Pick the backend products you need. Everything is pure SPM; no model files in the repo (they download on first call).

Tutorial

The 5-minute walkthrough — swift package init to streaming @Generable: docs/TUTORIAL.md.

Already on Apple FM and want to backport to iOS 18: docs/MIGRATING_FROM_APPLE_FM.md — a five-step recipe.

OpenAI-compatible local HTTP API (v0.7.0+)

Expose any PFM backend over the OpenAI HTTP shape so non-Swift codebases (Python, Node, curl, the official OpenAI SDKs) can drive Apple's on-device model unchanged:

swift run -c release pfm-serve-apple
# [pfm-serve] listening on http://127.0.0.1:11434
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11434/v1", api_key="not-required")

resp = client.chat.completions.create(
    model="apple-fm",
    messages=[{"role": "user", "content": "Capital of France?"}],
)
# resp.choices[0].message.content == "The capital of France is Paris."

Implemented endpoints: POST /v1/chat/completions (with SSE streaming, tool calling, vision content arrays, JSON mode), POST /v1/completions, POST /v1/embeddings, GET /v1/models, GET /healthz, full CORS for browser fetch(). Multi-model loading (Ollama-style) since v0.10.0:

pfm-serve-mlx \
  --model mlx-community/Qwen3.5-0.8B-MLX-4bit \
  --model mlx-community/FastVLM-0.5B-bf16 \
  --embedding-model sentence-transformers/all-MiniLM-L6-v2

End-to-end verified against the official openai==2.36 SDK including streaming tool calls and embeddings. Demos in Examples/PythonClient/.

Benchmarks

Standardized pfm-bench harness with median-of-3 + warmup. Apples-to-apples cross-runtime numbers on M4 Max and iPhone Air, multi-language coverage across en/es/ko/ja/zh, contributable from any Mac with one command:

swift run -c release pfm-bench-apple  --csv-append docs/BENCHMARKS.csv
swift run -c release pfm-bench-coreml --csv-append docs/BENCHMARKS.csv --model qwen3.5-0.8B
# MLX needs xcodebuild
$(find ~/Library/Developer/Xcode/DerivedData -name pfm-bench-mlx -path '*Release*' -type f | head -1) \
  --csv-append docs/BENCHMARKS.csv

docs/BENCHMARKS.csv grows per-contributor — both M4 Max and iPhone18,1 (iPhone Air) baselines are already in there. Add your own iPhone via the Examples/PFMiPhoneBench/ one-tap iOS app (auto-starts, AirDrop the CSV out, PR the diff).

Deep dives:

Verified

Captured on Apple M4 Max / macOS 26.0 / Xcode 26.1.1, against mlboydaisuke/lfm2.5-350m-coreml, mlx-community/Qwen3.5-0.8B-MLX-4bit, mlx-community/FastVLM-0.5B-bf16, sentence-transformers/all-MiniLM-L6-v2, and Apple's own on-device model:

Harness What it proves Result
swift test Session logic, schema decoder, tool dispatch, error wrapping — stub-backed for determinism 94 / 94 pass
pfm-verify Every public API path against a real CoreML model 10 / 10 pass (log)
pfm-portability Real Apple-FM-shaped code compiled and ran unchanged 8 / 8 pass (log)
pfm-deep Every Generable shape × Tool pattern against CoreML PASS 7 / MODEL 4 / FAIL 0 (log)
pfm-mlx-deep Same matrix routed through MLX-Swift PASS 9 / MODEL 5 / FAIL 0 (log)
pfm-apple-deep Same matrix through Apple's native FoundationModels PASS 14 / MODEL 0 / FAIL 0 (log)
pfm-apple-smoke respond + streamResponse + Generable through Apple FM ✓ load 0 s · respond 0.7 s · stream (log)
pfm-vision-sample OpenAI content array → MLX VLM (FastVLM-0.5B) end-to-end ✓ identified red top-left, green top-right (log)
pfm-embeddings-sample OpenAI /v1/embeddings → MLXEmbedder (MiniLM-L6-v2) ✓ 384-dim, semantic ranking correct (log)

Plus 6 captured runs through the openai Python SDK driving the HTTP server — chat, streaming, function calling, streaming tool calls, vision content arrays, embeddings — all in Examples/PythonClient/.

Bring your own backend

LanguageModelBackend is two methods (generate + streamGenerate) plus an availability property. Route to llama.cpp, a remote API, your own runtime — see Sources/PrivateFoundationModels/LanguageModelBackend.swift.

Compatibility with FoundationModels

PFM mirrors Apple's FoundationModels API surface as of WWDC 2025 / iOS 26.1:

  • LanguageModelSessionrespond(to:), respond(to:generating:), streamResponse(to:), streamResponse(to:generating:), prewarm(), transcript, isResponding, image: overloads.
  • Instructions, GenerationOptions, SamplingMode.
  • Response<Content>, ResponseStream<Content> (AsyncSequence with Snapshot).
  • Transcript + Transcript.Entry (Codable).
  • Tool protocol, AnyTool type-erased wrapper, two-turn tool calling.
  • Generable protocol + macro, GenerationSchema, @Guide(description:).
  • SystemLanguageModel + Availability + UnavailableReason, UseCase, Adapter.
  • Prompt + @PromptBuilder + @InstructionsBuilder.
  • Guardrails (default accept-all; Apple FM passthrough delegates to Apple's).
  • GenerationError with cases matching Apple's where they exist.

If you find a method or initializer in Apple's docs that PFM doesn't ship, open an issue.

What this package is not

  • Not affiliated with Apple. "Foundation Models" is Apple's trademark; this is an API-compatible alternative.
  • Not a model. It's a thin Swift surface that delegates to whatever backend you wire up.
  • Not a grammar-constrained sampler on CoreML / MLX. @Generable is enforced via system-prompt + post-processing; on retry the schema is re-injected. Apple FM uses Apple's native grammar sampler. Grammar-constrained MLX sampling is on the roadmap.

Examples

Roadmap

The current head is v0.10.4. Full version history in CHANGELOG.md. Next on the list:

  • Grammar-constrained sampling on MLX (closes the last "Not a..." disclaimer above).
  • Qwen3-VL stateful routing on CoreML.
  • llama.cpp / GGUF backend.
  • Multi-machine bench fill-in (M1 / M2 / M3 / iPhone / iPad / Vision Pro) — see CONTRIBUTING.md.

Author

Daisuke Majima (@JackdeS11) — founder of Pebble Inc., maintainer of CoreML-Models (1.7k★), CoreML-LLM, and the mlboydaisuke Apple Silicon model collection.

Open to consulting on Apple Silicon LLM inference and on-device deployment — pebble.co.jp.

License

MIT. See LICENSE. Model weights inherit their own licenses (Gemma: Gemma Terms; Qwen: Apache 2.0; LFM2.5: LFM Open License v1.0).

About

Apple FoundationModels API on iOS 18+. Same call site, native passthrough on iOS 26 (Apple Intelligence), CoreML / MLX backends on older OSes. Drop-in source compatible.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors