Skip to content

Add Apple Silicon support via vllm-metal#2

Merged
alchemystack merged 3 commits intoalchemystack:mainfrom
ereid7:fix/apple-silicon-support
Mar 8, 2026
Merged

Add Apple Silicon support via vllm-metal#2
alchemystack merged 3 commits intoalchemystack:mainfrom
ereid7:fix/apple-silicon-support

Conversation

@ereid7
Copy link
Copy Markdown
Contributor

@ereid7 ereid7 commented Mar 1, 2026

Summary

  • Adds Apple Silicon (macOS) support via vllm-metal — same plugin, same API, no code fork
  • 3-line fix in processor.py: always call .cpu() before .numpy() so MPS tensors convert correctly
  • README updated with full Apple Silicon setup guide, Open WebUI standalone instructions, and a note about the required vllm-metal PR #124

What changed

src/qr_sampler/processor.py — The _to_numpy() method previously only called .cpu() for CUDA tensors. MPS (Metal) tensors hit the else branch where .numpy() fails because the tensor isn't on CPU. Fix: always call .cpu() before .numpy() (no-op on CPU tensors).

README.md — Added:

  • Apple Silicon setup section with MLX-format model example (mlx-community/Qwen3-0.6B-4bit)
  • Verification step (check server logs for QRSamplerLogitsProcessor initialized)
  • Both /v1/completions and /v1/chat/completions curl examples
  • Prerequisite note about vllm-metal PR #124 (required for plugin discovery)
  • Open WebUI standalone Docker instructions for Apple Silicon
  • Split Web UI section into NVIDIA/Linux and Apple Silicon paths

.gitignore — Added .webui_secret_key (generated by Open WebUI, should not be committed)

Testing

  • 308/308 unit tests pass
  • E2E verified on Apple Silicon: both /v1/completions and /v1/chat/completions return entropy-driven responses
  • Per-token sampling logs confirmed — full pipeline active (z-score, u-value, token selection via CDF)
  • Open WebUI verified working against vllm-metal server

Dependencies

Requires vllm-metal PR #124 to be merged for plugin discovery to work. Without it, vllm-metal silently skips custom logits processors. The PR is a 9-line patch that mirrors GPUModelRunner's existing pattern.

Always call .cpu() before .numpy() in _to_numpy() — MPS tensors are not
on CPU and the previous CUDA-only check missed them. .cpu() is a no-op
on CPU tensors so this is safe for all devices.

Add Apple Silicon setup docs to README with vllm-metal install steps.
@alchemystack
Copy link
Copy Markdown
Owner

Great, thanks!

@alchemystack alchemystack merged commit beb1bbc into alchemystack:main Mar 8, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants