feat: auto-configure MTP models with metadata-based detection by offbyonebit · Pull Request #6 · offbyonebit/arc-llama

offbyonebit · 2026-05-21T14:50:37Z

Add three MTP-aware behaviors that inspect GGUF metadata (not filenames):

MTP head detection — read and architecture from the GGUF kv store to determine whether a file actually contains MTP heads. Auto-enable and at / time.
Safety wiring at launch time:
- Auto-inject for any model with MTP heads (prevents SSM compute-buffer OOM during speculative decode verification batches).
- Warn if the user explicitly set on a GGUF that lacks MTP heads.
Backend recommendation for hybrid SSM+attention MTP models on Xe2 (Battlemage, Lunar Lake): log a note that SYCL MTP is net-negative here due to GDN serial state passes, suggesting Vulkan for ~+9%.

Also adds:

CLI command for quick diagnostics
Admin edit endpoint accepts and
as a hard dependency

Add support for discovering and serving models from Ollama instances and OpenAI-compatible endpoints alongside locally managed models. - New UpstreamConfig in config.toml for explicit upstream URLs ([[upstreams]] url = "http://127.0.0.1:11435" name = "proxy"]) - discover_ollama() probes ports 11434-11436 for /api/tags - discover_openai_endpoints() scans ports 8080-8088, 18080 for /v1/models (opt-in via --discover-openai or configured upstreams) - discover_upstreams() combines all sources with dedup - Server merges upstream models into /v1/models and /admin/status, forwards chat/completion requests to the matching upstream - Web UI shows upstream models with source pill (ollama/upstream) - CLI: arc-llama add-upstream <url>, arc-llama scan --discover-openai - Startup cache-warm prevents first-request latency - Also includes: launcher log file handle fix, router fast-path optimization

Add three MTP-aware behaviors that inspect GGUF metadata (not filenames): 1. MTP head detection — read nextn_predict_layers and architecture from the GGUF kv store to determine whether a file actually contains MTP heads. Auto-enable --spec-type draft-mtp and -ub 8 at add/scan time. 2. Safety wiring at launch time: • Auto-inject -ub 8 for any model with MTP heads (prevents SSM compute-buffer OOM during speculative decode verification batches). • Warn if the user explicitly set spec_type=draft-mtp on a GGUF that lacks MTP heads. 3. Backend recommendation for hybrid SSM+attention MTP models on Xe2 (Battlemage, Lunar Lake): log a note that SYCL MTP is net-negative here due to GDN serial state passes, suggesting Vulkan for ~+9%. Also add an 'mtp-info <path.gguf>' CLI command for quick diagnostics, and extend the admin edit endpoint to accept spec_type and ubatch_size. New dependency: gguf>=0.10 (hard dep, not optional).

offbyonebit added 2 commits May 16, 2026 10:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: auto-configure MTP models with metadata-based detection#6

feat: auto-configure MTP models with metadata-based detection#6
offbyonebit wants to merge 2 commits into
mainfrom
mtp-auto-config

offbyonebit commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

offbyonebit commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant