Skip to content

feat: auto-configure MTP models with metadata-based detection#6

Open
offbyonebit wants to merge 2 commits into
mainfrom
mtp-auto-config
Open

feat: auto-configure MTP models with metadata-based detection#6
offbyonebit wants to merge 2 commits into
mainfrom
mtp-auto-config

Conversation

@offbyonebit
Copy link
Copy Markdown
Owner

Add three MTP-aware behaviors that inspect GGUF metadata (not filenames):

  1. MTP head detection — read and architecture from the GGUF kv store to determine whether a file actually contains MTP heads. Auto-enable and at / time.

  2. Safety wiring at launch time:

    • Auto-inject for any model with MTP heads (prevents SSM compute-buffer OOM during speculative decode verification batches).
    • Warn if the user explicitly set on a GGUF that lacks MTP heads.
  3. Backend recommendation for hybrid SSM+attention MTP models on Xe2 (Battlemage, Lunar Lake): log a note that SYCL MTP is net-negative here due to GDN serial state passes, suggesting Vulkan for ~+9%.

Also adds:

  • CLI command for quick diagnostics
  • Admin edit endpoint accepts and
  • as a hard dependency

Add support for discovering and serving models from Ollama instances
and OpenAI-compatible endpoints alongside locally managed models.

- New UpstreamConfig in config.toml for explicit upstream URLs
  ([[upstreams]] url = "http://127.0.0.1:11435" name = "proxy"])
- discover_ollama() probes ports 11434-11436 for /api/tags
- discover_openai_endpoints() scans ports 8080-8088, 18080 for /v1/models
  (opt-in via --discover-openai or configured upstreams)
- discover_upstreams() combines all sources with dedup
- Server merges upstream models into /v1/models and /admin/status,
  forwards chat/completion requests to the matching upstream
- Web UI shows upstream models with source pill (ollama/upstream)
- CLI: arc-llama add-upstream <url>, arc-llama scan --discover-openai
- Startup cache-warm prevents first-request latency
- Also includes: launcher log file handle fix, router fast-path optimization
Add three MTP-aware behaviors that inspect GGUF metadata (not filenames):

1. MTP head detection — read nextn_predict_layers and architecture from
   the GGUF kv store to determine whether a file actually contains MTP
   heads. Auto-enable --spec-type draft-mtp and -ub 8 at add/scan time.

2. Safety wiring at launch time:
   • Auto-inject -ub 8 for any model with MTP heads (prevents SSM
     compute-buffer OOM during speculative decode verification batches).
   • Warn if the user explicitly set spec_type=draft-mtp on a GGUF that
     lacks MTP heads.

3. Backend recommendation for hybrid SSM+attention MTP models on Xe2
   (Battlemage, Lunar Lake): log a note that SYCL MTP is net-negative
   here due to GDN serial state passes, suggesting Vulkan for ~+9%.

Also add an 'mtp-info <path.gguf>' CLI command for quick diagnostics,
and extend the admin edit endpoint to accept spec_type and ubatch_size.

New dependency: gguf>=0.10 (hard dep, not optional).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant