Add Serve Models guide for Jobs (vLLM + llama.cpp via exposed ports)#2554
Open
davanstrien wants to merge 3 commits into
Open
Add Serve Models guide for Jobs (vLLM + llama.cpp via exposed ports)#2554davanstrien wants to merge 3 commits into
davanstrien wants to merge 3 commits into
Conversation
All commands and client snippets verified against live jobs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mounting the model repo read-only skips the download (verified: 31s vs 4m15s to listening for the same model). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Wauplin
approved these changes
Jun 16, 2026
Wauplin
left a comment
Contributor
There was a problem hiding this comment.
Looks good, considering the Jobs commands (vllm and llama server) have been tested
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
Depends on #2551 — this guide builds on the Expose Ports documentation added there (it links to
jobs-configuration#expose-ports). Best merged after it.Adds a new Serve Models guide page (
docs/hub/jobs-serving.md+ toctree entry) showing how to use the new--exposefeature (huggingface_hub 1.19.0) to run a Job as a temporary inference server.What's covered
vllm serve LiquidAI/LFM2.5-8B-A1Bon a GPU flavor with--expose 8000, following startup withhf jobs logs -f/v1URL pattern; notes the token travels in theAuthorizationheader (works from scripts/notebooks/agents, not directly in a browser)llama-serverpulling the Gemma 4 12B GGUF straight from the Hub with-hf, using Gemma's recommended sampling settingsdocker run-style entrypoint args), servers must bind0.0.0.0,-s HF_TOKENfor authenticated model downloads,--separator when the server's flags would collide with the CLI'sTesting
Every command and client snippet was run verbatim against live Jobs — servers reached ready and returned completions through the exposed-port proxy.
A task-titled guide page ("Serve Models") rather than a section under Examples, so people — and agents — searching for "serve a model on Jobs" land directly on it. Happy to fold it elsewhere if you'd prefer.
🤖 Generated with Claude Code
Note
Low Risk
Documentation-only changes with no runtime or API impact.
Overview
Adds a dedicated Hub docs page Serve Models under Jobs, wired into
_toctree.ymlbetween Popular Images and Examples.The guide explains using
--exposeso a Job runs as a short-lived OpenAI-compatible inference endpoint: vLLM on a GPU flavor with-s HF_TOKEN, client setup (PythonOpenAI+curlwith Bearer token and/v1base URL), and llama.cppllama-serverfor Hub GGUF models (including--for CLI flag separation, optionalhf://volume mount, and0.0.0.0binding). It contrasts ephemeral Jobs serving with Inference Endpoints, covers readiness via logs, cancellation/timeouts, and noteshuggingface_hub≥ 1.19.0 plus exposed-port billing.Reviewed by Cursor Bugbot for commit 54d8156. Bugbot is set up for automated code reviews on this repo. Configure here.