Skip to content

Add Serve Models guide for Jobs (vLLM + llama.cpp via exposed ports)#2554

Open
davanstrien wants to merge 3 commits into
mainfrom
jobs-serving-guide
Open

Add Serve Models guide for Jobs (vLLM + llama.cpp via exposed ports)#2554
davanstrien wants to merge 3 commits into
mainfrom
jobs-serving-guide

Conversation

@davanstrien

@davanstrien davanstrien commented Jun 11, 2026

Copy link
Copy Markdown
Member

Important

Depends on #2551 — this guide builds on the Expose Ports documentation added there (it links to jobs-configuration#expose-ports). Best merged after it.

Adds a new Serve Models guide page (docs/hub/jobs-serving.md + toctree entry) showing how to use the new --expose feature (huggingface_hub 1.19.0) to run a Job as a temporary inference server.

What's covered

  • One-command vLLM servervllm serve LiquidAI/LFM2.5-8B-A1B on a GPU flavor with --expose 8000, following startup with hf jobs logs -f
  • Connecting clients — OpenAI Python client and curl, with the HF token as the API key and the /v1 URL pattern; notes the token travels in the Authorization header (works from scripts/notebooks/agents, not directly in a browser)
  • llama.cppllama-server pulling the Gemma 4 12B GGUF straight from the Hub with -hf, using Gemma's recommended sampling settings
  • Positioning — when to use this vs Inference Endpoints (ephemeral vs managed)
  • Gotchas found while testing: Jobs runs the command directly (no docker run-style entrypoint args), servers must bind 0.0.0.0, -s HF_TOKEN for authenticated model downloads, -- separator when the server's flags would collide with the CLI's

Testing

Every command and client snippet was run verbatim against live Jobs — servers reached ready and returned completions through the exposed-port proxy.

A task-titled guide page ("Serve Models") rather than a section under Examples, so people — and agents — searching for "serve a model on Jobs" land directly on it. Happy to fold it elsewhere if you'd prefer.

🤖 Generated with Claude Code


Note

Low Risk
Documentation-only changes with no runtime or API impact.

Overview
Adds a dedicated Hub docs page Serve Models under Jobs, wired into _toctree.yml between Popular Images and Examples.

The guide explains using --expose so a Job runs as a short-lived OpenAI-compatible inference endpoint: vLLM on a GPU flavor with -s HF_TOKEN, client setup (Python OpenAI + curl with Bearer token and /v1 base URL), and llama.cpp llama-server for Hub GGUF models (including -- for CLI flag separation, optional hf:// volume mount, and 0.0.0.0 binding). It contrasts ephemeral Jobs serving with Inference Endpoints, covers readiness via logs, cancellation/timeouts, and notes huggingface_hub ≥ 1.19.0 plus exposed-port billing.

Reviewed by Cursor Bugbot for commit 54d8156. Bugbot is set up for automated code reviews on this repo. Configure here.

All commands and client snippets verified against live jobs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

davanstrien and others added 2 commits June 11, 2026 23:22
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mounting the model repo read-only skips the download (verified: 31s vs
4m15s to listening for the same model).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@davanstrien davanstrien marked this pull request as ready for review June 11, 2026 20:32
@davanstrien davanstrien requested review from Wauplin and julien-c June 11, 2026 20:32

@Wauplin Wauplin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, considering the Jobs commands (vllm and llama server) have been tested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants