Add Serve Models guide for Jobs (vLLM + llama.cpp via exposed ports) by davanstrien · Pull Request #2554 · huggingface/hub-docs

davanstrien · 2026-06-11T20:18:09Z

Important

Depends on #2551 — this guide builds on the Expose Ports documentation added there (it links to jobs-configuration#expose-ports). Best merged after it.

Adds a new Serve Models guide page (docs/hub/jobs-serving.md + toctree entry) showing how to use the new --expose feature (huggingface_hub 1.19.0) to run a Job as a temporary inference server.

What's covered

One-command vLLM server — vllm serve LiquidAI/LFM2.5-8B-A1B on a GPU flavor with --expose 8000, following startup with hf jobs logs -f
Connecting clients — OpenAI Python client and curl, with the HF token as the API key and the /v1 URL pattern; notes the token travels in the Authorization header (works from scripts/notebooks/agents, not directly in a browser)
llama.cpp — llama-server pulling the Gemma 4 12B GGUF straight from the Hub with -hf, using Gemma's recommended sampling settings
Positioning — when to use this vs Inference Endpoints (ephemeral vs managed)
Gotchas found while testing: Jobs runs the command directly (no docker run-style entrypoint args), servers must bind 0.0.0.0, -s HF_TOKEN for authenticated model downloads, -- separator when the server's flags would collide with the CLI's

Testing

Every command and client snippet was run verbatim against live Jobs — servers reached ready and returned completions through the exposed-port proxy.

A task-titled guide page ("Serve Models") rather than a section under Examples, so people — and agents — searching for "serve a model on Jobs" land directly on it. Happy to fold it elsewhere if you'd prefer.

🤖 Generated with Claude Code

Note

Low Risk
Documentation-only changes with no runtime or API impact.

Overview
Adds a dedicated Hub docs page Serve Models under Jobs, wired into _toctree.yml between Popular Images and Examples.

The guide explains using --expose so a Job runs as a short-lived OpenAI-compatible inference endpoint: vLLM on a GPU flavor with -s HF_TOKEN, client setup (Python OpenAI + curl with Bearer token and /v1 base URL), and llama.cpp llama-server for Hub GGUF models (including -- for CLI flag separation, optional hf:// volume mount, and 0.0.0.0 binding). It contrasts ephemeral Jobs serving with Inference Endpoints, covers readiness via logs, cancellation/timeouts, and notes huggingface_hub ≥ 1.19.0 plus exposed-port billing.

^{Reviewed by Cursor Bugbot for commit 54d8156. Bugbot is set up for automated code reviews on this repo. Configure here.}

All commands and client snippets verified against live jobs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

HuggingFaceDocBuilderDev · 2026-06-11T20:20:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Mounting the model repo read-only skips the download (verified: 31s vs 4m15s to listening for the same model). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Wauplin

Looks good, considering the Jobs commands (vllm and llama server) have been tested

Add Serve Models guide for Jobs (vLLM + llama.cpp via exposed ports)

063f5c4

All commands and client snippets verified against live jobs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

davanstrien and others added 2 commits June 11, 2026 23:22

Switch llama.cpp example to Gemma 4 E4B for a faster demo

b32a9e8

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add volume-mount tip for faster server startup

54d8156

Mounting the model repo read-only skips the download (verified: 31s vs 4m15s to listening for the same model). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

davanstrien marked this pull request as ready for review June 11, 2026 20:32

davanstrien requested review from Wauplin and julien-c June 11, 2026 20:32

davanstrien mentioned this pull request Jun 12, 2026

Add Jobs guide: Process Large Datasets #2522

Draft

Wauplin approved these changes Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Serve Models guide for Jobs (vLLM + llama.cpp via exposed ports)#2554

Add Serve Models guide for Jobs (vLLM + llama.cpp via exposed ports)#2554
davanstrien wants to merge 3 commits into
mainfrom
jobs-serving-guide

davanstrien commented Jun 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2026

Uh oh!

Wauplin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

davanstrien commented Jun 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's covered

Testing

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2026

Uh oh!

Wauplin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davanstrien commented Jun 11, 2026 •

edited by cursor Bot

Loading