Add Jobs guide: Process Large Datasets by davanstrien · Pull Request #2522 · huggingface/hub-docs

davanstrien · 2026-06-02T14:39:10Z

What

Adds a new Jobs documentation page — Process Large Datasets (docs/hub/jobs-large-datasets.md) — under the Jobs section of the nav, plus an inbound pointer from jobs-configuration's Volumes section.

It's a task-oriented guide for working with datasets larger than a Job's ephemeral disk, opening with a short decision rule and then covering each approach:

Stream with datasets (streaming=True) — links the Stream guide and the 100× more efficient blog
Read & filter over hf:// directly with Polars/DuckDB/pandas (native readers, no mount)
Mount a dataset/model/bucket for tools that expect local file paths
Save results to a Storage Bucket (DuckDB COPY straight from hf:// to the mount) or push_to_hub
A Common Crawl worked example: stream a WET shard straight from hf:// → DuckDB, no download

It links the canonical reference pages (jobs-pricing, jobs-configuration#volumes, storage-buckets-access, datasets/stream, per-library dataset pages) rather than restating them.

Verification

Every snippet was run against live Jobs. Timings that shaped the page (same aggregate query, same ~28 GB / 14-file Parquet glob):

Read path	Time
Polars native `hf://` scan	~4 min (in the doc)
DuckDB native `hf://`	~18 min
DuckDB via `register_filesystem(HfFileSystem())`	did not finish (killed at 30-min default timeout)
Polars over a `-v` repo mount	~6 min per file (~85 min extrapolated)

Hence the page's structure: hf:// native readers for scans, mounts for whole-file/local-path access, explicit --timeout guidance for network-bound scans. The Common Crawl example runs verbatim in ~1 min on the default CPU flavor with no token; its printed output is included on the page.

Notes

Draft to preview rendering.
Touches the same _toctree.yml region as Add Serve Models guide for Jobs (vLLM + llama.cpp via exposed ports) #2554 — whichever merges second needs a trivial rebase.

New docs/hub/jobs-large-datasets.md: how to read, filter, and process datasets larger than a Job's ephemeral disk — streaming, mounting + lazy reads, reading/filtering over hf:// with Polars/DuckDB, saving results to a bucket, and a Common Crawl worked example. Links canonical pages rather than restating them; adds an inbound pointer from jobs-configuration#volumes.

HuggingFaceDocBuilderDev · 2026-06-02T14:42:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…sults Every example was run against real Jobs. Key changes from what testing showed: - hf:// section: native Polars scan (verified ~4 min for the 28 GB sample aggregate on the default CPU flavor) replaces the register_filesystem(HfFileSystem()) DuckDB example, which could not finish the same scan within the 30-minute default timeout. Native hf:// stays for datasets; HfFileSystem noted for buckets. - Mount section: reframed for whole-file/local-path access; large Parquet scans pointed at hf:// instead (measured several times slower through the mount). Added the missing PEP 723 header to the script example. - Save results: hosts the DuckDB COPY-to-bucket pattern (out-of-core write), scaled from 100BT to 10BT, with --timeout shown and a note that it transfers the full text column. - Added --timeout guidance (long scans are network-bound), real output for the Common Crawl example (verified verbatim, ~1 min, no token needed), hf jobs hardware mention, and Title Case/heading consistency. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

davanstrien and others added 2 commits June 2, 2026 15:39

small edits

a379bf9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Jobs guide: Process Large Datasets#2522

Add Jobs guide: Process Large Datasets#2522
davanstrien wants to merge 3 commits into
mainfrom
jobs-large-datasets-doc

davanstrien commented Jun 2, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

davanstrien commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Verification

Notes

Uh oh!

HuggingFaceDocBuilderDev commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davanstrien commented Jun 2, 2026 •

edited

Loading