Skip to content

Add Jobs guide: Process Large Datasets#2522

Draft
davanstrien wants to merge 3 commits into
mainfrom
jobs-large-datasets-doc
Draft

Add Jobs guide: Process Large Datasets#2522
davanstrien wants to merge 3 commits into
mainfrom
jobs-large-datasets-doc

Conversation

@davanstrien

@davanstrien davanstrien commented Jun 2, 2026

Copy link
Copy Markdown
Member

What

Adds a new Jobs documentation page — Process Large Datasets (docs/hub/jobs-large-datasets.md) — under the Jobs section of the nav, plus an inbound pointer from jobs-configuration's Volumes section.

It's a task-oriented guide for working with datasets larger than a Job's ephemeral disk, opening with a short decision rule and then covering each approach:

  • Stream with datasets (streaming=True) — links the Stream guide and the 100× more efficient blog
  • Read & filter over hf:// directly with Polars/DuckDB/pandas (native readers, no mount)
  • Mount a dataset/model/bucket for tools that expect local file paths
  • Save results to a Storage Bucket (DuckDB COPY straight from hf:// to the mount) or push_to_hub
  • A Common Crawl worked example: stream a WET shard straight from hf:// → DuckDB, no download

It links the canonical reference pages (jobs-pricing, jobs-configuration#volumes, storage-buckets-access, datasets/stream, per-library dataset pages) rather than restating them.

Verification

Every snippet was run against live Jobs. Timings that shaped the page (same aggregate query, same ~28 GB / 14-file Parquet glob):

Read path Time
Polars native hf:// scan ~4 min (in the doc)
DuckDB native hf:// ~18 min
DuckDB via register_filesystem(HfFileSystem()) did not finish (killed at 30-min default timeout)
Polars over a -v repo mount ~6 min per file (~85 min extrapolated)

Hence the page's structure: hf:// native readers for scans, mounts for whole-file/local-path access, explicit --timeout guidance for network-bound scans. The Common Crawl example runs verbatim in ~1 min on the default CPU flavor with no token; its printed output is included on the page.

Notes

davanstrien and others added 2 commits June 2, 2026 15:39
New docs/hub/jobs-large-datasets.md: how to read, filter, and process datasets larger than a Job's ephemeral disk — streaming, mounting + lazy reads, reading/filtering over hf:// with Polars/DuckDB, saving results to a bucket, and a Common Crawl worked example. Links canonical pages rather than restating them; adds an inbound pointer from jobs-configuration#volumes.
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…sults

Every example was run against real Jobs. Key changes from what testing showed:

- hf:// section: native Polars scan (verified ~4 min for the 28 GB sample
  aggregate on the default CPU flavor) replaces the
  register_filesystem(HfFileSystem()) DuckDB example, which could not
  finish the same scan within the 30-minute default timeout. Native hf://
  stays for datasets; HfFileSystem noted for buckets.
- Mount section: reframed for whole-file/local-path access; large Parquet
  scans pointed at hf:// instead (measured several times slower through
  the mount). Added the missing PEP 723 header to the script example.
- Save results: hosts the DuckDB COPY-to-bucket pattern (out-of-core
  write), scaled from 100BT to 10BT, with --timeout shown and a note that
  it transfers the full text column.
- Added --timeout guidance (long scans are network-bound), real output
  for the Common Crawl example (verified verbatim, ~1 min, no token
  needed), hf jobs hardware mention, and Title Case/heading consistency.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants