Add Jobs guide: Process Large Datasets#2522
Draft
davanstrien wants to merge 3 commits into
Draft
Conversation
New docs/hub/jobs-large-datasets.md: how to read, filter, and process datasets larger than a Job's ephemeral disk — streaming, mounting + lazy reads, reading/filtering over hf:// with Polars/DuckDB, saving results to a bucket, and a Common Crawl worked example. Links canonical pages rather than restating them; adds an inbound pointer from jobs-configuration#volumes.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…sults Every example was run against real Jobs. Key changes from what testing showed: - hf:// section: native Polars scan (verified ~4 min for the 28 GB sample aggregate on the default CPU flavor) replaces the register_filesystem(HfFileSystem()) DuckDB example, which could not finish the same scan within the 30-minute default timeout. Native hf:// stays for datasets; HfFileSystem noted for buckets. - Mount section: reframed for whole-file/local-path access; large Parquet scans pointed at hf:// instead (measured several times slower through the mount). Added the missing PEP 723 header to the script example. - Save results: hosts the DuckDB COPY-to-bucket pattern (out-of-core write), scaled from 100BT to 10BT, with --timeout shown and a note that it transfers the full text column. - Added --timeout guidance (long scans are network-bound), real output for the Common Crawl example (verified verbatim, ~1 min, no token needed), hf jobs hardware mention, and Title Case/heading consistency. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a new Jobs documentation page — Process Large Datasets (
docs/hub/jobs-large-datasets.md) — under the Jobs section of the nav, plus an inbound pointer fromjobs-configuration's Volumes section.It's a task-oriented guide for working with datasets larger than a Job's ephemeral disk, opening with a short decision rule and then covering each approach:
datasets(streaming=True) — links the Stream guide and the 100× more efficient bloghf://directly with Polars/DuckDB/pandas (native readers, no mount)COPYstraight fromhf://to the mount) orpush_to_hubhf://→ DuckDB, no downloadIt links the canonical reference pages (
jobs-pricing,jobs-configuration#volumes,storage-buckets-access,datasets/stream, per-library dataset pages) rather than restating them.Verification
Every snippet was run against live Jobs. Timings that shaped the page (same aggregate query, same ~28 GB / 14-file Parquet glob):
hf://scanhf://register_filesystem(HfFileSystem())-vrepo mountHence the page's structure:
hf://native readers for scans, mounts for whole-file/local-path access, explicit--timeoutguidance for network-bound scans. The Common Crawl example runs verbatim in ~1 min on the default CPU flavor with no token; its printed output is included on the page.Notes
_toctree.ymlregion as Add Serve Models guide for Jobs (vLLM + llama.cpp via exposed ports) #2554 — whichever merges second needs a trivial rebase.