-
Notifications
You must be signed in to change notification settings - Fork 63
docs: Added starter dev notes on push to hugging face hub #355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
nabinchha
wants to merge
6
commits into
main
Choose a base branch
from
nmulepati/docs/dev-notes-push-to-huggingface-hub
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
031ad32
Added starter dev notes on push to huggingface hub
nabinchha 46806da
Merge branch 'main' into nmulepati/docs/dev-notes-push-to-huggingface…
nabinchha e4a7e03
fix: move excerpt marker to intro and remove redundant markers
nabinchha ba8a055
Merge branch 'main' into nmulepati/docs/dev-notes-push-to-huggingface…
nabinchha b57e2d2
Update docs/devnotes/posts/push-datasets-to-hugging-face-hub.md
nabinchha a923a2a
Merge branch 'main' into nmulepati/docs/dev-notes-push-to-huggingface…
nabinchha File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
294 changes: 294 additions & 0 deletions
294
docs/devnotes/posts/push-datasets-to-hugging-face-hub.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,294 @@ | ||
| --- | ||
| date: 2026-02-26 | ||
| authors: | ||
| - nmulepati | ||
| --- | ||
|
|
||
| # **Push Datasets to Hugging Face Hub** | ||
|
|
||
|  | ||
|
|
||
| You just generated 10k multilingual greetings (or some other cool dataset). Now what — email a parquet file? | ||
| Nah. Call `.push_to_hub()` and you've got a live dataset page on Hugging Face. Done and dusted 🚢. | ||
|
|
||
| <!-- more --> | ||
|
|
||
| Here's the full flow — build a multilingual greeting dataset with a conversation | ||
| training processor, generate it, and push it to the Hub in one go: | ||
|
|
||
| ```python | ||
| import data_designer.config as dd | ||
| from data_designer.interface import DataDesigner | ||
|
|
||
| data_designer = DataDesigner() | ||
| config_builder = dd.DataDesignerConfigBuilder() | ||
|
|
||
| config_builder.add_column( | ||
| dd.SamplerColumnConfig( | ||
| name="language", | ||
| sampler_type=dd.SamplerType.CATEGORY, | ||
| params=dd.CategorySamplerParams( | ||
| values=["English", "Spanish", "French", "German", "Italian"], | ||
| ), | ||
| drop=True, | ||
| ) | ||
| ) | ||
|
|
||
| config_builder.add_column( | ||
| dd.LLMTextColumnConfig( | ||
| name="greeting", | ||
| model_alias="nvidia-text", | ||
| prompt="Write a casual greeting in {{ language }}.", | ||
| ) | ||
| ) | ||
| config_builder.add_column( | ||
| dd.LLMTextColumnConfig( | ||
| name="response", | ||
| model_alias="nvidia-text", | ||
| prompt="Write a helpful agent response to this greeting: '{{ greeting }}'.", | ||
| ) | ||
| ) | ||
|
|
||
| # Reshape into an OpenAI-style conversation training format | ||
| config_builder.add_processor( | ||
| dd.SchemaTransformProcessorConfig( | ||
| name="conversations", | ||
| template={ | ||
| "messages": [ | ||
| {"role": "user", "content": "{{ greeting }}"}, | ||
| {"role": "assistant", "content": "{{ response }}"}, | ||
| ] | ||
| }, | ||
| ) | ||
| ) | ||
|
|
||
| results = data_designer.create(config_builder, num_records=10_000) | ||
|
|
||
| # Ship it: | ||
| url = results.push_to_hub( | ||
| "my-org/multilingual-greetings", | ||
| "10k synthetic agent/user conversations across 5 languages.", | ||
| tags=["greetings", "multilingual", "conversation"], | ||
| ) | ||
| print(url) # https://huggingface.co/datasets/my-org/multilingual-greetings | ||
| ``` | ||
|
|
||
| --- | ||
| ## Two Ways In - same outcome | ||
|
|
||
| **From results** (the happy path) — you just ran `.create()`, you have the | ||
| results object, call `.push_to_hub()` on it. | ||
|
|
||
| **From a folder** (the "I closed my notebook" path) — you saved artifacts to | ||
| disk earlier and want to push them later: | ||
|
|
||
| ```python | ||
| from data_designer.integrations.huggingface import HuggingFaceHubClient | ||
|
|
||
| url = HuggingFaceHubClient.push_to_hub_from_folder( | ||
| dataset_path="./my-saved-dataset", | ||
| repo_id="my-org/multilingual-greetings", | ||
| description="10k synthetic agent/user conversations across 5 languages.", | ||
| ) | ||
| ``` | ||
|
|
||
| --- | ||
| ## What Gets Uploaded | ||
|
|
||
|  | ||
|
|
||
| Everything. The upload pipeline runs in this order: | ||
|
|
||
| ``` | ||
| 1. README.md ← auto-generated dataset card | ||
| 2. data/*.parquet ← your main dataset (remapped from parquet-files/) | ||
| 3. images/* ← if you have image columns (skipped otherwise) | ||
| 4. {processor}/* ← processor outputs (remapped from processors-files/) | ||
| 5. builder_config.json | ||
| 6. metadata.json ← paths rewritten to match HF repo layout | ||
| ``` | ||
|
|
||
| Each step is its own commit on the HF repo, so you get a clean history. | ||
|
|
||
| This is especially nice for large datasets. Data Designer writes output in | ||
| batched parquet partitions — generate 100k records and you'll have dozens of | ||
| parquet files across `parquet-files/`, `processors-files/`, and maybe `images/`. | ||
| Manually uploading all of that, organizing it into the right HF repo structure, | ||
| writing the dataset card YAML configs, and rewriting metadata paths would be | ||
| tedious and error-prone. `push_to_hub` handles the whole thing in one call — | ||
| folder uploads, path remapping, config registration, dataset card generation, | ||
| all of it. | ||
|
|
||
| Re-pushing to the same `repo_id` updates the existing repo — no need to delete | ||
| and recreate. | ||
|
|
||
| --- | ||
| ## Processors Get First-Class Treatment | ||
|
|
||
|  | ||
|
|
||
| Notice the `SchemaTransformProcessorConfig` in the example above. That's doing | ||
| the heavy lifting — it takes the raw `greeting` and `response` columns and | ||
| reshapes each row into an OpenAI-style `messages` array: | ||
|
|
||
| ```python | ||
| config_builder.add_processor( | ||
| dd.SchemaTransformProcessorConfig( | ||
| name="conversations", | ||
| template={ | ||
| "messages": [ | ||
| {"role": "user", "content": "{{ greeting }}"}, | ||
| {"role": "assistant", "content": "{{ response }}"}, | ||
| ] | ||
| }, | ||
| ) | ||
| ) | ||
| ``` | ||
|
|
||
| The template is Jinja2 all the way down. Keys become columns in the output, | ||
| values get rendered per-row with the actual column data. The template dict must | ||
| be JSON-serializable — strings, lists, nested objects, all fair game. So you can | ||
| build arbitrarily complex conversation schemas (multi-turn, system prompts, | ||
| tool calls) just by adding more entries to the `messages` list. | ||
|
|
||
| The processor runs after each batch and writes its output to a separate parquet | ||
| file alongside the main dataset. The main dataset (`data/`) still has the raw | ||
| columns — the processor output is an *additional* view, not a replacement. | ||
|
|
||
| **When you push to hub, each processor gets its own top-level directory and its | ||
| own HF dataset config.** So the `conversations` processor from our example ends | ||
| up like this on HF: | ||
|
|
||
| ``` | ||
| my-org/multilingual-greetings/ | ||
| ├── README.md | ||
| ├── data/ | ||
| │ ├── batch_00000.parquet ← raw columns (greeting, response) | ||
| │ └── batch_00001.parquet | ||
| ├── conversations/ | ||
| │ ├── batch_00000.parquet ← transformed (messages array) | ||
| │ └── batch_00001.parquet | ||
| ├── builder_config.json | ||
| └── metadata.json | ||
| ``` | ||
|
|
||
| The dataset card YAML frontmatter registers each processor as its own named | ||
| config: | ||
|
|
||
| ```yaml | ||
| configs: | ||
| - config_name: data | ||
| data_files: "data/*.parquet" | ||
| default: true | ||
| - config_name: conversations | ||
| data_files: "conversations/*.parquet" | ||
| ``` | ||
|
|
||
| So consumers grab exactly the format they need: | ||
|
|
||
| ```python | ||
| from datasets import load_dataset | ||
|
|
||
| # Raw columns — good for analysis | ||
| df = load_dataset("my-org/multilingual-greetings", "data", split="train") | ||
|
|
||
| # Conversation format — ready for fine-tuning | ||
| df_conv = load_dataset("my-org/multilingual-greetings", "conversations", split="train") | ||
| print(df_conv[0]) | ||
| # {'messages': [{'role': 'user', 'content': 'Hey! Como estás?'}, | ||
| # {'role': 'assistant', 'content': 'Hola! Estoy bien, gracias...'}]} | ||
| ``` | ||
|
|
||
| The Quick Start section in the generated README includes these snippets | ||
| automatically — one `load_dataset` call per processor. | ||
|
|
||
| **Metadata paths are rewritten too.** Local paths like | ||
| `processors-files/conversations/batch_00000.parquet` become | ||
| `conversations/batch_00000.parquet` so file references in the metadata match | ||
| the actual HF repo structure. | ||
|
|
||
| If there are no processors, all of this is silently skipped — no empty | ||
| directories, no phantom configs. | ||
|
|
||
| --- | ||
| ## The Auto-Generated Dataset Card | ||
|
|
||
| This is the fun part. The upload generates a full HuggingFace dataset card from | ||
| your run metadata. It pulls from `metadata.json` and `builder_config.json` to | ||
| build: | ||
|
|
||
| - A **Quick Start** section with `load_dataset` code (including processor subsets) | ||
| - A **Dataset Summary** with record count, column count, completion % | ||
| - A **Schema & Statistics** table — per-column type, uniqueness, null rate, token stats | ||
| - **Generation Details** — how many columns of each config type | ||
| - A **Citation** block so people can cite your dataset | ||
|
|
||
| Tags default to `["synthetic", "datadesigner"]` plus whatever you pass in. | ||
| Size category (`n<1K`, `1K<n<10K`, etc.) is auto-computed. | ||
|
|
||
| The template lives at `packages/data-designer/src/data_designer/integrations/huggingface/dataset_card_template.md` if you | ||
| want to see the Jinja2 source. | ||
|
|
||
| --- | ||
| ## Auth | ||
|
|
||
| Token resolution follows the standard `huggingface_hub` chain: | ||
|
|
||
| 1. Explicit `token=` parameter | ||
| 2. `HF_TOKEN` env var | ||
| 3. Cached creds from `hf auth login` | ||
nabinchha marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| If none of those work, you get a clear error telling you what to do. | ||
|
|
||
| --- | ||
| ## Reproducible Pipelines — The Round-Trip | ||
|
|
||
| { width="800" } | ||
|
|
||
| Here's the payoff: every dataset you push includes `builder_config.json` — the | ||
| full SDG pipeline definition. Anyone (including future-you) can recreate the | ||
| exact same pipeline from the HuggingFace URL: | ||
|
|
||
| ```python | ||
| import data_designer.config as dd | ||
|
|
||
| config_builder = dd.DataDesignerConfigBuilder.from_config( | ||
| "https://huggingface.co/datasets/my-org/multilingual-greetings/blob/main/builder_config.json" | ||
| ) | ||
| ``` | ||
|
|
||
| That's it. One line. `from_config` accepts a raw URL, a local file path, a dict, | ||
| or a YAML string. When you hand it a HuggingFace Hub URL, it auto-rewrites the | ||
| blob URL to a raw URL behind the scenes so the fetch just works (same trick for | ||
| GitHub blob URLs). | ||
|
|
||
| The loaded config builder comes back fully hydrated — columns, model configs, | ||
| constraints, seed config, all of it. You can inspect it, tweak it, and re-run: | ||
|
|
||
| ```python | ||
| from data_designer.interface import DataDesigner | ||
|
|
||
| # Maybe bump the count or swap a model | ||
| results = DataDesigner().create(config_builder, num_records=50_000) | ||
|
|
||
| # And push the new version right back | ||
| results.push_to_hub( | ||
| "my-org/multilingual-greetings-v2", | ||
| "50k version with the same pipeline.", | ||
| ) | ||
| ``` | ||
|
|
||
| So the full loop is: **design → generate → push → share URL → recreate → iterate**. | ||
| The `builder_config.json` on HuggingFace *is* the reproducibility artifact. | ||
|
|
||
| --- | ||
| ## Gotchas | ||
|
|
||
| - **`repo_id` must be `username/dataset-name`** — exactly one slash. The client | ||
| validates this before hitting the network. | ||
| - **`description` is required** — it's the prose that appears right under the | ||
| title on the dataset card. Make it good. | ||
| - **`private=True`** if you don't want the world to see your dataset yet. | ||
| - **Metadata paths get rewritten** — local paths like `parquet-files/batch_00000.parquet` | ||
| become `data/batch_00000.parquet` in the uploaded `metadata.json` so references | ||
| stay valid on HF. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable name shadows package name
The local variable
data_designer(assigned aDataDesignerinstance) shares its name with thedata_designerpackage that was just imported on lines 20–21. While Python won't error here (the imports useddandDataDesigneras their local bindings), readers skimming the snippet may confuse the instance with the module. A less ambiguous name likedesignerordd_clientwould make the example clearer, especially since the second Round-Trip code block (line 272) already uses the inlineDataDesigner().create(...)style without assigning to a named variable at all.Then update line 65 accordingly:
Prompt To Fix With AI
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!