Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/devnotes/.authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,7 @@ authors:
name: Dhruv Nathawani
description: Researcher at NVIDIA
avatar: https://avatars.githubusercontent.com/u/128275431?v=4
nmulepati:
name: Nabin Mulepati
description: Researcher at NVIDIA
avatar: https://avatars.githubusercontent.com/u/5551931?v=4
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
294 changes: 294 additions & 0 deletions docs/devnotes/posts/push-datasets-to-hugging-face-hub.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,294 @@
---
date: 2026-02-26
authors:
- nmulepati
---

# **Push Datasets to Hugging Face Hub**

![Push to Hub Hero](images/push-to-hub-hero.png)

You just generated 10k multilingual greetings (or some other cool dataset). Now what — email a parquet file?
Nah. Call `.push_to_hub()` and you've got a live dataset page on Hugging Face. Done and dusted 🚢.

<!-- more -->

Here's the full flow — build a multilingual greeting dataset with a conversation
training processor, generate it, and push it to the Hub in one go:

```python
import data_designer.config as dd
from data_designer.interface import DataDesigner

data_designer = DataDesigner()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable name shadows package name

The local variable data_designer (assigned a DataDesigner instance) shares its name with the data_designer package that was just imported on lines 20–21. While Python won't error here (the imports use dd and DataDesigner as their local bindings), readers skimming the snippet may confuse the instance with the module. A less ambiguous name like designer or dd_client would make the example clearer, especially since the second Round-Trip code block (line 272) already uses the inline DataDesigner().create(...) style without assigning to a named variable at all.

Suggested change
data_designer = DataDesigner()
designer = DataDesigner()

Then update line 65 accordingly:

results = designer.create(config_builder, num_records=10_000)
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/devnotes/posts/push-datasets-to-hugging-face-hub.md
Line: 23

Comment:
**Variable name shadows package name**

The local variable `data_designer` (assigned a `DataDesigner` instance) shares its name with the `data_designer` package that was just imported on lines 20–21. While Python won't error here (the imports use `dd` and `DataDesigner` as their local bindings), readers skimming the snippet may confuse the instance with the module. A less ambiguous name like `designer` or `dd_client` would make the example clearer, especially since the second Round-Trip code block (line 272) already uses the inline `DataDesigner().create(...)` style without assigning to a named variable at all.

```suggestion
designer = DataDesigner()
```
Then update line 65 accordingly:
```python
results = designer.create(config_builder, num_records=10_000)
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

config_builder = dd.DataDesignerConfigBuilder()

config_builder.add_column(
dd.SamplerColumnConfig(
name="language",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(
values=["English", "Spanish", "French", "German", "Italian"],
),
drop=True,
)
)

config_builder.add_column(
dd.LLMTextColumnConfig(
name="greeting",
model_alias="nvidia-text",
prompt="Write a casual greeting in {{ language }}.",
)
)
config_builder.add_column(
dd.LLMTextColumnConfig(
name="response",
model_alias="nvidia-text",
prompt="Write a helpful agent response to this greeting: '{{ greeting }}'.",
)
)

# Reshape into an OpenAI-style conversation training format
config_builder.add_processor(
dd.SchemaTransformProcessorConfig(
name="conversations",
template={
"messages": [
{"role": "user", "content": "{{ greeting }}"},
{"role": "assistant", "content": "{{ response }}"},
]
},
)
)

results = data_designer.create(config_builder, num_records=10_000)

# Ship it:
url = results.push_to_hub(
"my-org/multilingual-greetings",
"10k synthetic agent/user conversations across 5 languages.",
tags=["greetings", "multilingual", "conversation"],
)
print(url) # https://huggingface.co/datasets/my-org/multilingual-greetings
```

---
## Two Ways In - same outcome

**From results** (the happy path) — you just ran `.create()`, you have the
results object, call `.push_to_hub()` on it.

**From a folder** (the "I closed my notebook" path) — you saved artifacts to
disk earlier and want to push them later:

```python
from data_designer.integrations.huggingface import HuggingFaceHubClient

url = HuggingFaceHubClient.push_to_hub_from_folder(
dataset_path="./my-saved-dataset",
repo_id="my-org/multilingual-greetings",
description="10k synthetic agent/user conversations across 5 languages.",
)
```

---
## What Gets Uploaded

![Push to Hub Pipeline](images/push-to-hub-pipeline.png)

Everything. The upload pipeline runs in this order:

```
1. README.md ← auto-generated dataset card
2. data/*.parquet ← your main dataset (remapped from parquet-files/)
3. images/* ← if you have image columns (skipped otherwise)
4. {processor}/* ← processor outputs (remapped from processors-files/)
5. builder_config.json
6. metadata.json ← paths rewritten to match HF repo layout
```

Each step is its own commit on the HF repo, so you get a clean history.

This is especially nice for large datasets. Data Designer writes output in
batched parquet partitions — generate 100k records and you'll have dozens of
parquet files across `parquet-files/`, `processors-files/`, and maybe `images/`.
Manually uploading all of that, organizing it into the right HF repo structure,
writing the dataset card YAML configs, and rewriting metadata paths would be
tedious and error-prone. `push_to_hub` handles the whole thing in one call —
folder uploads, path remapping, config registration, dataset card generation,
all of it.

Re-pushing to the same `repo_id` updates the existing repo — no need to delete
and recreate.

---
## Processors Get First-Class Treatment

![Schema Transform for Conversation Training](images/push-to-hub-schema-transform.png)

Notice the `SchemaTransformProcessorConfig` in the example above. That's doing
the heavy lifting — it takes the raw `greeting` and `response` columns and
reshapes each row into an OpenAI-style `messages` array:

```python
config_builder.add_processor(
dd.SchemaTransformProcessorConfig(
name="conversations",
template={
"messages": [
{"role": "user", "content": "{{ greeting }}"},
{"role": "assistant", "content": "{{ response }}"},
]
},
)
)
```

The template is Jinja2 all the way down. Keys become columns in the output,
values get rendered per-row with the actual column data. The template dict must
be JSON-serializable — strings, lists, nested objects, all fair game. So you can
build arbitrarily complex conversation schemas (multi-turn, system prompts,
tool calls) just by adding more entries to the `messages` list.

The processor runs after each batch and writes its output to a separate parquet
file alongside the main dataset. The main dataset (`data/`) still has the raw
columns — the processor output is an *additional* view, not a replacement.

**When you push to hub, each processor gets its own top-level directory and its
own HF dataset config.** So the `conversations` processor from our example ends
up like this on HF:

```
my-org/multilingual-greetings/
├── README.md
├── data/
│ ├── batch_00000.parquet ← raw columns (greeting, response)
│ └── batch_00001.parquet
├── conversations/
│ ├── batch_00000.parquet ← transformed (messages array)
│ └── batch_00001.parquet
├── builder_config.json
└── metadata.json
```

The dataset card YAML frontmatter registers each processor as its own named
config:

```yaml
configs:
- config_name: data
data_files: "data/*.parquet"
default: true
- config_name: conversations
data_files: "conversations/*.parquet"
```

So consumers grab exactly the format they need:

```python
from datasets import load_dataset

# Raw columns — good for analysis
df = load_dataset("my-org/multilingual-greetings", "data", split="train")

# Conversation format — ready for fine-tuning
df_conv = load_dataset("my-org/multilingual-greetings", "conversations", split="train")
print(df_conv[0])
# {'messages': [{'role': 'user', 'content': 'Hey! Como estás?'},
# {'role': 'assistant', 'content': 'Hola! Estoy bien, gracias...'}]}
```

The Quick Start section in the generated README includes these snippets
automatically — one `load_dataset` call per processor.

**Metadata paths are rewritten too.** Local paths like
`processors-files/conversations/batch_00000.parquet` become
`conversations/batch_00000.parquet` so file references in the metadata match
the actual HF repo structure.

If there are no processors, all of this is silently skipped — no empty
directories, no phantom configs.

---
## The Auto-Generated Dataset Card

This is the fun part. The upload generates a full HuggingFace dataset card from
your run metadata. It pulls from `metadata.json` and `builder_config.json` to
build:

- A **Quick Start** section with `load_dataset` code (including processor subsets)
- A **Dataset Summary** with record count, column count, completion %
- A **Schema & Statistics** table — per-column type, uniqueness, null rate, token stats
- **Generation Details** — how many columns of each config type
- A **Citation** block so people can cite your dataset

Tags default to `["synthetic", "datadesigner"]` plus whatever you pass in.
Size category (`n<1K`, `1K<n<10K`, etc.) is auto-computed.

The template lives at `packages/data-designer/src/data_designer/integrations/huggingface/dataset_card_template.md` if you
want to see the Jinja2 source.

---
## Auth

Token resolution follows the standard `huggingface_hub` chain:

1. Explicit `token=` parameter
2. `HF_TOKEN` env var
3. Cached creds from `hf auth login`

If none of those work, you get a clear error telling you what to do.

---
## Reproducible Pipelines — The Round-Trip

![Round-Trip Reproducibility](images/push-to-hub-round-trip.png){ width="800" }

Here's the payoff: every dataset you push includes `builder_config.json` — the
full SDG pipeline definition. Anyone (including future-you) can recreate the
exact same pipeline from the HuggingFace URL:

```python
import data_designer.config as dd

config_builder = dd.DataDesignerConfigBuilder.from_config(
"https://huggingface.co/datasets/my-org/multilingual-greetings/blob/main/builder_config.json"
)
```

That's it. One line. `from_config` accepts a raw URL, a local file path, a dict,
or a YAML string. When you hand it a HuggingFace Hub URL, it auto-rewrites the
blob URL to a raw URL behind the scenes so the fetch just works (same trick for
GitHub blob URLs).

The loaded config builder comes back fully hydrated — columns, model configs,
constraints, seed config, all of it. You can inspect it, tweak it, and re-run:

```python
from data_designer.interface import DataDesigner

# Maybe bump the count or swap a model
results = DataDesigner().create(config_builder, num_records=50_000)

# And push the new version right back
results.push_to_hub(
"my-org/multilingual-greetings-v2",
"50k version with the same pipeline.",
)
```

So the full loop is: **design → generate → push → share URL → recreate → iterate**.
The `builder_config.json` on HuggingFace *is* the reproducibility artifact.

---
## Gotchas

- **`repo_id` must be `username/dataset-name`** — exactly one slash. The client
validates this before hitting the network.
- **`description` is required** — it's the prose that appears right under the
title on the dataset card. Make it good.
- **`private=True`** if you don't want the world to see your dataset yet.
- **Metadata paths get rewritten** — local paths like `parquet-files/batch_00000.parquet`
become `data/batch_00000.parquet` in the uploaded `metadata.json` so references
stay valid on HF.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ nav:
- analysis: code_reference/analysis.md
- Dev Notes:
- devnotes/index.md
- Push Datasets to Hugging Face Hub: devnotes/posts/push-datasets-to-hugging-face-hub.md
- Structured Outputs from Nemotron: devnotes/posts/structured-outputs-from-nemotron.md
- Deep Research Trajectories: devnotes/posts/deep-research-trajectories.md
- RQA Dataset: devnotes/posts/rqa.md
Expand Down