diff --git a/docs/colab_notebooks/7-nemotron-personas.ipynb b/docs/colab_notebooks/7-nemotron-personas.ipynb new file mode 100644 index 000000000..32cc898ba --- /dev/null +++ b/docs/colab_notebooks/7-nemotron-personas.ipynb @@ -0,0 +1,1071 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "40d84353", + "metadata": {}, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "id": "092d9fb9", + "metadata": {}, + "source": [ + "# πŸ‘₯ Data Designer Tutorial: Reproducing & Customizing Nemotron-Personas\n", + "\n", + "This notebook reproduces the [Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) generation pipeline end to end with [🎨 NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner), and then shows how to customize that pipeline to generate personas for a specific use case. A similar approach was used to build every dataset in the [Nemotron-Personas HF collection](https://huggingface.co/collections/nvidia/nemotron-personas).\n", + "\n", + "We seed the pipeline with the **extended Nemotron-Personas-USA dataset on NGC**, which is a superset of the publicly released HuggingFace version β€” it includes additional demographic and persona fields used internally to ground synthetic generation. From those grounded seeds, two stages of LLM structured-output columns produce the persona attributes (cultural background, skills, career goals, hobbies) and the persona descriptions across professional, financial, healthcare, sports, arts, travel, and culinary dimensions.\n", + "\n", + "> ⚠️ **Note**: To run this notebook, follow the setup instructions in the [Quick Start](https://nvidia-nemo.github.io/DataDesigner/quick-start/), make sure you have generated an API key for accessing models on [build.nvidia.com](https://build.nvidia.com), and that you've set the `NVIDIA_API_KEY` environment variable. The next section also walks through downloading the NGC-hosted Nemotron-Personas dataset.\n", + "\n", + "
\n", + " \"Nemotron\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "a8d81c50", + "metadata": {}, + "source": [ + "# 1. πŸ“¦ Install and import python packages\n", + "\n", + "**IMPORTANT** πŸ‘‰ If you haven't already, follow the [Quick Start](https://nvidia-nemo.github.io/DataDesigner/quick-start/) to install Data Designer. Note that you may need to restart/select your kernel after setting up the environment.\n", + "\n", + "If the installation is successful, you should be able to run the imports below without any errors." + ] + }, + { + "cell_type": "markdown", + "id": "b4386c99", + "metadata": {}, + "source": [ + "### ⚑ Colab Setup\n", + "\n", + "Run the cells below to install the dependencies and set up the API key. If you don't have an API key, you can generate one from [build.nvidia.com](https://build.nvidia.com).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "83c32958", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install -U data-designer" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f6afdb18", + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import os\n", + "\n", + "from google.colab import userdata\n", + "\n", + "try:\n", + " os.environ[\"NVIDIA_API_KEY\"] = userdata.get(\"NVIDIA_API_KEY\")\n", + "except userdata.SecretNotFoundError:\n", + " os.environ[\"NVIDIA_API_KEY\"] = getpass.getpass(\"Enter your NVIDIA API key: \")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "422b5826", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "import os\n", + "\n", + "# Install the NGC CLI (used by `data-designer download personas` to fetch the\n", + "# managed Nemotron-Personas dataset). Pinned to a known-good version; bump as\n", + "# needed when NGC publishes new releases.\n", + "!wget -q --no-cache \"https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.164.0/files/ngccli_linux.zip\" -O /tmp/ngccli_linux.zip\n", + "!unzip -q -o /tmp/ngccli_linux.zip -d /tmp\n", + "!chmod u+x /tmp/ngc-cli/ngc\n", + "os.environ[\"PATH\"] = f\"/tmp/ngc-cli:{os.environ['PATH']}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25a122da", + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import os\n", + "\n", + "from google.colab import userdata\n", + "\n", + "try:\n", + " os.environ[\"NGC_API_KEY\"] = userdata.get(\"NGC_API_KEY\")\n", + "except userdata.SecretNotFoundError:\n", + " os.environ[\"NGC_API_KEY\"] = getpass.getpass(\"Enter your NGC API key: \")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2abfe950", + "metadata": {}, + "outputs": [], + "source": [ + "from __future__ import annotations\n", + "\n", + "import json\n", + "import shlex\n", + "import subprocess\n", + "from pathlib import Path\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "from pydantic import BaseModel, Field\n", + "\n", + "import data_designer.config as dd\n", + "from data_designer.interface import DataDesigner" + ] + }, + { + "cell_type": "markdown", + "id": "11e8a35c", + "metadata": {}, + "source": [ + "## πŸ“₯ Download the Nemotron-Personas dataset from NGC\n", + "\n", + "Before configuring Data Designer, make sure the NGC-hosted Nemotron-Personas dataset is on disk. This is the **extended** version, a superset of the public HF release. To use it you need an [NGC API key](https://ngc.nvidia.com/setup/api-key), the [NGC CLI](https://ngc.nvidia.com/setup/installers/cli) installed, and `NGC_API_KEY` exported in your environment.\n", + "\n", + "The cell below idempotently invokes the Data Designer CLI and only downloads when the locale's parquet isn't already in `~/.data-designer/managed-assets/datasets/`. Change `personas_locale` to any other [supported locale](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/person_sampling/) (`en_IN`, `en_SG`, `fr_FR`, `hi_Deva_IN`, `hi_Latn_IN`, `ja_JP`, `ko_KR`, `pt_BR`) to seed a regional pipeline instead." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33eedcbc", + "metadata": {}, + "outputs": [], + "source": [ + "personas_locale = \"en_US\"\n", + "\n", + "assets_dir = Path.home() / \".data-designer\" / \"managed-assets\" / \"datasets\"\n", + "existing = list(assets_dir.glob(f\"{personas_locale}*.parquet\")) if assets_dir.exists() else []\n", + "\n", + "if existing:\n", + " print(f\"Nemotron-Personas-{personas_locale} already present at {assets_dir}:\")\n", + " for p in existing:\n", + " print(f\" - {p.name}\")\n", + "else:\n", + " print(f\"Nemotron-Personas-{personas_locale} not found. Downloading via the Data Designer CLI...\")\n", + " subprocess.run(\n", + " shlex.split(f\"data-designer download personas --locale {personas_locale}\"),\n", + " check=True,\n", + " )\n", + " print(f\"Done. Dataset placed under {assets_dir}.\")" + ] + }, + { + "cell_type": "markdown", + "id": "6d326ad5", + "metadata": { + "lines_to_next_cell": 2 + }, + "source": [ + "# 2. πŸ› οΈ Define helpers\n", + "\n", + "These OCEAN Big-Five helpers come from the original Nemotron-Personas pipeline. They are **not invoked** in the default flow below, where OCEAN traits come directly from the NGC-hosted Nemotron-Personas-USA dataset via `with_synthetic_personas=True`. They are kept here for the `SAMPLE_FROM_SDG_PGM = True` reproduction path (see Section 4.2), since [NeMo SDG-PGMs](https://github.com/NVIDIA-NeMo/SDG-PGMs) handles demographic distributions but not Big Five personality scoring." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76f0de62", + "metadata": {}, + "outputs": [], + "source": [ + "def get_trait_label(score: int) -> str:\n", + " \"\"\"Convert a Big Five T-score into a coarse label.\"\"\"\n", + " if score < 35:\n", + " return \"very low\"\n", + " if score < 45:\n", + " return \"low\"\n", + " if score < 55:\n", + " return \"average\"\n", + " if score < 65:\n", + " return \"high\"\n", + " return \"very high\"\n", + "\n", + "\n", + "def get_trait_description(trait: str, label: str) -> str:\n", + " \"\"\"Return a prose description for a (trait, label) pair.\"\"\"\n", + " descriptions: dict[str, dict[str, str]] = {\n", + " \"openness\": {\n", + " \"very low\": \"Strongly prefers routine and the familiar. Traditional in thinking and values practicality over abstract ideas.\",\n", + " \"low\": \"Generally prefers structure and predictability. Tends to be practical and focused on immediate realities.\",\n", + " \"average\": \"Balances curiosity with practicality. Appreciates both new ideas and established methods.\",\n", + " \"high\": \"Curious and appreciative of art, new ideas, and varied experiences. Open to unconventional thinking.\",\n", + " \"very high\": \"Highly imaginative and intellectually curious. Strongly drawn to novelty, art, and abstract concepts.\",\n", + " },\n", + " \"conscientiousness\": {\n", + " \"very low\": \"Spontaneous and flexible, often resisting structure. May struggle with organization and deadlines.\",\n", + " \"low\": \"Often relaxed about obligations and somewhat disorganized. Values flexibility over strict planning.\",\n", + " \"average\": \"Maintains balance between organization and flexibility. Reasonably reliable and attentive to responsibilities.\",\n", + " \"high\": \"Organized, reliable, and methodical. Plans ahead and follows through on commitments.\",\n", + " \"very high\": \"Exceptionally organized and disciplined. Strongly focused on achievement and meeting high standards.\",\n", + " },\n", + " \"extraversion\": {\n", + " \"very low\": \"Strongly prefers solitude and quiet environments. May find social interaction draining.\",\n", + " \"low\": \"Generally reserved and comfortable with solitude. Prefers small groups to large gatherings.\",\n", + " \"average\": \"Balances social interaction with need for alone time. Moderately talkative in social situations.\",\n", + " \"high\": \"Sociable, outgoing, and energetic. Enjoys group activities and being around others.\",\n", + " \"very high\": \"Highly sociable and draws energy from others. Very talkative and comfortable being center of attention.\",\n", + " },\n", + " \"agreeableness\": {\n", + " \"very low\": \"Critical, skeptical, and competitive. Prioritizes personal interests over group harmony.\",\n", + " \"low\": \"Sometimes skeptical of others' intentions. More competitive than cooperative in approach.\",\n", + " \"average\": \"Generally cooperative but can be assertive. Balances compassion with self-interest.\",\n", + " \"high\": \"Kind, cooperative, and considerate. Prioritizes harmony and others' needs.\",\n", + " \"very high\": \"Exceptionally compassionate and cooperative. Strongly motivated to help others and maintain harmony.\",\n", + " },\n", + " \"neuroticism\": {\n", + " \"very low\": \"Exceptionally calm and resilient. Rarely experiences negative emotions like anxiety or sadness.\",\n", + " \"low\": \"Emotionally stable and handles stress well. Not easily upset by challenging situations.\",\n", + " \"average\": \"Experiences normal range of emotions. Moderately resilient but affected by significant challenges.\",\n", + " \"high\": \"Experiences more negative emotions than average. Prone to worry and sensitive to stress.\",\n", + " \"very high\": \"Highly emotionally reactive and prone to distress. Often experiences intense anxiety or sadness.\",\n", + " },\n", + " }\n", + " return descriptions[trait][label]\n", + "\n", + "\n", + "def generate_ocean_traits(num_records: int, base_seed: int | None = None) -> pd.DataFrame:\n", + " \"\"\"Generate synthetic OCEAN traits as a DataFrame with one JSON-encoded object per trait per row.\"\"\"\n", + " if num_records <= 0:\n", + " return pd.DataFrame()\n", + "\n", + " traits = [\"openness\", \"conscientiousness\", \"extraversion\", \"agreeableness\", \"neuroticism\"]\n", + " rng = np.random.RandomState(base_seed) if base_seed is not None else np.random.RandomState()\n", + " data: dict[str, list[str]] = {}\n", + "\n", + " for trait in traits:\n", + " scores = rng.normal(50.0, 10.0, num_records) + rng.normal(0.0, 2.0, num_records)\n", + " scores = np.clip(scores, 20.0, 80.0)\n", + " t_scores = np.round(scores).astype(int)\n", + " labels = [get_trait_label(int(score)) for score in t_scores]\n", + " descriptions = [get_trait_description(trait, label) for label in labels]\n", + " data[trait] = [\n", + " json.dumps({\"t_score\": int(t), \"label\": l, \"description\": d})\n", + " for t, l, d in zip(t_scores, labels, descriptions, strict=True)\n", + " ]\n", + "\n", + " return pd.DataFrame(data)" + ] + }, + { + "cell_type": "markdown", + "id": "2670f722", + "metadata": {}, + "source": [ + "# 3. 🎨 Set Up NeMo Data Designer (NDD)" + ] + }, + { + "cell_type": "markdown", + "id": "72d494f0", + "metadata": {}, + "source": [ + "## πŸͺͺ Specify Model ID and Alias\n", + "\n", + "- Use a [build.nvidia.com](https://build.nvidia.com/) model endpoint and model ID\n", + "- Make sure your `NVIDIA_API_KEY` environment variable is set" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ddbfd69", + "metadata": {}, + "outputs": [], + "source": [ + "MODEL_PROVIDER = \"nvidia\"\n", + "MODEL_ID = \"openai/gpt-oss-20b\"\n", + "MODEL_ALIAS = \"gpt-oss-20b\"" + ] + }, + { + "cell_type": "markdown", + "id": "0b00577a", + "metadata": {}, + "source": [ + "## πŸŽ›οΈ Adjust the model config\n", + "\n", + "> ⚠️ **Note**: You may need to adjust temperature and top_p settings depending on the model you use. Consult the model card on [build.nvidia.com](https://build.nvidia.com) for recommended settings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "549075e8", + "metadata": {}, + "outputs": [], + "source": [ + "model_configs = [\n", + " dd.ModelConfig(\n", + " alias=MODEL_ALIAS,\n", + " model=MODEL_ID,\n", + " provider=MODEL_PROVIDER,\n", + " inference_parameters=dd.ChatCompletionInferenceParams(\n", + " max_tokens=16384,\n", + " temperature=dd.UniformDistribution(params=dd.UniformDistributionParams(low=0.9, high=1.1)),\n", + " top_p=1.0,\n", + " extra_body={\"reasoning_effort\": \"high\"},\n", + " timeout=1200,\n", + " max_parallel_requests=32,\n", + " ),\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "e966cdc0", + "metadata": {}, + "source": [ + "## πŸš€ Initialize Data Designer" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9d2e015", + "metadata": {}, + "outputs": [], + "source": [ + "data_designer = DataDesigner()" + ] + }, + { + "cell_type": "markdown", + "id": "2ddb021e", + "metadata": {}, + "source": [ + "# 4. ✍️ Design the dataset\n", + "\n", + "#### Once the SDG-PGMs reproduction path is wired up, there are three main steps to Nemotron-Personas:\n", + "#### 1️⃣ Generate OCEAN Personality Traits\n", + "#### 2️⃣ Generate Persona Attributes by grounding in (PGM + OCEAN) details\n", + "#### 3️⃣ Generate Personas by grounding in (2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73b040f5", + "metadata": {}, + "outputs": [], + "source": [ + "NUM_RECORDS = 5" + ] + }, + { + "cell_type": "markdown", + "id": "b0061838", + "metadata": {}, + "source": [ + "## 4.1 🌊 Generate OCEAN (Big Five) personality traits\n", + "OCEAN is the most common scientific model for measuring and describing human personality traits.
\n", + "See [Big Five personality traits Wikipedia article](https://en.wikipedia.org/wiki/Big_Five_personality_traits) for more context.\n", + "\n", + "In this notebook the OCEAN traits come straight from the **NGC-hosted Nemotron-Personas-USA dataset** in the next section (`with_synthetic_personas=True` exposes `person.openness`, `person.conscientiousness`, etc. as `struct`). The helper functions in Section 2 are kept ready for the `SAMPLE_FROM_SDG_PGM = True` reproduction path." + ] + }, + { + "cell_type": "markdown", + "id": "84775cd8", + "metadata": {}, + "source": [ + "## 4.2 πŸ‘©β€πŸŽ¨πŸ‘¨β€πŸŽ¨ Generate Persona Attributes\n", + "\n", + "We are focusing just on the part in the diagram below and seeding persona attributes with PGM + OCEAN details:\n", + "\n", + "
\n", + " \"Stage\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "0abef9b5", + "metadata": {}, + "source": [ + "> ⚠️ **Note**:\n", + "> Below, we show two different ways of seeding persona generation:\n", + ">\n", + "> When `SAMPLE_FROM_SDG_PGM = False` (default), we sample personal details and OCEAN traits from Data Designer's `PersonSampler` against the NGC-hosted Nemotron-Personas dataset (`PersonSamplerParams(locale=personas_locale, with_synthetic_personas=True)`).\n", + ">\n", + "> When `SAMPLE_FROM_SDG_PGM = True`, persons are generated from a custom Probabilistic Graphical Model via [NeMo SDG-PGMs](https://github.com/NVIDIA-NeMo/SDG-PGMs), and the OCEAN helpers from Section 2 layer the personality traits on top. **This branch is currently a TODO** β€” see the cell below for the eventual integration shape.\n", + ">\n", + "> To switch locales, just update `personas_locale` in the cell above (and re-run the download cell). All downstream prompts work unchanged across locales." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5502ca81", + "metadata": {}, + "outputs": [], + "source": [ + "# Toggle the source of the base \"person\" record.\n", + "# False (default) -- sample from the NGC-hosted Nemotron-Personas-USA artifact.\n", + "# True -- generate persons from a custom PGM via SDG-PGMs (TODO; see below).\n", + "SAMPLE_FROM_SDG_PGM = False" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "38af8e15", + "metadata": {}, + "outputs": [], + "source": [ + "config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs)\n", + "\n", + "if SAMPLE_FROM_SDG_PGM:\n", + " # TODO: Generate the base person record from a custom Probabilistic Graphical Model\n", + " # using NeMo SDG-PGMs (https://github.com/NVIDIA-NeMo/SDG-PGMs), then layer the OCEAN\n", + " # Big-Five helpers above on top. This matches the original four-stage Nemotron-Personas\n", + " # pipeline (Stage 1 = OCEAN helpers, Stage 2 = PGM demographics).\n", + " #\n", + " # The integration is approximately:\n", + " #\n", + " # from data_designer_plugins.pgm_generator_plugin import PGMGeneratorPluginConfig\n", + " # ocean_df = generate_ocean_traits(NUM_RECORDS) # Stage 1 (OCEAN)\n", + " # config_builder.with_seed_dataset(\n", + " # dd.DataFrameSeedSource(df=ocean_df),\n", + " # sampling_strategy=dd.SamplingStrategy.ORDERED,\n", + " # )\n", + " # config_builder.add_column( # Stage 2 (demographics)\n", + " # PGMGeneratorPluginConfig(\n", + " # name=\"person\",\n", + " # generator_class=\"my_generators.UsPersonGenerator\",\n", + " # )\n", + " # )\n", + " raise NotImplementedError(\n", + " \"SDG-PGMs path is not implemented in this notebook yet. \"\n", + " \"See https://github.com/NVIDIA-NeMo/SDG-PGMs for the open-sourced library.\"\n", + " )\n", + "\n", + "# Default path: sample synthetic personal details + OCEAN traits from the NGC-hosted asset.\n", + "# `with_synthetic_personas=True` exposes Big Five t-scores + labels + descriptions, plus\n", + "# `person.cultural_background`, hobbies, career goals, and context-specific personas (those\n", + "# extra fields stay nested in `person` and don't conflict with the columns we regenerate\n", + "# downstream). `drop=True` keeps `person` from leaking into the final dataset.\n", + "config_builder.add_column(\n", + " dd.SamplerColumnConfig(\n", + " name=\"person\",\n", + " sampler_type=dd.SamplerType.PERSON,\n", + " params=dd.PersonSamplerParams(\n", + " locale=personas_locale,\n", + " age_range=[18, 114],\n", + " with_synthetic_personas=True,\n", + " # sex=\"Male\" # Optional: filter by sex\n", + " # city=[\"New York\", \"Los Angeles\"] # Optional: filter by cities\n", + " ),\n", + " drop=True,\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "490d9149", + "metadata": {}, + "outputs": [], + "source": [ + "# Add a unique identifier for each record\n", + "config_builder.add_column(name=\"uuid\", column_type=\"sampler\", sampler_type=\"uuid\")\n", + "\n", + "# Lift OCEAN traits to top-level so the original prompts can reference {{ openness.description }} etc.\n", + "for trait in [\"openness\", \"conscientiousness\", \"extraversion\", \"agreeableness\", \"neuroticism\"]:\n", + " config_builder.add_column(dd.ExpressionColumnConfig(name=trait, expr=f\"{{{{ person.{trait} }}}}\"))\n", + "\n", + "# Add specific personal detail columns -- NOT included in the public release, but used for seeding Personas\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(\n", + " name=\"ethnic_background\",\n", + " expr=\"{{ person.ethnic_background if person.ethnic_background else ' ' }}\",\n", + " )\n", + ")\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"first_name\", expr=\"{{ person.first_name }}\"))\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(\n", + " name=\"middle_name\",\n", + " expr=\"{{ person.middle_name if person.middle_name else ' ' }}\",\n", + " )\n", + ")\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"last_name\", expr=\"{{ person.last_name }}\"))\n", + "# Note: the underlying field is `district`; the original Nemotron-Personas-USA dataset surfaces it as `county`.\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"county\", expr=\"{{ person.district }}\"))\n", + "\n", + "# Add specific personal detail columns -- included in the public release\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"sex\", expr=\"{{ person.sex }}\"))\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"age\", expr=\"{{ person.age }}\"))\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"marital_status\", expr=\"{{ person.marital_status }}\"))\n", + "# These can legitimately be null in the source dataset; coerce to a single space so downstream\n", + "# Jinja templates stay safe (DD's validator rejects expression columns that render to \"\").\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(\n", + " name=\"education_level\",\n", + " expr=\"{{ person.education_level if person.education_level else ' ' }}\",\n", + " )\n", + ")\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(\n", + " name=\"bachelors_field\",\n", + " expr=\"{{ person.bachelors_field if person.bachelors_field else ' ' }}\",\n", + " )\n", + ")\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(\n", + " name=\"occupation\",\n", + " expr=\"{{ person.occupation if person.occupation else ' ' }}\",\n", + " )\n", + ")\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"city\", expr=\"{{ person.city }}\"))\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"state\", expr=\"{{ person.state }}\"))\n", + "# Note: the underlying field is `postcode`; the original dataset surfaces it as `zipcode`.\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"zipcode\", expr=\"{{ person.postcode }}\"))\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"country\", expr=\"{{ person.country }}\"))" + ] + }, + { + "cell_type": "markdown", + "id": "5ba21894", + "metadata": {}, + "source": [ + "### πŸ‘€ Generate a preview to see what we have so far (OCEAN + PGM columns only for now)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d10ef036", + "metadata": {}, + "outputs": [], + "source": [ + "preview = data_designer.preview(config_builder, num_records=10)\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "id": "68c58b71", + "metadata": {}, + "source": [ + "### ➑️ Next, generate persona attributes grounded in OCEAN + PGM" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ca534dd7", + "metadata": {}, + "outputs": [], + "source": [ + "PERSONA_ATTRIBUTES_SYSTEM_PROMPT = \"\"\"\\\n", + "You are a detailed persona generator specializing in creating realistic, nuanced, and diverse personal attributes. You should:\n", + "1. Generate attributes that are internally consistent and logically connected to the base persona details\n", + "2. Ensure cultural sensitivity and avoid stereotypes while acknowledging cultural influences\n", + "3. Create specific, detailed responses rather than generic ones\n", + "4. Base your responses on realistic correlations between personal attributes like ethnic background, age, sex, marital status, education, occupation, etc.\n", + "5. Always return your response in a valid JSON format\n", + "6. DO NOT include any explanations or reasoning for your choices\n", + "\n", + "Your responses should be creative yet plausible, diverse yet consistent with the provided demographic information.\n", + "\"\"\"\n", + "\n", + "\n", + "# We define a PersonaAttributes schema so that all attributes are generated in one go,\n", + "# with the types and constraints as specified below. Pydantic is used to automatically validate the output.\n", + "class PersonaAttributes(BaseModel):\n", + " cultural_background: str = Field(description=\"Description of the person's cultural background\")\n", + " skills_and_expertise: str = Field(description=\"Description of the person's skills and expertise\")\n", + " skills_and_expertise_list: list[str] = Field(description=\"List of the person's skills and expertise\")\n", + " career_goals_and_ambitions: str = Field(description=\"Description of the person's career goals and ambitions\")\n", + " hobbies_and_interests: str = Field(description=\"Description of the person's hobbies and interests\")\n", + " hobbies_and_interests_list: list[str] = Field(description=\"List of the person's hobbies and interests\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ac30f35", + "metadata": {}, + "outputs": [], + "source": [ + "# Here we use a structured output column trick to generate all persona attributes\n", + "# in one shot, minimizing the number of API calls.\n", + "#\n", + "# Note how easy it is to access other fields in the dataset via Jinja templating.\n", + "# Doing so automatically infuses every record with row-specific details.\n", + "config_builder.add_column(\n", + " dd.LLMStructuredColumnConfig(\n", + " name=\"persona_attributes\",\n", + " system_prompt=PERSONA_ATTRIBUTES_SYSTEM_PROMPT,\n", + " prompt=\"\"\"\\\n", + "Based on a person with the following profile:\n", + "\n", + "Name: {{ first_name }} {{ middle_name if middle_name else '' }} {{ last_name }}\n", + "Sex: {{ sex }}\n", + "Age: {{ age }}\n", + "{{ 'Ethnic background: ' + ethnic_background if ethnic_background else ''}}\n", + "Marital status: {{ marital_status }}\n", + "Education: {{ education_level }}{{ ' in ' + bachelors_field if bachelors_field != 'no degree' else '' }}\n", + "Occupation: {{ occupation }}\n", + "Location: {{ city }}, {{ state }}, {{ county }}\n", + "\n", + "Personality profile:\n", + "- {{ openness.description }}\n", + "- {{ conscientiousness.description }}\n", + "- {{ extraversion.description }}\n", + "- {{ agreeableness.description }}\n", + "- {{ neuroticism.description }}\n", + "\n", + "Generate the following detailed persona attributes:\n", + "- cultural_background\n", + "- skills_and_expertise\n", + "- skills_and_expertise_list\n", + "- career_goals_and_ambitions\n", + "- hobbies_and_interests\n", + "- hobbies_and_interests_list\n", + "\n", + "When generating attributes, make sure to incorporate the influences suggested by the personality profile description.\n", + "\"\"\",\n", + " output_format=PersonaAttributes,\n", + " model_alias=MODEL_ALIAS,\n", + " drop=True,\n", + " )\n", + ")\n", + "\n", + "# Now we break up into multiple columns\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(name=\"cultural_background\", expr=\"{{ persona_attributes.cultural_background }}\")\n", + ")\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(name=\"skills_and_expertise\", expr=\"{{ persona_attributes.skills_and_expertise }}\")\n", + ")\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(\n", + " name=\"skills_and_expertise_list\", expr=\"{{ persona_attributes.skills_and_expertise_list }}\"\n", + " )\n", + ")\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(\n", + " name=\"career_goals_and_ambitions\", expr=\"{{ persona_attributes.career_goals_and_ambitions }}\"\n", + " )\n", + ")\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(name=\"hobbies_and_interests\", expr=\"{{ persona_attributes.hobbies_and_interests }}\")\n", + ")\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(\n", + " name=\"hobbies_and_interests_list\", expr=\"{{ persona_attributes.hobbies_and_interests_list }}\"\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "0d6c96a8", + "metadata": {}, + "source": [ + "### πŸ” Generate a preview and examine a sample record" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f45d8aa1", + "metadata": {}, + "outputs": [], + "source": [ + "preview = data_designer.preview(config_builder, num_records=10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76a733a2", + "metadata": {}, + "outputs": [], + "source": [ + "preview.dataset[0:3]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "162fb6a7", + "metadata": {}, + "outputs": [], + "source": [ + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "id": "7f178df2", + "metadata": {}, + "source": [ + "### 4.3 πŸ¦Έβ€β™€οΈ πŸ‘©β€πŸŽ€ πŸ‘©β€πŸ³ πŸ‘©β€πŸ”¬ Generate Personas\n", + "\n", + "Now, let's focus on the second part shown in the diagram below:\n", + "\n", + "
\n", + " \"Stage\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f0b882b", + "metadata": {}, + "outputs": [], + "source": [ + "PERSONA_SYSTEM_PROMPT = \"\"\"\\\n", + "You are a specialized persona generator that creates fine-grained, creative and meaningful persona descriptions based on an individual's cultural background, skills, career goals, and interests. You should:\n", + "1. Synthesize a coherent persona that naturally emerges from these characteristics\n", + "2. Focus on how these attributes combine to create a unique perspective and approach to life\n", + "3. Ensure the persona description reflects the intersection of professional expertise, cultural values, and personal interests\n", + "4. Create a narrative that explains how these characteristics influence their worldview and decision-making\n", + "5. Always return your response in a valid JSON format\n", + "6. INCLUDE NAME IN EVERY PERSONA DESCRIPTION.\n", + "7. ALWAYS TAKE AGE INTO ACCOUNT TO INFORM INTERESTS, HABITS AND AFFINITY TO VARIOUS ASPECTS OF LIFE.\n", + "8. NEVER DIRECTLY MENTION THE CULTURAL HERITAGE. INSTEAD, INFUSE IT INTO PERSONA DESCRIPTIONS BY REFERRING TO CULTURAL PRACTICES, TRADITIONS, AND VALUES.\n", + "\n", + "Each persona should be very specific, not a generic/bland description. Do not shy away from mentioning bad habits or quirks.\n", + "\n", + "Here are examples of how each persona description may begin:\n", + "\"An aspiring musician...\"\n", + "\"A renowned machine learning researcher...\"\n", + "\"A neonatal nurse with decades of experince...\"\n", + "\"An urban planner with a passion...\"\n", + "\"\"\"\n", + "\n", + "\n", + "# We define a Personas schema so that all attributes are generated in one go,\n", + "# with the types and constraints specified. Again, Pydantic is used to automatically validate the output.\n", + "class Personas(BaseModel):\n", + " professional_persona: str = Field(\n", + " description=\"A one-sentence persona description including primary field of work, key professional skills, and how their unique personality traits manifest in their career\"\n", + " )\n", + " finance_persona: str = Field(\n", + " description=\"A one-sentence persona characterization of spending habits, relationship with money, saving and investment habits, and approach to financial decision-making, mentioning specific financial instruments and investment strategies they use.\"\n", + " )\n", + " healthcare_persona: str = Field(\n", + " description=\"A one-sentence persona description of very specific health conditions they have as well as their approach to medical care, and their typical behavior as a patient. Include condition names and describe how the person proactively addresses/ completely neglects/ periodically manages/ actively monitors/ struggles with/ effectively controls/ inconsistently treats these conditions\"\n", + " )\n", + " sports_persona: str = Field(\n", + " description=\"A one-sentence persona description of athletic interests, seasonal sports, and their approach to fitness and exercise. Provide specific names of professional sports teams and club affiliations, based on the persona location\"\n", + " )\n", + " arts_persona: str = Field(\n", + " description=\"A one-sentence persona characterization of engagement with creative expression, artistic appreciation, cultural activities, and how the arts shape their identity and leisure time, if at all. Always provide specific artist/musician/actor/performer names\"\n", + " )\n", + " travel_persona: str = Field(\n", + " description=\"A one-sentence persona capturing travel interests and style, including planning preferences, adventure versus relaxation focus, and financial or family constraints. Always provide specific local and/or international destinations they have visited or wish to visit\"\n", + " )\n", + " culinary_persona: str = Field(\n", + " description=\"A one-sentence persona description of food/cuisine preferences, cooking skill level, and approach to dining experiences. Always provide specific names of dishes and names of ingredients they enjoy.\"\n", + " )\n", + " concise_persona: str = Field(\n", + " description=\"A one-sentence description capturing the essence of this person's unique perspective and approach to life, highlighting unique quirks, facts, and/or bad habits\"\n", + " )\n", + " detailed_persona: str = Field(\n", + " description=\"A paragraph describing persona's cultural background, skills, goals, and interests shape their worldview and decision-making. Don't shy away from talking about bad habits or quirks\"\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "da5e208b", + "metadata": {}, + "outputs": [], + "source": [ + "# Here we use a structured output column trick to generate all personas\n", + "# in one call, minimizing the number of API calls.\n", + "#\n", + "# As before, we can easily access other fields in the dataset via Jinja templating.\n", + "# Doing so automatically infuses every record with row-specific details.\n", + "config_builder.add_column(\n", + " dd.LLMStructuredColumnConfig(\n", + " name=\"personas\",\n", + " system_prompt=PERSONA_SYSTEM_PROMPT,\n", + " prompt=\"\"\"\\\n", + "Based on a person with the following persona attributes and profile:\n", + "\n", + "Age: {{ age }}\n", + "Cultural background: {{ cultural_background }}\n", + "{{ 'Hobbies and interests: ' + hobbies_and_interests if age >= 6 else '' }}\n", + "{{ 'Skills and expertise: ' + skills_and_expertise if age >= 16 else '' }}\n", + "{{ 'Career goals and ambitions: ' + career_goals_and_ambitions if age >= 16 else '' }}\n", + "\n", + "Personality profile:\n", + "- {{ openness.description }}\n", + "- {{ conscientiousness.description }}\n", + "- {{ extraversion.description }}\n", + "- {{ agreeableness.description }}\n", + "- {{ neuroticism.description }}\n", + "\n", + "Generate the following self-contained persona descriptions that capture how persona attributes and profile combine to create a unique individual's perspective and approach to various facets of life.\n", + "\n", + "- professional_persona\n", + "- finance_persona\n", + "- healthcare_persona\n", + "- sports_persona\n", + "- arts_persona\n", + "- travel_persona\n", + "- culinary_persona\n", + "- concise_persona\n", + "- detailed_persona\n", + "\n", + "Each requested persona description should be self-contained, meaning it can't begin with they/their as the reference wouldn't be clear.\n", + "When generating personas, make sure to incorporate the influences suggested by the personality profile description.\n", + "\n", + "DO NOT USE THE RACE OF THE PERSONA IN YOUR RESPONSE.\n", + "NEVER DIRECTLY MENTION THE CULTURAL HERITAGE. INSTEAD, INFUSE IT INTO PERSONA DESCRIPTIONS BY REFERRING TO CULTURAL PRACTICES, TRADITIONS, AND VALUES.\n", + "INCLUDE NAME IN EVERY PERSONA DESCRIPTION.\n", + "ALWAYS TAKE AGE INTO ACCOUNT TO INFORM INTERESTS, HABITS AND AFFINITY TO VARIOUS ASPECTS OF LIFE.\n", + "\n", + "Each persona description should be creative yet plausible and consistent with the provided demographic information and persona attributes.\n", + "Each persona should be very specific, not a generic/bland description. Do not shy away from mentioning bad habits or quirks.\n", + "\n", + "Here are examples of how each description may begin:\n", + "\"An aspiring musician...\"\n", + "\"A renowned machine learning researcher...\"\n", + "\"A neonatal nurse with decades of experince...\"\n", + "\"An urban planner with a passion...\"\n", + "\"\"\",\n", + " output_format=Personas,\n", + " model_alias=MODEL_ALIAS,\n", + " drop=True,\n", + " )\n", + ")\n", + "\n", + "# Now we break up into multiple columns\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(name=\"professional_persona\", expr=\"{{ personas.professional_persona }}\")\n", + ")\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"finance_persona\", expr=\"{{ personas.finance_persona }}\"))\n", + "config_builder.add_column(\n", + " dd.ExpressionColumnConfig(name=\"healthcare_persona\", expr=\"{{ personas.healthcare_persona }}\")\n", + ")\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"sports_persona\", expr=\"{{ personas.sports_persona }}\"))\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"arts_persona\", expr=\"{{ personas.arts_persona }}\"))\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"travel_persona\", expr=\"{{ personas.travel_persona }}\"))\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"culinary_persona\", expr=\"{{ personas.culinary_persona }}\"))\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"concise_persona\", expr=\"{{ personas.concise_persona }}\"))\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"detailed_persona\", expr=\"{{ personas.detailed_persona }}\"))" + ] + }, + { + "cell_type": "markdown", + "id": "65498355", + "metadata": {}, + "source": [ + "### πŸ” Generate a preview and examine a sample record" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d2ffbbcf", + "metadata": {}, + "outputs": [], + "source": [ + "preview = data_designer.preview(config_builder, num_records=10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b525dc5d", + "metadata": {}, + "outputs": [], + "source": [ + "preview.dataset[0:3]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "182ba19f", + "metadata": {}, + "outputs": [], + "source": [ + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "id": "7a56cbfc", + "metadata": {}, + "source": [ + "### ↗️ Scale Up Persona Generation\n", + "Scale up to the specified `NUM_RECORDS`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "935dee6f", + "metadata": {}, + "outputs": [], + "source": [ + "scaled_persona_results = data_designer.create(config_builder, num_records=NUM_RECORDS, dataset_name=\"personas\")\n", + "\n", + "# Load the dataset into a pandas DataFrame\n", + "all_personas = scaled_persona_results.load_dataset()\n", + "all_personas.head(3)" + ] + }, + { + "cell_type": "markdown", + "id": "3ee205af", + "metadata": {}, + "source": [ + "### πŸ“„ View the evaluation report" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d0cf4d65", + "metadata": {}, + "outputs": [], + "source": [ + "analysis = scaled_persona_results.load_analysis()\n", + "analysis.to_report()" + ] + }, + { + "cell_type": "markdown", + "id": "2e9b0dab", + "metadata": { + "lines_to_next_cell": 2 + }, + "source": [ + "# 5. 🎯 Customize for your use case\n", + "\n", + "Everything above reproduces the **general-purpose** Nemotron-Personas-USA pipeline. In practice, enterprises will want personas grounded in their own domain β€” a healthcare provider needs persona dimensions a media company doesn't, and vice versa. With NeMo Data Designer, layering a custom attribute or persona on top of the released artifact is a few lines of config.\n", + "\n", + "To make the customization story concrete, the cell below adds a **`tech_persona`** dimension (with a specific list of `tech_tools` they use) that wasn't in the original Nemotron-Personas schema. The same pattern (one Pydantic schema + one `LLMStructuredColumnConfig` + one expression column per output field) generalizes to any domain-specific dimension you need." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0533fed", + "metadata": {}, + "outputs": [], + "source": [ + "class TechPersona(BaseModel):\n", + " tech_persona: str = Field(\n", + " description=(\n", + " \"A 2-3 sentence description of this person's relationship with technology: \"\n", + " \"comfort with AI/digital tools, level of tech adoption (early-adopter / mainstream / late / \"\n", + " \"skeptic), preferred devices, and one specific way technology shapes their daily routine. \"\n", + " \"Be specific and consistent with the rest of the persona profile.\"\n", + " )\n", + " )\n", + " tech_tools: list[str] = Field(\n", + " description=(\n", + " \"List of 4-6 specific tech tools, apps, services, or devices this person uses regularly. \"\n", + " \"Each entry should be a concrete named product, not a generic category.\"\n", + " )\n", + " )\n", + "\n", + "\n", + "config_builder.add_column(\n", + " dd.LLMStructuredColumnConfig(\n", + " name=\"custom_persona\",\n", + " system_prompt=(\n", + " \"You write nuanced, specific tech-relationship personas grounded in demographic \"\n", + " \"and psychometric attributes. Avoid generic platitudes; ground every claim in the \"\n", + " \"person's age, occupation, personality, and lifestyle.\"\n", + " ),\n", + " prompt=\"\"\"\\\n", + "Based on a person with the following persona profile:\n", + "\n", + "Name: {{ first_name }} {{ last_name }}, Age: {{ age }}, Occupation: {{ occupation }}\n", + "Cultural background: {{ cultural_background }}\n", + "Career goals: {{ career_goals_and_ambitions }}\n", + "Hobbies: {{ hobbies_and_interests }}\n", + "Concise persona: {{ concise_persona }}\n", + "\n", + "Personality profile:\n", + "- {{ openness.description }}\n", + "- {{ conscientiousness.description }}\n", + "- {{ extraversion.description }}\n", + "- {{ agreeableness.description }}\n", + "- {{ neuroticism.description }}\n", + "\n", + "Generate the `tech_persona` and `tech_tools` fields as described in the schema. Be specific and consistent with the profile above.\n", + "\"\"\",\n", + " output_format=TechPersona,\n", + " model_alias=MODEL_ALIAS,\n", + " drop=True,\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"tech_persona\", expr=\"{{ custom_persona.tech_persona }}\"))\n", + "config_builder.add_column(dd.ExpressionColumnConfig(name=\"tech_tools\", expr=\"{{ custom_persona.tech_tools }}\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "66d6813a", + "metadata": {}, + "outputs": [], + "source": [ + "preview = data_designer.preview(config_builder, num_records=5)\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "id": "08701311", + "metadata": {}, + "source": [ + "# ⏭️ Next Steps\n", + "\n", + "1. Everything above is just an example of personas that can be generated. These personas are not set in stone and can be easily adjusted. For example, if you need a different type of persona for *-Nemotron, tweak or extend the pipeline (Section 5 demonstrates the pattern).\n", + "\n", + "2. You should be able to use this notebook as is to generate Nemotron-Personas for any of the [supported locales](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/person_sampling/) by changing `personas_locale` and re-running the download cell. For a brand-new region without an NGC dataset, flip `SAMPLE_FROM_SDG_PGM = True` and provide a custom [SDG-PGMs](https://github.com/NVIDIA-NeMo/SDG-PGMs) generator (the OCEAN helpers in Section 2 are the Stage 1 scaffolding for that path).\n", + " - You may need to adjust and/or translate prompts to your region's language(s)\n", + " - You may need to work with a different LLM that is better suited for your region" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/devnotes/posts/assets/nemotron-personas/nemotron-personas-world-map.png b/docs/devnotes/posts/assets/nemotron-personas/nemotron-personas-world-map.png new file mode 100644 index 000000000..da745e3e3 Binary files /dev/null and b/docs/devnotes/posts/assets/nemotron-personas/nemotron-personas-world-map.png differ diff --git a/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd.png b/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd.png new file mode 100644 index 000000000..003e4be20 Binary files /dev/null and b/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd.png differ diff --git a/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd_step_2.png b/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd_step_2.png new file mode 100644 index 000000000..79a870dbb Binary files /dev/null and b/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd_step_2.png differ diff --git a/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd_step_3.png b/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd_step_3.png new file mode 100644 index 000000000..ee3f1959b Binary files /dev/null and b/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd_step_3.png differ diff --git a/docs/devnotes/posts/nemotron-personas.md b/docs/devnotes/posts/nemotron-personas.md new file mode 100644 index 000000000..1691235d9 --- /dev/null +++ b/docs/devnotes/posts/nemotron-personas.md @@ -0,0 +1,406 @@ +--- +date: 2026-05-07 +authors: + - ymeyer + - dcorneil +--- + + + +# **Inside Nemotron-Personas: Multi-Locale Synthetic Personas Powering Nemotron Training** + +The [Nemotron-Personas HF collection](https://huggingface.co/collections/nvidia/nemotron-personas) is a growing family of multilingual, region-specific synthetic persona datasets (currently covering seven countries and nine language variants with roughly **53 million personas** in total), each grounded in real-world demographic and geographic distributions. Behind every dataset is the same NeMo Data Designer compound-AI pipeline, adapted per region. And while the public release is a useful artifact in its own right, what's less visible is just how much these personas show up in **Nemotron model training itself** β€” seeding long-context samples, tool-use rollouts, formal-logic data, safety refusals, and general chat. This post pulls back the curtain on both halves of that story: how the collection is built, and how it is used. + + + +

+ Nemotron-Personas collection +

+ +--- + +## **Why grounded synthetic personas matter** + +It's easy to underestimate what a really good persona seed buys you. Three angles worth keeping in mind: + +1. **Distributional faithfulness for sovereign AI.** Models trained on synthetic data that doesn't reflect the actual demographics of a region inherit subtle biases β€” over-representing some groups, under-representing others, getting cultural context wrong. For sovereign-AI work, that's not a rounding error; it's the whole problem. Grounding personas in census + administrative data closes that gap before the LLM ever sees the data. + +2. **Diversity that random sampling can't produce.** "Generate 10,000 customer queries" with no seed and an LLM will give you 10,000 variations on the same handful of latent personas. Conditioning each query on a distinct, demographically-grounded persona forces the model to span the actual population it'll be deployed against β€” the conscientious 62-year-old retired electrician in Pittsburgh, the 24-year-old graduate student in Bengaluru, the elementary-school teacher in Lille. Each yields a meaningfully different prompt. + +3. **Reusable seed material.** Once a persona has a name, a demographic profile, an OCEAN vector, and a coherent backstory, *any* downstream pipeline can attach to it: a tool-use environment, a long-context construction, a safety-refusal template, a roleplay scenario. The collection acts as a library β€” generate the personas once, reuse them across training stages. + +That last point is the bridge to the rest of this post. + +--- + +## **Nemotron-Personas inside Nemotron training** + +The [Nemotron 3 Super Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) shows just how foundational these personas have become. They're not a side-quest dataset; they're a *seeding primitive* used across many post-training stages. + +### Long-context samples + +Long-context training data is hard to source β€” you need genuinely long, coherent sequences that aren't just concatenations of unrelated documents. Persona records, by virtue of being self-contained narratives with rich attributes, concatenate cleanly: + +> *"We also construct long-context samples by concatenating records from [Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) to reach the required sequence length."* +> +> β€” [Nemotron 3 Super Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) + +Each persona is internally coherent (the OCEAN traits inform the cultural background, which informs the career goals, which informs the professional persona, etc.), and across personas the records are independent β€” exactly the right shape to pack into long sequences. + +### General-purpose tool-use rollouts + +Tool-use trajectories require a *user* with a goal, not just a tool set. The Super pipeline uses a dual-LLM setup where one LLM plays the user and another plays the agent: + +> *"The User-LLM is seeded with the selected tool set, a persona sampled from [Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA)..."* +> +> β€” [Nemotron 3 Super Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) + +Seeding the user side with a real persona is what makes the rollouts feel like authentic conversations β€” the user's goals, communication style, and frustration patterns all flow from their underlying attributes. The agent has to handle the variance that real users actually produce, not the narrow band of "well-behaved benchmark user" prompts. + +A closely related approach was used to build **Nemotron-Nano-9B-v2-Japanese**, NVIDIA's Japanese small language model that ranks **#1 on the Nejumi LLM Leaderboard**. The Japanese instruction-following + general-chat data was seeded by [Nemotron-Personas-Japan](https://huggingface.co/datasets/nvidia/Nemotron-Personas-Japan), with prompts and assistant responses anchored to Japanese-grounded personas. That's the multi-locale story turning into a multi-locale model story: a Japanese persona collection, generated by a localized DD pipeline, becomes the seeding layer for a Japanese model that beats the leaderboard. + +The same template is being used across the family β€” instruction-following and general-chat data going into Nemotron Nano v3 (and from there into Super v3) follows the same persona-seeded recipe. + +### Synthetic formal-logic data + +Even abstract reasoning data benefits from persona conditioning: + +> *"We introduced variability into the generated scenarios, premises, and formulas by incorporating random personas, letters, and/or logic connective (i.e., ∧, ∨, βŠƒ, ≑, ∼) into the prompt."* +> +> β€” [Nemotron 3 Super Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) + +Formal-logic problems become more diverse β€” and more transferable β€” when the surface scenario shifts. A propositional-logic puzzle about an elementary teacher planning a field trip exercises the same underlying inference as one about a credit-counselor evaluating a loan, but the lexical surface looks completely different. Persona-driven scenario variation breaks the model out of the canonical "Alice and Bob" rut that plagues most synthetic formal-logic datasets. + +### Sensitive-safety-category-refusals (SSCR) + +The SSCR dataset β€” used in Nemotron's safety blend β€” leverages Nemotron-Personas as seed data when constructing prompts that require refusal across sensitive categories. Personas matter here because real-world adversarial / sensitive prompts come from all kinds of users; grounding the synthetic prompts in demographically diverse personas ensures the trained refusal behavior generalizes across user populations rather than overfitting to a narrow band of "obviously suspicious" phrasings. SSCR is included in the broader **nemotron-safety-blend**. + +### General chat and instruction following + +The same persona-seeding pattern that powers tool-use rollouts also powers the broader general-chat and instruction-following data that flows into Nemotron Nano v3 and from there into Super v3. A chat or instruction sample is a function of *who* is asking β€” their goals, their constraints, their communication style β€” and personas are how the pipeline encodes "who." + +--- + +## **How they're built: a four-stage compound-AI pipeline** + +Across all locales, the construction pipeline is the same four-stage shape (the regional adaptations live in the seed distributions, the language of the prompts, and which locale-specific fields get added). NeMo Data Designer orchestrates the pipeline as a column DAG: + +

+ Pipeline overview: PGM demographics + OCEAN traits seed two stages of structured-output LLM generation +

+ +### Stage 1 β€” OCEAN Big-Five sampling + +OCEAN ([Big Five personality traits](https://en.wikipedia.org/wiki/Big_Five_personality_traits)) is the most empirically grounded model of human personality. For each persona we sample five trait T-scores (\(\mu = 50\), \(\sigma = 10\), clipped to \([20, 80]\)), bucket each into a coarse label, and attach a prose description grounded in the personality literature. Working at the description level (rather than raw scores) is what makes the downstream LLM stages produce nuanced, internally-consistent narratives β€” "highly conscientious" vs "highly extraverted" reads very differently to an LLM than `t_score=72`. + +The score-to-label mapping is shared across all five traits: + +| T-score | Label | +| :---: | :--- | +| 20 – 34 | very low | +| 35 – 44 | low | +| 45 – 54 | average | +| 55 – 64 | high | +| 65 – 80 | very high | + +Each (trait, label) pair maps to a curated description that captures how that level of the trait actually manifests. A representative slice of the **openness** mapping: + +| Label | Description | +| :--- | :--- | +| very low | *"Strongly prefers routine and the familiar. Traditional in thinking and values practicality over abstract ideas."* | +| low | *"Generally prefers structure and predictability. Tends to be practical and focused on immediate realities."* | +| average | *"Balances curiosity with practicality. Appreciates both new ideas and established methods."* | +| high | *"Curious and appreciative of art, new ideas, and varied experiences. Open to unconventional thinking."* | +| very high | *"Highly imaginative and intellectually curious. Strongly drawn to novelty, art, and abstract concepts."* | + +The other four traits each have their own 5-row description table tuned to their domain (conscientiousness around organization vs spontaneity, extraversion around social energy, agreeableness around cooperation, neuroticism around emotional reactivity). The result is that one sampled persona arrives at Stage 3 with a structured personality block: + +```python +{ + "openness": {"t_score": 67, "label": "high", "description": "Curious and appreciative of art..."}, + "conscientiousness": {"t_score": 72, "label": "very high", "description": "Exceptionally organized..."}, + "extraversion": {"t_score": 41, "label": "low", "description": "Generally reserved..."}, + "agreeableness": {"t_score": 55, "label": "average", "description": "Generally cooperative..."}, + "neuroticism": {"t_score": 38, "label": "low", "description": "Emotionally stable..."}, +} +``` + +…which the downstream LLM prompts reference directly via Jinja templates: + +```jinja +Personality profile: +- {{ openness.description }} +- {{ conscientiousness.description }} +- {{ extraversion.description }} +- {{ agreeableness.description }} +- {{ neuroticism.description }} +``` + +### Stage 2 β€” Demographically-grounded sampling + +This is the engine of regional fidelity. For each locale, the goal is to produce a demographic record whose attributes correlate with each other the way real populations do β€” age Γ— education Γ— occupation Γ— marital status Γ— geography, with locale-specific extensions. Naive independent sampling produces nonsensical records (3-year-old surgeon married for 30 years living alone in Singapore); the released artifact pulls from [Probabilistic Graphical Models](https://en.wikipedia.org/wiki/Graphical_model) trained on real statistical distributions (census tables, administrative records, public surveys) so the correlations are statistically faithful. + +**The simplest path to seed your own pipeline today** is to consume the released NGC-hosted Nemotron-Personas dataset directly via Data Designer's built-in `PersonSampler`. This gives you the full demographic + OCEAN block from a verified PGM-grounded source without rebuilding anything yourself. One `SamplerColumnConfig` is enough: + +```python +import data_designer.config as dd + +config_builder.add_column( + dd.SamplerColumnConfig( + name="person", + sampler_type=dd.SamplerType.PERSON, + params=dd.PersonSamplerParams( + locale="en_US", # or ja_JP, en_IN, fr_FR, ko_KR, pt_BR, en_SG, hi_Deva_IN, hi_Latn_IN + age_range=[18, 114], + with_synthetic_personas=True, # exposes Big Five + cultural background + hobbies + career_goals + ... + ), + drop=True, + ) +) +``` + +`{{ person.openness.description }}`, `{{ person.occupation }}`, `{{ person.county }}` all become available to downstream Jinja templates immediately. See the [Person Sampling docs](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/person_sampling/) for the full setup walkthrough (NGC API key + `data-designer download personas --locale en_US`). + +**For new locales without a released artifact β€” or for teams that need full control over the demographic distributions** β€” the underlying engine, **SDG-PGMs**, was just open-sourced as [NVIDIA-NeMo/SDG-PGMs](https://github.com/NVIDIA-NeMo/SDG-PGMs). Its README states the connection plainly: + +> *"Together with Data Designer, SDG-PGMs helps power the Nemotron-Personas HF collection β€” multilingual, region-specific synthetic persona datasets for sovereign AI development. The USA dataset alone contains 6M personas grounded in US Census data, with realistic demographic correlations across age, sex, geography, education, marital status, and 560+ occupations."* + +A first-class Data Designer plugin (`PGMGeneratorPluginConfig`) is **coming soon**. The eventual integration shape: + +```python +# Coming soon β€” full Data Designer integration for custom PGMs: +from data_designer_plugins.pgm_generator_plugin import PGMGeneratorPluginConfig + +config_builder.add_column( + PGMGeneratorPluginConfig( + name="person", + generator_class="my_generators.UsPersonGenerator", + ) +) +``` + +Until that lands, SDG-PGMs can be run standalone (output β†’ seed parquet β†’ `dd.LocalFileSeedSource`) to feed any Data Designer pipeline. Either way, Stage 2 produces a consistent demographic record per persona; the locale-specific fields (France's `name_heritage`, Korea's military/health indicators, India's multi-language stack, etc.) are layered in here, sourced from the relevant regional statistical bodies. + +### Stage 3 β€” Persona attributes via structured outputs + +With OCEAN traits and demographic grounding in hand, the pipeline calls a reasoning LLM with a single `LLMStructuredColumnConfig` that materializes six rich attribute fields in one shot via a Pydantic schema: + +

+ Stage 3: Persona attributes via structured outputs +

+ +```python +from pydantic import BaseModel, Field + + +class PersonaAttributes(BaseModel): + cultural_background: str = Field(description="Description of the person's cultural background") + skills_and_expertise: str = Field(description="Description of the person's skills and expertise") + skills_and_expertise_list: list[str] = Field(description="List of the person's skills and expertise") + career_goals_and_ambitions: str = Field(description="Description of the person's career goals and ambitions") + hobbies_and_interests: str = Field(description="Description of the person's hobbies and interests") + hobbies_and_interests_list: list[str] = Field(description="List of the person's hobbies and interests") + + +config_builder.add_column( + dd.LLMStructuredColumnConfig( + name="persona_attributes", + system_prompt=PERSONA_ATTRIBUTES_SYSTEM_PROMPT, + prompt="""\ +Based on a person with the following profile: + +Name: {{ first_name }} {{ middle_name if middle_name else '' }} {{ last_name }} +Age: {{ age }}, Sex: {{ sex }}, Occupation: {{ occupation }} +Location: {{ city }}, {{ state }}, {{ county }} + +Personality profile: +- {{ openness.description }} +- {{ conscientiousness.description }} +- {{ extraversion.description }} +- {{ agreeableness.description }} +- {{ neuroticism.description }} + +Generate the cultural_background, skills_and_expertise, career_goals_and_ambitions, and hobbies_and_interests fields. +""", + output_format=PersonaAttributes, + model_alias=MODEL_ALIAS, + drop=True, + ) +) +``` + +The system prompt forces internal consistency ("attributes that are internally consistent and logically connected to the base persona details"), cultural sensitivity ("avoid stereotypes while acknowledging cultural influences"), and specificity ("create specific, detailed responses rather than generic ones"). Pydantic schema enforcement means every record's attributes parse cleanly downstream. + +### Stage 4 β€” Persona descriptions + +The final stage is a second structured-output LLM call that synthesizes everything above into nine cohesive persona descriptions: `professional_persona`, `finance_persona`, `healthcare_persona`, `sports_persona`, `arts_persona`, `travel_persona`, `culinary_persona`, `concise_persona`, and a paragraph-length `detailed_persona`. + +

+ Stage 4: Persona prose synthesis +

+ +```python +class Personas(BaseModel): + professional_persona: str = Field(description="...primary field of work, key professional skills...") + finance_persona: str = Field(description="...spending habits, relationship with money...") + healthcare_persona: str = Field(description="...specific health conditions, patient behavior...") + sports_persona: str = Field(description="...athletic interests, fitness approach, specific teams...") + arts_persona: str = Field(description="...engagement with creative expression, specific artists...") + travel_persona: str = Field(description="...travel interests, planning style, specific destinations...") + culinary_persona: str = Field(description="...food preferences, specific dishes and ingredients...") + concise_persona: str = Field(description="One-sentence essence of the person, including quirks.") + detailed_persona: str = Field(description="Paragraph-length descriptive narrative.") + + +config_builder.add_column( + dd.LLMStructuredColumnConfig( + name="personas", + system_prompt=PERSONA_SYSTEM_PROMPT, + prompt="""\ +Based on a person with the following persona attributes and profile: + +Age: {{ age }} +Cultural background: {{ cultural_background }} +Hobbies and interests: {{ hobbies_and_interests }} +Skills and expertise: {{ skills_and_expertise }} +Career goals and ambitions: {{ career_goals_and_ambitions }} + +Personality profile: +- {{ openness.description }} +- ... + +Generate self-contained persona descriptions per the schema. +""", + output_format=Personas, + model_alias=MODEL_ALIAS, + drop=True, + ) +) +``` + +The system prompt contains explicit guardrails: include the name in every description, never directly mention cultural heritage (infuse it implicitly through practices and traditions), and always take age into account. The LLM does the synthesis; Pydantic does the validation; Data Designer's DAG executes the whole thing in parallel across millions of records. + +--- + +## **Building your own β€” the customization story** + +The released artifact is the *general-purpose* collection. In practice, every team that uses these personas downstream extends them in some way. NeMo Data Designer makes that trivial: the same `LLMStructuredColumnConfig` + `ExpressionColumnConfig` pattern that builds the released schema can be used to layer on any custom dimension you need. + +The accompanying [Data Designer Tutorial: Reproducing & Customizing Nemotron-Personas](#try-it-yourself) walks through a concrete example. After reproducing the released schema with a `PersonSampler` against the NGC-hosted dataset, the tutorial adds a custom `tech_persona` dimension with two new fields β€” a prose description of the persona's relationship with technology, plus a list of specific tech tools they use: + +```python +import data_designer.config as dd +from pydantic import BaseModel, Field + + +class TechPersona(BaseModel): + tech_persona: str = Field( + description=( + "A 2-3 sentence description of this person's relationship with technology: " + "comfort with AI/digital tools, level of tech adoption, preferred devices, " + "and one specific way technology shapes their daily routine." + ) + ) + tech_tools: list[str] = Field( + description=( + "List of 4-6 specific tech tools, apps, services, or devices this person uses regularly. " + "Each entry should be a concrete named product, not a generic category." + ) + ) + + +config_builder.add_column( + dd.LLMStructuredColumnConfig( + name="custom_persona", + system_prompt=( + "You write nuanced, specific tech-relationship personas grounded in demographic " + "and psychometric attributes. Avoid generic platitudes; ground every claim in the " + "person's age, occupation, personality, and lifestyle." + ), + prompt="""\ +Based on a person with the following persona profile: + +Name: {{ first_name }} {{ last_name }}, Age: {{ age }}, Occupation: {{ occupation }} +Cultural background: {{ cultural_background }} +Career goals: {{ career_goals_and_ambitions }} +Hobbies: {{ hobbies_and_interests }} + +Personality profile: +- {{ openness.description }} +- {{ conscientiousness.description }} +- {{ extraversion.description }} +- {{ agreeableness.description }} +- {{ neuroticism.description }} + +Generate the `tech_persona` and `tech_tools` fields per the schema. +""", + output_format=TechPersona, + model_alias=MODEL_ALIAS, + drop=True, + ) +) + +config_builder.add_column(dd.ExpressionColumnConfig(name="tech_persona", expr="{{ custom_persona.tech_persona }}")) +config_builder.add_column(dd.ExpressionColumnConfig(name="tech_tools", expr="{{ custom_persona.tech_tools }}")) +``` + +A representative output from the tutorial run: + +```text +tech_persona Megan pragmatically adopts mainstream tech, seamlessly weaving AI assistants + into her lesson planning while preferring her well-worn iPad over flashier + gadgets; technology shapes her workflow most when she's grading assignments + on Sunday evenings. +tech_tools ['MacBook Air', 'iPad Pro 12.9', 'iPhone 14', 'Google Classroom', + 'Microsoft OneNote', 'ChatGPT'] +``` + +That's it β€” a few lines of Pydantic + one LLM column + a couple of expression columns and the released schema picks up two brand-new domain-specific fields. The same pattern scales: a healthcare provider extends with `medical_history_persona` and `insurance_persona`; a media company extends with `media_consumption_persona` and `subscription_stack`; a financial-services team extends with `investment_persona` and `risk_tolerance_persona`. The PGM-grounded base record stays the seed; everything else is one schema away. + +### Going deeper: build a brand-new locale + +For locales without an NGC-hosted Nemotron-Personas dataset, the build path is open. The OCEAN Big-Five helpers ship in the tutorial repo (Stage 1 of the original pipeline), and [NeMo SDG-PGMs](https://github.com/NVIDIA-NeMo/SDG-PGMs) provides the framework for building your own demographic PGM (Stage 2) β€” collect aggregate statistical distributions, declare a `PGMGenerator` subclass, and drop it into Data Designer via the bundled `PGMGeneratorPluginConfig`. The downstream LLM stages (3 and 4) are locale-agnostic; they just need the right language in the prompts. The tutorial leaves a `SAMPLE_FROM_SDG_PGM = True` toggle in place as the integration point. + +--- + +## **Try it yourself** + +The full reproduction-and-customization tutorial covers every detail in this post end-to-end, from the NGC dataset bootstrap through the toy custom-persona example. + +Open In Colab + +- **Tutorial notebook:** [Reproducing & Customizing Nemotron-Personas](../../notebooks/7-nemotron-personas.ipynb) β€” runs locally end-to-end; takes ~5 min on `gpt-oss-20b` for a 5-record smoke run. +- **Colab:** click the badge above to launch the same notebook on Colab. The injected setup cells handle the `NVIDIA_API_KEY` / `NGC_API_KEY` ceremony from Colab Secrets and install the NGC CLI before the persona dataset download. +- **NGC dataset setup (local):** see the [Person Sampling docs](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/person_sampling/) for the full walkthrough (NGC API key + NGC CLI + `data-designer download personas --locale en_US`). + +Switching locales is a one-liner: change `personas_locale = "en_US"` to any of `en_IN`, `en_SG`, `fr_FR`, `hi_Deva_IN`, `hi_Latn_IN`, `ja_JP`, `ko_KR`, `pt_BR` and re-run the download cell. Everything downstream stays the same. + +--- + +## **Closing thoughts** + +The headline number on the [Nemotron-Personas HF collection](https://huggingface.co/collections/nvidia/nemotron-personas) is the persona count, but the real story is that **a single, modular, locale-adaptable pipeline produces seed material that recurs throughout Nemotron's training stack**. Long-context construction, tool-use rollouts, formal-logic variability, safety refusals, instruction-following data β€” all of them lean on the same underlying primitive. That's the compound-AI bet paying off: build the right primitive once, and many downstream pipelines stop being one-off projects. + +If you're building region-specific synthetic data for your own model, the path is clear: take a locale's released artifact, layer your domain-specific dimensions on top with a few lines of Data Designer config, and you have a custom dataset that inherits all the demographic grounding the original artifact carries. + +--- + +**Key Resources:** + +- **Nemotron-Personas HF collection:** [huggingface.co/collections/nvidia/nemotron-personas](https://huggingface.co/collections/nvidia/nemotron-personas) +- **NeMo Data Designer:** [github.com/NVIDIA-NeMo/DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) +- **NeMo SDG-PGMs:** [github.com/NVIDIA-NeMo/SDG-PGMs](https://github.com/NVIDIA-NeMo/SDG-PGMs) +- **Nemotron 3 Super Technical Report:** [research.nvidia.com/labs/nemotron/.../NVIDIA-Nemotron-3-Super-Technical-Report.pdf](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) +- **Person Sampling in Data Designer:** [nvidia-nemo.github.io/DataDesigner/.../person_sampling](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/person_sampling/) +- **Related dev notes:** [Designing Data Designer: Why SDG Is a Systems Problem](design-principles.md), [Engineering an Enterprise-Grade Text-to-SQL Dataset](text-to-sql.md), [Push Datasets to Hugging Face Hub](push-datasets-to-hugging-face-hub.md) + +--- + +*Want to learn more about NeMo Data Designer? Check out our [documentation](https://nvidia-nemo.github.io/DataDesigner/) and start building your own region-specific synthetic persona datasets today.* diff --git a/docs/notebook_source/7-nemotron-personas.py b/docs/notebook_source/7-nemotron-personas.py new file mode 100644 index 000000000..ce9ce175c --- /dev/null +++ b/docs/notebook_source/7-nemotron-personas.py @@ -0,0 +1,708 @@ +# --- +# jupyter: +# jupytext: +# text_representation: +# extension: .py +# format_name: percent +# format_version: '1.3' +# jupytext_version: 1.18.1 +# kernelspec: +# display_name: .venv +# language: python +# name: python3 +# --- + +# %% [markdown] +# # πŸ‘₯ Data Designer Tutorial: Reproducing & Customizing Nemotron-Personas +# +# This notebook reproduces the [Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) generation pipeline end to end with [🎨 NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner), and then shows how to customize that pipeline to generate personas for a specific use case. A similar approach was used to build every dataset in the [Nemotron-Personas HF collection](https://huggingface.co/collections/nvidia/nemotron-personas). +# +# We seed the pipeline with the **extended Nemotron-Personas-USA dataset on NGC**, which is a superset of the publicly released HuggingFace version β€” it includes additional demographic and persona fields used internally to ground synthetic generation. From those grounded seeds, two stages of LLM structured-output columns produce the persona attributes (cultural background, skills, career goals, hobbies) and the persona descriptions across professional, financial, healthcare, sports, arts, travel, and culinary dimensions. +# +# > ⚠️ **Note**: To run this notebook, follow the setup instructions in the [Quick Start](https://nvidia-nemo.github.io/DataDesigner/quick-start/), make sure you have generated an API key for accessing models on [build.nvidia.com](https://build.nvidia.com), and that you've set the `NVIDIA_API_KEY` environment variable. The next section also walks through downloading the NGC-hosted Nemotron-Personas dataset. +# +#
+# Nemotron Personas pipeline overview +#
+ +# %% [markdown] +# # 1. πŸ“¦ Install and import python packages +# +# **IMPORTANT** πŸ‘‰ If you haven't already, follow the [Quick Start](https://nvidia-nemo.github.io/DataDesigner/quick-start/) to install Data Designer. Note that you may need to restart/select your kernel after setting up the environment. +# +# If the installation is successful, you should be able to run the imports below without any errors. + +# %% +from __future__ import annotations + +import json +import shlex +import subprocess +from pathlib import Path + +import numpy as np +import pandas as pd +from pydantic import BaseModel, Field + +import data_designer.config as dd +from data_designer.interface import DataDesigner + +# %% [markdown] +# ## πŸ“₯ Download the Nemotron-Personas dataset from NGC +# +# Before configuring Data Designer, make sure the NGC-hosted Nemotron-Personas dataset is on disk. This is the **extended** version, a superset of the public HF release. To use it you need an [NGC API key](https://ngc.nvidia.com/setup/api-key), the [NGC CLI](https://ngc.nvidia.com/setup/installers/cli) installed, and `NGC_API_KEY` exported in your environment. +# +# The cell below idempotently invokes the Data Designer CLI and only downloads when the locale's parquet isn't already in `~/.data-designer/managed-assets/datasets/`. Change `personas_locale` to any other [supported locale](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/person_sampling/) (`en_IN`, `en_SG`, `fr_FR`, `hi_Deva_IN`, `hi_Latn_IN`, `ja_JP`, `ko_KR`, `pt_BR`) to seed a regional pipeline instead. + +# %% +personas_locale = "en_US" + +assets_dir = Path.home() / ".data-designer" / "managed-assets" / "datasets" +existing = list(assets_dir.glob(f"{personas_locale}*.parquet")) if assets_dir.exists() else [] + +if existing: + print(f"Nemotron-Personas-{personas_locale} already present at {assets_dir}:") + for p in existing: + print(f" - {p.name}") +else: + print(f"Nemotron-Personas-{personas_locale} not found. Downloading via the Data Designer CLI...") + subprocess.run( + shlex.split(f"data-designer download personas --locale {personas_locale}"), + check=True, + ) + print(f"Done. Dataset placed under {assets_dir}.") + +# %% [markdown] +# # 2. πŸ› οΈ Define helpers +# +# These OCEAN Big-Five helpers come from the original Nemotron-Personas pipeline. They are **not invoked** in the default flow below, where OCEAN traits come directly from the NGC-hosted Nemotron-Personas-USA dataset via `with_synthetic_personas=True`. They are kept here for the `SAMPLE_FROM_SDG_PGM = True` reproduction path (see Section 4.2), since [NeMo SDG-PGMs](https://github.com/NVIDIA-NeMo/SDG-PGMs) handles demographic distributions but not Big Five personality scoring. + + +# %% +def get_trait_label(score: int) -> str: + """Convert a Big Five T-score into a coarse label.""" + if score < 35: + return "very low" + if score < 45: + return "low" + if score < 55: + return "average" + if score < 65: + return "high" + return "very high" + + +def get_trait_description(trait: str, label: str) -> str: + """Return a prose description for a (trait, label) pair.""" + descriptions: dict[str, dict[str, str]] = { + "openness": { + "very low": "Strongly prefers routine and the familiar. Traditional in thinking and values practicality over abstract ideas.", + "low": "Generally prefers structure and predictability. Tends to be practical and focused on immediate realities.", + "average": "Balances curiosity with practicality. Appreciates both new ideas and established methods.", + "high": "Curious and appreciative of art, new ideas, and varied experiences. Open to unconventional thinking.", + "very high": "Highly imaginative and intellectually curious. Strongly drawn to novelty, art, and abstract concepts.", + }, + "conscientiousness": { + "very low": "Spontaneous and flexible, often resisting structure. May struggle with organization and deadlines.", + "low": "Often relaxed about obligations and somewhat disorganized. Values flexibility over strict planning.", + "average": "Maintains balance between organization and flexibility. Reasonably reliable and attentive to responsibilities.", + "high": "Organized, reliable, and methodical. Plans ahead and follows through on commitments.", + "very high": "Exceptionally organized and disciplined. Strongly focused on achievement and meeting high standards.", + }, + "extraversion": { + "very low": "Strongly prefers solitude and quiet environments. May find social interaction draining.", + "low": "Generally reserved and comfortable with solitude. Prefers small groups to large gatherings.", + "average": "Balances social interaction with need for alone time. Moderately talkative in social situations.", + "high": "Sociable, outgoing, and energetic. Enjoys group activities and being around others.", + "very high": "Highly sociable and draws energy from others. Very talkative and comfortable being center of attention.", + }, + "agreeableness": { + "very low": "Critical, skeptical, and competitive. Prioritizes personal interests over group harmony.", + "low": "Sometimes skeptical of others' intentions. More competitive than cooperative in approach.", + "average": "Generally cooperative but can be assertive. Balances compassion with self-interest.", + "high": "Kind, cooperative, and considerate. Prioritizes harmony and others' needs.", + "very high": "Exceptionally compassionate and cooperative. Strongly motivated to help others and maintain harmony.", + }, + "neuroticism": { + "very low": "Exceptionally calm and resilient. Rarely experiences negative emotions like anxiety or sadness.", + "low": "Emotionally stable and handles stress well. Not easily upset by challenging situations.", + "average": "Experiences normal range of emotions. Moderately resilient but affected by significant challenges.", + "high": "Experiences more negative emotions than average. Prone to worry and sensitive to stress.", + "very high": "Highly emotionally reactive and prone to distress. Often experiences intense anxiety or sadness.", + }, + } + return descriptions[trait][label] + + +def generate_ocean_traits(num_records: int, base_seed: int | None = None) -> pd.DataFrame: + """Generate synthetic OCEAN traits as a DataFrame with one JSON-encoded object per trait per row.""" + if num_records <= 0: + return pd.DataFrame() + + traits = ["openness", "conscientiousness", "extraversion", "agreeableness", "neuroticism"] + rng = np.random.RandomState(base_seed) if base_seed is not None else np.random.RandomState() + data: dict[str, list[str]] = {} + + for trait in traits: + scores = rng.normal(50.0, 10.0, num_records) + rng.normal(0.0, 2.0, num_records) + scores = np.clip(scores, 20.0, 80.0) + t_scores = np.round(scores).astype(int) + labels = [get_trait_label(int(score)) for score in t_scores] + descriptions = [get_trait_description(trait, label) for label in labels] + data[trait] = [ + json.dumps({"t_score": int(t), "label": l, "description": d}) + for t, l, d in zip(t_scores, labels, descriptions, strict=True) + ] + + return pd.DataFrame(data) + + +# %% [markdown] +# # 3. 🎨 Set Up NeMo Data Designer (NDD) + +# %% [markdown] +# ## πŸͺͺ Specify Model ID and Alias +# +# - Use a [build.nvidia.com](https://build.nvidia.com/) model endpoint and model ID +# - Make sure your `NVIDIA_API_KEY` environment variable is set + +# %% +MODEL_PROVIDER = "nvidia" +MODEL_ID = "openai/gpt-oss-20b" +MODEL_ALIAS = "gpt-oss-20b" + +# %% [markdown] +# ## πŸŽ›οΈ Adjust the model config +# +# > ⚠️ **Note**: You may need to adjust temperature and top_p settings depending on the model you use. Consult the model card on [build.nvidia.com](https://build.nvidia.com) for recommended settings. + +# %% +model_configs = [ + dd.ModelConfig( + alias=MODEL_ALIAS, + model=MODEL_ID, + provider=MODEL_PROVIDER, + inference_parameters=dd.ChatCompletionInferenceParams( + max_tokens=16384, + temperature=dd.UniformDistribution(params=dd.UniformDistributionParams(low=0.9, high=1.1)), + top_p=1.0, + extra_body={"reasoning_effort": "high"}, + timeout=1200, + max_parallel_requests=32, + ), + ) +] + +# %% [markdown] +# ## πŸš€ Initialize Data Designer + +# %% +data_designer = DataDesigner() + +# %% [markdown] +# # 4. ✍️ Design the dataset +# +# #### Once the SDG-PGMs reproduction path is wired up, there are three main steps to Nemotron-Personas: +# #### 1️⃣ Generate OCEAN Personality Traits +# #### 2️⃣ Generate Persona Attributes by grounding in (PGM + OCEAN) details +# #### 3️⃣ Generate Personas by grounding in (2) + +# %% +NUM_RECORDS = 5 + +# %% [markdown] +# ## 4.1 🌊 Generate OCEAN (Big Five) personality traits +# OCEAN is the most common scientific model for measuring and describing human personality traits.
+# See [Big Five personality traits Wikipedia article](https://en.wikipedia.org/wiki/Big_Five_personality_traits) for more context. +# +# In this notebook the OCEAN traits come straight from the **NGC-hosted Nemotron-Personas-USA dataset** in the next section (`with_synthetic_personas=True` exposes `person.openness`, `person.conscientiousness`, etc. as `struct`). The helper functions in Section 2 are kept ready for the `SAMPLE_FROM_SDG_PGM = True` reproduction path. + +# %% [markdown] +# ## 4.2 πŸ‘©β€πŸŽ¨πŸ‘¨β€πŸŽ¨ Generate Persona Attributes +# +# We are focusing just on the part in the diagram below and seeding persona attributes with PGM + OCEAN details: +# +#
+# Stage 3: Persona attributes via structured outputs +#
+ +# %% [markdown] +# > ⚠️ **Note**: +# > Below, we show two different ways of seeding persona generation: +# > +# > When `SAMPLE_FROM_SDG_PGM = False` (default), we sample personal details and OCEAN traits from Data Designer's `PersonSampler` against the NGC-hosted Nemotron-Personas dataset (`PersonSamplerParams(locale=personas_locale, with_synthetic_personas=True)`). +# > +# > When `SAMPLE_FROM_SDG_PGM = True`, persons are generated from a custom Probabilistic Graphical Model via [NeMo SDG-PGMs](https://github.com/NVIDIA-NeMo/SDG-PGMs), and the OCEAN helpers from Section 2 layer the personality traits on top. **This branch is currently a TODO** β€” see the cell below for the eventual integration shape. +# > +# > To switch locales, just update `personas_locale` in the cell above (and re-run the download cell). All downstream prompts work unchanged across locales. + +# %% +# Toggle the source of the base "person" record. +# False (default) -- sample from the NGC-hosted Nemotron-Personas-USA artifact. +# True -- generate persons from a custom PGM via SDG-PGMs (TODO; see below). +SAMPLE_FROM_SDG_PGM = False + +# %% +config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs) + +if SAMPLE_FROM_SDG_PGM: + # TODO: Generate the base person record from a custom Probabilistic Graphical Model + # using NeMo SDG-PGMs (https://github.com/NVIDIA-NeMo/SDG-PGMs), then layer the OCEAN + # Big-Five helpers above on top. This matches the original four-stage Nemotron-Personas + # pipeline (Stage 1 = OCEAN helpers, Stage 2 = PGM demographics). + # + # The integration is approximately: + # + # from data_designer_plugins.pgm_generator_plugin import PGMGeneratorPluginConfig + # ocean_df = generate_ocean_traits(NUM_RECORDS) # Stage 1 (OCEAN) + # config_builder.with_seed_dataset( + # dd.DataFrameSeedSource(df=ocean_df), + # sampling_strategy=dd.SamplingStrategy.ORDERED, + # ) + # config_builder.add_column( # Stage 2 (demographics) + # PGMGeneratorPluginConfig( + # name="person", + # generator_class="my_generators.UsPersonGenerator", + # ) + # ) + raise NotImplementedError( + "SDG-PGMs path is not implemented in this notebook yet. " + "See https://github.com/NVIDIA-NeMo/SDG-PGMs for the open-sourced library." + ) + +# Default path: sample synthetic personal details + OCEAN traits from the NGC-hosted asset. +# `with_synthetic_personas=True` exposes Big Five t-scores + labels + descriptions, plus +# `person.cultural_background`, hobbies, career goals, and context-specific personas (those +# extra fields stay nested in `person` and don't conflict with the columns we regenerate +# downstream). `drop=True` keeps `person` from leaking into the final dataset. +config_builder.add_column( + dd.SamplerColumnConfig( + name="person", + sampler_type=dd.SamplerType.PERSON, + params=dd.PersonSamplerParams( + locale=personas_locale, + age_range=[18, 114], + with_synthetic_personas=True, + # sex="Male" # Optional: filter by sex + # city=["New York", "Los Angeles"] # Optional: filter by cities + ), + drop=True, + ) +) + +# %% +# Add a unique identifier for each record +config_builder.add_column(name="uuid", column_type="sampler", sampler_type="uuid") + +# Lift OCEAN traits to top-level so the original prompts can reference {{ openness.description }} etc. +for trait in ["openness", "conscientiousness", "extraversion", "agreeableness", "neuroticism"]: + config_builder.add_column(dd.ExpressionColumnConfig(name=trait, expr=f"{{{{ person.{trait} }}}}")) + +# Add specific personal detail columns -- NOT included in the public release, but used for seeding Personas +config_builder.add_column( + dd.ExpressionColumnConfig( + name="ethnic_background", + expr="{{ person.ethnic_background if person.ethnic_background else ' ' }}", + ) +) +config_builder.add_column(dd.ExpressionColumnConfig(name="first_name", expr="{{ person.first_name }}")) +config_builder.add_column( + dd.ExpressionColumnConfig( + name="middle_name", + expr="{{ person.middle_name if person.middle_name else ' ' }}", + ) +) +config_builder.add_column(dd.ExpressionColumnConfig(name="last_name", expr="{{ person.last_name }}")) +# Note: the underlying field is `district`; the original Nemotron-Personas-USA dataset surfaces it as `county`. +config_builder.add_column(dd.ExpressionColumnConfig(name="county", expr="{{ person.district }}")) + +# Add specific personal detail columns -- included in the public release +config_builder.add_column(dd.ExpressionColumnConfig(name="sex", expr="{{ person.sex }}")) +config_builder.add_column(dd.ExpressionColumnConfig(name="age", expr="{{ person.age }}")) +config_builder.add_column(dd.ExpressionColumnConfig(name="marital_status", expr="{{ person.marital_status }}")) +# These can legitimately be null in the source dataset; coerce to a single space so downstream +# Jinja templates stay safe (DD's validator rejects expression columns that render to ""). +config_builder.add_column( + dd.ExpressionColumnConfig( + name="education_level", + expr="{{ person.education_level if person.education_level else ' ' }}", + ) +) +config_builder.add_column( + dd.ExpressionColumnConfig( + name="bachelors_field", + expr="{{ person.bachelors_field if person.bachelors_field else ' ' }}", + ) +) +config_builder.add_column( + dd.ExpressionColumnConfig( + name="occupation", + expr="{{ person.occupation if person.occupation else ' ' }}", + ) +) +config_builder.add_column(dd.ExpressionColumnConfig(name="city", expr="{{ person.city }}")) +config_builder.add_column(dd.ExpressionColumnConfig(name="state", expr="{{ person.state }}")) +# Note: the underlying field is `postcode`; the original dataset surfaces it as `zipcode`. +config_builder.add_column(dd.ExpressionColumnConfig(name="zipcode", expr="{{ person.postcode }}")) +config_builder.add_column(dd.ExpressionColumnConfig(name="country", expr="{{ person.country }}")) + +# %% [markdown] +# ### πŸ‘€ Generate a preview to see what we have so far (OCEAN + PGM columns only for now) + +# %% +preview = data_designer.preview(config_builder, num_records=10) +preview.display_sample_record() + +# %% [markdown] +# ### ➑️ Next, generate persona attributes grounded in OCEAN + PGM + +# %% +PERSONA_ATTRIBUTES_SYSTEM_PROMPT = """\ +You are a detailed persona generator specializing in creating realistic, nuanced, and diverse personal attributes. You should: +1. Generate attributes that are internally consistent and logically connected to the base persona details +2. Ensure cultural sensitivity and avoid stereotypes while acknowledging cultural influences +3. Create specific, detailed responses rather than generic ones +4. Base your responses on realistic correlations between personal attributes like ethnic background, age, sex, marital status, education, occupation, etc. +5. Always return your response in a valid JSON format +6. DO NOT include any explanations or reasoning for your choices + +Your responses should be creative yet plausible, diverse yet consistent with the provided demographic information. +""" + + +# We define a PersonaAttributes schema so that all attributes are generated in one go, +# with the types and constraints as specified below. Pydantic is used to automatically validate the output. +class PersonaAttributes(BaseModel): + cultural_background: str = Field(description="Description of the person's cultural background") + skills_and_expertise: str = Field(description="Description of the person's skills and expertise") + skills_and_expertise_list: list[str] = Field(description="List of the person's skills and expertise") + career_goals_and_ambitions: str = Field(description="Description of the person's career goals and ambitions") + hobbies_and_interests: str = Field(description="Description of the person's hobbies and interests") + hobbies_and_interests_list: list[str] = Field(description="List of the person's hobbies and interests") + + +# %% +# Here we use a structured output column trick to generate all persona attributes +# in one shot, minimizing the number of API calls. +# +# Note how easy it is to access other fields in the dataset via Jinja templating. +# Doing so automatically infuses every record with row-specific details. +config_builder.add_column( + dd.LLMStructuredColumnConfig( + name="persona_attributes", + system_prompt=PERSONA_ATTRIBUTES_SYSTEM_PROMPT, + prompt="""\ +Based on a person with the following profile: + +Name: {{ first_name }} {{ middle_name if middle_name else '' }} {{ last_name }} +Sex: {{ sex }} +Age: {{ age }} +{{ 'Ethnic background: ' + ethnic_background if ethnic_background else ''}} +Marital status: {{ marital_status }} +Education: {{ education_level }}{{ ' in ' + bachelors_field if bachelors_field != 'no degree' else '' }} +Occupation: {{ occupation }} +Location: {{ city }}, {{ state }}, {{ county }} + +Personality profile: +- {{ openness.description }} +- {{ conscientiousness.description }} +- {{ extraversion.description }} +- {{ agreeableness.description }} +- {{ neuroticism.description }} + +Generate the following detailed persona attributes: +- cultural_background +- skills_and_expertise +- skills_and_expertise_list +- career_goals_and_ambitions +- hobbies_and_interests +- hobbies_and_interests_list + +When generating attributes, make sure to incorporate the influences suggested by the personality profile description. +""", + output_format=PersonaAttributes, + model_alias=MODEL_ALIAS, + drop=True, + ) +) + +# Now we break up into multiple columns +config_builder.add_column( + dd.ExpressionColumnConfig(name="cultural_background", expr="{{ persona_attributes.cultural_background }}") +) +config_builder.add_column( + dd.ExpressionColumnConfig(name="skills_and_expertise", expr="{{ persona_attributes.skills_and_expertise }}") +) +config_builder.add_column( + dd.ExpressionColumnConfig( + name="skills_and_expertise_list", expr="{{ persona_attributes.skills_and_expertise_list }}" + ) +) +config_builder.add_column( + dd.ExpressionColumnConfig( + name="career_goals_and_ambitions", expr="{{ persona_attributes.career_goals_and_ambitions }}" + ) +) +config_builder.add_column( + dd.ExpressionColumnConfig(name="hobbies_and_interests", expr="{{ persona_attributes.hobbies_and_interests }}") +) +config_builder.add_column( + dd.ExpressionColumnConfig( + name="hobbies_and_interests_list", expr="{{ persona_attributes.hobbies_and_interests_list }}" + ) +) + +# %% [markdown] +# ### πŸ” Generate a preview and examine a sample record + +# %% +preview = data_designer.preview(config_builder, num_records=10) + +# %% +preview.dataset[0:3] + +# %% +preview.display_sample_record() + +# %% [markdown] +# ### 4.3 πŸ¦Έβ€β™€οΈ πŸ‘©β€πŸŽ€ πŸ‘©β€πŸ³ πŸ‘©β€πŸ”¬ Generate Personas +# +# Now, let's focus on the second part shown in the diagram below: +# +#
+# Stage 4: Persona prose synthesis +#
+ +# %% +PERSONA_SYSTEM_PROMPT = """\ +You are a specialized persona generator that creates fine-grained, creative and meaningful persona descriptions based on an individual's cultural background, skills, career goals, and interests. You should: +1. Synthesize a coherent persona that naturally emerges from these characteristics +2. Focus on how these attributes combine to create a unique perspective and approach to life +3. Ensure the persona description reflects the intersection of professional expertise, cultural values, and personal interests +4. Create a narrative that explains how these characteristics influence their worldview and decision-making +5. Always return your response in a valid JSON format +6. INCLUDE NAME IN EVERY PERSONA DESCRIPTION. +7. ALWAYS TAKE AGE INTO ACCOUNT TO INFORM INTERESTS, HABITS AND AFFINITY TO VARIOUS ASPECTS OF LIFE. +8. NEVER DIRECTLY MENTION THE CULTURAL HERITAGE. INSTEAD, INFUSE IT INTO PERSONA DESCRIPTIONS BY REFERRING TO CULTURAL PRACTICES, TRADITIONS, AND VALUES. + +Each persona should be very specific, not a generic/bland description. Do not shy away from mentioning bad habits or quirks. + +Here are examples of how each persona description may begin: +"An aspiring musician..." +"A renowned machine learning researcher..." +"A neonatal nurse with decades of experince..." +"An urban planner with a passion..." +""" + + +# We define a Personas schema so that all attributes are generated in one go, +# with the types and constraints specified. Again, Pydantic is used to automatically validate the output. +class Personas(BaseModel): + professional_persona: str = Field( + description="A one-sentence persona description including primary field of work, key professional skills, and how their unique personality traits manifest in their career" + ) + finance_persona: str = Field( + description="A one-sentence persona characterization of spending habits, relationship with money, saving and investment habits, and approach to financial decision-making, mentioning specific financial instruments and investment strategies they use." + ) + healthcare_persona: str = Field( + description="A one-sentence persona description of very specific health conditions they have as well as their approach to medical care, and their typical behavior as a patient. Include condition names and describe how the person proactively addresses/ completely neglects/ periodically manages/ actively monitors/ struggles with/ effectively controls/ inconsistently treats these conditions" + ) + sports_persona: str = Field( + description="A one-sentence persona description of athletic interests, seasonal sports, and their approach to fitness and exercise. Provide specific names of professional sports teams and club affiliations, based on the persona location" + ) + arts_persona: str = Field( + description="A one-sentence persona characterization of engagement with creative expression, artistic appreciation, cultural activities, and how the arts shape their identity and leisure time, if at all. Always provide specific artist/musician/actor/performer names" + ) + travel_persona: str = Field( + description="A one-sentence persona capturing travel interests and style, including planning preferences, adventure versus relaxation focus, and financial or family constraints. Always provide specific local and/or international destinations they have visited or wish to visit" + ) + culinary_persona: str = Field( + description="A one-sentence persona description of food/cuisine preferences, cooking skill level, and approach to dining experiences. Always provide specific names of dishes and names of ingredients they enjoy." + ) + concise_persona: str = Field( + description="A one-sentence description capturing the essence of this person's unique perspective and approach to life, highlighting unique quirks, facts, and/or bad habits" + ) + detailed_persona: str = Field( + description="A paragraph describing persona's cultural background, skills, goals, and interests shape their worldview and decision-making. Don't shy away from talking about bad habits or quirks" + ) + + +# %% +# Here we use a structured output column trick to generate all personas +# in one call, minimizing the number of API calls. +# +# As before, we can easily access other fields in the dataset via Jinja templating. +# Doing so automatically infuses every record with row-specific details. +config_builder.add_column( + dd.LLMStructuredColumnConfig( + name="personas", + system_prompt=PERSONA_SYSTEM_PROMPT, + prompt="""\ +Based on a person with the following persona attributes and profile: + +Age: {{ age }} +Cultural background: {{ cultural_background }} +{{ 'Hobbies and interests: ' + hobbies_and_interests if age >= 6 else '' }} +{{ 'Skills and expertise: ' + skills_and_expertise if age >= 16 else '' }} +{{ 'Career goals and ambitions: ' + career_goals_and_ambitions if age >= 16 else '' }} + +Personality profile: +- {{ openness.description }} +- {{ conscientiousness.description }} +- {{ extraversion.description }} +- {{ agreeableness.description }} +- {{ neuroticism.description }} + +Generate the following self-contained persona descriptions that capture how persona attributes and profile combine to create a unique individual's perspective and approach to various facets of life. + +- professional_persona +- finance_persona +- healthcare_persona +- sports_persona +- arts_persona +- travel_persona +- culinary_persona +- concise_persona +- detailed_persona + +Each requested persona description should be self-contained, meaning it can't begin with they/their as the reference wouldn't be clear. +When generating personas, make sure to incorporate the influences suggested by the personality profile description. + +DO NOT USE THE RACE OF THE PERSONA IN YOUR RESPONSE. +NEVER DIRECTLY MENTION THE CULTURAL HERITAGE. INSTEAD, INFUSE IT INTO PERSONA DESCRIPTIONS BY REFERRING TO CULTURAL PRACTICES, TRADITIONS, AND VALUES. +INCLUDE NAME IN EVERY PERSONA DESCRIPTION. +ALWAYS TAKE AGE INTO ACCOUNT TO INFORM INTERESTS, HABITS AND AFFINITY TO VARIOUS ASPECTS OF LIFE. + +Each persona description should be creative yet plausible and consistent with the provided demographic information and persona attributes. +Each persona should be very specific, not a generic/bland description. Do not shy away from mentioning bad habits or quirks. + +Here are examples of how each description may begin: +"An aspiring musician..." +"A renowned machine learning researcher..." +"A neonatal nurse with decades of experince..." +"An urban planner with a passion..." +""", + output_format=Personas, + model_alias=MODEL_ALIAS, + drop=True, + ) +) + +# Now we break up into multiple columns +config_builder.add_column( + dd.ExpressionColumnConfig(name="professional_persona", expr="{{ personas.professional_persona }}") +) +config_builder.add_column(dd.ExpressionColumnConfig(name="finance_persona", expr="{{ personas.finance_persona }}")) +config_builder.add_column( + dd.ExpressionColumnConfig(name="healthcare_persona", expr="{{ personas.healthcare_persona }}") +) +config_builder.add_column(dd.ExpressionColumnConfig(name="sports_persona", expr="{{ personas.sports_persona }}")) +config_builder.add_column(dd.ExpressionColumnConfig(name="arts_persona", expr="{{ personas.arts_persona }}")) +config_builder.add_column(dd.ExpressionColumnConfig(name="travel_persona", expr="{{ personas.travel_persona }}")) +config_builder.add_column(dd.ExpressionColumnConfig(name="culinary_persona", expr="{{ personas.culinary_persona }}")) +config_builder.add_column(dd.ExpressionColumnConfig(name="concise_persona", expr="{{ personas.concise_persona }}")) +config_builder.add_column(dd.ExpressionColumnConfig(name="detailed_persona", expr="{{ personas.detailed_persona }}")) + +# %% [markdown] +# ### πŸ” Generate a preview and examine a sample record + +# %% +preview = data_designer.preview(config_builder, num_records=10) + +# %% +preview.dataset[0:3] + +# %% +preview.display_sample_record() + +# %% [markdown] +# ### ↗️ Scale Up Persona Generation +# Scale up to the specified `NUM_RECORDS` + +# %% +scaled_persona_results = data_designer.create(config_builder, num_records=NUM_RECORDS, dataset_name="personas") + +# Load the dataset into a pandas DataFrame +all_personas = scaled_persona_results.load_dataset() +all_personas.head(3) + +# %% [markdown] +# ### πŸ“„ View the evaluation report + +# %% +analysis = scaled_persona_results.load_analysis() +analysis.to_report() + +# %% [markdown] +# # 5. 🎯 Customize for your use case +# +# Everything above reproduces the **general-purpose** Nemotron-Personas-USA pipeline. In practice, enterprises will want personas grounded in their own domain β€” a healthcare provider needs persona dimensions a media company doesn't, and vice versa. With NeMo Data Designer, layering a custom attribute or persona on top of the released artifact is a few lines of config. +# +# To make the customization story concrete, the cell below adds a **`tech_persona`** dimension (with a specific list of `tech_tools` they use) that wasn't in the original Nemotron-Personas schema. The same pattern (one Pydantic schema + one `LLMStructuredColumnConfig` + one expression column per output field) generalizes to any domain-specific dimension you need. + + +# %% +class TechPersona(BaseModel): + tech_persona: str = Field( + description=( + "A 2-3 sentence description of this person's relationship with technology: " + "comfort with AI/digital tools, level of tech adoption (early-adopter / mainstream / late / " + "skeptic), preferred devices, and one specific way technology shapes their daily routine. " + "Be specific and consistent with the rest of the persona profile." + ) + ) + tech_tools: list[str] = Field( + description=( + "List of 4-6 specific tech tools, apps, services, or devices this person uses regularly. " + "Each entry should be a concrete named product, not a generic category." + ) + ) + + +config_builder.add_column( + dd.LLMStructuredColumnConfig( + name="custom_persona", + system_prompt=( + "You write nuanced, specific tech-relationship personas grounded in demographic " + "and psychometric attributes. Avoid generic platitudes; ground every claim in the " + "person's age, occupation, personality, and lifestyle." + ), + prompt="""\ +Based on a person with the following persona profile: + +Name: {{ first_name }} {{ last_name }}, Age: {{ age }}, Occupation: {{ occupation }} +Cultural background: {{ cultural_background }} +Career goals: {{ career_goals_and_ambitions }} +Hobbies: {{ hobbies_and_interests }} +Concise persona: {{ concise_persona }} + +Personality profile: +- {{ openness.description }} +- {{ conscientiousness.description }} +- {{ extraversion.description }} +- {{ agreeableness.description }} +- {{ neuroticism.description }} + +Generate the `tech_persona` and `tech_tools` fields as described in the schema. Be specific and consistent with the profile above. +""", + output_format=TechPersona, + model_alias=MODEL_ALIAS, + drop=True, + ) +) + +config_builder.add_column(dd.ExpressionColumnConfig(name="tech_persona", expr="{{ custom_persona.tech_persona }}")) +config_builder.add_column(dd.ExpressionColumnConfig(name="tech_tools", expr="{{ custom_persona.tech_tools }}")) + +# %% +preview = data_designer.preview(config_builder, num_records=5) +preview.display_sample_record() + +# %% [markdown] +# # ⏭️ Next Steps +# +# 1. Everything above is just an example of personas that can be generated. These personas are not set in stone and can be easily adjusted. For example, if you need a different type of persona for *-Nemotron, tweak or extend the pipeline (Section 5 demonstrates the pattern). +# +# 2. You should be able to use this notebook as is to generate Nemotron-Personas for any of the [supported locales](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/person_sampling/) by changing `personas_locale` and re-running the download cell. For a brand-new region without an NGC dataset, flip `SAMPLE_FROM_SDG_PGM = True` and provide a custom [SDG-PGMs](https://github.com/NVIDIA-NeMo/SDG-PGMs) generator (the OCEAN helpers in Section 2 are the Stage 1 scaffolding for that path). +# - You may need to adjust and/or translate prompts to your region's language(s) +# - You may need to work with a different LLM that is better suited for your region diff --git a/docs/scripts/generate_colab_notebooks.py b/docs/scripts/generate_colab_notebooks.py index 13aa14063..86d42b760 100644 --- a/docs/scripts/generate_colab_notebooks.py +++ b/docs/scripts/generate_colab_notebooks.py @@ -48,8 +48,42 @@ except userdata.SecretNotFoundError: os.environ["NVIDIA_API_KEY"] = getpass.getpass("Enter your NVIDIA API key: ")""" +# Optional per-file Colab setup cells, injected immediately after the standard +# install + NVIDIA_API_KEY cells. Used by tutorials that need additional Colab +# bootstrapping (e.g. NGC CLI install + NGC_API_KEY for the Nemotron-Personas +# tutorial). +NGC_CLI_INSTALL_CELL = """\ +%%capture +import os + +# Install the NGC CLI (used by `data-designer download personas` to fetch the +# managed Nemotron-Personas dataset). Pinned to a known-good version; bump as +# needed when NGC publishes new releases. +!wget -q --no-cache "https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.164.0/files/ngccli_linux.zip" -O /tmp/ngccli_linux.zip +!unzip -q -o /tmp/ngccli_linux.zip -d /tmp +!chmod u+x /tmp/ngc-cli/ngc +os.environ["PATH"] = f"/tmp/ngc-cli:{os.environ['PATH']}\"""" + +NGC_API_KEY_CELL = """\ +import getpass +import os + +from google.colab import userdata -def create_colab_setup_cells(additional_dependencies: str) -> list[NotebookNode]: +try: + os.environ["NGC_API_KEY"] = userdata.get("NGC_API_KEY") +except userdata.SecretNotFoundError: + os.environ["NGC_API_KEY"] = getpass.getpass("Enter your NGC API key: ")""" + +ADDITIONAL_SETUP_CELLS: dict[str, list[str]] = { + "7-nemotron-personas.py": [NGC_CLI_INSTALL_CELL, NGC_API_KEY_CELL], +} + + +def create_colab_setup_cells( + additional_dependencies: str, + additional_setup_cell_sources: list[str] | None = None, +) -> list[NotebookNode]: """Create the Colab-specific setup cells to inject before imports.""" cells = [] cells += [new_markdown_cell(source=COLAB_SETUP_MARKDOWN)] @@ -60,6 +94,10 @@ def create_colab_setup_cells(additional_dependencies: str) -> list[NotebookNode] cells += [new_code_cell(source=install_cell)] cells += [new_code_cell(source=COLAB_API_KEY_CELL)] + + if additional_setup_cell_sources: + cells += [new_code_cell(source=src) for src in additional_setup_cell_sources] + return cells @@ -89,6 +127,7 @@ def process_notebook(notebook: NotebookNode, source_path: Path) -> NotebookNode: cells = notebook.cells additional_dependencies = ADDITIONAL_DEPENDENCIES.get(source_path.name, "") + additional_setup_cells = ADDITIONAL_SETUP_CELLS.get(source_path.name) # Find where to insert Colab setup (before "Import the essentials") import_idx = find_import_section_index(cells) @@ -98,7 +137,7 @@ def process_notebook(notebook: NotebookNode, source_path: Path) -> NotebookNode: import_idx = 1 # Insert Colab setup cells before the import section - colab_cells = create_colab_setup_cells(additional_dependencies) + colab_cells = create_colab_setup_cells(additional_dependencies, additional_setup_cells) processed_cells = cells[:import_idx] + colab_cells + cells[import_idx:] badge_source = COLAB_BADGE_TEMPLATE.format(filename=f"{source_path.stem}.ipynb") diff --git a/mkdocs.yml b/mkdocs.yml index 50f49b26d..def95e6e0 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -40,6 +40,7 @@ nav: - Providing Images as Context: notebooks/4-providing-images-as-context.ipynb - Generating Images: notebooks/5-generating-images.ipynb - Image-to-Image Editing: notebooks/6-editing-images-with-image-context.ipynb + - "Reproducing & Customizing Nemotron-Personas": notebooks/7-nemotron-personas.ipynb - Recipes: - Recipe Cards: recipes/cards.md - Code Generation: