DPO with NeMo Customizer

This guide covers Direct Preference Optimization (DPO) training using the NeMo Agent Toolkit finetuning harness integrated with NVIDIA NeMo Customizer. This integration enables preference-based finetuning of large language models using NVIDIA's enterprise-grade training infrastructure.

Understanding DPO

What is Direct Preference Optimization?

Direct Preference Optimization (DPO) is a reinforcement learning technique that trains language models to prefer certain responses over others, without requiring a separate reward model. Unlike traditional RLHF (Reinforcement Learning from Human Feedback), which requires training a reward model and then using PPO to optimize against it, DPO directly optimizes the policy using preference pairs.

The DPO Objective

DPO works by optimizing the following objective:

L_DPO(π_θ; π_ref) = -E[(x, y_w, y_l)] [log σ(β · (log π_θ(y_w|x) - log π_ref(y_w|x)) - β · (log π_θ(y_l|x) - log π_ref(y_l|x)))]

Where:

π_θ is the policy being trained
π_ref is the reference policy (frozen copy of the initial model)
x is the prompt
y_w is the "chosen" (preferred) response
y_l is the "rejected" (non-preferred) response
β is a temperature parameter controlling deviation from the reference policy
σ is the sigmoid function

In simpler terms: DPO increases the probability of chosen responses while decreasing the probability of rejected responses, with a KL penalty to prevent the model from deviating too far from its original behavior.

Why DPO?

Advantages over traditional RLHF:

Simpler Pipeline: No need to train a separate reward model
More Stable Training: Avoids the instabilities of PPO optimization
Computationally Efficient: Single-stage training process
Direct Optimization: Directly optimizes preference likelihood

When to use DPO:

You have paired preference data (chosen vs rejected responses)
You want to align model outputs with specific quality criteria
You're training agents where you can score different action choices
You want to improve response quality without explicit reward modeling

Preference Pairs from Test-Time Compute

The NeMo Agent Toolkit DPO integration uses Test-Time Compute (TTC) to generate preference pairs automatically. During workflow execution:

Multiple Candidates Generated: For each decision point, the workflow generates multiple candidate responses
Candidates Scored: Each candidate is evaluated using a scoring function
Pairs Created: Higher-scored candidates become "chosen", lower-scored become "rejected"

This approach enables automated preference data collection without manual labeling.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         DPO Training Pipeline                               │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                      Data Collection Phase                              ││
│  │                                                                         ││
│  │  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────────┐   ││
│  │  │   Dataset    │───►│   Workflow   │───►│  TTC Move Selector       │   ││
│  │  │   (inputs)   │    │  Execution   │    │  (generates candidates)  │   ││
│  │  └──────────────┘    └──────────────┘    └──────────────────────────┘   ││
│  │                                                   │                     ││
│  │                                                   ▼                     ││
│  │                                          ┌──────────────────────────┐   ││
│  │                                          │   Score Candidates       │   ││
│  │                                          │   (reward function)      │   ││
│  │                                          └──────────────────────────┘   ││
│  │                                                   │                     ││
│  │                                                   ▼                     ││
│  │  ┌──────────────────────────────────────────────────────────────────┐   ││
│  │  │                  DPO Trajectory Builder                          │   ││
│  │  │                                                                  │   ││
│  │  │  • Collects TTC_END intermediate steps with TTCEventData         │   ││
│  │  │  • Groups candidates by turn_id                                  │   ││
│  │  │  • Generates preference pairs (chosen vs rejected)               │   ││
│  │  │  • Builds Trajectory objects with DPOItem episodes               │   ││
│  │  └──────────────────────────────────────────────────────────────────┘   ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                    │                                        │
│                                    ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                      Training Submission Phase                          ││
│  │                                                                         ││
│  │  ┌──────────────────────────────────────────────────────────────────┐   ││
│  │  │                  NeMo Customizer Trainer Adapter                 │   ││
│  │  │                                                                  │   ││
│  │  │  1. Convert trajectories to JSONL format                         │   ││
│  │  │  2. Upload dataset to NeMo Datastore (via HuggingFace Hub API)   │   ││
│  │  │  3. Submit customization job to NeMo Customizer                  │   ││
│  │  │  4. Monitor job progress until completion                        │   ││
│  │  │  5. Optionally deploy trained model                              │   ││
│  │  └──────────────────────────────────────────────────────────────────┘   ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                    │                                        │
│                                    ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                    NeMo Customizer Backend                              ││
│  │                                                                         ││
│  │  ┌─────────────────────┐  ┌─────────────────────┐                       ││
│  │  │   Entity Store      │  │   Datastore         │                       ││
│  │  │   (job management)  │  │   (dataset storage) │                       ││
│  │  └─────────────────────┘  └─────────────────────┘                       ││
│  │                                                                         ││
│  │  ┌─────────────────────────────────────────────────────────────────┐    ││
│  │  │                    Training Infrastructure                      │    ││
│  │  │                                                                 │    ││
│  │  │  • DPO loss computation with reference model                    │    ││
│  │  │  • LoRA or full-weight finetuning                               │    ││
│  │  │  • Multi-GPU distributed training                               │    ││
│  │  └─────────────────────────────────────────────────────────────────┘    ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                    │                                        │
│                                    ▼                                        │
│                         ┌──────────────────┐                                │
│                         │  Trained Model   │                                │
│                         │  (optional NIM   │                                │
│                         │   deployment)    │                                │
│                         └──────────────────┘                                │
└─────────────────────────────────────────────────────────────────────────────┘

Installation

Install the NeMo Customizer plugin package:

pip install nvidia-nat-nemo-customizer

This provides:

dpo_traj_builder: DPO trajectory builder for collecting preference pairs
nemo_customizer_trainer_adapter: Adapter for submitting jobs to NeMo Customizer
nemo_customizer_trainer: Trainer orchestrator for the DPO workflow

Prerequisites

NeMo Microservices Platform (NMP): Access to a deployed NeMo Customizer instance
Entity Store: For managing datasets, models, and jobs
Datastore: For storing training datasets (accessed via HuggingFace Hub API)

Configuration

Complete Configuration Example

# LLM Configuration
llms:
  inference_llm:
    _type: openai
    model_name: meta/llama-3.1-8b-instruct
    base_url: https://integrate.api.nvidia.com/v1
    api_key: ${NVIDIA_API_KEY}
    temperature: 0.7

# Workflow that uses TTC for candidate generation
workflow:
  _type: my_dpo_workflow
  llm: inference_llm

# Evaluation configuration
eval:
  general:
    max_concurrency: 8
    output_dir: .tmp/nat/finetuning/eval
    dataset:
      _type: json
      file_path: data/training_data.json

  evaluators:
    game_evaluator:
      _type: my_game_evaluator

# DPO Trajectory Builder
trajectory_builders:
  dpo_builder:
    _type: dpo_traj_builder
    ttc_step_name: dpo_candidate_move
    exhaustive_pairs: true
    min_score_diff: 0.05
    max_pairs_per_turn: 10
    reward_from_score_diff: true
    require_multiple_candidates: true

# NeMo Customizer Trainer Adapter
trainer_adapters:
  nemo_adapter:
    _type: nemo_customizer_trainer_adapter
    entity_host: https://nmp.example.com
    datastore_host: https://datastore.example.com
    namespace: my-dpo-project
    dataset_name: dpo-training-data
    customization_config: meta/llama-3.1-8b-instruct@v1.0.0+A100
    create_namespace_if_missing: true
    use_full_message_history: true
    hyperparameters:
      training_type: dpo
      finetuning_type: all_weights
      epochs: 3
      batch_size: 4
      learning_rate: 5e-6
      dpo:
        ref_policy_kl_penalty: 0.1
        preference_loss_weight: 1.0
        preference_average_log_probs: false
        sft_loss_weight: 0.0
    deploy_on_completion: false
    poll_interval_seconds: 30.0
    deployment_timeout_seconds: 1800.0

# NeMo Customizer Trainer
trainers:
  nemo_trainer:
    _type: nemo_customizer_trainer
    num_runs: 3
    wait_for_completion: true
    deduplicate_pairs: true
    max_pairs: 5000

# Finetuning configuration
finetuning:
  enabled: true
  trainer: nemo_trainer
  trajectory_builder: dpo_builder
  trainer_adapter: nemo_adapter
  reward_function:
    name: game_evaluator
  num_epochs: 1  # Not used for NeMo Customizer (uses num_runs instead)
  output_dir: .tmp/nat/finetuning/output

Configuration Reference

DPO Trajectory Builder Configuration

The DPO trajectory builder collects preference pairs from TTC intermediate steps.

trajectory_builders:
  dpo_builder:
    _type: dpo_traj_builder
    ttc_step_name: dpo_candidate_move
    exhaustive_pairs: true
    min_score_diff: 0.0
    max_pairs_per_turn: null
    reward_from_score_diff: true
    require_multiple_candidates: true

Field	Type	Default	Description
`ttc_step_name`	`str`	`"dpo_candidate_move"`	Name of the TTC intermediate step to collect. Must match the name used in your workflow's `push_intermediate_step()` call.
`exhaustive_pairs`	`bool`	`true`	If `true`, generate all pairwise comparisons where `score(A) > score(B)`. If `false`, only generate best vs worst pair per turn.
`min_score_diff`	`float`	`0.0`	Minimum score difference required to create a preference pair. Pairs with smaller differences are filtered out. Useful for ensuring meaningful preference signal.
`max_pairs_per_turn`	`int \| null`	`null`	Maximum preference pairs per turn. If set, pairs are sorted by score difference (highest first) and truncated. `null` means no limit.
`reward_from_score_diff`	`bool`	`true`	If `true`, trajectory reward = score difference (chosen - rejected). If `false`, reward = chosen candidate's score.
`require_multiple_candidates`	`bool`	`true`	If `true`, skip turns with only one candidate (no preference signal possible). If `false`, include single-candidate turns.

Pair Generation Modes

Exhaustive Pairs (exhaustive_pairs: true)

For candidates with scores [A=0.9, B=0.7, C=0.5], generates:

(A chosen, B rejected) - score diff: 0.2
(A chosen, C rejected) - score diff: 0.4
(B chosen, C rejected) - score diff: 0.2

This provides more training signal but may include weak preference pairs.

Best vs Worst (exhaustive_pairs: false)

For the same candidates, generates only:

(A chosen, C rejected) - score diff: 0.4

This provides stronger preference signal but fewer training examples.

NeMo Customizer Trainer Configuration

The trainer orchestrates data collection runs.

trainers:
  nemo_trainer:
    _type: nemo_customizer_trainer
    num_runs: 3
    continue_on_collection_error: false
    deduplicate_pairs: true
    max_pairs: null
    wait_for_completion: true

Field	Type	Default	Description
`num_runs`	`int`	`1`	Number of times to run the trajectory builder to collect data. Multiple runs increase dataset diversity by generating different trajectories for the same inputs.
`continue_on_collection_error`	`bool`	`false`	If `true`, continue with remaining runs if one fails. If `false`, stop immediately on first error.
`deduplicate_pairs`	`bool`	`true`	If `true`, remove duplicate DPO pairs based on prompt+chosen+rejected content. Useful when multiple runs may generate identical pairs.
`max_pairs`	`int \| null`	`null`	Maximum DPO pairs to include in training. If set, randomly samples from collected pairs. `null` means use all pairs.
`wait_for_completion`	`bool`	`true`	If `true`, wait for NeMo Customizer job to complete. If `false`, submit and return immediately.

NeMo Customizer Trainer Adapter Configuration

The adapter handles communication with NeMo Customizer services.

trainer_adapters:
  nemo_adapter:
    _type: nemo_customizer_trainer_adapter

    # Endpoint Configuration
    entity_host: https://nmp.example.com
    datastore_host: https://datastore.example.com
    hf_token: ""

    # Namespace and Dataset
    namespace: my-project
    dataset_name: nat-dpo
    dataset_output_dir: null
    create_namespace_if_missing: true

    # Customization Job
    customization_config: meta/llama-3.1-8b-instruct@v1.0.0+A100
    hyperparameters:
      training_type: dpo
      finetuning_type: all_weights
      epochs: 3
      batch_size: 4
      learning_rate: 5e-5
      dpo:
        ref_policy_kl_penalty: 0.1
        preference_loss_weight: 1.0
        preference_average_log_probs: false
        sft_loss_weight: 0.0

    # Prompt Formatting
    use_full_message_history: false

    # Deployment
    deploy_on_completion: false
    deployment_config:
      image_name: nvcr.io/nim/meta/llama-3.1-8b-instruct
      image_tag: latest
      gpu: 1
      deployment_name: null
      description: Fine-tuned model deployment

    # Polling
    poll_interval_seconds: 30.0
    deployment_timeout_seconds: 1800.0

Endpoint Configuration

Field	Type	Default	Description
`entity_host`	`str`	required	Base URL for NeMo Entity Store (e.g., `https://nmp.example.com`).
`datastore_host`	`str`	required	Base URL for NeMo Datastore (e.g., `https://datastore.example.com`).
`hf_token`	`str`	`""`	HuggingFace token for datastore authentication. Can be empty if not required.

Namespace and Dataset

Field	Type	Default	Description
`namespace`	`str`	required	Namespace for organizing resources (datasets, models, deployments).
`dataset_name`	`str`	`"nat-dpo"`	Name for the training dataset. Must be unique within namespace.
`dataset_output_dir`	`str \| null`	`null`	Directory to save dataset JSONL files locally. If `null`, uses temporary directory. If specified, files are preserved for debugging.
`create_namespace_if_missing`	`bool`	`true`	If `true`, create namespace in entity store and datastore if it doesn't exist.

Customization Job

Field	Type	Default	Description
`customization_config`	`str`	required	Model configuration string (e.g., `meta/llama-3.1-8b-instruct@v1.0.0+A100`). Available `configs` can be listed via NeMo Customizer API.

Hyperparameters

Field	Type	Default	Description
`training_type`	`"sft" \| "dpo"`	`"dpo"`	Training type. Use `"dpo"` for preference optimization.
`finetuning_type`	`"lora" \| "all_weights"`	`"all_weights"`	`"lora"` for parameter-efficient finetuning, `"all_weights"` for full model.
`epochs`	`int`	`3`	Number of training epochs over the dataset.
`batch_size`	`int`	`4`	Training batch size.
`learning_rate`	`float`	`5e-5`	Learning rate for optimizer.

DPO-Specific Hyperparameters

Field	Type	Default	Description
`ref_policy_kl_penalty`	`float`	`0.1`	KL penalty coefficient (β in DPO objective). Controls how much the model can deviate from reference policy. Higher values = more conservative updates.
`preference_loss_weight`	`float`	`1.0`	Weight for the preference (DPO) loss term.
`preference_average_log_probs`	`bool`	`false`	If `true`, average log probabilities over sequence length. If `false`, sum log probabilities.
`sft_loss_weight`	`float`	`0.0`	Weight for optional SFT loss on chosen responses. Can help maintain response quality.

Prompt Formatting

Field	Type	Default	Description
`use_full_message_history`	`bool`	`false`	If `true`, include full conversation history as list of messages: `[{"role": "system", "content": "..."}, ...]`. If `false`, use only last message content as string.

Deployment Configuration

Field	Type	Default	Description
`deploy_on_completion`	`bool`	`false`	If `true`, automatically deploy the trained model after job completion.
`deployment_config.image_name`	`str`	`"nvcr.io/nim/meta/llama-3.1-8b-instruct"`	NIM container image name.
`deployment_config.image_tag`	`str`	`"latest"`	NIM container image tag.
`deployment_config.gpu`	`int`	`1`	Number of GPUs for deployment.
`deployment_config.deployment_name`	`str \| null`	`null`	Name for deployment. If `null`, auto-generated.
`deployment_config.description`	`str`	`"Fine-tuned model deployment"`	Description for the deployment.

Polling Configuration

Field	Type	Default	Description
`poll_interval_seconds`	`float`	`30.0`	Interval between job status checks.
`deployment_timeout_seconds`	`float`	`1800.0`	Maximum time to wait for deployment to be ready (30 minutes default).

Implementing TTC in Your Workflow

To generate DPO training data, your workflow must emit TTC (Test-Time Compute) intermediate steps with TTCEventData. Here's how to implement this:

TTCEventData Structure

from nat.data_models.intermediate_step import (
    IntermediateStepPayload,
    IntermediateStepType,
    TTCEventData,
)

# Create TTCEventData for each candidate
ttc_data = TTCEventData(
    turn_id="turn_0",           # Groups candidates competing for same prompt
    turn_index=0,               # Index of this turn in the episode
    candidate_index=idx,        # Index of this candidate within the turn
    input=messages,             # Prompt (string or list of OpenAI messages)
    output=response,            # Model's response
    score=candidate_score,      # Score for this candidate (higher = better)
)

Emitting TTC Steps

from nat.builder.context import Context

# Get the step manager from context
context = Context.get()
step_manager = context.intermediate_step_manager

# Emit TTC_END step for each candidate
step_manager.push_intermediate_step(
    IntermediateStepPayload(
        event_type=IntermediateStepType.TTC_END,
        name="dpo_candidate_move",  # Must match ttc_step_name in config
        data=ttc_data,
        metadata={"is_selected": is_best_candidate},
    )
)

Complete Example: TTC Move Selector

from nat.builder.context import Context
from nat.data_models.intermediate_step import (
    IntermediateStepPayload,
    IntermediateStepType,
    TTCEventData,
)

async def ttc_move_selector(
    prompt: str,
    candidates: list[str],
    scores: list[float],
    turn_id: str,
    turn_index: int,
) -> str:
    """
    Select best candidate and emit TTC steps for DPO training.

    Args:
        prompt: The input prompt
        candidates: List of candidate responses
        scores: Scores for each candidate (higher = better)
        turn_id: Unique identifier for this decision point
        turn_index: Index of this turn in the episode

    Returns:
        The best candidate response
    """
    context = Context.get()
    step_manager = context.intermediate_step_manager

    # Find best candidate
    best_idx = scores.index(max(scores))

    # Emit TTC_END step for each candidate
    for idx, (candidate, score) in enumerate(zip(candidates, scores)):
        ttc_data = TTCEventData(
            turn_id=turn_id,
            turn_index=turn_index,
            candidate_index=idx,
            input=prompt,
            output=candidate,
            score=score,
        )

        step_manager.push_intermediate_step(
            IntermediateStepPayload(
                event_type=IntermediateStepType.TTC_END,
                name="dpo_candidate_move",
                data=ttc_data,
                metadata={"is_selected": idx == best_idx},
            )
        )

    return candidates[best_idx]

How It Works

Phase 1: Data Collection

The DPO trajectory builder collects preference data through the NeMo Agent Toolkit evaluation system:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DPO Trajectory Builder Flow                              │
│                                                                             │
│  start_run(run_id)                                                          │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Launch evaluation run                                                │  │
│  │                                                                       │  │
│  │  For each dataset example:                                            │  │
│  │    1. Execute workflow                                                │  │
│  │    2. Workflow emits TTC_END steps with TTCEventData                  │  │
│  │    3. Compute reward using configured evaluator                       │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  finalize(run_id)                                                           │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Process collected intermediate steps:                                │  │
│  │                                                                       │  │
│  │  1. Filter for TTC_END steps with configured name                     │  │
│  │  2. Extract TTCEventData (turn_id, candidate_index, score, etc.)      │  │
│  │  3. Group candidates by (example_id, turn_id)                         │  │
│  │  4. Generate preference pairs based on score differences              │  │
│  │  5. Build Trajectory objects with DPOItem episodes                    │  │
│  │  6. Group trajectories by example_id                                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  Return TrajectoryCollection                                                │
└─────────────────────────────────────────────────────────────────────────────┘

Phase 2: Training Submission

The trainer adapter converts trajectories and submits to NeMo Customizer:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    NeMo Customizer Trainer Adapter Flow                     │
│                                                                             │
│  submit(trajectories)                                                       │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Convert to JSONL format:                                             │  │
│  │                                                                       │  │
│  │  {                                                                    │  │
│  │    "prompt": "What move should I make?",                              │  │
│  │    "chosen_response": "I'll play X in the center...",                 │  │
│  │    "rejected_response": "I'll play X in the corner..."                │  │
│  │  }                                                                    │  │
│  │                                                                       │  │
│  │  Split: 80% training, 20% validation                                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Upload to NeMo Datastore:                                            │  │
│  │                                                                       │  │
│  │  1. Create dataset repo via HuggingFace Hub API                       │  │
│  │  2. Register dataset in Entity Store                                  │  │
│  │  3. Upload training_file.jsonl and validation_file.jsonl              │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Submit customization job:                                            │  │
│  │                                                                       │  │
│  │  client.customization.jobs.create(                                    │  │
│  │    config=customization_config,                                       │  │
│  │    dataset={name, namespace},                                         │  │
│  │    hyperparameters={training_type: dpo, ...}                          │  │
│  │  )                                                                    │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  Return TrainingJobRef                                                      │
└─────────────────────────────────────────────────────────────────────────────┘

Phase 3: Monitoring and Completion

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Training Monitoring Flow                                 │
│                                                                             │
│  wait_until_complete(job_ref)                                               │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Poll job status:                                                     │  │
│  │                                                                       │  │
│  │  while not done:                                                      │  │
│  │    status = client.customization.jobs.status(job_id)                  │  │
│  │    log status changes and progress                                    │  │
│  │    if status in [completed, failed, cancelled]: break                 │  │
│  │    sleep(poll_interval_seconds)                                       │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼ (if deploy_on_completion and status == completed)                    │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Deploy trained model:                                                │  │
│  │                                                                       │  │
│  │  1. Create deployment config                                          │  │
│  │  2. Create model deployment                                           │  │
│  │  3. Wait for deployment to be ready                                   │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  Return TrainingJobStatus                                                   │
└─────────────────────────────────────────────────────────────────────────────┘

Running DPO Training

Basic Training

# Run DPO training with your configuration
nat finetune --config_file=configs/dpo_finetune.yml

With Configuration Overrides

# Override number of data collection runs
nat finetune --config_file=configs/dpo_finetune.yml \
    -o trainers.nemo_trainer.num_runs 5

# Override training epochs
nat finetune --config_file=configs/dpo_finetune.yml \
    -o trainer_adapters.nemo_adapter.hyperparameters.epochs 5

# Override learning rate
nat finetune --config_file=configs/dpo_finetune.yml \
    -o trainer_adapters.nemo_adapter.hyperparameters.learning_rate 1e-5

Monitoring Progress

During training, check:

Console Output: Shows data collection progress, pair counts, job status

INFO - Starting NeMo Customizer DPO workflow with 3 data collection runs
INFO - Starting data collection run 1/3
INFO - Run 1: Collected 50 trajectories, 120 DPO pairs, avg reward: 0.4523
INFO - Starting data collection run 2/3
INFO - Run 2: Collected 50 trajectories, 115 DPO pairs, avg reward: 0.4812
INFO - Starting data collection run 3/3
INFO - Run 3: Collected 50 trajectories, 118 DPO pairs, avg reward: 0.4701
INFO - Data collection complete: 150 trajectory groups, ~353 total DPO pairs from 3 runs
INFO - Deduplication: 353 -> 312 trajectories
INFO - Submitted training job: job_abc123
INFO - Job nemo_dpo_a1b2c3d4: Status -> 'running'
INFO - Job nemo_dpo_a1b2c3d4: Progress 25.0%
INFO - Job nemo_dpo_a1b2c3d4: Progress 50.0%
INFO - Job nemo_dpo_a1b2c3d4: Progress 75.0%
INFO - Job nemo_dpo_a1b2c3d4: Status -> 'completed'
INFO - Training completed with status: completed

Output Files (in finetuning.output_dir):
- data_collection_progress.jsonl: Per-run metrics
- collection_history.json: Complete collection history
- final_metrics.json: Final training metrics
NeMo Customizer UI: Monitor job progress via the NeMo platform

Dataset Format

The trainer adapter converts DPO pairs to JSONL format:

Standard Format `(use_full_message_history: false)`

{"prompt": "What's the best move in this position?", "chosen_response": "I'll play X in the center because...", "rejected_response": "I'll play X in the corner because..."}
{"prompt": "How should I respond to this attack?", "chosen_response": "I should defend by...", "rejected_response": "I should attack by..."}

Full Message History Format `(use_full_message_history: true)`

{"prompt": [{"role": "system", "content": "You are a chess expert."}, {"role": "user", "content": "What's the best move?"}], "chosen_response": "I recommend Nf3 because...", "rejected_response": "I recommend a4 because..."}

Advanced Configuration

Tuning DPO Hyperparameters

KL Penalty (ref_policy_kl_penalty)

The KL penalty (β) controls how much the model can deviate from the reference policy:

hyperparameters:
  dpo:
    ref_policy_kl_penalty: 0.1  # Default: balanced exploration
    # ref_policy_kl_penalty: 0.01  # Lower: more aggressive updates
    # ref_policy_kl_penalty: 0.5   # Higher: more conservative updates

Lower values (0.01-0.05): Allow larger policy updates, faster learning but risk of instability
Higher values (0.2-0.5): More conservative updates, slower but more stable training

SFT Loss Weight

Adding SFT loss on chosen responses can help maintain response quality:

hyperparameters:
  dpo:
    sft_loss_weight: 0.1  # Add 10% SFT loss

Optimizing Data Collection

Multiple Runs for Diversity

Running multiple data collection passes generates diverse preference pairs:

trainers:
  nemo_trainer:
    num_runs: 5  # More runs = more diverse data

Filtering Weak Preferences

Filter out pairs with small score differences:

trajectory_builders:
  dpo_builder:
    min_score_diff: 0.1  # Only keep pairs with >0.1 score difference

Limiting Pairs Per Turn

For turns with many candidates, limit pairs to strongest preferences:

trajectory_builders:
  dpo_builder:
    exhaustive_pairs: true
    max_pairs_per_turn: 5  # Keep top 5 pairs by score difference

Automatic Model Deployment

Enable automatic deployment of trained models:

trainer_adapters:
  nemo_adapter:
    deploy_on_completion: true
    deployment_config:
      image_name: nvcr.io/nim/meta/llama-3.1-8b-instruct
      image_tag: latest
      gpu: 2
      deployment_name: my-dpo-model
      description: DPO-finetuned agent model
    deployment_timeout_seconds: 3600  # 1 hour timeout

Troubleshooting

Connection Issues

"Failed to connect to NeMo Customizer"

Verify endpoints are correct:

curl https://nmp.example.com/health
curl https://datastore.example.com/health

Check authentication (HuggingFace token if required)
Verify network connectivity and firewall rules

No Preference Pairs Generated

"No trajectories collected from any run"

Check TTC step name: Ensure ttc_step_name matches your workflow:

trajectory_builders:
  dpo_builder:
    ttc_step_name: dpo_candidate_move  # Must match workflow

Verify TTCEventData is emitted: Add logging to confirm steps are being pushed
Check candidate scores: If all candidates have same score, no pairs can be created

Review min_score_diff: Lower threshold if filtering too aggressively:

trajectory_builders:
  dpo_builder:
    min_score_diff: 0.0  # Accept all score differences

Training Job Failures

"Customization job failed"

Check NeMo Customizer logs for detailed error messages

Verify dataset format is correct:

# Check generated JSONL files
cat .tmp/nat/finetuning/output/*/training_file.jsonl | head -5

Ensure model configuration is valid and available
Check GPU resources are available

Deployment Issues

"Deployment did not become ready within timeout"

Increase timeout:

trainer_adapters:
  nemo_adapter:
    deployment_timeout_seconds: 3600  # 1 hour

Check NeMo deployment logs for errors
Verify GPU resources are available for deployment
Check deployment configuration matches model requirements

Memory Issues

"CUDA out of memory" during training

Reduce batch size:

hyperparameters:
  batch_size: 2  # Reduce from default 4

Use LoRA instead of full-weight:

hyperparameters:
  finetuning_type: lora

Contact NeMo Customizer admin to allocate more GPU resources

Examples

The examples/finetuning/dpo_tic_tac_toe directory contains a complete working example demonstrating:

Tic-tac-toe game workflow with TTC move selection
Custom scoring function for move quality
Full DPO training configuration
Training and evaluation datasets

See the example's README for detailed instructions.

Best Practices

Data Quality

Meaningful Score Differences: Ensure your scoring function produces meaningful distinctions between candidates
Diverse Training Data: Use multiple data collection runs and diverse input examples
Balance Difficulty: Include examples of varying difficulty levels

Hyperparameter Selection

Start Conservative: Begin with default KL penalty (0.1) and adjust based on results
Monitor Validation: Track validation metrics to detect overfitting
Iterate: DPO often benefits from multiple rounds of training with fresh data

Production Deployment

Test Before Deploy: Evaluate model quality before enabling automatic deployment
Version Models: Use descriptive deployment names for tracking
Monitor Performance: Track model performance in production and retrain as needed

FilesExpand file tree

dpo_with_nemo_customizer.md

Latest commit

History