Skip to content

Latest commit

 

History

History
950 lines (757 loc) · 46 KB

File metadata and controls

950 lines (757 loc) · 46 KB

DPO with NeMo Customizer

This guide covers Direct Preference Optimization (DPO) training using the NeMo Agent Toolkit finetuning harness integrated with NVIDIA NeMo Customizer. This integration enables preference-based finetuning of large language models using NVIDIA's enterprise-grade training infrastructure.

Understanding DPO

What is Direct Preference Optimization?

Direct Preference Optimization (DPO) is a reinforcement learning technique that trains language models to prefer certain responses over others, without requiring a separate reward model. Unlike traditional RLHF (Reinforcement Learning from Human Feedback), which requires training a reward model and then using PPO to optimize against it, DPO directly optimizes the policy using preference pairs.

The DPO Objective

DPO works by optimizing the following objective:

L_DPO(π_θ; π_ref) = -E[(x, y_w, y_l)] [log σ(β · (log π_θ(y_w|x) - log π_ref(y_w|x)) - β · (log π_θ(y_l|x) - log π_ref(y_l|x)))]

Where:

  • π_θ is the policy being trained
  • π_ref is the reference policy (frozen copy of the initial model)
  • x is the prompt
  • y_w is the "chosen" (preferred) response
  • y_l is the "rejected" (non-preferred) response
  • β is a temperature parameter controlling deviation from the reference policy
  • σ is the sigmoid function

In simpler terms: DPO increases the probability of chosen responses while decreasing the probability of rejected responses, with a KL penalty to prevent the model from deviating too far from its original behavior.

Why DPO?

Advantages over traditional RLHF:

  1. Simpler Pipeline: No need to train a separate reward model
  2. More Stable Training: Avoids the instabilities of PPO optimization
  3. Computationally Efficient: Single-stage training process
  4. Direct Optimization: Directly optimizes preference likelihood

When to use DPO:

  • You have paired preference data (chosen vs rejected responses)
  • You want to align model outputs with specific quality criteria
  • You're training agents where you can score different action choices
  • You want to improve response quality without explicit reward modeling

Preference Pairs from Test-Time Compute

The NeMo Agent Toolkit DPO integration uses Test-Time Compute (TTC) to generate preference pairs automatically. During workflow execution:

  1. Multiple Candidates Generated: For each decision point, the workflow generates multiple candidate responses
  2. Candidates Scored: Each candidate is evaluated using a scoring function
  3. Pairs Created: Higher-scored candidates become "chosen", lower-scored become "rejected"

This approach enables automated preference data collection without manual labeling.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         DPO Training Pipeline                               │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                      Data Collection Phase                              ││
│  │                                                                         ││
│  │  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────────┐   ││
│  │  │   Dataset    │───►│   Workflow   │───►│  TTC Move Selector       │   ││
│  │  │   (inputs)   │    │  Execution   │    │  (generates candidates)  │   ││
│  │  └──────────────┘    └──────────────┘    └──────────────────────────┘   ││
│  │                                                   │                     ││
│  │                                                   ▼                     ││
│  │                                          ┌──────────────────────────┐   ││
│  │                                          │   Score Candidates       │   ││
│  │                                          │   (reward function)      │   ││
│  │                                          └──────────────────────────┘   ││
│  │                                                   │                     ││
│  │                                                   ▼                     ││
│  │  ┌──────────────────────────────────────────────────────────────────┐   ││
│  │  │                  DPO Trajectory Builder                          │   ││
│  │  │                                                                  │   ││
│  │  │  • Collects TTC_END intermediate steps with TTCEventData         │   ││
│  │  │  • Groups candidates by turn_id                                  │   ││
│  │  │  • Generates preference pairs (chosen vs rejected)               │   ││
│  │  │  • Builds Trajectory objects with DPOItem episodes               │   ││
│  │  └──────────────────────────────────────────────────────────────────┘   ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                    │                                        │
│                                    ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                      Training Submission Phase                          ││
│  │                                                                         ││
│  │  ┌──────────────────────────────────────────────────────────────────┐   ││
│  │  │                  NeMo Customizer Trainer Adapter                 │   ││
│  │  │                                                                  │   ││
│  │  │  1. Convert trajectories to JSONL format                         │   ││
│  │  │  2. Upload dataset to NeMo Datastore (via HuggingFace Hub API)   │   ││
│  │  │  3. Submit customization job to NeMo Customizer                  │   ││
│  │  │  4. Monitor job progress until completion                        │   ││
│  │  │  5. Optionally deploy trained model                              │   ││
│  │  └──────────────────────────────────────────────────────────────────┘   ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                    │                                        │
│                                    ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                    NeMo Customizer Backend                              ││
│  │                                                                         ││
│  │  ┌─────────────────────┐  ┌─────────────────────┐                       ││
│  │  │   Entity Store      │  │   Datastore         │                       ││
│  │  │   (job management)  │  │   (dataset storage) │                       ││
│  │  └─────────────────────┘  └─────────────────────┘                       ││
│  │                                                                         ││
│  │  ┌─────────────────────────────────────────────────────────────────┐    ││
│  │  │                    Training Infrastructure                      │    ││
│  │  │                                                                 │    ││
│  │  │  • DPO loss computation with reference model                    │    ││
│  │  │  • LoRA or full-weight finetuning                               │    ││
│  │  │  • Multi-GPU distributed training                               │    ││
│  │  └─────────────────────────────────────────────────────────────────┘    ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                    │                                        │
│                                    ▼                                        │
│                         ┌──────────────────┐                                │
│                         │  Trained Model   │                                │
│                         │  (optional NIM   │                                │
│                         │   deployment)    │                                │
│                         └──────────────────┘                                │
└─────────────────────────────────────────────────────────────────────────────┘

Installation

Install the NeMo Customizer plugin package:

pip install nvidia-nat-nemo-customizer

This provides:

  • dpo_traj_builder: DPO trajectory builder for collecting preference pairs
  • nemo_customizer_trainer_adapter: Adapter for submitting jobs to NeMo Customizer
  • nemo_customizer_trainer: Trainer orchestrator for the DPO workflow

Prerequisites

  1. NeMo Microservices Platform (NMP): Access to a deployed NeMo Customizer instance
  2. Entity Store: For managing datasets, models, and jobs
  3. Datastore: For storing training datasets (accessed via HuggingFace Hub API)

Configuration

Complete Configuration Example

# LLM Configuration
llms:
  inference_llm:
    _type: openai
    model_name: meta/llama-3.1-8b-instruct
    base_url: https://integrate.api.nvidia.com/v1
    api_key: ${NVIDIA_API_KEY}
    temperature: 0.7

# Workflow that uses TTC for candidate generation
workflow:
  _type: my_dpo_workflow
  llm: inference_llm

# Evaluation configuration
eval:
  general:
    max_concurrency: 8
    output_dir: .tmp/nat/finetuning/eval
    dataset:
      _type: json
      file_path: data/training_data.json

  evaluators:
    game_evaluator:
      _type: my_game_evaluator

# DPO Trajectory Builder
trajectory_builders:
  dpo_builder:
    _type: dpo_traj_builder
    ttc_step_name: dpo_candidate_move
    exhaustive_pairs: true
    min_score_diff: 0.05
    max_pairs_per_turn: 10
    reward_from_score_diff: true
    require_multiple_candidates: true

# NeMo Customizer Trainer Adapter
trainer_adapters:
  nemo_adapter:
    _type: nemo_customizer_trainer_adapter
    entity_host: https://nmp.example.com
    datastore_host: https://datastore.example.com
    namespace: my-dpo-project
    dataset_name: dpo-training-data
    customization_config: meta/llama-3.1-8b-instruct@v1.0.0+A100
    create_namespace_if_missing: true
    use_full_message_history: true
    hyperparameters:
      training_type: dpo
      finetuning_type: all_weights
      epochs: 3
      batch_size: 4
      learning_rate: 5e-6
      dpo:
        ref_policy_kl_penalty: 0.1
        preference_loss_weight: 1.0
        preference_average_log_probs: false
        sft_loss_weight: 0.0
    deploy_on_completion: false
    poll_interval_seconds: 30.0
    deployment_timeout_seconds: 1800.0

# NeMo Customizer Trainer
trainers:
  nemo_trainer:
    _type: nemo_customizer_trainer
    num_runs: 3
    wait_for_completion: true
    deduplicate_pairs: true
    max_pairs: 5000

# Finetuning configuration
finetuning:
  enabled: true
  trainer: nemo_trainer
  trajectory_builder: dpo_builder
  trainer_adapter: nemo_adapter
  reward_function:
    name: game_evaluator
  num_epochs: 1  # Not used for NeMo Customizer (uses num_runs instead)
  output_dir: .tmp/nat/finetuning/output

Configuration Reference

DPO Trajectory Builder Configuration

The DPO trajectory builder collects preference pairs from TTC intermediate steps.

trajectory_builders:
  dpo_builder:
    _type: dpo_traj_builder
    ttc_step_name: dpo_candidate_move
    exhaustive_pairs: true
    min_score_diff: 0.0
    max_pairs_per_turn: null
    reward_from_score_diff: true
    require_multiple_candidates: true
Field Type Default Description
ttc_step_name str "dpo_candidate_move" Name of the TTC intermediate step to collect. Must match the name used in your workflow's push_intermediate_step() call.
exhaustive_pairs bool true If true, generate all pairwise comparisons where score(A) > score(B). If false, only generate best vs worst pair per turn.
min_score_diff float 0.0 Minimum score difference required to create a preference pair. Pairs with smaller differences are filtered out. Useful for ensuring meaningful preference signal.
max_pairs_per_turn int | null null Maximum preference pairs per turn. If set, pairs are sorted by score difference (highest first) and truncated. null means no limit.
reward_from_score_diff bool true If true, trajectory reward = score difference (chosen - rejected). If false, reward = chosen candidate's score.
require_multiple_candidates bool true If true, skip turns with only one candidate (no preference signal possible). If false, include single-candidate turns.

Pair Generation Modes

Exhaustive Pairs (exhaustive_pairs: true)

For candidates with scores [A=0.9, B=0.7, C=0.5], generates:

  • (A chosen, B rejected) - score diff: 0.2
  • (A chosen, C rejected) - score diff: 0.4
  • (B chosen, C rejected) - score diff: 0.2

This provides more training signal but may include weak preference pairs.

Best vs Worst (exhaustive_pairs: false)

For the same candidates, generates only:

  • (A chosen, C rejected) - score diff: 0.4

This provides stronger preference signal but fewer training examples.

NeMo Customizer Trainer Configuration

The trainer orchestrates data collection runs.

trainers:
  nemo_trainer:
    _type: nemo_customizer_trainer
    num_runs: 3
    continue_on_collection_error: false
    deduplicate_pairs: true
    max_pairs: null
    wait_for_completion: true
Field Type Default Description
num_runs int 1 Number of times to run the trajectory builder to collect data. Multiple runs increase dataset diversity by generating different trajectories for the same inputs.
continue_on_collection_error bool false If true, continue with remaining runs if one fails. If false, stop immediately on first error.
deduplicate_pairs bool true If true, remove duplicate DPO pairs based on prompt+chosen+rejected content. Useful when multiple runs may generate identical pairs.
max_pairs int | null null Maximum DPO pairs to include in training. If set, randomly samples from collected pairs. null means use all pairs.
wait_for_completion bool true If true, wait for NeMo Customizer job to complete. If false, submit and return immediately.

NeMo Customizer Trainer Adapter Configuration

The adapter handles communication with NeMo Customizer services.

trainer_adapters:
  nemo_adapter:
    _type: nemo_customizer_trainer_adapter

    # Endpoint Configuration
    entity_host: https://nmp.example.com
    datastore_host: https://datastore.example.com
    hf_token: ""

    # Namespace and Dataset
    namespace: my-project
    dataset_name: nat-dpo
    dataset_output_dir: null
    create_namespace_if_missing: true

    # Customization Job
    customization_config: meta/llama-3.1-8b-instruct@v1.0.0+A100
    hyperparameters:
      training_type: dpo
      finetuning_type: all_weights
      epochs: 3
      batch_size: 4
      learning_rate: 5e-5
      dpo:
        ref_policy_kl_penalty: 0.1
        preference_loss_weight: 1.0
        preference_average_log_probs: false
        sft_loss_weight: 0.0

    # Prompt Formatting
    use_full_message_history: false

    # Deployment
    deploy_on_completion: false
    deployment_config:
      image_name: nvcr.io/nim/meta/llama-3.1-8b-instruct
      image_tag: latest
      gpu: 1
      deployment_name: null
      description: Fine-tuned model deployment

    # Polling
    poll_interval_seconds: 30.0
    deployment_timeout_seconds: 1800.0

Endpoint Configuration

Field Type Default Description
entity_host str required Base URL for NeMo Entity Store (e.g., https://nmp.example.com).
datastore_host str required Base URL for NeMo Datastore (e.g., https://datastore.example.com).
hf_token str "" HuggingFace token for datastore authentication. Can be empty if not required.

Namespace and Dataset

Field Type Default Description
namespace str required Namespace for organizing resources (datasets, models, deployments).
dataset_name str "nat-dpo" Name for the training dataset. Must be unique within namespace.
dataset_output_dir str | null null Directory to save dataset JSONL files locally. If null, uses temporary directory. If specified, files are preserved for debugging.
create_namespace_if_missing bool true If true, create namespace in entity store and datastore if it doesn't exist.

Customization Job

Field Type Default Description
customization_config str required Model configuration string (e.g., meta/llama-3.1-8b-instruct@v1.0.0+A100). Available configs can be listed via NeMo Customizer API.

Hyperparameters

Field Type Default Description
training_type "sft" | "dpo" "dpo" Training type. Use "dpo" for preference optimization.
finetuning_type "lora" | "all_weights" "all_weights" "lora" for parameter-efficient finetuning, "all_weights" for full model.
epochs int 3 Number of training epochs over the dataset.
batch_size int 4 Training batch size.
learning_rate float 5e-5 Learning rate for optimizer.

DPO-Specific Hyperparameters

Field Type Default Description
ref_policy_kl_penalty float 0.1 KL penalty coefficient (β in DPO objective). Controls how much the model can deviate from reference policy. Higher values = more conservative updates.
preference_loss_weight float 1.0 Weight for the preference (DPO) loss term.
preference_average_log_probs bool false If true, average log probabilities over sequence length. If false, sum log probabilities.
sft_loss_weight float 0.0 Weight for optional SFT loss on chosen responses. Can help maintain response quality.

Prompt Formatting

Field Type Default Description
use_full_message_history bool false If true, include full conversation history as list of messages: [{"role": "system", "content": "..."}, ...]. If false, use only last message content as string.

Deployment Configuration

Field Type Default Description
deploy_on_completion bool false If true, automatically deploy the trained model after job completion.
deployment_config.image_name str "nvcr.io/nim/meta/llama-3.1-8b-instruct" NIM container image name.
deployment_config.image_tag str "latest" NIM container image tag.
deployment_config.gpu int 1 Number of GPUs for deployment.
deployment_config.deployment_name str | null null Name for deployment. If null, auto-generated.
deployment_config.description str "Fine-tuned model deployment" Description for the deployment.

Polling Configuration

Field Type Default Description
poll_interval_seconds float 30.0 Interval between job status checks.
deployment_timeout_seconds float 1800.0 Maximum time to wait for deployment to be ready (30 minutes default).

Implementing TTC in Your Workflow

To generate DPO training data, your workflow must emit TTC (Test-Time Compute) intermediate steps with TTCEventData. Here's how to implement this:

TTCEventData Structure

from nat.data_models.intermediate_step import (
    IntermediateStepPayload,
    IntermediateStepType,
    TTCEventData,
)

# Create TTCEventData for each candidate
ttc_data = TTCEventData(
    turn_id="turn_0",           # Groups candidates competing for same prompt
    turn_index=0,               # Index of this turn in the episode
    candidate_index=idx,        # Index of this candidate within the turn
    input=messages,             # Prompt (string or list of OpenAI messages)
    output=response,            # Model's response
    score=candidate_score,      # Score for this candidate (higher = better)
)

Emitting TTC Steps

from nat.builder.context import Context

# Get the step manager from context
context = Context.get()
step_manager = context.intermediate_step_manager

# Emit TTC_END step for each candidate
step_manager.push_intermediate_step(
    IntermediateStepPayload(
        event_type=IntermediateStepType.TTC_END,
        name="dpo_candidate_move",  # Must match ttc_step_name in config
        data=ttc_data,
        metadata={"is_selected": is_best_candidate},
    )
)

Complete Example: TTC Move Selector

from nat.builder.context import Context
from nat.data_models.intermediate_step import (
    IntermediateStepPayload,
    IntermediateStepType,
    TTCEventData,
)

async def ttc_move_selector(
    prompt: str,
    candidates: list[str],
    scores: list[float],
    turn_id: str,
    turn_index: int,
) -> str:
    """
    Select best candidate and emit TTC steps for DPO training.

    Args:
        prompt: The input prompt
        candidates: List of candidate responses
        scores: Scores for each candidate (higher = better)
        turn_id: Unique identifier for this decision point
        turn_index: Index of this turn in the episode

    Returns:
        The best candidate response
    """
    context = Context.get()
    step_manager = context.intermediate_step_manager

    # Find best candidate
    best_idx = scores.index(max(scores))

    # Emit TTC_END step for each candidate
    for idx, (candidate, score) in enumerate(zip(candidates, scores)):
        ttc_data = TTCEventData(
            turn_id=turn_id,
            turn_index=turn_index,
            candidate_index=idx,
            input=prompt,
            output=candidate,
            score=score,
        )

        step_manager.push_intermediate_step(
            IntermediateStepPayload(
                event_type=IntermediateStepType.TTC_END,
                name="dpo_candidate_move",
                data=ttc_data,
                metadata={"is_selected": idx == best_idx},
            )
        )

    return candidates[best_idx]

How It Works

Phase 1: Data Collection

The DPO trajectory builder collects preference data through the NeMo Agent Toolkit evaluation system:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DPO Trajectory Builder Flow                              │
│                                                                             │
│  start_run(run_id)                                                          │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Launch evaluation run                                                │  │
│  │                                                                       │  │
│  │  For each dataset example:                                            │  │
│  │    1. Execute workflow                                                │  │
│  │    2. Workflow emits TTC_END steps with TTCEventData                  │  │
│  │    3. Compute reward using configured evaluator                       │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  finalize(run_id)                                                           │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Process collected intermediate steps:                                │  │
│  │                                                                       │  │
│  │  1. Filter for TTC_END steps with configured name                     │  │
│  │  2. Extract TTCEventData (turn_id, candidate_index, score, etc.)      │  │
│  │  3. Group candidates by (example_id, turn_id)                         │  │
│  │  4. Generate preference pairs based on score differences              │  │
│  │  5. Build Trajectory objects with DPOItem episodes                    │  │
│  │  6. Group trajectories by example_id                                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  Return TrajectoryCollection                                                │
└─────────────────────────────────────────────────────────────────────────────┘

Phase 2: Training Submission

The trainer adapter converts trajectories and submits to NeMo Customizer:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    NeMo Customizer Trainer Adapter Flow                     │
│                                                                             │
│  submit(trajectories)                                                       │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Convert to JSONL format:                                             │  │
│  │                                                                       │  │
│  │  {                                                                    │  │
│  │    "prompt": "What move should I make?",                              │  │
│  │    "chosen_response": "I'll play X in the center...",                 │  │
│  │    "rejected_response": "I'll play X in the corner..."                │  │
│  │  }                                                                    │  │
│  │                                                                       │  │
│  │  Split: 80% training, 20% validation                                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Upload to NeMo Datastore:                                            │  │
│  │                                                                       │  │
│  │  1. Create dataset repo via HuggingFace Hub API                       │  │
│  │  2. Register dataset in Entity Store                                  │  │
│  │  3. Upload training_file.jsonl and validation_file.jsonl              │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Submit customization job:                                            │  │
│  │                                                                       │  │
│  │  client.customization.jobs.create(                                    │  │
│  │    config=customization_config,                                       │  │
│  │    dataset={name, namespace},                                         │  │
│  │    hyperparameters={training_type: dpo, ...}                          │  │
│  │  )                                                                    │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  Return TrainingJobRef                                                      │
└─────────────────────────────────────────────────────────────────────────────┘

Phase 3: Monitoring and Completion

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Training Monitoring Flow                                 │
│                                                                             │
│  wait_until_complete(job_ref)                                               │
│      │                                                                      │
│      ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Poll job status:                                                     │  │
│  │                                                                       │  │
│  │  while not done:                                                      │  │
│  │    status = client.customization.jobs.status(job_id)                  │  │
│  │    log status changes and progress                                    │  │
│  │    if status in [completed, failed, cancelled]: break                 │  │
│  │    sleep(poll_interval_seconds)                                       │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼ (if deploy_on_completion and status == completed)                    │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Deploy trained model:                                                │  │
│  │                                                                       │  │
│  │  1. Create deployment config                                          │  │
│  │  2. Create model deployment                                           │  │
│  │  3. Wait for deployment to be ready                                   │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│      │                                                                      │
│      ▼                                                                      │
│  Return TrainingJobStatus                                                   │
└─────────────────────────────────────────────────────────────────────────────┘

Running DPO Training

Basic Training

# Run DPO training with your configuration
nat finetune --config_file=configs/dpo_finetune.yml

With Configuration Overrides

# Override number of data collection runs
nat finetune --config_file=configs/dpo_finetune.yml \
    -o trainers.nemo_trainer.num_runs 5

# Override training epochs
nat finetune --config_file=configs/dpo_finetune.yml \
    -o trainer_adapters.nemo_adapter.hyperparameters.epochs 5

# Override learning rate
nat finetune --config_file=configs/dpo_finetune.yml \
    -o trainer_adapters.nemo_adapter.hyperparameters.learning_rate 1e-5

Monitoring Progress

During training, check:

  1. Console Output: Shows data collection progress, pair counts, job status
INFO - Starting NeMo Customizer DPO workflow with 3 data collection runs
INFO - Starting data collection run 1/3
INFO - Run 1: Collected 50 trajectories, 120 DPO pairs, avg reward: 0.4523
INFO - Starting data collection run 2/3
INFO - Run 2: Collected 50 trajectories, 115 DPO pairs, avg reward: 0.4812
INFO - Starting data collection run 3/3
INFO - Run 3: Collected 50 trajectories, 118 DPO pairs, avg reward: 0.4701
INFO - Data collection complete: 150 trajectory groups, ~353 total DPO pairs from 3 runs
INFO - Deduplication: 353 -> 312 trajectories
INFO - Submitted training job: job_abc123
INFO - Job nemo_dpo_a1b2c3d4: Status -> 'running'
INFO - Job nemo_dpo_a1b2c3d4: Progress 25.0%
INFO - Job nemo_dpo_a1b2c3d4: Progress 50.0%
INFO - Job nemo_dpo_a1b2c3d4: Progress 75.0%
INFO - Job nemo_dpo_a1b2c3d4: Status -> 'completed'
INFO - Training completed with status: completed
  1. Output Files (in finetuning.output_dir):

    • data_collection_progress.jsonl: Per-run metrics
    • collection_history.json: Complete collection history
    • final_metrics.json: Final training metrics
  2. NeMo Customizer UI: Monitor job progress via the NeMo platform

Dataset Format

The trainer adapter converts DPO pairs to JSONL format:

Standard Format (use_full_message_history: false)

{"prompt": "What's the best move in this position?", "chosen_response": "I'll play X in the center because...", "rejected_response": "I'll play X in the corner because..."}
{"prompt": "How should I respond to this attack?", "chosen_response": "I should defend by...", "rejected_response": "I should attack by..."}

Full Message History Format (use_full_message_history: true)

{"prompt": [{"role": "system", "content": "You are a chess expert."}, {"role": "user", "content": "What's the best move?"}], "chosen_response": "I recommend Nf3 because...", "rejected_response": "I recommend a4 because..."}

Advanced Configuration

Tuning DPO Hyperparameters

KL Penalty (ref_policy_kl_penalty)

The KL penalty (β) controls how much the model can deviate from the reference policy:

hyperparameters:
  dpo:
    ref_policy_kl_penalty: 0.1  # Default: balanced exploration
    # ref_policy_kl_penalty: 0.01  # Lower: more aggressive updates
    # ref_policy_kl_penalty: 0.5   # Higher: more conservative updates
  • Lower values (0.01-0.05): Allow larger policy updates, faster learning but risk of instability
  • Higher values (0.2-0.5): More conservative updates, slower but more stable training

SFT Loss Weight

Adding SFT loss on chosen responses can help maintain response quality:

hyperparameters:
  dpo:
    sft_loss_weight: 0.1  # Add 10% SFT loss

Optimizing Data Collection

Multiple Runs for Diversity

Running multiple data collection passes generates diverse preference pairs:

trainers:
  nemo_trainer:
    num_runs: 5  # More runs = more diverse data

Filtering Weak Preferences

Filter out pairs with small score differences:

trajectory_builders:
  dpo_builder:
    min_score_diff: 0.1  # Only keep pairs with >0.1 score difference

Limiting Pairs Per Turn

For turns with many candidates, limit pairs to strongest preferences:

trajectory_builders:
  dpo_builder:
    exhaustive_pairs: true
    max_pairs_per_turn: 5  # Keep top 5 pairs by score difference

Automatic Model Deployment

Enable automatic deployment of trained models:

trainer_adapters:
  nemo_adapter:
    deploy_on_completion: true
    deployment_config:
      image_name: nvcr.io/nim/meta/llama-3.1-8b-instruct
      image_tag: latest
      gpu: 2
      deployment_name: my-dpo-model
      description: DPO-finetuned agent model
    deployment_timeout_seconds: 3600  # 1 hour timeout

Troubleshooting

Connection Issues

"Failed to connect to NeMo Customizer"

  1. Verify endpoints are correct:

    curl https://nmp.example.com/health
    curl https://datastore.example.com/health
  2. Check authentication (HuggingFace token if required)

  3. Verify network connectivity and firewall rules

No Preference Pairs Generated

"No trajectories collected from any run"

  1. Check TTC step name: Ensure ttc_step_name matches your workflow:

    trajectory_builders:
      dpo_builder:
        ttc_step_name: dpo_candidate_move  # Must match workflow
  2. Verify TTCEventData is emitted: Add logging to confirm steps are being pushed

  3. Check candidate scores: If all candidates have same score, no pairs can be created

  4. Review min_score_diff: Lower threshold if filtering too aggressively:

    trajectory_builders:
      dpo_builder:
        min_score_diff: 0.0  # Accept all score differences

Training Job Failures

"Customization job failed"

  1. Check NeMo Customizer logs for detailed error messages

  2. Verify dataset format is correct:

    # Check generated JSONL files
    cat .tmp/nat/finetuning/output/*/training_file.jsonl | head -5
  3. Ensure model configuration is valid and available

  4. Check GPU resources are available

Deployment Issues

"Deployment did not become ready within timeout"

  1. Increase timeout:

    trainer_adapters:
      nemo_adapter:
        deployment_timeout_seconds: 3600  # 1 hour
  2. Check NeMo deployment logs for errors

  3. Verify GPU resources are available for deployment

  4. Check deployment configuration matches model requirements

Memory Issues

"CUDA out of memory" during training

  1. Reduce batch size:

    hyperparameters:
      batch_size: 2  # Reduce from default 4
  2. Use LoRA instead of full-weight:

    hyperparameters:
      finetuning_type: lora
  3. Contact NeMo Customizer admin to allocate more GPU resources

Examples

The examples/finetuning/dpo_tic_tac_toe directory contains a complete working example demonstrating:

  • Tic-tac-toe game workflow with TTC move selection
  • Custom scoring function for move quality
  • Full DPO training configuration
  • Training and evaluation datasets

See the example's README for detailed instructions.

Best Practices

Data Quality

  1. Meaningful Score Differences: Ensure your scoring function produces meaningful distinctions between candidates
  2. Diverse Training Data: Use multiple data collection runs and diverse input examples
  3. Balance Difficulty: Include examples of varying difficulty levels

Hyperparameter Selection

  1. Start Conservative: Begin with default KL penalty (0.1) and adjust based on results
  2. Monitor Validation: Track validation metrics to detect overfitting
  3. Iterate: DPO often benefits from multiple rounds of training with fresh data

Production Deployment

  1. Test Before Deploy: Evaluate model quality before enabling automatic deployment
  2. Version Models: Use descriptive deployment names for tracking
  3. Monitor Performance: Track model performance in production and retrain as needed

See Also