Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/about/ecosystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ We have hands-on tutorials with supported training frameworks to help you train
- **{doc}`NeMo RL <../training-tutorials/nemo-rl-grpo/index>`** - GRPO training to improve multi-step tool calling on the Workplace Assistant environment
- **[OpenRLHF](https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/python/agent_func_nemogym_executor.py)** - example agent executor for RL training
- **{doc}`TRL <../training-tutorials/trl>`** - GRPO training on Workplace Assistant and Reasoning Gym environments
- **{doc}`Unsloth <../training-tutorials/unsloth-training>`** - GRPO training on instruction following and reasoning environments
- **{doc}`Unsloth <../training-tutorials/unsloth>`** - GRPO training on instruction following and reasoning environments
- **NeMo Customizer** - *(In progress)*
- **VeRL** - *(In progress)*

Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ Multi-Environment Training <environment-tutorials/multi-environment-training>
Overview <training-tutorials/index>
NeMo RL <training-tutorials/nemo-rl-grpo/index.md>
TRL <training-tutorials/trl>
Unsloth <training-tutorials/unsloth-training>
Unsloth <training-tutorials/unsloth>
Offline Training (SFT/DPO) <training-tutorials/offline-training-w-rollouts>
```

Expand Down
2 changes: 1 addition & 1 deletion docs/training-tutorials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ GRPO training on Workplace Assistant and Reasoning Gym environments
:::

:::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` Unsloth
:link: unsloth-training
:link: unsloth
:link-type: doc
GRPO training on instruction following and reasoning environments.
+++
Expand Down
239 changes: 236 additions & 3 deletions docs/training-tutorials/trl.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,240 @@
(training-trl)=

# TRL Training
# RL Training with TRL

```{warning}
**Status: In Development** — TRL integration is planned but not yet implemented. Track progress at [GitHub Issue #548](https://github.com/NVIDIA-NeMo/Gym/issues/548).
[TRL (Transformer Reinforcement Learning)](https://github.com/huggingface/trl) is Hugging Face's library for post-training foundation models. This integration enables training models in NeMo Gym environments using TRL's GRPOTrainer with vLLM server mode.

### Install TRL and NeMo Gym

1. **Install TRL venv with vLLM and some extras**

```bash
cd trl/
uv venv
source .venv/bin/activate
uv sync --extra vllm
uv pip install fastapi uvicorn accelerate deepspeed wandb omegaconf
```

1. **Install NeMo Gym in a separate venv**

```bash
git clone https://github.com/NVIDIA-NeMo/Gym.git
cd Gym
uv venv --python 3.12
source .venv/bin/activate
uv sync
```

### Prepare a Dataset

In this example we use the reasoning gym resources server in NeMo Gym to train a model in sudoku:

```bash
cd Gym
source .venv/bin/activate
uv pip install reasoning-gym
cd resources_servers/reasoning_gym
python scripts/create_dataset.py \
--task mini_sudoku \
--size 2000 \
--seed 42 \
--output data/reasoning_gym/train_mini_sudoku.jsonl

python scripts/create_dataset.py \
--task mini_sudoku \
--size 50 \
--seed 24 \
--output data/reasoning_gym/val_mini_sudoku.jsonl
```

## Interactive Training

Training requires 2+ GPUs, one for the vLLM server, and one for training. The NeMo Gym TRL integration currently depends on vLLM server mode.

To run training on a single node, launch the NeMo Gym servers, vLLM server, then run training:

### Setup

1. **Update Environment Config**

Update `env.yaml` in `Gym/` to include model information:

```yaml
policy_base_url: http://127.0.0.1:8000/v1
policy_api_key: EMPTY
policy_model_name: Qwen/Qwen2.5-1.5B-Instruct
```

2. **Update Training Config**

Update `examples/scripts/nemo_gym/config.yaml` to point to the mini sudoku dataset:

```yaml
model_name: "Qwen/Qwen2.5-1.5B-Instruct"

dataset_path: "/path/to/Gym/resources_servers/reasoning_gym/data/reasoning_gym/train_mini_sudoku.jsonl"
eval_dataset_path: "/path/to/Gym/resources_servers/reasoning_gym/data/reasoning_gym/val_mini_sudoku.jsonl"

task: "mini-sudoku"
output_dir: "outputs/nemo_gym_sudoku"

learning_rate: 1.0e-5
num_generations: 16
per_device_train_batch_size: 8
gradient_accumulation_steps: 1
max_completion_length: 10000
vllm_importance_sampling_correction: true

temperature: 1.0
top_p: 0.999
```

### Run Training


1. **Start NeMo Gym Servers**

```bash
cd Gym/
source .venv/bin/activate

config_paths="resources_servers/reasoning_gym/configs/reasoning_gym.yaml,\
responses_api_models/vllm_model/configs/vllm_model_for_training.yaml"

ng_run "+config_paths=[${config_paths}]"
```

1. **Start TRL vLLM Server on GPU 0**

```bash
cd trl/
source .venv/bin/activate
CUDA_VISIBLE_DEVICES=0 trl vllm-serve \
--model Qwen/Qwen2.5-1.5B-Instruct \
--max-model-len 16384 \
--host 0.0.0.0 \
--port 8000
```

1. **Run Training on GPU 1**

```bash
cd trl/
source .venv/bin/activate
cd examples/scripts/nemo_gym

CUDA_VISIBLE_DEVICES=1 python train_multi_environment.py --config config.yaml
```

## Multi-Node Training with Slurm

An example five-node training script is provided in `submit.sh`. Nodes one through four run the training backend, while node five runs vLLM inference for NeMo Gym agent rollouts.

Before running the Slurm script, ensure you have completed the TRL and NeMo Gym installation steps above. The script assumes `.venv` directories exist for both TRL and Gym. If you use a container in the Slurm script, you should also create the virtual environments from the container in an interactive session or with a separate sbatch script.

1. **Configure the Script**

Update `submit.sh` with your Slurm account, partition, paths to your project directory, and updated training configs.

1. **Submit the Job**

```bash
sbatch submit.sh
```

1. **Monitor Training**

```bash
tail -f logs/<job_id>/*
```

## Multi-Environment Training

NeMo Gym is designed to enable training on many environments simultaneously and at scale. This allows learning diverse capabilities, such as tool calling and reasoning, in a single training run. In this example, we add the workplace assistant environment to the mini sudoku setup above, which is a multi-step tool use environment for office tasks.

1. **Prepare Workplace Assistant Dataset**

Many NeMo Gym datasets used to train Nemotron models are available on Hugging Face. Use `ng_prepare_data` to download and prepare datasets. This command:

- Downloads the dataset from Hugging Face
- Validates the format and computes metrics
- Adds an `agent_ref` field to each example that tells NeMo Gym which agent server should handle that example

First, create `env.yaml` in `Gym/` with your HF token:

```yaml
hf_token: <your_hf_token>
```

Then prepare the dataset:

```bash
cd Gym
source .venv/bin/activate

config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
resources_servers/workplace_assistant/configs/workplace_assistant.yaml"

ng_prepare_data "+config_paths=[${config_paths}]" \
+output_dirpath=data/workplace_assistant \
+mode=train_preparation \
+should_download=true \
+data_source=huggingface
```

This creates `train.jsonl` and `validation.jsonl` files in `data/workplace_assistant/`.

1. **Create Combined Dataset**

Combine datasets into a single file with tasks from both environments:

```bash
cat data/workplace_assistant/train_workplace.jsonl data/reasoning_gym/train_mini_sudoku.jsonl | shuf > train_multi_env.jsonl
```

> **Tip**: Ensure datasets are the same size before shuffling for an even blend of tasks. Repeat for the validation dataset.

1. **Update Training Config**

Update the config to point to the combined dataset:

```yaml
model_name: "Qwen/Qwen3-4B-Instruct-2507"

dataset_path: "/path/to/data/train_multi_env.jsonl"
eval_dataset_path: "/path/to/data/val_multi_env.jsonl"

task: "workplace-sudoku" # used in wandb run name
output_dir: "outputs/nemo_gym_multi_env"

# ... rest of config same
```

1. **Update ng_run**

Whether training interactively or via Slurm, update the `ng_run` command to include config files from each resources server:

```bash
cd Gym
source .venv/bin/activate

config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
resources_servers/workplace_assistant/configs/workplace_assistant.yaml,\
resources_servers/reasoning_gym/configs/reasoning_gym.yaml"

ng_run "+config_paths=[${config_paths}]"
```

This starts servers for both environments. The training script automatically routes each example to the correct agent server based on its `agent_ref` field.

1. **Run Training**

Update the Slurm submission script to use the new training config and both `ng_run` resources server configs, then submit the job as before.

The training script reads `agent_ref` from each example's metadata, routes requests to the correct NeMo Gym agent server, and handles different agents and environments in the same batch.

## Resources

- [TRL GitHub](https://github.com/huggingface/trl)
- [TRL Documentation](https://huggingface.co/docs/trl/en/index)
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,7 @@ This tutorial demonstrates how to use [Unsloth](https://github.com/unslothai/uns

**Unsloth** is a fast, memory-efficient library for fine-tuning large language models. It provides optimized implementations that significantly reduce memory usage and training time, making it possible to fine-tune larger models on consumer hardware.

Unsloth can be used with NeMo Gym single-step verifiers including math tasks, structured outputs, instruction following, reasoning gym, and more.

:::{card}

**Goal**: Fine-tune a model for single-step tasks using Unsloth with NeMo Gym verifiers.

**Time**: ~30 minutes (Colab)

^^^

**In this tutorial, you will**:

1. Set up Unsloth for efficient fine-tuning
2. Use NeMo Gym for tasks and verification
3. Train a model using GRPO on a single GPU
4. Evaluate trained model performance

:::
Unsloth can be used with NeMo Gym single-step verifiers including math tasks, structured outputs, instruction following, reasoning gym, and more.

## Prerequisites

Expand All @@ -34,18 +17,25 @@ Unsloth can be used with NeMo Gym single-step verifiers including math tasks, st

## Getting Started

Follow this interactive notebook to train your first model with Unsloth and NeMo Gym:
Follow these interactive notebooks to train models with Unsloth and NeMo Gym:

:::{button-link} https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/nemo_gym_sudoku.ipynb
:::{button-link} https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/NeMo-Gym-Sudoku.ipynb
:color: primary
:class: sd-rounded-pill

Unsloth GRPO notebook
Sudoku
:::

:::{button-link} https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/NeMo-Gym-Multi-Environment.ipynb
:color: secondary
:class: sd-rounded-pill

Multi-Environment Training
:::

Check out [Unsloth's documentation](https://docs.unsloth.ai/models/nemotron-3#reinforcement-learning--nemo-gym) for more details.

> **Note:** This notebook supports **single-step tasks** including math, structured outputs, instruction following, reasoning gym, and more. For multi-step tool calling scenarios, see the {doc}`GRPO with NeMo RL <nemo-rl-grpo/index>` tutorial.
> **Note:** These notebooks support **single-step tasks** including math, structured outputs, instruction following, reasoning gym, and more. For multi-step tool calling scenarios, see the {doc}`GRPO with NeMo RL <nemo-rl-grpo/index>` tutorial.

---

Expand Down