Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ PUBLISH_WORKFLOW := Publish Fern Docs
.DEFAULT_GOAL := help

.PHONY: help \
docs docs-check docs-preview docs-publish docs-login docs-generate-library
docs docs-check docs-preview docs-publish docs-login docs-login-remote docs-generate-library

help:
@echo ""
Expand Down Expand Up @@ -80,7 +80,14 @@ docs-login:
*) echo ""; echo "Bailing. Open the dashboard URL above, sign in, then re-run 'make docs-login'."; exit 1 ;; \
esac
@echo ""
npx -y fern-api@latest login
npx -y fern-api@latest login $(LOGIN_FLAGS)

# Same as docs-login, but uses Fern's device-code flow instead of opening a
# browser locally — needed on headless remote machines (SSH dev boxes, etc.)
# where the OAuth callback can't reach a browser. Prints a URL + short code
# you complete on your laptop.
docs-login-remote: LOGIN_FLAGS := --device-code
docs-login-remote: docs-login

# Local-only preview. `fern docs md generate` populates fern/product-docs/ from
# the nemo_gym package source (declared under `libraries:` in fern/docs.yml);
Expand Down
4 changes: 2 additions & 2 deletions fern/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ npm install -g fern-api

# 2. Provision your Fern account + CLI auth (one-time per machine).
# Walks you through the dashboard sign-in step before running `fern login`.
make docs-login
make docs-login # or `make docs-login-remote` when working on a headless remote machine

# 3. Build the API library reference and start the local dev server
make docs # http://localhost:3000
Expand Down Expand Up @@ -88,7 +88,7 @@ make docs-publish # trigger the `Publish Fern Docs` workflow on origin
make docs-generate-library # standalone library regeneration (rarely needed; `make docs` runs it)
```

For first-time-on-this-machine setup, see the [Quickstart](#quickstart) above — `make docs-login` walks through dashboard provisioning + `fern login` together.
For first-time-on-this-machine setup, see the [Quickstart](#quickstart) above — `make docs-login` / `make docs-login-remote` walks through dashboard provisioning + `fern login` together.

`make docs` first runs `fern docs md generate`, which populates `fern/product-docs/` from the `nemo_gym` package source declared in the `libraries:` block of `docs.yml`. Without it, a cold `fern docs dev` will fail with `Folder not found: ./product-docs/...`. Re-run only when the upstream Python source changes — for prose-only iteration after the first generation, `cd fern && npm run dev` is enough.

Expand Down
6 changes: 6 additions & 0 deletions fern/versions/latest/pages/model-server/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,12 @@ Self-hosted inference with vLLM for maximum control.
<Badge minimal outlined>self-hosted</Badge> <Badge minimal outlined>open-source</Badge>
</Card>

<Card title="Local vLLM" href="/model-server/local-vllm">
Let NeMo Gym launch and manage the vLLM server for you.

<Badge minimal outlined>gym-managed</Badge> <Badge minimal outlined>open-source</Badge>
</Card>

</Cards>

## Server Configuration
Expand Down
57 changes: 57 additions & 0 deletions fern/versions/latest/pages/model-server/local-vllm-proxy.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
title: "Local vLLM Proxy"
description: "Expose one Local vLLM deployment as multiple model servers"
position: 4
---

LocalVLLMModelProxy (in `responses_api_models/local_vllm_model_proxy`) is a lightweight model server that forwards requests to an existing [LocalVLLMModel](/model-server/local-vllm) instead of launching its own vLLM engine.
It is a subclass of VLLMModel, so it accepts the same configuration fields, but it owns no GPUs.

## When to use it

Use a proxy when you need several model servers that share **one** vLLM deployment but differ in their request-time configuration.
For example, one server with reasoning enabled and one with reasoning disabled through the request params, or servers with different sampling parameters.
Without the proxy you would have to launch a separate vLLM engine (and duplicate GPUs) for each variation.

At startup the proxy waits for its referenced LocalVLLMModel to come up, reads that server's inner vLLM endpoint (`base_url`, `api_key`, `model`), and routes all of its own requests there.

If you are working with an existing vLLM endpoint that you manage outside of Gym, use [VLLMModel](/model-server/vllm) instead.

## Configuration

A proxy is a normal model server config that adds a `model_server` reference pointing at the LocalVLLMModel it should forward to:

```yaml
policy_model_reasoning_off:
responses_api_models:
local_vllm_model_proxy:
entrypoint: app.py

# Request-time settings that differ from the backing server
chat_template_kwargs:
enable_thinking: false

# Standard VLLMModel fields
return_token_id_information: false
uses_reasoning_parser: true

model_server:
type: responses_api_models
name: policy_model # name of the LocalVLLMModel server to proxy to
```

Run it alongside the backing LocalVLLMModel by chaining both configs in `config_paths`:

```bash
config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
responses_api_models/local_vllm_model_proxy/configs/local_vllm_model_proxy.yaml"
ng_run "+config_paths=[${config_paths}]" \
++policy_model_proxy.responses_api_models.local_vllm_model_proxy.model_server.name=gpt-oss-20b-reasoning-high
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_server` | `ModelServerRef` | — | **Required.** The LocalVLLMModel server to forward requests to, by `type` and `name`. |

`base_url`, `api_key`, and `model` are populated automatically from the backing server and should **not** be set in your config.
All other VLLMModel fields (`chat_template_kwargs`, `extra_body`, `return_token_id_information`, and so on) behave as documented in the [VLLMModel configuration reference](/model-server/vllm#vllmmodel-configuration-reference).
160 changes: 160 additions & 0 deletions fern/versions/latest/pages/model-server/local-vllm.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
---
title: "Local vLLM"
description: "Gym-managed vLLM server deployment"
position: 3
---

NeMo Gym can launch and manage the vLLM server for you using LocalVLLMModel (in `responses_api_models/local_vllm_model`).
LocalVLLMModel is a subclass of VLLMModel that spawns the vLLM engine and auto-configures the model server to use it.
The Chat Completions to Responses API conversion is inherited from VLLMModel. See [VLLMModel](/model-server/vllm) for details.

A single LocalVLLMModel deployment can back multiple model servers, even when they need different request-time settings (for example, sampling parameters or reasoning on or off).
See [Local vLLM Proxy](/model-server/local-vllm-proxy) for this configuration.

<Note>
If you want to connect NeMo Gym to a vLLM server you start and manage yourself, use VLLMModel directly. See [VLLMModel](/model-server/vllm) for more details.
</Note>

## Use LocalVLLMModel

Unlike VLLMModel, LocalVLLMModel does not require a separate vLLM install or a manual model download. vLLM is a transitive dependency of `responses_api_models/local_vllm_model`, and model weights are fetched from the Hugging Face Hub on first run (using `HF_TOKEN` from your environment if present). A single `ng_run` brings up both vLLM and NeMo Gym.

Several model configs ship with the server under `responses_api_models/local_vllm_model/configs/` — see the `Qwen/`, `openai/`, and `nvidia/` subdirectories for ready-to-use examples. To launch a single-node 8-GPU run with `Qwen3-30B-A3B-Instruct-2507`:

```bash
config_paths="resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,\
responses_api_models/local_vllm_model/configs/Qwen/Qwen3-30B-A3B-Instruct-2507.yaml"
ng_run "+config_paths=[${config_paths}]"
```

Override the parallelism dimensions on the command line to match your node:

```bash
ng_run "+config_paths=[${config_paths}]" \
++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \
++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2
```

Once the servers are up, call the agent to verify everything works end-to-end:

```bash
python responses_api_agents/simple_agent/client.py
```

## LocalVLLMModel configuration reference

LocalVLLMModel inherits all fields from VLLMModel (see [VLLMModel configuration reference](/model-server/vllm#vllmmodel-configuration-reference)). It adds the following:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `vllm_serve_kwargs` | `dict` | — | **Required.** Arguments passed through to `vllm serve`. See `vllm_serve_kwargs` below. |
| `vllm_serve_env_vars` | `dict` | — | **Required.** Environment variables for the vLLM process. Must include `VLLM_RAY_DP_PACK_STRATEGY`. |
| `hf_home` | `str` | `<cwd>/.cache/huggingface` | Hugging Face cache directory. Set this if you have already downloaded weights elsewhere. |
| `debug` | `bool` | `false` | Print vLLM server logs to stderr. |
| `show_vllm_engine_stats` | `bool` | `false` | Periodically log vLLM engine throughput stats. |
| `ray_worker_py_executable` | `str` | `sys.executable` | Python interpreter Ray uses for worker processes. |

Two inherited fields are auto-populated by LocalVLLMModel after vLLM spawns and should **not** be set in your config:

- `base_url`: assigned to the URL of the vLLM process once it binds a port. Defaults to `[]`.
- `api_key`: defaults to `"dummy"`. vLLM does not authenticate local connections.

### `vllm_serve_kwargs`

Required keys (asserted at startup):

- `data_parallel_size`
- `tensor_parallel_size`
- `pipeline_parallel_size`

LocalVLLMModel injects the following keys automatically. Do not set them in your config:

- `distributed_executor_backend: ray`
- `data_parallel_backend: ray`
- `host: 0.0.0.0`
- `port` (chosen from the free-port pool)
- `download_dir` (derived from `hf_home`)

Commonly tuned keys (see the shipped configs for full examples):

```yaml
vllm_serve_kwargs:
data_parallel_size: 1
tensor_parallel_size: 8
pipeline_parallel_size: 1
gpu_memory_utilization: 0.9
trust_remote_code: true
enable_auto_tool_choice: true
tool_call_parser: hermes
model_loader_extra_config:
enable_multithread_load: true
num_threads: 16
```

Any flag accepted by `vllm serve` can be set under `vllm_serve_kwargs`. See the [official vLLM serve reference](https://docs.vllm.ai/en/latest/cli/serve/) for the full list.

### `vllm_serve_env_vars`

Environment variables set in the vLLM process. `VLLM_RAY_DP_PACK_STRATEGY` is mandatory:

```yaml
vllm_serve_env_vars:
VLLM_RAY_DP_PACK_STRATEGY: strict
```

See [Multi-node deployments](#multi-node-deployments) for what `strict` and `span` mean.

## Multi-node deployments

LocalVLLMModel uses Ray to place vLLM workers across nodes. The `VLLM_RAY_DP_PACK_STRATEGY` environment variable controls how worker groups are packed:

- **`strict`**: each data-parallel replica must fit on a single node (`tensor_parallel_size * pipeline_parallel_size ≤ GPUs per node`). Use for single-node setups or when running multiple replicas, each constrained to one node.
- **`span`**: a single model instance may span multiple nodes. Use when `tensor_parallel_size * pipeline_parallel_size` exceeds the GPU count of one node. When `span` is set, `data_parallel_size_local` is automatically unset.

### Sample topologies

**1 node, 1 instance (TP=8).** The default for the shipped configs:

```bash
config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml"
ng_run "+config_paths=[${config_paths}]"
```

**1 node, 1 instance (DP=2, TP=4).** Split one node into two data-parallel replicas:

```bash
config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml"
ng_run "+config_paths=[${config_paths}]" \
++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2 \
++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4
```

**1 node, 2 instances (TP=4 each).** Chain two model configs into one run; each gets half the node:

```bash
config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml"
ng_run "+config_paths=[${config_paths}]" \
++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \
++gpt-oss-120b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4
```

**2 nodes, 2 instances (TP=8 each).** With `strict` packing, each replica stays on its own node:

```bash
config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml"
ng_run "+config_paths=[${config_paths}]"
```

## Inherited features

The following capabilities work the same as in VLLMModel. See [VLLMModel configuration reference](/model-server/vllm#vllmmodel-configuration-reference) for details.

- **`chat_template_kwargs`**: override chat template behavior per model.
- **`extra_body`**: pass vLLM-specific request parameters (for example, `guided_json`, `reasoning.effort`).
- **`return_token_id_information`**: enable for training workflows that need `prompt_token_ids`, `generation_token_ids`, and `generation_log_probs`.

<Note>
Multi-endpoint replicas (the `base_url: list[str]` pattern used with VLLMModel) do not apply to LocalVLLMModel: each LocalVLLMModel instance manages exactly one vLLM engine. To run multiple replicas, define each as a separate server instance in your config (chain configs in `config_paths`, or define multiple top-level keys in a single YAML).
</Note>
8 changes: 7 additions & 1 deletion fern/versions/latest/pages/model-server/vllm.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "vLLM"
description: ""
description: "Wrapper for an existing, external vLLM server"
position: 2
---
[vLLM](https://docs.vllm.ai/) is a popular LLM inference engine. The NeMo Gym VLLMModel server wraps vLLM's Chat Completions endpoint and converts requests and responses to NeMo Gym's native format, the OpenAI [Responses API](https://platform.openai.com/docs/api-reference/responses) schema.
Expand All @@ -16,8 +16,14 @@ responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"
```

<Note>
VLLMModel connects NeMo Gym to a vLLM server that you start and manage yourself. If you would prefer NeMo Gym to launch and manage vLLM itself, use LocalVLLMModel instead. See [LocalVLLMModel](/model-server/local-vllm) to learn more.
</Note>

## Use VLLMModel
Below is an e2e example of how to spin up a NeMo Gym compatible vLLM Chat Completions OpenAI server and run rollout collection with it.
This section walks through starting a vLLM server manually and connecting NeMo Gym to it through `responses_api_models/vllm_model`.
If you want NeMo Gym to manage the vLLM server lifecycle for you instead, see [LocalVLLMModel](/model-server/local-vllm).

### Install vLLM
Please run the steps below in a separate terminal than your NeMo Gym terminal! The installation will take a few minutes.
Expand Down
22 changes: 22 additions & 0 deletions responses_api_models/vllm_model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Example run config

VLLMModel connects NeMo Gym to a vLLM server that you start and manage yourself. Spin up a vLLM server in a separate terminal (see the [vLLM docs](https://docs.vllm.ai/)), then point NeMo Gym at it.

```bash
config_paths="resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[${config_paths}]" \
++policy_base_url=http://0.0.0.0:10240/v1 \
++policy_model_name=<your-model> \
++policy_api_key=dummy_key &> temp.log &
```

View the logs
```bash
tail -f temp.log
```

Once you see that server instances are up, call the server. If you see a model response here, then everything is working as intended.
```bash
python responses_api_agents/simple_agent/client.py
```
Loading