NVIDIA-NeMo · marta-sd · May 27, 2026 · May 27, 2026 · May 28, 2026 · May 29, 2026
diff --git a/Makefile b/Makefile
@@ -15,7 +15,7 @@ PUBLISH_WORKFLOW := Publish Fern Docs
 .DEFAULT_GOAL := help
 
 .PHONY: help \
-        docs docs-check docs-preview docs-publish docs-login docs-generate-library
+        docs docs-check docs-preview docs-publish docs-login docs-login-remote docs-generate-library
 
 help:
 	@echo ""
@@ -80,7 +80,14 @@ docs-login:
 		*) echo ""; echo "Bailing. Open the dashboard URL above, sign in, then re-run 'make docs-login'."; exit 1 ;; \
 	esac
 	@echo ""
-	npx -y fern-api@latest login
+	npx -y fern-api@latest login $(LOGIN_FLAGS)
+
+# Same as docs-login, but uses Fern's device-code flow instead of opening a
+# browser locally — needed on headless remote machines (SSH dev boxes, etc.)
+# where the OAuth callback can't reach a browser. Prints a URL + short code
+# you complete on your laptop.
+docs-login-remote: LOGIN_FLAGS := --device-code
+docs-login-remote: docs-login
 
 # Local-only preview. `fern docs md generate` populates fern/product-docs/ from
 # the nemo_gym package source (declared under `libraries:` in fern/docs.yml);

diff --git a/fern/README.md b/fern/README.md
@@ -25,7 +25,7 @@ npm install -g fern-api
 
 # 2. Provision your Fern account + CLI auth (one-time per machine).
 #    Walks you through the dashboard sign-in step before running `fern login`.
-make docs-login
+make docs-login   # or `make docs-login-remote` when working on a headless remote machine
 
 # 3. Build the API library reference and start the local dev server
 make docs           # http://localhost:3000
@@ -88,7 +88,7 @@ make docs-publish           # trigger the `Publish Fern Docs` workflow on origin
 make docs-generate-library  # standalone library regeneration (rarely needed; `make docs` runs it)
 ```
 
-For first-time-on-this-machine setup, see the [Quickstart](#quickstart) above — `make docs-login` walks through dashboard provisioning + `fern login` together.
+For first-time-on-this-machine setup, see the [Quickstart](#quickstart) above — `make docs-login` / `make docs-login-remote` walks through dashboard provisioning + `fern login` together.
 
 `make docs` first runs `fern docs md generate`, which populates `fern/product-docs/` from the `nemo_gym` package source declared in the `libraries:` block of `docs.yml`. Without it, a cold `fern docs dev` will fail with `Folder not found: ./product-docs/...`. Re-run only when the upstream Python source changes — for prose-only iteration after the first generation, `cd fern && npm run dev` is enough.
 

diff --git a/fern/versions/latest/pages/model-server/index.mdx b/fern/versions/latest/pages/model-server/index.mdx
@@ -25,6 +25,12 @@ Self-hosted inference with vLLM for maximum control.
 <Badge minimal outlined>self-hosted</Badge> <Badge minimal outlined>open-source</Badge>
 </Card>
 
+<Card title="Local vLLM" href="/model-server/local-vllm">
+Let NeMo Gym launch and manage the vLLM server for you.
+
+<Badge minimal outlined>gym-managed</Badge> <Badge minimal outlined>open-source</Badge>
+</Card>
+
 </Cards>
 
 ## Server Configuration

diff --git a/fern/versions/latest/pages/model-server/local-vllm-proxy.mdx b/fern/versions/latest/pages/model-server/local-vllm-proxy.mdx
@@ -0,0 +1,57 @@
+---
+title: "Local vLLM Proxy"
+description: "Expose one Local vLLM deployment as multiple model servers"
+position: 4
+---
+
+LocalVLLMModelProxy (in `responses_api_models/local_vllm_model_proxy`) is a lightweight model server that forwards requests to an existing [LocalVLLMModel](/model-server/local-vllm) instead of launching its own vLLM engine.
+It is a subclass of VLLMModel, so it accepts the same configuration fields, but it owns no GPUs.
+
+## When to use it
+
+Use a proxy when you need several model servers that share **one** vLLM deployment but differ in their request-time configuration.
+For example, one server with reasoning enabled and one with reasoning disabled through the request params, or servers with different sampling parameters.
+Without the proxy you would have to launch a separate vLLM engine (and duplicate GPUs) for each variation.
+
+At startup the proxy waits for its referenced LocalVLLMModel to come up, reads that server's inner vLLM endpoint (`base_url`, `api_key`, `model`), and routes all of its own requests there.
+
+If you are working with an existing vLLM endpoint that you manage outside of Gym, use [VLLMModel](/model-server/vllm) instead.
+
+## Configuration
+
+A proxy is a normal model server config that adds a `model_server` reference pointing at the LocalVLLMModel it should forward to:
+
+```yaml
+policy_model_reasoning_off:
+  responses_api_models:
+    local_vllm_model_proxy:
+      entrypoint: app.py
+
+      # Request-time settings that differ from the backing server
+      chat_template_kwargs:
+        enable_thinking: false
+
+      # Standard VLLMModel fields
+      return_token_id_information: false
+      uses_reasoning_parser: true
+
+      model_server:
+        type: responses_api_models
+        name: policy_model   # name of the LocalVLLMModel server to proxy to
+```
+
+Run it alongside the backing LocalVLLMModel by chaining both configs in `config_paths`:
+
+```bash
+config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
+responses_api_models/local_vllm_model_proxy/configs/local_vllm_model_proxy.yaml"
+ng_run "+config_paths=[${config_paths}]" \
+    ++policy_model_proxy.responses_api_models.local_vllm_model_proxy.model_server.name=gpt-oss-20b-reasoning-high
+```
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `model_server` | `ModelServerRef` | — | **Required.** The LocalVLLMModel server to forward requests to, by `type` and `name`. |
+
+`base_url`, `api_key`, and `model` are populated automatically from the backing server and should **not** be set in your config.
+All other VLLMModel fields (`chat_template_kwargs`, `extra_body`, `return_token_id_information`, and so on) behave as documented in the [VLLMModel configuration reference](/model-server/vllm#vllmmodel-configuration-reference).
diff --git a/fern/versions/latest/pages/model-server/local-vllm.mdx b/fern/versions/latest/pages/model-server/local-vllm.mdx
@@ -0,0 +1,160 @@
+---
+title: "Local vLLM"
+description: "Gym-managed vLLM server deployment"
+position: 3
+---
+
+NeMo Gym can launch and manage the vLLM server for you using LocalVLLMModel (in `responses_api_models/local_vllm_model`).
+LocalVLLMModel is a subclass of VLLMModel that spawns the vLLM engine and auto-configures the model server to use it.
+The Chat Completions to Responses API conversion is inherited from VLLMModel. See [VLLMModel](/model-server/vllm) for details.
+
+A single LocalVLLMModel deployment can back multiple model servers, even when they need different request-time settings (for example, sampling parameters or reasoning on or off).
+See [Local vLLM Proxy](/model-server/local-vllm-proxy) for this configuration.
+
+<Note>
+If you want to connect NeMo Gym to a vLLM server you start and manage yourself, use VLLMModel directly. See [VLLMModel](/model-server/vllm) for more details.
+</Note>
+
+## Use LocalVLLMModel
+
+Unlike VLLMModel, LocalVLLMModel does not require a separate vLLM install or a manual model download. vLLM is a transitive dependency of `responses_api_models/local_vllm_model`, and model weights are fetched from the Hugging Face Hub on first run (using `HF_TOKEN` from your environment if present). A single `ng_run` brings up both vLLM and NeMo Gym.
+
+Several model configs ship with the server under `responses_api_models/local_vllm_model/configs/` — see the `Qwen/`, `openai/`, and `nvidia/` subdirectories for ready-to-use examples. To launch a single-node 8-GPU run with `Qwen3-30B-A3B-Instruct-2507`:
+
+```bash
+config_paths="resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,\
+responses_api_models/local_vllm_model/configs/Qwen/Qwen3-30B-A3B-Instruct-2507.yaml"
+ng_run "+config_paths=[${config_paths}]"
+```
+
+Override the parallelism dimensions on the command line to match your node:
+
+```bash
+ng_run "+config_paths=[${config_paths}]" \
+    ++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \
+    ++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2
+```
+
+Once the servers are up, call the agent to verify everything works end-to-end:
+
+```bash
+python responses_api_agents/simple_agent/client.py
+```
+
+## LocalVLLMModel configuration reference
+
+LocalVLLMModel inherits all fields from VLLMModel (see [VLLMModel configuration reference](/model-server/vllm#vllmmodel-configuration-reference)). It adds the following:
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `vllm_serve_kwargs` | `dict` | — | **Required.** Arguments passed through to `vllm serve`. See `vllm_serve_kwargs` below. |
+| `vllm_serve_env_vars` | `dict` | — | **Required.** Environment variables for the vLLM process. Must include `VLLM_RAY_DP_PACK_STRATEGY`. |
+| `hf_home` | `str` | `<cwd>/.cache/huggingface` | Hugging Face cache directory. Set this if you have already downloaded weights elsewhere. |
+| `debug` | `bool` | `false` | Print vLLM server logs to stderr. |
+| `show_vllm_engine_stats` | `bool` | `false` | Periodically log vLLM engine throughput stats. |
+| `ray_worker_py_executable` | `str` | `sys.executable` | Python interpreter Ray uses for worker processes. |
+
+Two inherited fields are auto-populated by LocalVLLMModel after vLLM spawns and should **not** be set in your config:
+
+- `base_url`: assigned to the URL of the vLLM process once it binds a port. Defaults to `[]`.
+- `api_key`: defaults to `"dummy"`. vLLM does not authenticate local connections.
+
+### `vllm_serve_kwargs`
+
+Required keys (asserted at startup):
+
+- `data_parallel_size`
+- `tensor_parallel_size`
+- `pipeline_parallel_size`
+
+LocalVLLMModel injects the following keys automatically. Do not set them in your config:
+
+- `distributed_executor_backend: ray`
+- `data_parallel_backend: ray`
+- `host: 0.0.0.0`
+- `port` (chosen from the free-port pool)
+- `download_dir` (derived from `hf_home`)
+
+Commonly tuned keys (see the shipped configs for full examples):
+
+```yaml
+vllm_serve_kwargs:
+  data_parallel_size: 1
+  tensor_parallel_size: 8
+  pipeline_parallel_size: 1
+  gpu_memory_utilization: 0.9
+  trust_remote_code: true
+  enable_auto_tool_choice: true
+  tool_call_parser: hermes
+  model_loader_extra_config:
+    enable_multithread_load: true
+    num_threads: 16
+```
+
+Any flag accepted by `vllm serve` can be set under `vllm_serve_kwargs`. See the [official vLLM serve reference](https://docs.vllm.ai/en/latest/cli/serve/) for the full list.
+
+### `vllm_serve_env_vars`
+
+Environment variables set in the vLLM process. `VLLM_RAY_DP_PACK_STRATEGY` is mandatory:
+
+```yaml
+vllm_serve_env_vars:
+  VLLM_RAY_DP_PACK_STRATEGY: strict
+```
+
+See [Multi-node deployments](#multi-node-deployments) for what `strict` and `span` mean.
+
+## Multi-node deployments
+
+LocalVLLMModel uses Ray to place vLLM workers across nodes. The `VLLM_RAY_DP_PACK_STRATEGY` environment variable controls how worker groups are packed:
+
+- **`strict`**: each data-parallel replica must fit on a single node (`tensor_parallel_size * pipeline_parallel_size ≤ GPUs per node`). Use for single-node setups or when running multiple replicas, each constrained to one node.
+- **`span`**: a single model instance may span multiple nodes. Use when `tensor_parallel_size * pipeline_parallel_size` exceeds the GPU count of one node. When `span` is set, `data_parallel_size_local` is automatically unset.
+
+### Sample topologies
+
+**1 node, 1 instance (TP=8).** The default for the shipped configs:
+
+```bash
+config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml"
+ng_run "+config_paths=[${config_paths}]"
+```
+
+**1 node, 1 instance (DP=2, TP=4).** Split one node into two data-parallel replicas:
+
+```bash
+config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml"
+ng_run "+config_paths=[${config_paths}]" \
+    ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2 \
+    ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4
+```
+
+**1 node, 2 instances (TP=4 each).** Chain two model configs into one run; each gets half the node:
+
+```bash
+config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
+responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml"
+ng_run "+config_paths=[${config_paths}]" \
+    ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \
+    ++gpt-oss-120b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4
+```
+
+**2 nodes, 2 instances (TP=8 each).** With `strict` packing, each replica stays on its own node:
+
+```bash
+config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
+responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml"
+ng_run "+config_paths=[${config_paths}]"
+```
+
+## Inherited features
+
+The following capabilities work the same as in VLLMModel. See [VLLMModel configuration reference](/model-server/vllm#vllmmodel-configuration-reference) for details.
+
+- **`chat_template_kwargs`**: override chat template behavior per model.
+- **`extra_body`**: pass vLLM-specific request parameters (for example, `guided_json`, `reasoning.effort`).
+- **`return_token_id_information`**: enable for training workflows that need `prompt_token_ids`, `generation_token_ids`, and `generation_log_probs`.
+
+<Note>
+Multi-endpoint replicas (the `base_url: list[str]` pattern used with VLLMModel) do not apply to LocalVLLMModel: each LocalVLLMModel instance manages exactly one vLLM engine. To run multiple replicas, define each as a separate server instance in your config (chain configs in `config_paths`, or define multiple top-level keys in a single YAML).
+</Note>
diff --git a/fern/versions/latest/pages/model-server/vllm.mdx b/fern/versions/latest/pages/model-server/vllm.mdx
@@ -1,6 +1,6 @@
 ---
 title: "vLLM"
-description: ""
+description: "Wrapper for an existing, external vLLM server"
 position: 2
 ---
 [vLLM](https://docs.vllm.ai/) is a popular LLM inference engine. The NeMo Gym VLLMModel server wraps vLLM's Chat Completions endpoint and converts requests and responses to NeMo Gym's native format, the OpenAI [Responses API](https://platform.openai.com/docs/api-reference/responses) schema.
@@ -16,8 +16,14 @@ responses_api_models/vllm_model/configs/vllm_model.yaml"
 ng_run "+config_paths=[$config_paths]"
 ```
 
+<Note>
+VLLMModel connects NeMo Gym to a vLLM server that you start and manage yourself. If you would prefer NeMo Gym to launch and manage vLLM itself, use LocalVLLMModel instead. See [LocalVLLMModel](/model-server/local-vllm) to learn more.
+</Note>
+
 ## Use VLLMModel
 Below is an e2e example of how to spin up a NeMo Gym compatible vLLM Chat Completions OpenAI server and run rollout collection with it.
+This section walks through starting a vLLM server manually and connecting NeMo Gym to it through `responses_api_models/vllm_model`.
+If you want NeMo Gym to manage the vLLM server lifecycle for you instead, see [LocalVLLMModel](/model-server/local-vllm).
 
 ### Install vLLM
 Please run the steps below in a separate terminal than your NeMo Gym terminal! The installation will take a few minutes.

diff --git a/responses_api_models/vllm_model/README.md b/responses_api_models/vllm_model/README.md
@@ -0,0 +1,22 @@
+# Example run config
+
+VLLMModel connects NeMo Gym to a vLLM server that you start and manage yourself. Spin up a vLLM server in a separate terminal (see the [vLLM docs](https://docs.vllm.ai/)), then point NeMo Gym at it.
+
+```bash
+config_paths="resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,\
+responses_api_models/vllm_model/configs/vllm_model.yaml"
+ng_run "+config_paths=[${config_paths}]" \
+    ++policy_base_url=http://0.0.0.0:10240/v1 \
+    ++policy_model_name=<your-model> \
+    ++policy_api_key=dummy_key &> temp.log &
+```
+
+View the logs
+```bash
+tail -f temp.log
+```
+
+Once you see that server instances are up, call the server. If you see a model response here, then everything is working as intended.
+```bash
+python responses_api_agents/simple_agent/client.py
+```