diff --git a/Makefile b/Makefile index 4e16b93d6..9d2ffa324 100644 --- a/Makefile +++ b/Makefile @@ -15,7 +15,7 @@ PUBLISH_WORKFLOW := Publish Fern Docs .DEFAULT_GOAL := help .PHONY: help \ - docs docs-check docs-preview docs-publish docs-login docs-generate-library + docs docs-check docs-preview docs-publish docs-login docs-login-remote docs-generate-library help: @echo "" @@ -80,7 +80,14 @@ docs-login: *) echo ""; echo "Bailing. Open the dashboard URL above, sign in, then re-run 'make docs-login'."; exit 1 ;; \ esac @echo "" - npx -y fern-api@latest login + npx -y fern-api@latest login $(LOGIN_FLAGS) + +# Same as docs-login, but uses Fern's device-code flow instead of opening a +# browser locally — needed on headless remote machines (SSH dev boxes, etc.) +# where the OAuth callback can't reach a browser. Prints a URL + short code +# you complete on your laptop. +docs-login-remote: LOGIN_FLAGS := --device-code +docs-login-remote: docs-login # Local-only preview. `fern docs md generate` populates fern/product-docs/ from # the nemo_gym package source (declared under `libraries:` in fern/docs.yml); diff --git a/fern/README.md b/fern/README.md index ed296b9ae..e7ffe69f9 100644 --- a/fern/README.md +++ b/fern/README.md @@ -25,7 +25,7 @@ npm install -g fern-api # 2. Provision your Fern account + CLI auth (one-time per machine). # Walks you through the dashboard sign-in step before running `fern login`. -make docs-login +make docs-login # or `make docs-login-remote` when working on a headless remote machine # 3. Build the API library reference and start the local dev server make docs # http://localhost:3000 @@ -88,7 +88,7 @@ make docs-publish # trigger the `Publish Fern Docs` workflow on origin make docs-generate-library # standalone library regeneration (rarely needed; `make docs` runs it) ``` -For first-time-on-this-machine setup, see the [Quickstart](#quickstart) above — `make docs-login` walks through dashboard provisioning + `fern login` together. +For first-time-on-this-machine setup, see the [Quickstart](#quickstart) above — `make docs-login` / `make docs-login-remote` walks through dashboard provisioning + `fern login` together. `make docs` first runs `fern docs md generate`, which populates `fern/product-docs/` from the `nemo_gym` package source declared in the `libraries:` block of `docs.yml`. Without it, a cold `fern docs dev` will fail with `Folder not found: ./product-docs/...`. Re-run only when the upstream Python source changes — for prose-only iteration after the first generation, `cd fern && npm run dev` is enough. diff --git a/fern/versions/latest/pages/model-server/index.mdx b/fern/versions/latest/pages/model-server/index.mdx index b1c54832d..68c7dd202 100644 --- a/fern/versions/latest/pages/model-server/index.mdx +++ b/fern/versions/latest/pages/model-server/index.mdx @@ -25,6 +25,12 @@ Self-hosted inference with vLLM for maximum control. self-hosted open-source + +Let NeMo Gym launch and manage the vLLM server for you. + +gym-managed open-source + + ## Server Configuration diff --git a/fern/versions/latest/pages/model-server/local-vllm-proxy.mdx b/fern/versions/latest/pages/model-server/local-vllm-proxy.mdx new file mode 100644 index 000000000..e02b83e97 --- /dev/null +++ b/fern/versions/latest/pages/model-server/local-vllm-proxy.mdx @@ -0,0 +1,57 @@ +--- +title: "Local vLLM Proxy" +description: "Expose one Local vLLM deployment as multiple model servers" +position: 4 +--- + +LocalVLLMModelProxy (in `responses_api_models/local_vllm_model_proxy`) is a lightweight model server that forwards requests to an existing [LocalVLLMModel](/model-server/local-vllm) instead of launching its own vLLM engine. +It is a subclass of VLLMModel, so it accepts the same configuration fields, but it owns no GPUs. + +## When to use it + +Use a proxy when you need several model servers that share **one** vLLM deployment but differ in their request-time configuration. +For example, one server with reasoning enabled and one with reasoning disabled through the request params, or servers with different sampling parameters. +Without the proxy you would have to launch a separate vLLM engine (and duplicate GPUs) for each variation. + +At startup the proxy waits for its referenced LocalVLLMModel to come up, reads that server's inner vLLM endpoint (`base_url`, `api_key`, `model`), and routes all of its own requests there. + +If you are working with an existing vLLM endpoint that you manage outside of Gym, use [VLLMModel](/model-server/vllm) instead. + +## Configuration + +A proxy is a normal model server config that adds a `model_server` reference pointing at the LocalVLLMModel it should forward to: + +```yaml +policy_model_reasoning_off: + responses_api_models: + local_vllm_model_proxy: + entrypoint: app.py + + # Request-time settings that differ from the backing server + chat_template_kwargs: + enable_thinking: false + + # Standard VLLMModel fields + return_token_id_information: false + uses_reasoning_parser: true + + model_server: + type: responses_api_models + name: policy_model # name of the LocalVLLMModel server to proxy to +``` + +Run it alongside the backing LocalVLLMModel by chaining both configs in `config_paths`: + +```bash +config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\ +responses_api_models/local_vllm_model_proxy/configs/local_vllm_model_proxy.yaml" +ng_run "+config_paths=[${config_paths}]" \ + ++policy_model_proxy.responses_api_models.local_vllm_model_proxy.model_server.name=gpt-oss-20b-reasoning-high +``` + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `model_server` | `ModelServerRef` | — | **Required.** The LocalVLLMModel server to forward requests to, by `type` and `name`. | + +`base_url`, `api_key`, and `model` are populated automatically from the backing server and should **not** be set in your config. +All other VLLMModel fields (`chat_template_kwargs`, `extra_body`, `return_token_id_information`, and so on) behave as documented in the [VLLMModel configuration reference](/model-server/vllm#vllmmodel-configuration-reference). diff --git a/fern/versions/latest/pages/model-server/local-vllm.mdx b/fern/versions/latest/pages/model-server/local-vllm.mdx new file mode 100644 index 000000000..afcbccdac --- /dev/null +++ b/fern/versions/latest/pages/model-server/local-vllm.mdx @@ -0,0 +1,160 @@ +--- +title: "Local vLLM" +description: "Gym-managed vLLM server deployment" +position: 3 +--- + +NeMo Gym can launch and manage the vLLM server for you using LocalVLLMModel (in `responses_api_models/local_vllm_model`). +LocalVLLMModel is a subclass of VLLMModel that spawns the vLLM engine and auto-configures the model server to use it. +The Chat Completions to Responses API conversion is inherited from VLLMModel. See [VLLMModel](/model-server/vllm) for details. + +A single LocalVLLMModel deployment can back multiple model servers, even when they need different request-time settings (for example, sampling parameters or reasoning on or off). +See [Local vLLM Proxy](/model-server/local-vllm-proxy) for this configuration. + + +If you want to connect NeMo Gym to a vLLM server you start and manage yourself, use VLLMModel directly. See [VLLMModel](/model-server/vllm) for more details. + + +## Use LocalVLLMModel + +Unlike VLLMModel, LocalVLLMModel does not require a separate vLLM install or a manual model download. vLLM is a transitive dependency of `responses_api_models/local_vllm_model`, and model weights are fetched from the Hugging Face Hub on first run (using `HF_TOKEN` from your environment if present). A single `ng_run` brings up both vLLM and NeMo Gym. + +Several model configs ship with the server under `responses_api_models/local_vllm_model/configs/` — see the `Qwen/`, `openai/`, and `nvidia/` subdirectories for ready-to-use examples. To launch a single-node 8-GPU run with `Qwen3-30B-A3B-Instruct-2507`: + +```bash +config_paths="resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,\ +responses_api_models/local_vllm_model/configs/Qwen/Qwen3-30B-A3B-Instruct-2507.yaml" +ng_run "+config_paths=[${config_paths}]" +``` + +Override the parallelism dimensions on the command line to match your node: + +```bash +ng_run "+config_paths=[${config_paths}]" \ + ++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \ + ++Qwen3-30B-A3B-Instruct-2507.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2 +``` + +Once the servers are up, call the agent to verify everything works end-to-end: + +```bash +python responses_api_agents/simple_agent/client.py +``` + +## LocalVLLMModel configuration reference + +LocalVLLMModel inherits all fields from VLLMModel (see [VLLMModel configuration reference](/model-server/vllm#vllmmodel-configuration-reference)). It adds the following: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `vllm_serve_kwargs` | `dict` | — | **Required.** Arguments passed through to `vllm serve`. See `vllm_serve_kwargs` below. | +| `vllm_serve_env_vars` | `dict` | — | **Required.** Environment variables for the vLLM process. Must include `VLLM_RAY_DP_PACK_STRATEGY`. | +| `hf_home` | `str` | `/.cache/huggingface` | Hugging Face cache directory. Set this if you have already downloaded weights elsewhere. | +| `debug` | `bool` | `false` | Print vLLM server logs to stderr. | +| `show_vllm_engine_stats` | `bool` | `false` | Periodically log vLLM engine throughput stats. | +| `ray_worker_py_executable` | `str` | `sys.executable` | Python interpreter Ray uses for worker processes. | + +Two inherited fields are auto-populated by LocalVLLMModel after vLLM spawns and should **not** be set in your config: + +- `base_url`: assigned to the URL of the vLLM process once it binds a port. Defaults to `[]`. +- `api_key`: defaults to `"dummy"`. vLLM does not authenticate local connections. + +### `vllm_serve_kwargs` + +Required keys (asserted at startup): + +- `data_parallel_size` +- `tensor_parallel_size` +- `pipeline_parallel_size` + +LocalVLLMModel injects the following keys automatically. Do not set them in your config: + +- `distributed_executor_backend: ray` +- `data_parallel_backend: ray` +- `host: 0.0.0.0` +- `port` (chosen from the free-port pool) +- `download_dir` (derived from `hf_home`) + +Commonly tuned keys (see the shipped configs for full examples): + +```yaml +vllm_serve_kwargs: + data_parallel_size: 1 + tensor_parallel_size: 8 + pipeline_parallel_size: 1 + gpu_memory_utilization: 0.9 + trust_remote_code: true + enable_auto_tool_choice: true + tool_call_parser: hermes + model_loader_extra_config: + enable_multithread_load: true + num_threads: 16 +``` + +Any flag accepted by `vllm serve` can be set under `vllm_serve_kwargs`. See the [official vLLM serve reference](https://docs.vllm.ai/en/latest/cli/serve/) for the full list. + +### `vllm_serve_env_vars` + +Environment variables set in the vLLM process. `VLLM_RAY_DP_PACK_STRATEGY` is mandatory: + +```yaml +vllm_serve_env_vars: + VLLM_RAY_DP_PACK_STRATEGY: strict +``` + +See [Multi-node deployments](#multi-node-deployments) for what `strict` and `span` mean. + +## Multi-node deployments + +LocalVLLMModel uses Ray to place vLLM workers across nodes. The `VLLM_RAY_DP_PACK_STRATEGY` environment variable controls how worker groups are packed: + +- **`strict`**: each data-parallel replica must fit on a single node (`tensor_parallel_size * pipeline_parallel_size ≤ GPUs per node`). Use for single-node setups or when running multiple replicas, each constrained to one node. +- **`span`**: a single model instance may span multiple nodes. Use when `tensor_parallel_size * pipeline_parallel_size` exceeds the GPU count of one node. When `span` is set, `data_parallel_size_local` is automatically unset. + +### Sample topologies + +**1 node, 1 instance (TP=8).** The default for the shipped configs: + +```bash +config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml" +ng_run "+config_paths=[${config_paths}]" +``` + +**1 node, 1 instance (DP=2, TP=4).** Split one node into two data-parallel replicas: + +```bash +config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml" +ng_run "+config_paths=[${config_paths}]" \ + ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2 \ + ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 +``` + +**1 node, 2 instances (TP=4 each).** Chain two model configs into one run; each gets half the node: + +```bash +config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\ +responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml" +ng_run "+config_paths=[${config_paths}]" \ + ++gpt-oss-20b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 \ + ++gpt-oss-120b-reasoning-high.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=4 +``` + +**2 nodes, 2 instances (TP=8 each).** With `strict` packing, each replica stays on its own node: + +```bash +config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\ +responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml" +ng_run "+config_paths=[${config_paths}]" +``` + +## Inherited features + +The following capabilities work the same as in VLLMModel. See [VLLMModel configuration reference](/model-server/vllm#vllmmodel-configuration-reference) for details. + +- **`chat_template_kwargs`**: override chat template behavior per model. +- **`extra_body`**: pass vLLM-specific request parameters (for example, `guided_json`, `reasoning.effort`). +- **`return_token_id_information`**: enable for training workflows that need `prompt_token_ids`, `generation_token_ids`, and `generation_log_probs`. + + +Multi-endpoint replicas (the `base_url: list[str]` pattern used with VLLMModel) do not apply to LocalVLLMModel: each LocalVLLMModel instance manages exactly one vLLM engine. To run multiple replicas, define each as a separate server instance in your config (chain configs in `config_paths`, or define multiple top-level keys in a single YAML). + diff --git a/fern/versions/latest/pages/model-server/vllm.mdx b/fern/versions/latest/pages/model-server/vllm.mdx index 6391e1d10..45ec76e5e 100644 --- a/fern/versions/latest/pages/model-server/vllm.mdx +++ b/fern/versions/latest/pages/model-server/vllm.mdx @@ -1,6 +1,6 @@ --- title: "vLLM" -description: "" +description: "Wrapper for an existing, external vLLM server" position: 2 --- [vLLM](https://docs.vllm.ai/) is a popular LLM inference engine. The NeMo Gym VLLMModel server wraps vLLM's Chat Completions endpoint and converts requests and responses to NeMo Gym's native format, the OpenAI [Responses API](https://platform.openai.com/docs/api-reference/responses) schema. @@ -16,8 +16,14 @@ responses_api_models/vllm_model/configs/vllm_model.yaml" ng_run "+config_paths=[$config_paths]" ``` + +VLLMModel connects NeMo Gym to a vLLM server that you start and manage yourself. If you would prefer NeMo Gym to launch and manage vLLM itself, use LocalVLLMModel instead. See [LocalVLLMModel](/model-server/local-vllm) to learn more. + + ## Use VLLMModel Below is an e2e example of how to spin up a NeMo Gym compatible vLLM Chat Completions OpenAI server and run rollout collection with it. +This section walks through starting a vLLM server manually and connecting NeMo Gym to it through `responses_api_models/vllm_model`. +If you want NeMo Gym to manage the vLLM server lifecycle for you instead, see [LocalVLLMModel](/model-server/local-vllm). ### Install vLLM Please run the steps below in a separate terminal than your NeMo Gym terminal! The installation will take a few minutes. diff --git a/responses_api_models/vllm_model/README.md b/responses_api_models/vllm_model/README.md index e69de29bb..5578d29bf 100644 --- a/responses_api_models/vllm_model/README.md +++ b/responses_api_models/vllm_model/README.md @@ -0,0 +1,22 @@ +# Example run config + +VLLMModel connects NeMo Gym to a vLLM server that you start and manage yourself. Spin up a vLLM server in a separate terminal (see the [vLLM docs](https://docs.vllm.ai/)), then point NeMo Gym at it. + +```bash +config_paths="resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,\ +responses_api_models/vllm_model/configs/vllm_model.yaml" +ng_run "+config_paths=[${config_paths}]" \ + ++policy_base_url=http://0.0.0.0:10240/v1 \ + ++policy_model_name= \ + ++policy_api_key=dummy_key &> temp.log & +``` + +View the logs +```bash +tail -f temp.log +``` + +Once you see that server instances are up, call the server. If you see a model response here, then everything is working as intended. +```bash +python responses_api_agents/simple_agent/client.py +```