Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
d33420b
basic launcher
rafapi Dec 12, 2025
acad1e7
single channel streams
rafapi Dec 12, 2025
9a7eab0
update
rafapi Dec 12, 2025
b526d8d
Merge branch 'main' into fast-llm
rafapi Dec 16, 2025
9543265
enable fast-llm for basic streaming
rafapi Dec 19, 2025
1d9ac8d
enable fast-llm for basic streaming
rafapi Dec 19, 2025
6d42ade
set cwd
rafapi Dec 19, 2025
e8674e1
convert samples for fast-llm format
rafapi Dec 19, 2025
e78844e
listen to trainer events from redis
rafapi Dec 19, 2025
d16a829
put all the samples in a single stream
rafapi Dec 19, 2025
67b3f2b
fix accessing non complete dists
bigximik Feb 9, 2026
dc04911
change to calling fast-llm without conda
bigximik Feb 9, 2026
30a5346
fix imports for vllm .14.1rc1
bigximik Feb 9, 2026
9df2feb
tmp changes to not install dependencies which are in base image already
bigximik Feb 9, 2026
49707aa
2 gpu and 1 gpu integration tests
bigximik Feb 13, 2026
a893686
changes for tests
bigximik Feb 13, 2026
641e457
added instruction to install PipelineRL+FastLLM
bigximik Feb 13, 2026
7d14e80
added fast-llm bradcast functionality and test
bigximik Feb 18, 2026
dad6242
refactoring of traning helper
bigximik Feb 20, 2026
402f29a
added stop traning support and changed engine clean up logic to warn …
bigximik Feb 20, 2026
3848fbc
added 3 and 4 gpu tests (not run yet), added traning end event, added…
bigximik Feb 20, 2026
e170185
fix setting current device for tp and pp cases
bigximik Feb 23, 2026
3541dfe
fix multi actor generation abab patter consistency check
bigximik Feb 23, 2026
5560ada
fast-llm weight update bug fix some other changes
bigximik Mar 2, 2026
f4cc7a4
note update
bigximik Mar 2, 2026
f78be64
fix no weight bradcast case with fast-llm
bigximik Mar 2, 2026
807f47c
added data for grpo loss to send to fast-llm, added base fast-llm con…
bigximik Mar 2, 2026
407be1d
fast-llm weights broadcast integration
bigximik Mar 3, 2026
b7e2109
removed duplicate option
bigximik Mar 4, 2026
a3f6ed2
Merge pull request #128 from ServiceNow/fast-llm
rafapi Mar 4, 2026
ceed0bf
added pass through of wandb params to fast-llm, bigger seq len in dem…
bigximik Mar 5, 2026
6ecfa40
fix loss masking
bigximik Mar 6, 2026
152d489
integrated fast-llm config and changed to start fast-llm with pytorch
bigximik Mar 6, 2026
c1aeaac
migrate fast-llm weight broadcast to ProcessGroupPool API
bigximik Mar 19, 2026
29964cc
fix fast-llm config schema and subprocess launcher
bigximik Mar 19, 2026
f48386b
fix missed broadcast API in timed_broadcast_fast_llm inner function
bigximik Mar 20, 2026
b3a0f64
fix fast-llm lag check interval: reduce from 5s to 0s
bigximik Mar 24, 2026
58a8798
update base and counting configs for fast-llm: max_ready_samples, vll…
bigximik Mar 24, 2026
979bdf1
split fast-llm lag polling into dedicated thread separate from event …
bigximik Mar 24, 2026
38783a5
fix seq_length validation to use micro_batch_size for fast-llm path
bigximik Mar 24, 2026
b8b9885
set use_fast_llm=true as default, add max_model_len and vllm v1 to co…
bigximik Mar 24, 2026
1f3ca5e
increase grpo epsilon from 0.1 to 0.2 in counting config
bigximik Mar 24, 2026
bd18f46
add data pipeline diagnostic logging (log_data_pipeline flag)
bigximik Mar 25, 2026
31820d2
add Redis verbose logging to file for crash diagnosis
bigximik Mar 25, 2026
c51d6bd
handle vLLM 4xx errors gracefully instead of crashing the actor
bigximik Apr 16, 2026
00ddfc3
fix vLLM port race condition when starting multiple actor servers
bigximik Apr 16, 2026
e10e8ce
add 8-GPU math 7B submit script and actor error handling tests
bigximik Apr 16, 2026
326199a
merge: origin/vllm_v1 into fast-llm (main catch-up + PR #137 fixes)
bigximik Apr 23, 2026
562214a
vllm1: default weight_update_mode to 'http' when args lacks the attr
bigximik Apr 23, 2026
0cde577
launch: dispatch run_finetune to DeepSpeed when use_fast_llm=false
bigximik Apr 23, 2026
e0f01a4
vllm1 HTTP path: use StatelessProcessGroup to match the trainer
bigximik Apr 23, 2026
8893d2c
preprocess: guard unpacked-mode popleft against partial queue
bigximik Apr 23, 2026
842736c
launch: use absolute paths for redis --dir and --logfile
bigximik Apr 23, 2026
b753ece
vllm1: drop pause/resume wrap on fast-llm path
bigximik Apr 23, 2026
54d4eeb
vllm1: re-add pause/resume wrap on fast-llm path with startup gate
bigximik Apr 23, 2026
612b9bf
vllm1: use _pause_generation helper that drains in-flight requests
bigximik Apr 23, 2026
6829195
multinode: fix DeepSpeed path, log capture, Redis host, and add tests
bigximik Apr 27, 2026
56f4a8f
launch: write per-node output files in multinode finetune to avoid NF…
bigximik Apr 27, 2026
4dce398
vllm v1: enable processed_logprobs mode by default
bigximik Apr 27, 2026
eaa2a9a
vllm1: use mode=keep in pause_generation to match PR #137
bigximik Apr 27, 2026
eeac65e
actor: retry rollout on vLLM abort instead of crashing
bigximik Apr 27, 2026
7f1b87a
world: read GPUS_PER_NODE from env instead of hardcoding 8; add resum…
bigximik Apr 27, 2026
8b19f1f
launch: clear stale pod IPs on resume; unique resume job names in sub…
bigximik Apr 27, 2026
223687f
launch: fix pod IP exchange on resume with session-token barrier
bigximik Apr 27, 2026
b6ea563
launch: use job-specific MASTER_ADDR as pod IP session token
bigximik Apr 27, 2026
9fcfb20
launch: simplify pod IP exchange to per-job subdirectory
bigximik Apr 27, 2026
cc419e5
launch: make pod IP exchange run_id configurable via world.run_id
bigximik Apr 27, 2026
1d25b99
launch: require world.run_id; raise on missing or already-used dir
bigximik Apr 27, 2026
c3cf807
docs: document world.run_id requirement and resume workflow in multin…
bigximik Apr 27, 2026
7594bc1
docs: rewrite multinode run/resume section to be launcher-agnostic
bigximik Apr 27, 2026
13a42bf
Merge origin/vllm_v1 into fast-llm
bigximik Apr 27, 2026
84b95be
actor: use run_in_executor for result_queue.put to avoid blocking eve…
bigximik Apr 27, 2026
92a32db
actor: retry on ServerDisconnectedError; drop dead use_v1 from math.yaml
bigximik Apr 28, 2026
3a5671c
utils: guard against None metadata in wandb python_env collection
bigximik Apr 28, 2026
89ceb1b
fix multi-node job submission scripts for fast-llm branch
bigximik Apr 28, 2026
5d75c7e
Merge remote-tracking branch 'origin/main' into fast-llm
bigximik May 6, 2026
40afa2d
launch scripts: drop top-level fp32_lm_head knob (removed in main)
bigximik May 6, 2026
45ff301
docs(fast-llm): handover docs and interactive smoke examples
bigximik May 6, 2026
abf2b02
docs(fast-llm): trim speculative TODO items from handover doc
bigximik May 6, 2026
613116f
docs(fast-llm): add 400-step comparison charts and outstanding TODOs
bigximik May 6, 2026
b1b8823
docs(fast-llm): simplify metric-gap open question
bigximik May 6, 2026
8c02c87
docs(fast-llm): add image/vLLM version-bump open question
bigximik May 6, 2026
399acf7
docs(fast-llm): document base-image source repo
bigximik May 6, 2026
8ca2b5f
docs(fast-llm): clarify use-vs-build of the toolkit image
bigximik May 6, 2026
6090900
examples(interactive): align ds_4node with the GSPO chart baseline
bigximik May 7, 2026
50c7ff7
docs(fast-llm): track DS GSPO source script and link reproduction rec…
bigximik May 7, 2026
e4bd9b8
docs(fast-llm): parameterize Denis-specific values; drop DS PPO script
bigximik May 7, 2026
902cb7a
docs(fast-llm): add explicit "How to launch" walkthrough
bigximik May 7, 2026
850bcad
docs(fast-llm): drop misleading examples/interactive scripts
bigximik May 7, 2026
0c2f99e
docs(fast-llm): split §9 Testing into 3 subsections
bigximik May 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,17 @@ PipelineRL is organized as a modular, Hydra-driven pipeline with 6 core componen
- Pull a batch → call `rl_step(...)` (in `pipelinerl/finetune/rl/utils.py`) to compute policy-gradient (+ KL penalty if configured) → `optimizer.step()` → `lr_scheduler.step()`.
- On rank 0, use `WeightUpdateManager.send_weight_update(version)` to gather model parameters, send `WeightUpdateRequest` to Actor LLMs (HTTP), broadcast tensors via NCCL, and write a `WeightUpdateSuccess` message to the update stream.

#### Fast-LLM trainer path (preview)

When `use_fast_llm: true` (default in `conf/math.yaml`), the DeepSpeed ZeRO-3 trainer above is replaced with [Fast-LLM](https://github.com/ServiceNow/Fast-LLM) (FSDP + sequence-data-parallel) and the per-step weight update over HTTP is replaced with a persistent NCCL broadcast group:

- Trainer: `fast_llm train gpt` launched via torchrun (`pipelinerl/launch.py:run_finetune`); rank 0 also serves the broadcast `TCPStore`.
- Fast-LLM's `StreamingTrainerCallback` gathers full-precision weights after each optimizer step and broadcasts them on a persistent NCCL group whose name is `WEIGHTS_BROADCAST_PG_NAME`.
- vLLM workers join the same group via `vllm1.init_actor_update_group(...)` and copy parameters into the model in place.
- Coordinated NCCL teardown (`pipelinerl/vllm1.py:484-547`) listens to a `training_finished` redis xadd from the trainer and destroys the process group on the vLLM side so `dist.destroy_process_group()` doesn't hang.

This path is **WIP** — see [`docs/FAST_LLM_INTEGRATION.md`](docs/FAST_LLM_INTEGRATION.md) for known issues, configuration knobs, and example interactive-job scripts.

### 6. Verifier
- Entrypoint: `pipelinerl/entrypoints/verifier.py`
- Serves a FastAPI app with:
Expand All @@ -360,6 +371,7 @@ PipelineRL is organized as a modular, Hydra-driven pipeline with 6 core componen
- Defined in `pipelinerl/streams.py`.
- Implements `SingleStreamSpec` and `StreamRangeSpec` for file-system or Redis-based queues.
- `write_to_streams(...)` and `read_stream(...)` provide a JSON-line protocol for inter-process messaging.
- Pass `shared=True` to these helpers when multiple actors must fan-in to a single Redis stream (e.g., ServiceNow/Fast-LLM trainer). The shared mode encodes payloads via `orjson`, tags them with a global index, and lets the trainer perform downstream sharding safely.
- Available backends:
- File system: default.
- Redis: requires Redis server.
Expand All @@ -371,3 +383,107 @@ PipelineRL is organized as a modular, Hydra-driven pipeline with 6 core componen
- `training_data` stream (StreamRangeSpec(topic="training_data")): File- or Redis-backed stream used to transfer processed training micro-batches from the Preprocessor to the Trainer. Configured via `cfg.preprocess.output` and `cfg.finetune.input` (defaulting to "training_data") in `conf/base.yaml`. Written in `pipelinerl/run_preprocess.py` and consumed in `pipelinerl/run_finetune.py`.
- `actor_test` and `stats_test` streams: analogous streams used for evaluation loops (test samples and test metrics).
- `stats` stream (SingleStreamSpec(topic="stats")): produced by `ActorLoop.publish_stats` with sliding-window metrics; consumed by external monitoring (e.g. WANDB, logging viewers).




## Multi-Node Requirements

PipelineRL can span multiple nodes, with actor (vLLM) and trainer roles on separate machines. Each role opens outbound TCP connections to other roles; every target port must be reachable from the source node.

### Ports and config params

| Port (default) | Config param | Direction | Purpose |
|---|---|---|---|
| `streams.port` (11000) | `conf/streams/redis.yaml` | all nodes → rank-0 node | Redis data streams (actor → preprocessor → trainer) |
| `world.actor_group_port` (9000) | `conf/base.yaml` | actor node → trainer node | Weight-broadcast process group (NCCL TCPStore rendezvous) |
| `world.environment_start_port` (7777) | `conf/base.yaml` | actor node → environment node | Remote environment HTTP server |
| `8080 + gpu_local_idx` | derived from GPU placement | trainer node → actor node | vLLM HTTP endpoints for weight updates, one per GPU |
| `MASTER_PORT` env var | set by your cluster launcher | trainer nodes ↔ each other | torchrun / accelerate rendezvous between finetune ranks |

### What each node connects to

**Trainer node** opens connections to:
- `{actor_node_ip}:{8080 + i}` for each vLLM GPU `i` — to POST updated weights after each optimizer step.
- `{rank_0_ip}:{streams.port}` — to read training batches from Redis (when `streams=redis`).

**Actor node** opens connections to:
- `{rank_0_ip}:{streams.port}` — to publish rollout data to Redis.
- `{rank_0_ip}:{world.actor_group_port}` — to join the NCCL weight-broadcast process group (vLLM workers connect as clients; the trainer creates the TCPStore server on this port).
- `{env_node_ip}:{world.environment_start_port + i}` — to call remote environment servers (if `environments[*].mode=remote`).

**All finetune nodes** connect to each other on `MASTER_PORT` for the distributed training rendezvous (rank-0 finetune node is the server).

### Topology assumptions

- With fast-llm (`use_fast_llm=true`), each component must occupy whole nodes — torchrun requires every finetune rank to see a complete, identical GPU set.
- With `world.preprocessor_fraction=0`, every node is either a pure actor node or a pure trainer node (no mixing).
- The DeepSpeed hostfile and `--deepspeed_inclusion_filter` use DNS/hostname names (not IPs), so the cluster rendezvous port (`MASTER_PORT`) must be reachable via those names. All other cross-node connections use IP addresses and are independent of DNS.

### Running and resuming multi-node jobs

**`world.run_id` is required for multi-node jobs.** It must be a string that is unique per job run. It namespaces the pod IP exchange directory on the shared NFS mount so that stale files from a previous run are never picked up by a new one. Any value that your cluster scheduler guarantees to be unique per job works — a job UUID, a replica-group ID, or the job's `MASTER_ADDR` (which is unique per torchrun launch):

```bash
python -m pipelinerl.launch ... 'world.run_id=${MASTER_ADDR}'
```

**To resume a preempted run**, reuse the same `output_dir` as the original job. fast-LLM automatically finds the latest checkpoint in `output_dir/finetune/checkpoint/` and resumes from it. WandB also resumes the same run because fast-LLM persists the run ID in `output_dir/finetune/wandb_config.yaml` on the first launch and reloads it on every subsequent launch.

Each resumed job must still use a fresh `world.run_id` (the new job's ID, not the original one), so the pod IP exchange directory is always clean.

# Install FastLLM+PipelineRL

> **Status (2026-05-06):** This integration is WIP — see [`docs/FAST_LLM_INTEGRATION.md`](docs/FAST_LLM_INTEGRATION.md) for the full handover (architecture, known issues, TODO).

### 1. Container image

To **use**: reference the prebuilt image
```
registry.toolkit-sp.yul201.service-now.com/snow.research.afm/interactive-toolkit:25.12-py3-vllm014rc1redis
```
It bundles the redis server.

To **build** (from the [`ServiceNow/research-interactive-toolkit`](https://github.com/ServiceNow/research-interactive-toolkit/tree/fml/pytorch_vllm014rc1) repo, branch `fml/pytorch_vllm014rc1` — SN-internal, link is gated): set `~/.research-interactive-env` and run the toolkit's build target.

```shell
USE_ACCOUNT_REPO := 1
BASE_IMAGE := nvcr.io/nvidia/pytorch:25.12-py3
IMAGE_REVISION := 25.12-py3-vllm014rc1redis
EAI_PROFILE := yul201
```

Base layer is `nvcr.io/nvidia/pytorch:25.12-py3`; the toolkit branch layers on vLLM 0.14.0rc1, redis, and the EAI helpers.

### 2. Clone + venv + editable installs

Inside a running interactive instance, install both Fast-LLM and PipelineRL into a single venv at `PipelineRL/.venv`:

```shell
git clone git@github.com:ServiceNow/Fast-LLM.git
git clone git@github.com:ServiceNow/PipelineRL.git

cd PipelineRL
/usr/bin/python3.12 -m venv --system-site-packages .venv
source .venv/bin/activate
export PIP_CONSTRAINT=""

# Fast-LLM: GSPO branch is the one paired with the PipelineRL fast-llm branch
cd ../Fast-LLM
git submodule update --init --recursive
git checkout gspo
pip install --no-cache-dir --no-build-isolation -e ".[CORE,OPTIONAL,HUGGINGFACE,SSM,VISION,GENERATION,STREAMING,DEV]" triton==3.5.1

# PipelineRL: fast-llm branch
cd ../PipelineRL
git checkout fast-llm
pip install --no-cache-dir -e ".[lora]"
```

### 3. Known caveats

- **`pyproject.toml:81-87`** — `[tool.uv]` overrides `transformers>=4.51.0` and `accelerate>=1.7.0` because `tapeagents==0.1.16` pins them lower; the `[tapeagents]` extra is **broken at runtime** until tapeagents bumps support. Track this as a TODO; do not enable `[tapeagents]` on the fast-llm path.
- **`PIP_CONSTRAINT=""`** is required — the toolkit image sets a constraint file that conflicts with our pinned versions.
- **Triton must be `==3.5.1`** — newer triton breaks the fast-llm GSPO kernels.


96 changes: 94 additions & 2 deletions conf/base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ vllm_config:

world:
replicas: 1

actor_fraction: 4
preprocessor_fraction: 0
finetune_fraction: 4
Expand All @@ -83,6 +83,11 @@ world:

actor_group_port: 9000
environment_start_port: 7777
# Unique identifier for this job run, used to namespace the pod IP exchange
# directory so stale files from previous runs are never seen.
# Defaults to $MASTER_ADDR when null (suitable for EAI and torchrun jobs).
run_id: null

# this will be autocreated based on the config
jobs: []

Expand All @@ -100,7 +105,7 @@ fsdp:
reduce_dtype: fp32
buffer_dtype: fp32

output_dir: ???
output_dir: null
force_restart: false
pop_old_data: true
max_lag: null
Expand All @@ -111,6 +116,93 @@ debug:
streams_from: null
place_inference_workers: true
use_existing_llms: false
log_data_pipeline: false

# Fast-LLM integration: when true, fast-llm is used as the trainer.
# Data flows actors -> Redis (fast_llm_streaming) -> fast-llm training loop.
# Weight updates are broadcast via NCCL using fast-llm's streaming callback.
use_fast_llm: true
# Whether the trainer broadcasts updated weights to vLLM after each training step.
weight_broadcast: true

# Pure fast-llm config written as-is to a YAML file at launch time.
# Fields set to null are populated by the launcher at runtime (source noted in the comment) — do not modify them here.
# This section is only used when use_fast_llm: true.
fast_llm:
training:
num_workers: 0
train_iters: 100000 # Total number of optimizer steps (provided by pipelinerl)
wandb:
entity_name: null # cfg.wandb.wandb_entity_name (null disables wandb)
project_name: null # cfg.wandb.wandb_project_name
group_name: null # cfg.wandb.wandb_group
logs:
interval: 1 # Logging frequency in optimizer steps
checkpoint:
interval: 1000
export:
interval: 1000
format: ${fast_llm_finetune.model_format}

schedule:
depth_first_micro_batches: 16 # Gradient accumulation steps (sequential, one sample at a time)

data:
micro_batch_size: 18000 # Tokens per sample; also the max rollout length accepted
truncate_documents: false # Do not truncate RL rollouts
shuffle: disabled # Streaming dataset ignores shuffling
datasets:
training:
type: streaming # Redis-backed streaming dataset
host: null # cfg.streams.host
port: null # cfg.streams.port

pretrained:
format: ${fast_llm_finetune.model_format}
path: null # cfg.model_path
model_weights: true

model:
base_model:
head:
losses:
grpo:
type: grpo
epsilon_low: 0.2
epsilon_high: 0.2
multi_stage:
zero_stage: 2
distributed:
compute_dtype: bf16
tensor_parallel: 1
pipeline_parallel: 1
sequence_data_parallel: 1

run:
experiment_dir: null # exp_dir/finetune
experiment_name: null # derived from exp_dir relative to cfg.wandb.wandb_workspace_root

# callbacks section is written only when weight_broadcast: true (removed by launcher otherwise)
callbacks:
streaming:
type: streaming
host: null # cfg.streams.host
port: null # cfg.streams.port
broadcast:
backend: nccl
external_world_size: null # world_map.weight_update_group_size - 1
host: null # world_map.master_addr
port: null # cfg.world.actor_group_port
export:
format: ${fast_llm_finetune.model_format}
model_weights: true
optimizer_state: false

# Launcher-specific fast-llm settings (not passed to fast-llm itself).
fast_llm_finetune:
model_type: gpt # fast-llm model type argument: fast-llm train <model_type>
model_format: qwen2 # pretrained/export format; interpolated into fast_llm config
torchrun_port: 29500 # master port for torchrun rendezvous

me:
# Which job is this one? This will be autopopulated
Expand Down
18 changes: 18 additions & 0 deletions conf/counting.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,24 @@ defaults:
finetune:
seq_length: 4000
gradient_accumulation_passes: 1024
vllm_config:
vllm_kwargs:
max_model_len: 4000
fast_llm:
training:
num_workers: 1
schedule:
depth_first_micro_batches: 256
model:
base_model:
head:
losses:
grpo:
epsilon_low: 0.2
epsilon_high: 0.2
optimizer:
learning_rate:
base: 1e-5
llm:
parameters:
max_tokens: 1000
Expand Down
7 changes: 7 additions & 0 deletions conf/math.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@ defaults:
- base
- _self_

use_fast_llm: true
weight_broadcast: true

fast_llm:
data:
micro_batch_size: 18000

actor:
rollout_policy: pipelinerl.domains.math.generate_math_rollout
system_prompt: Please reason step by step, and put your final answer within \boxed{}.
Expand Down
Loading