[WIP] Fast-LLM trainer integration with vLLM v1 weight broadcast — handover by bigximik · Pull Request #140 · ServiceNow/PipelineRL

bigximik · 2026-05-06T08:09:54Z

Status: WIP — handover from Denis (2026-05-06)

This branch is not ready to merge. It's the in-progress integration of Fast-LLM as an alternative trainer to DeepSpeed, with weight broadcast to vLLM v1 over a persistent NCCL group instead of HTTP. I'm leaving the integration project — this PR captures everything needed to pick it up.

Read this first: docs/FAST_LLM_INTEGRATION.md — canonical handover (architecture, per-file changes, glossary, all known issues with file:line citations, testing guide, operations notes, open questions).

Stats: 79 commits ahead of main, ~8 400 insertions / 195 deletions across 35 files (mostly new tests + integration plumbing + handover docs).

What works today

Fast-LLM (gspo branch) trainer launches under torchrun, joins a persistent NCCL broadcast group, and pushes weights to vLLM v1 workers in place. No HTTP weight upload.
Coordinated NCCL teardown (training_finished event over redis → vLLM destroys process group → both sides hit the collective barrier together) — dist.destroy_process_group() no longer hangs.
4-node multi-node smoke verified end-to-end on both fast-llm GSPO and DeepSpeed PPO (see "Smoke result" below).
GSPO loss math matches DeepSpeed exactly: grad_norm parity, grpo_new_logprobs matches step-by-step over a 400-step run (see chart below).

Companion Fast-LLM PR

This PipelineRL branch pins to the gspo branch in Fast-LLM (PR #502). The Fast-LLM PR contains:

GSPO loss kernel (sequence-level geometric-mean IS-ratio clipping)
Decoupled loss/gradient divisors (loss /num_documents, grad /num_documents²) + SDP loss correction — exact match to DeepSpeed's 1/batch_size dual-factor math
fp32_lm_head flag matching vLLM's bf16_last_layer_fp32 precision (otherwise IS ratios drift)
metrics: GRPOMetricsLevel enum (none/basic/with_entropy) — merged from PR #494 (Joel's metrics refactor)

Once that PR merges to Fast-LLM main, the README install step here should be revved from git checkout gspo → git checkout main and this PR rebased onto a fresh main.

What's NOT done yet

Fix actor _prefetch_to_doc_target overshoot (pipelinerl/actor.py:613). Causes premature run end on long runs (50+ steps). Workaround: bump max_train_steps ~20%. Real fix: trainer signals "done" instead of actor inferring.
Address rollout retry exhaustion under bursts (pipelinerl/async_llm.py:137-146). Two consecutive aborts can drop a rollout permanently. Allow more retries or evict stuck rollouts.
Investigate reward lag vs DS (~2-point gap at step 400 in actor/reward_mean — see chart below). Root cause unknown; newlp parity is confirmed so the gap is upstream of the trainer.
Resolve commented-out pyproject.toml overrides (pyproject.toml:81-87). The [tool.uv] block force-overrides transformers>=4.51.0 / accelerate>=1.7.0 because tapeagents==0.1.16 pins them lower; [tapeagents] extra is broken at runtime. Either bump tapeagents or drop the extra on this branch.
Close fast-llm finetune metric gaps, e.g. rl/ess (effective sample size — diagnostic for data/policy drift).
Bump base image + vLLM version. Currently pinned to interactive-toolkit:25.12-py3-vllm014rc1redis (PyTorch 25.12, vLLM 0.14.0rc1). Move to the latest base PyTorch + vLLM that both Fast-LLM and PipelineRL support; re-run smoke after.

Known issues (with code references)

Issue	Symptom	Site	Memory ref
Actor overshoot ends runs early	`TimeoutError: No document received after 600s` near final step	`pipelinerl/actor.py:158, 613-614`	`project_actor_samples_target_overshoot_bug.md`
Rollout retry exhaustion	Rollout stuck in actor's `in_progress` after `attempt=2/2 abort`	`pipelinerl/async_llm.py:137-146`	`project_stall_investigation.md`
Reward lag vs DS	`actor/reward_mean` ~2 points below DS at step 400	unknown (upstream of trainer)	`project_fastllm_reward_lag_after_gspo_fix.md`

Current limitation (not a bug): streams=files is not implemented for use_fast_llm=true — Fast-LLM only ships RedisStreamingDataset. Use streams=redis. See project_streams_files_not_supported_fast_llm.md.

Training curves (400-step run): fast-llm GSPO vs DeepSpeed GSPO

Compared runs:

fast-llm: math_7b_4node_fastllm_gspo_20260505_122944 (divisor² + SDP fix)
DS: math_7b_ds_fastllm_4node_20260428_135427 (matching GSPO config: policy_loss=gspo, epsilon_low=3e-3, 400 steps)

new_logprobs — fast-llm matches DS step-by-step (the GSPO loss math fix is correct):

actor/reward_mean — fast-llm lags DS by ~2 points at step 400 (open issue):

How to verify locally

See examples/interactive/fast_llm_4node.sh and examples/interactive/ds_4node.sh — both follow the README install.

# inside an interactive 4-node EAI session, after the README install:
bash examples/interactive/fast_llm_4node.sh   # fast-llm + vLLM v1 + GSPO
bash examples/interactive/ds_4node.sh         # DeepSpeed + vLLM v1 + PPO (reference)

Both run a 2-step smoke and finish in ~10 minutes. Override MAX_TRAIN_STEPS=N for longer runs.

Smoke result (last verified 2026-05-06)

Smoke	EAI Job	Step 1 grad_norm	Step 2 grad_norm	Step 1 newlp	Step 2 newlp	NaN
fast-llm GSPO	`59f3b62f`	0.166	0.173	-0.171	-0.162	0
DeepSpeed PPO	`084ef7d8`	0.201	0.247	-0.162	-0.146	0

Per-step wall time ~80–120 s for both — fast-llm and DS run at comparable speed at this scale.

Code change summary

See docs/FAST_LLM_INTEGRATION.md §5 "Per-file changes" for the file-by-file table. Highlights:

pipelinerl/launch.py: TCPStore pre-creation for broadcast rendezvous (workaround for torchrun client-only TORCHELASTIC_USE_AGENT_STORE=True); fast_llm.callbacks.streaming.broadcast.* injection.
pipelinerl/state.py: fast-llm event-stream listener thread; samples_processed=0 initialization to avoid startup deadlock.
pipelinerl/vllm1.py: init_actor_update_group/destroy_actor_update_group with WEIGHTS_BROADCAST_PG_NAME; training_finished handler for coordinated NCCL teardown.
pipelinerl/async_llm.py: rollout retry on vLLM aborted request (weight-update collision).
tests/: weight-broadcast tests (test_vllm1_fast_llm_broadcast.py), full vLLM v1 integration (test_vllm1_integration.py), multi-node topology (test_world_multinode.py), actor error handling.

Reviewer checklist

This is a draft PR for handover, not for merge. Reviewer should:

Read docs/FAST_LLM_INTEGRATION.md end-to-end.
Skim README §"Install FastLLM+PipelineRL" — it should reproduce on a fresh interactive job.
Run bash examples/interactive/fast_llm_4node.sh and confirm step 1-2 metrics in finetune/stdout_node0.log.
Pick up the TODO list above; create separate issues/PRs for each item.

…stop traning was not received

… better abab pattern detection in generations results to test weight bradcast correctnes, some refactoring

…fig for testing

[WIP] Adding tests to vllm actor for Fast-LLM integration

When vLLM aborts an in-flight request during a weight update pause it returns finish_reason='abort' with empty logprobs. Previously this propagated to make_training_text which raised ValueError and crashed the entire actor. Raise asyncio.TimeoutError instead so the actor's existing retry logic replays the rollout cleanly.

…e support to submit script

…mit script Stale .pod_ips files from a previous run caused the pod IP exchange to return immediately with old IPs — slow-starting ranks were never waited for, breaking torchrun rendezvous on resume. clean_up() now removes the directory so every run waits for all live ranks to write fresh IPs. Submit script appends a timestamp to resume job names so EAI does not reject them as duplicates.

Stale .pod_ips files from a previous job caused rank 0 to complete the exchange with wrong IPs before other ranks had even started. Then clean_up() deleted rank_0.txt, leaving ranks 1-N waiting forever. Rank 0 now atomically wipes the old directory and writes a UUID session token before any rank writes its IP. Non-zero ranks block on the session token, so they only write after rank 0 has cleared stale data. Remove the incorrect pod_ips deletion from clean_up() (it was too late: exchange already complete, and it wiped rank_0.txt other ranks needed).

The UUID approach was broken: a non-zero rank arriving before rank 0 would see the stale session UUID from the previous job, skip waiting, write its IP — then rank 0 would wipe the dir (deleting the fresh IP) and write a new UUID. Rank 0 then waits forever for that rank's file. Use the rank-0 DNS name from MASTER_ADDR as the token instead. It is unique per EAI job (contains a job UUID), so non-zero ranks reject a stale session by comparing token content to their own MASTER_ADDR.

Replace session-token logic with a per-job subdirectory under .pod_ips/. The subdir name is MASTER_ADDR (a standard distributed-launcher env var, unique per job), so stale files from previous runs are simply never seen — no wiping, no barriers, no coordination needed. This removes the EAI-specific dependency on the dns_address_map naming convention and works with any launcher that sets MASTER_ADDR.

Add world.run_id config field (default null). The call site resolves it as: cfg.world.run_id if set, else $MASTER_ADDR, else "default". On EAI/torchrun MASTER_ADDR is unique per job so the default works out-of-the-box; other systems can set world.run_id explicitly. Remove the MASTER_ADDR hardcoding from _exchange_pod_ips itself.

world.run_id must now be set explicitly for multi-node jobs — no silent fallback to MASTER_ADDR. Raises ValueError if unset, RuntimeError if the run_id dir already exists (duplicate or stale run detected early). Rank 0 exclusively creates the dir; non-zero ranks wait for it, so the existence check is unambiguous: if the dir is there when rank 0 arrives, it is from a previous job. Submit script passes world.run_id=${MASTER_ADDR} so EAI jobs are unique per replica-group without any manual intervention.

…ode section

- Drop use_v1 toggle: vLLM V1 is now always used (remove use_v1 config field, V0 legacy flags, and conditional entrypoint selection) - launch.py: _get_vllm_kwargs no longer takes use_v1 param; always drops V0 legacy flags; num-scheduler-steps dropped unconditionally for V1 - vllm1.py HTTP path: add timing/version logging for pause/update/resume (from vllm_v1); keep _pause_generation helper (drains in-flight requests) and self.engine.engine_core (not engine_client which isn't in HEAD init) - vllm1.py fast-llm path: propagate same timing/version logging to receive_weight_update_fast_llm for parity with HTTP path

…nt loop Blocking put on a full queue stalled the asyncio event loop (test_actor_stall_fixed). Delete from group_rollouts before the await to prevent double-processing.

ServerDisconnectedError is a transient failure (vLLM event loop briefly blocked during synchronized post-weight-update response burst) — add it to retryable_rollout_exceptions so the actor backs off and retries instead of crashing the whole job. conf/math.yaml: remove use_v1: true left over from before the always-v1 switch; was missed in the 13a42bf merge cleanup.

- Remove single quotes around world.run_id=\${MASTER_ADDR} so bash expands MASTER_ADDR in the container (pod IP exchange was hanging because OmegaConf tried to resolve the literal string '${MASTER_ADDR}' as a config key) - Add + prefix to fast_llm.schedule.docs_per_step (new field not in base.yaml struct, requires append syntax) - Add DS submit script for fast-llm branch (submit_eai_math_7b_multinode_ds_vllm_v1.sh) - Set max_ready_samples_per_lead: 64 (was 512) to match reference branch - Add monitor_jobs.sh for polling EAI job status

Top-level `fp32_lm_head=true` is rejected after main merge (launch.py warns and exits). Fast-LLM-side override `+fast_llm.model.base_model.head.fp32_lm_head=true` still works and is kept. Also replaces removed `compute_extra_metrics=true` with new PR #494 enum `metrics=with_entropy`.

Adds canonical handover documentation for the fast-llm trainer integration, since this branch is WIP and being handed off: - docs/FAST_LLM_INTEGRATION.md: architecture, per-file changes, configuration knobs, glossary, known issues with file:line citations, testing guide, operations notes, and open questions for the successor. - examples/interactive/fast_llm_4node.sh, ds_4node.sh: 2-step smoke runs that mirror the EAI submit scripts but execute in the current shell. Default to MAX_TRAIN_STEPS=2 for verification; bump for real runs. - README.md: refresh stale install steps (gspo branch in Fast-LLM, not jlp_pipeline_rl), call out pyproject.toml tapeagents caveat, add a "Fast-LLM trainer path (preview)" subsection under §5 Trainer pointing to the canonical doc. No code changes. Functional behavior unchanged.

Drop sections that were nice-to-have ideas, not real code TODOs: - streams=files / +finetune.max_lag (speculation about reward-lag fix) - Step progress heartbeat (no actual TODO in Fast-LLM runner.py) - xreadgroup count=1 perf (perf speculation, no measurement) - Data logging stash (debug tool, not handover-critical) Tighten reward-lag entry: drop the unverified streams-staleness theory and "investigations to try" list. Reframe streams=files as a current limitation, not a fix-needed item. Real measured issues (actor overshoot, rollout retry exhaustion, reward lag investigation needed) stay.

- Embed reward_mean and new_logprobs charts (fast-llm GSPO vs DeepSpeed GSPO, 400-step run, eps=3e-3): newlp matches step-by-step; reward lags ~2 points at step 400. - Compared runs: fast-llm math_7b_4node_fastllm_gspo_20260505_122944 (divisor² + SDP fix) vs DS math_7b_ds_fastllm_4node_20260428_135427. - Add open questions for the successor: * Resolve commented-out pyproject.toml [tool.uv] tapeagents overrides (transformers/accelerate pins; [tapeagents] extra broken at runtime). * Close metric coverage gap on fast-llm finetune side (start with rl/ess).

Note that the interactive-toolkit:25.12-py3-vllm014rc1redis image is built from the fml/pytorch_vllm014rc1 branch of ServiceNow/research-interactive- toolkit (SN-internal). Base layer nvcr.io/nvidia/pytorch:25.12-py3, branch adds vLLM 0.14.0rc1, redis, and EAI helpers.

.research-interactive-env values are for *building* the image (in the research-interactive-toolkit repo on branch fml/pytorch_vllm014rc1), not for using the prebuilt one. Reword both README and handover doc so that "use" just means referencing the image URI, and "build" is a separate flow with the env config.

The DS example script was using PPO config, which doesn't reproduce the DeepSpeed curve in docs/FAST_LLM_INTEGRATION.md (those charts compare fast-llm GSPO vs DS GSPO at 400 steps with epsilon_low=3e-3, epsilon_high=4e-3). Switch ds_4node.sh defaults to policy_loss=gspo + epsilon=3e-3/4e-3 so 'MAX_TRAIN_STEPS=400 bash examples/interactive/ds_4node.sh' reproduces math_7b_ds_fastllm_4node_20260428_135427 byte-for-byte. Update both script header comments to call out that they're the chart reproduction recipes.

…ipes - Track submit_eai_math_7b_multinode_ds_fastllm_branch.sh — the production EAI launcher that produced math_7b_ds_fastllm_4node_20260428_135427 (the DS curve in the comparison charts). Drop the now-removed top-level fp32_lm_head=true knob from it. - docs/FAST_LLM_INTEGRATION.md: * Add §"Launching an interactive EAI job" — the prereq for the examples/interactive/ scripts (ServiceNow/research-interactive-toolkit `make launch` flow). * Add §"Reproduction recipes" — table mapping the chart-baseline runs to both the interactive examples and the production submit_eai_*.sh launchers, so readers can pick the right script for their context. - examples/interactive/{fast_llm,ds}_4node.sh: rewrite the prereq comment block so it points to the new "Launching an interactive EAI job" section before sending the user to the README install.

submit_eai_math_7b_multinode.sh and submit_eai_math_7b_multinode_ds_fastllm_branch.sh both hardcoded Denis- specific values (RESULTS_DIR=/mnt/shared/denis/..., wandb_entity_name= denisko-se, --data snow.home.denis_kocetkov:..., --data snow.research.afm .shared_fml:...). Add a "PERSONALIZE THESE BEFORE RUNNING" block at the top of each script with env-var-overridable defaults so a new user can set RESULTS_DIR / WANDB_ENTITY / WANDB_PROJECT / EAI_HOME_DATA / EAI_SHARED_DATA before launching, instead of editing inline. Add a "Personalize before running" subsection in docs/FAST_LLM_INTEGRATION.md explaining what each knob is and which scripts each applies to. Delete submit_eai_math_7b_multinode_ds_vllm_v1.sh (the DS PPO variant) — the GSPO version (submit_eai_math_7b_multinode_ds_fastllm_branch.sh, which reproduces the chart baseline) is now the canonical DS launcher. Also fix that script's stale path: PipelineRL-fastllm worktree no longer exists; cd into /home/toolkit/code/PipelineRL (already on the fast-llm branch).

The pieces (prereqs, env vars, script paths) were scattered across §3 "End-to-end install", §"Personalize before running", and §"Reproduction recipes". A reader had to assemble a launch command themselves. Add §"How to launch (prereqs + commands)" with two concrete paths: - Path 1: production EAI batch job (eai CLI, wandb creds, env vars, then bash submit_eai_*.sh; how to monitor and stop) - Path 2: interactive session (launch interactive, install, then bash examples/interactive/*.sh; smoke vs MAX_TRAIN_STEPS=400) Both paths show actual bash commands the reader can copy.

The examples/interactive/{fast_llm,ds}_4node.sh scripts assumed running from an interactive session that has 4 nodes attached. That's not how EAI interactive jobs work - interactive sessions are 1-2-GPU dev environments, and 4-node training jobs are submitted *from* them via 'eai job new' (which is what submit_eai_*.sh does). The two scripts wrapped 'python -m pipelinerl.launch' directly (no 'eai job new'), so they could never run in a typical EAI interactive session. Delete them; submit_eai_*.sh are the canonical reproduction recipes for both smoke and full-length runs. Update docs/FAST_LLM_INTEGRATION.md: - Rewrite the interactive-session subsection to clarify it is a dev/console environment, not a 4-node training setup. - Drop the Path-2 (interactive) flow from How-to-launch; only Path 1 (production submit_eai_*.sh) remains. - Add a prereq linking back to End-to-end install -> Steps. - Multi-node smoke section: explain 2-step verification via inline edit of the submit script.

Restructure §9 from a flat list of seven subsections into a three-bucket hierarchy: 9. Testing Unit tests (single host) 4-node test results 2-step smoke (last verified ...) 400-step training curves: fast-llm GSPO vs DS GSPO How to run 4-node tests Personalize Reproduction scripts Launch The old flat layout interleaved "what we observed" content (smoke results, curves) with "how to do it" content (personalize, recipes, launch). The new layout puts results in one bucket and the launch recipe in the other, so readers can jump to the half they need. Pure reorganization; no content changes beyond moving paragraphs and adjusting heading levels (### → ### / ####).

rafapi and others added 30 commits December 12, 2025 14:09

basic launcher

d33420b

single channel streams

acad1e7

update

9a7eab0

Merge branch 'main' into fast-llm

b526d8d

enable fast-llm for basic streaming

9543265

enable fast-llm for basic streaming

1d9ac8d

set cwd

6d42ade

convert samples for fast-llm format

e8674e1

listen to trainer events from redis

e78844e

put all the samples in a single stream

d16a829

fix accessing non complete dists

67b3f2b

change to calling fast-llm without conda

dc04911

fix imports for vllm .14.1rc1

30a5346

tmp changes to not install dependencies which are in base image already

9df2feb

2 gpu and 1 gpu integration tests

49707aa

changes for tests

a893686

added instruction to install PipelineRL+FastLLM

641e457

added fast-llm bradcast functionality and test

7d14e80

refactoring of traning helper

dad6242

added stop traning support and changed engine clean up logic to warn …

402f29a

…stop traning was not received

added 3 and 4 gpu tests (not run yet), added traning end event, added…

3848fbc

… better abab pattern detection in generations results to test weight bradcast correctnes, some refactoring

fix setting current device for tp and pp cases

e170185

fix multi actor generation abab patter consistency check

3541dfe

fast-llm weight update bug fix some other changes

5560ada

note update

f4cc7a4

fix no weight bradcast case with fast-llm

f78be64

added data for grpo loss to send to fast-llm, added base fast-llm con…

807f47c

…fig for testing

fast-llm weights broadcast integration

407be1d

removed duplicate option

b7e2109

Merge pull request #128 from ServiceNow/fast-llm

a3f6ed2

[WIP] Adding tests to vllm actor for Fast-LLM integration

bigximik added 30 commits April 27, 2026 15:18

world: read GPUS_PER_NODE from env instead of hardcoding 8; add resum…

7f1b87a

…e support to submit script

docs: document world.run_id requirement and resume workflow in multin…

c3cf807

…ode section

docs: rewrite multinode run/resume section to be launcher-agnostic

7594bc1

actor: use run_in_executor for result_queue.put to avoid blocking eve…

84b95be

…nt loop Blocking put on a full queue stalled the asyncio event loop (test_actor_stall_fixed). Delete from group_rollouts before the await to prevent double-processing.

utils: guard against None metadata in wandb python_env collection

3a5671c

Merge remote-tracking branch 'origin/main' into fast-llm

5d75c7e

docs(fast-llm): simplify metric-gap open question

b1b8823

docs(fast-llm): add image/vLLM version-bump open question

8c02c87

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fast-LLM trainer integration with vLLM v1 weight broadcast — handover#140

[WIP] Fast-LLM trainer integration with vLLM v1 weight broadcast — handover#140
bigximik wants to merge 90 commits intomainfrom
fast-llm

bigximik commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bigximik commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status: WIP — handover from Denis (2026-05-06)

What works today

Companion Fast-LLM PR

What's NOT done yet

Known issues (with code references)

Training curves (400-step run): fast-llm GSPO vs DeepSpeed GSPO

How to verify locally

Smoke result (last verified 2026-05-06)

Code change summary

Reviewer checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bigximik commented May 6, 2026 •

edited

Loading