feat: Add dedicated merged mode to Megatron backend by vivekkalyan · Pull Request #636 · OpenPipe/ART

vivekkalyan · 2026-04-01T00:03:36Z

Summary

This adds dedicated merged mode to the Megatron backend.

In dedicated merged mode, ART now keeps vLLM running on the inference GPU, trains Megatron on the trainer GPU, and updates inference weights in place through vLLM's native weight transfer APIs. This allows the training of models which do not have LoRA support on vLLM as well as enabling faster inference when used in LocalBackend

What this enables

Use Megatron with dedicated trainer and inference GPUs in rollout_weights_mode="merged"
Keep inference and training decoupled while only pausing generation during the actual merged weight swap
Advance the served model alias step by step without restarting the dedicated vLLM server

Implementation

Add a Megatron service-to-trainer job protocol for:
- initial merged sync
- LoRA training jobs
- merged training jobs
Start dedicated vLLM in merged mode with native weight transfer enabled
Initialize NCCL weight transfer between the Megatron trainer and vLLM
Convert live Megatron weights through Megatron Bridge into HF/vLLM checkpoint names, merge ART LoRA deltas into those tensors, and send them directly to vLLM
Update the served model name after each successful merged sync
Reuse a shared TCP port helper instead of depending on a backend-specific implementation

Validation

Unit coverage:
- tests/unit/test_megatron_dedicated.py
Fresh-cluster 2-GPU smoke:
- trainer on GPU 0
- inference on GPU 1
- base model Qwen/Qwen3-30B-A3B-Instruct-2507
- dedicated merged mode completed two real train steps and advanced the served model from @0 to @2

Notes

This PR is stacked on feat: Add dedicated lora mode to Megatron backend #635

vivekkalyan · 2026-04-14T06:14:34Z

Rebased this PR onto the latest main and force-pushed the updated branch.

Since the earlier review/validation, I made a few follow-up fixes while chasing the dedicated merged regression:

fixed shared step-0 LoRA bootstrap
cleared stale dedicated merged startup jobs before the initial sync
fixed dedicated merged weight sync to vLLM in Megatron by:
- unwrapping ._orig_mod. names from compiled modules
- matching both decoder.layers... and language_model.decoder.layers... handler prefixes
- fixing expert fc1 merge handling for per-expert gate_proj / up_proj

Validation summary on SkyPilot (uv run sky):

origin/main yes-no-maybe control converges to 0.9583333333333334
shared mode on this stack also converges to 0.9583333333333334
dedicated merged now converges as well

Dedicated merged 20-step run (pr636-dedicated-merged-20260413h):

step 1: val 0.6197916666666666, train 0.625
step 2: val 0.9427083333333334, train 0.9176432291666666
step 3: val 0.9583333333333334, train 0.9560546875
steps 4-20: val/train stay at 0.9583333333333334 with 0 trainable groups / 0 gradient steps because the task is solved

One small operational note from the rerun: dedicated mode needed TENSOR_PARALLEL_SIZE=1 when serving on a single inference GPU.

So the previously observed dedicated merged non-convergence was caused by the merged sync path, and that path is now fixed and end-to-end validated.

vivekkalyan mentioned this pull request Apr 2, 2026

feat: Add Qwen3.5 support to Megatron backend #637

Open

vivekkalyan force-pushed the feat/dedicated-mode-megatron branch from b16a5bc to a1e0c6e Compare April 10, 2026 22:58

vivekkalyan added 8 commits April 10, 2026 15:59

refactor: move TCP port helper to shared utils

2114a76

feat: add dedicated merged mode to Megatron backend

390ddee

fix: restore merged sync adapter loading

6d4c1b3

fix: bootstrap shared LoRA at step zero

9bea67a

test: Cover dedicated Megatron merged startup reset

8abe1bf

fix: Clear stale dedicated merged jobs on startup

0a7a9ab

test: Cover compiled Megatron wrapper names

e7ebfef

fix: Restore Megatron dedicated merged sync

90aa5cb

vivekkalyan force-pushed the feat/merged-mode-megatron branch from dd3c18d to 90aa5cb Compare April 14, 2026 06:14

vivekkalyan marked this pull request as ready for review April 14, 2026 06:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add dedicated merged mode to Megatron backend#636

feat: Add dedicated merged mode to Megatron backend#636
vivekkalyan wants to merge 8 commits intofeat/dedicated-mode-megatronfrom
feat/merged-mode-megatron

vivekkalyan commented Apr 1, 2026 •

edited

Loading

Uh oh!

vivekkalyan commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vivekkalyan commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this enables

Implementation

Validation

Notes

Uh oh!

vivekkalyan commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vivekkalyan commented Apr 1, 2026 •

edited

Loading