Skip to content

feat: Add dedicated merged mode to Megatron backend#636

Open
vivekkalyan wants to merge 8 commits intofeat/dedicated-mode-megatronfrom
feat/merged-mode-megatron
Open

feat: Add dedicated merged mode to Megatron backend#636
vivekkalyan wants to merge 8 commits intofeat/dedicated-mode-megatronfrom
feat/merged-mode-megatron

Conversation

@vivekkalyan
Copy link
Copy Markdown
Collaborator

@vivekkalyan vivekkalyan commented Apr 1, 2026

Summary

This adds dedicated merged mode to the Megatron backend.

In dedicated merged mode, ART now keeps vLLM running on the inference GPU, trains Megatron on the trainer GPU, and updates inference weights in place through vLLM's native weight transfer APIs. This allows the training of models which do not have LoRA support on vLLM as well as enabling faster inference when used in LocalBackend

What this enables

  • Use Megatron with dedicated trainer and inference GPUs in rollout_weights_mode="merged"
  • Keep inference and training decoupled while only pausing generation during the actual merged weight swap
  • Advance the served model alias step by step without restarting the dedicated vLLM server

Implementation

  • Add a Megatron service-to-trainer job protocol for:
    • initial merged sync
    • LoRA training jobs
    • merged training jobs
  • Start dedicated vLLM in merged mode with native weight transfer enabled
  • Initialize NCCL weight transfer between the Megatron trainer and vLLM
  • Convert live Megatron weights through Megatron Bridge into HF/vLLM checkpoint names, merge ART LoRA deltas into those tensors, and send them directly to vLLM
  • Update the served model name after each successful merged sync
  • Reuse a shared TCP port helper instead of depending on a backend-specific implementation

Validation

  • Unit coverage:
    • tests/unit/test_megatron_dedicated.py
  • Fresh-cluster 2-GPU smoke:
    • trainer on GPU 0
    • inference on GPU 1
    • base model Qwen/Qwen3-30B-A3B-Instruct-2507
    • dedicated merged mode completed two real train steps and advanced the served model from @0 to @2

Notes

@vivekkalyan vivekkalyan force-pushed the feat/merged-mode-megatron branch from dd3c18d to 90aa5cb Compare April 14, 2026 06:14
Copy link
Copy Markdown
Collaborator Author

Rebased this PR onto the latest main and force-pushed the updated branch.

Since the earlier review/validation, I made a few follow-up fixes while chasing the dedicated merged regression:

  • fixed shared step-0 LoRA bootstrap
  • cleared stale dedicated merged startup jobs before the initial sync
  • fixed dedicated merged weight sync to vLLM in Megatron by:
    • unwrapping ._orig_mod. names from compiled modules
    • matching both decoder.layers... and language_model.decoder.layers... handler prefixes
    • fixing expert fc1 merge handling for per-expert gate_proj / up_proj

Validation summary on SkyPilot (uv run sky):

  • origin/main yes-no-maybe control converges to 0.9583333333333334
  • shared mode on this stack also converges to 0.9583333333333334
  • dedicated merged now converges as well

Dedicated merged 20-step run (pr636-dedicated-merged-20260413h):

  • step 1: val 0.6197916666666666, train 0.625
  • step 2: val 0.9427083333333334, train 0.9176432291666666
  • step 3: val 0.9583333333333334, train 0.9560546875
  • steps 4-20: val/train stay at 0.9583333333333334 with 0 trainable groups / 0 gradient steps because the task is solved

One small operational note from the rerun: dedicated mode needed TENSOR_PARALLEL_SIZE=1 when serving on a single inference GPU.

So the previously observed dedicated merged non-convergence was caused by the merged sync path, and that path is now fixed and end-to-end validated.

@vivekkalyan vivekkalyan marked this pull request as ready for review April 14, 2026 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant