Feat: Refactor & Support TP for long-context draft model training #280

zyksir · 2025-11-07T09:45:50Z

Motivation

(See #117 ) Training Llama-3.1 models (8B and 70B) in offline mode with long context lengths (e.g., 8K, 16K, or 32K) currently fails with Out-of-Memory (OOM) errors, even on multi-GPU setups.

Modifications

Decouple Draft TP and Target DP
Replicate kv head if num_kv_head < tp
(TODO)Will Merge code from Feat: Support TP for long-context draft model training #117 to support TP Feature

Thanks @yd-oom for previous PR

Related Issues

#112

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist · 2025-11-07T09:47:10Z

Summary of Changes

Hello @zyksir, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for Tensor Parallelism (TP) in the training of long-context draft models, particularly addressing Out-of-Memory (OOM) issues encountered with large Llama-3.1 models. It achieves this through a major refactoring of the training infrastructure, including a new class-based trainer, decoupled distributed training groups for target and draft models, and a reorganized module structure for improved scalability and maintainability.

Highlights

Distributed Training Enhancement: Introduced distinct Tensor Parallelism (TP) and Data Parallelism (DP) groups for both draft and target models, allowing for more flexible and efficient distributed training configurations tailored for long-context models.
Unified Training Script: Consolidated the train_eagle3_online.py and train_eagle3_sgl_online.py scripts into a single, refactored train_eagle3_online.py with a class-based trainer (Eagle3Trainer and SglOnlineEagle3Trainer), significantly improving code organization and maintainability.
Improved Checkpointing and Resume: Enhanced the training process with more robust checkpointing and resume capabilities, including support for Zero2 optimization for improved memory efficiency during large-scale training.
Module Restructuring: Performed a significant reorganization of the specforge package, moving core and modeling modules into a new model subpackage, which improves logical grouping and clarity of the codebase.
SGLang Integration Updates: Modified the SGLang backend integration to better handle hidden states and logits processing for EAGLE3 training, enabling the target model to provide necessary data for draft model training more effectively.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant enhancements to support Tensor Parallelism (TP) for draft model training, which is crucial for handling long contexts and mitigating Out-of-Memory (OOM) errors. The core changes involve decoupling the draft model's TP from the target model's Data Parallelism (DP), extensive refactoring of the training scripts into a more object-oriented structure with Eagle3Trainer classes, and updating the distributed communication setup. While the overall direction and refactoring are excellent for maintainability and scalability, I have identified a few issues, including a critical script path error, some redundant code, and opportunities for better code organization.

gemini-code-assist · 2025-11-07T09:49:23Z

examples/run_llama3_eagle3_online.sh

    --standalone \
    --nproc_per_node $NUM_GPUS \
-    $ROOT_DIR/scripts/train_eagle3_online.py \
+    $ROOT_DIR/scripts/train_eagle3.py \


The script path has been changed to $ROOT_DIR/scripts/train_eagle3.py, but the file being modified in this pull request is scripts/train_eagle3_online.py. The file train_eagle3.py does not appear to exist, which will cause this script to fail.

Suggested change

$ROOT_DIR/scripts/train_eagle3.py \

$ROOT_DIR/scripts/train_eagle3_online.py \

gemini-code-assist · 2025-11-07T09:49:23Z

examples/run_llama3_eagle3_sgl_online.sh

 python scripts/build_eagle3_dataset_cache.py \
    --target-model-path $MODEL_PATH \
    --draft-model-config ./configs/llama3-8B-eagle3.json \
-    --train-data-path $DATASET_PATH/sharegpt_ultrachat.jsonl \
+    --train-data-path $GENERATED_DATASET_PATH/train_data.jsonl \
+    --eval-data-path $GENERATED_DATASET_PATH/eval_data.jsonl \
    --cache-dir $CACHE_DIR \
    --chat-template $CHAT_TEMPLATE \
    --max-length $MAX_LENGTH \
-    --view-train-data 1
+    --view-train-data 1 --debug


The build_eagle3_dataset_cache.py script is called here with the --debug flag, and then called again on lines 52-59 without it. This first call seems redundant for a final script. If it's intended for debugging during development, it should be removed to avoid unnecessary processing and potential confusion.

gemini-code-assist · 2025-11-07T09:49:24Z

scripts/train_eagle3_online.py

+class SglOnlineEagle3Trainer(Eagle3Trainer):
+    def __init__(self, args):
+        super().__init__(args)
+        assert (
+            args.draft_micro_batch_size == 1
+        ), "SglOnlineEagle3Trainer only supports draft_micro_batch_size = 1"


The SglOnlineEagle3Trainer has a hardcoded assertion args.draft_micro_batch_size == 1. This is also noted in the file's docstring as a TODO. While this is a known limitation, it's a significant one that restricts the training configurations. It would be beneficial to prioritize removing this limitation to allow for more flexible batching strategies.

gemini-code-assist · 2025-11-07T09:49:24Z

specforge/distributed.py

+def get_draft_dp_device_mesh():
+    global _DRAFT_DP_DEVICE_MESH
+    return _DRAFT_DP_DEVICE_MESH


The function get_draft_dp_device_mesh is defined and exported but is not used anywhere in the codebase. This appears to be dead code and should be removed to improve maintainability.

gemini-code-assist · 2025-11-07T09:49:24Z

specforge/model/target/eagle3_target_model.py

+def get_dp_data_shard_from_tp(
+    tensor: Union[torch.Tensor, List[torch.Tensor]]
+) -> torch.Tensor:
+    """
+    Get the data shard from the tensor.
+    """
+    tp_size = dist.get_world_size(get_target_tp_group())
+    tp_rank = dist.get_rank(get_target_tp_group())
+    tensor_length = len(tensor) if isinstance(tensor, List) else tensor.shape[0]
+    assert tensor_length % tp_size == 0, "Tensor length must be divisible by tp_size"
+    chunk_size = tensor_length // tp_size
+    return tensor[tp_rank * chunk_size : (tp_rank + 1) * chunk_size]


The function get_dp_data_shard_from_tp is a generic distributed utility. Placing it within eagle3_target_model.py makes the code less modular. It would be better placed in a more general utility module like specforge/distributed.py to centralize distributed helper functions.

gemini-code-assist · 2025-11-07T09:49:24Z

specforge/modeling/target/sglang_backend/utils.py

+#     This is a modified forward function for the SGLang's logits processor, adapted from https://github.com/sgl-project/sglang/blob/v0.5.4/python/sglang/srt/layers/logits_processor.py.
+#     The modification is to return the logits and aux hidden states instead of the last hidden states.
+#     """
+
+#     if isinstance(logits_metadata, ForwardBatch):
+#         logits_metadata = LogitsMetadata.from_forward_batch(logits_metadata)
+
+#     # Check if multi-item scoring is enabled via server args (only for prefill-only requests)
+#     multi_item_delimiter = get_global_server_args().multi_item_scoring_delimiter
+#     if multi_item_delimiter is not None and logits_metadata.is_prefill_only:
+#         return self.compute_logprobs_for_multi_item_scoring(
+#             input_ids, hidden_states, lm_head, logits_metadata, multi_item_delimiter
+#         )
+
+#     # Get the last hidden states and last logits for the next token prediction
+#     if (
+#         logits_metadata.forward_mode.is_decode_or_idle()
+#         or logits_metadata.forward_mode.is_target_verify()
+#         or logits_metadata.forward_mode.is_draft_extend_v2()
+#     ):
+#         pruned_states = hidden_states
+#         if aux_hidden_states is not None:
+#             aux_pruned_states = [hidden for hidden in aux_hidden_states]
+#         sample_indices = None
+#         input_logprob_indices = None
+#     else:
+#         raise RuntimeError(
+#             f"The modified logits processor is not supported for this forward mode: {logits_metadata.forward_mode}"
+#         )
+
+#     # Compute logits for both input and sampled tokens.
+#     logits = self._get_logits(pruned_states, lm_head, logits_metadata)
+
+#     hidden_states_to_store: Optional[torch.Tensor] = None
+#     if logits_metadata.capture_hidden_mode.need_capture():
+#         if logits_metadata.capture_hidden_mode.is_full():
+#             if aux_hidden_states is not None:
+#                 aux_hidden_states = torch.cat(aux_hidden_states, dim=-1)
+#                 hidden_states_to_store = aux_hidden_states
+#             else:
+#                 hidden_states_to_store = hidden_states
+#         elif logits_metadata.capture_hidden_mode.is_last():
+#             # Get the last token hidden states. If sample_indices is None,
+#             # pruned states only contain the last tokens already.
+#             if aux_hidden_states is not None:
+#                 aux_pruned_states = torch.cat(aux_pruned_states, dim=-1)
+#                 hidden_states_to_store = (
+#                     aux_pruned_states[sample_indices]
+#                     if sample_indices is not None
+#                     else aux_pruned_states
+#                 )
+#             else:
+#                 hidden_states_to_store = (
+#                     pruned_states[sample_indices]
+#                     if sample_indices is not None
+#                     else pruned_states
+#                 )
+#         else:
+#             assert False, "Should never reach"
+
+#     if not logits_metadata.extend_return_logprob:
+#         # Decode mode or extend mode without return_logprob.
+#         return ReplacedLogitsProcessorEagle3Output(
+#             logits=logits,
+#             aux_hidden_states=hidden_states_to_store,
+#         )
+
+
+class LogitsProcessorForEAGLE3(torch.nn.Module):
+    def __init__(
+        self, logits_processor: LogitsProcessor, return_full_logits: bool = False
+    ):
+        super().__init__()
+        self.logits_processor = logits_processor
+        self.return_full_logits = return_full_logits
+
+    def forward(
+        self,
+        input_ids,
+        hidden_states,
+        lm_head,
+        logits_metadata,
+        aux_hidden_states: Optional[List[torch.Tensor]] = None,
+    ) -> LogitsProcessorOutput:
+        logits_metadata.forward_mode = ForwardMode.DECODE
+        # ret = replaced_logits_processor_forward_for_eagle3(
+        #     self.logits_processor,
+        #     input_ids,
+        #     hidden_states,
+        #     lm_head,
+        #     logits_metadata,
+        #     aux_hidden_states,
+        # )
+        # ret = self.logits_processor.forward(
+        #     input_ids, hidden_states, lm_head, logits_metadata, aux_hidden_states
+        # )
+        return ReplacedLogitsProcessorEagle3Output(
+            hidden_states=hidden_states,
+            aux_hidden_states=torch.cat(aux_hidden_states, dim=-1),
+        )


This file contains a significant amount of commented-out code. This should be removed to improve code clarity and maintainability.

FrankLeeeee · 2025-11-10T03:24:53Z

benchmarks/module_benchmarks/benchmark_flex_attention.py

+    seq_lengths.extend([16384])



Why remove 32k?

32k will lead to OOM when I test in my H100

FrankLeeeee · 2025-11-10T03:34:37Z

tests/modeling/draft/test_flex_attention.py

 from specforge.utils import padding
-
-from .utils import norm_tensor
+from tests.utils import norm_tensor


This import is not used.

FrankLeeeee · 2025-11-10T03:38:33Z

specforge/trainer_helper/__init__.py

+from .optimizer import BF16Optimizer
+from .tracker import Tracker, build_tracker
+
+__all__ = ["BF16Optimizer", "Tracker", "build_tracker"]


I think this module name is a bit weird, helper usually refers to some utilities functions, but this module contains components necessary for training. I guess these can be an independent file, i.e. specforge.tracker and specforge.optimizer.

FrankLeeeee · 2025-11-10T03:39:48Z

specforge/modeling/target/sglang_backend/model_runner.py

+    def model_specific_adjustment(self):
+        pass


What is this for?

This function will call something like log_on_rank0; here, dist is initialized but the sglang parallel_state._WORLD is not set, which will lead to an error

FrankLeeeee · 2025-11-10T03:43:28Z

specforge/modeling/target/eagle3_target_model.py

+            )
+            target_micro_batch_size = None
+        else:
+            server_args = ServerArgs.from_cli_args(args)


I recommend that we don't do this, because this might cause some confusion to the users.

Some arguments in SGLang are for optimizations other than prefill, these options won't take effect even if the user specify them

this will make --help of our training script extremely long.

I recommend that we only keep those important to prefill.

FrankLeeeee · 2025-11-10T03:47:44Z

scripts/train_eagle3_online.py

+@dataclasses.dataclass
+class Eagle3TrainerArgs:
+    target_model_path: str


We can migrate this to specforge.arguments.

FrankLeeeee · 2025-11-10T03:48:32Z

scripts/train_eagle3_online.py

+    target_micro_batch_size: int = 8
+    draft_tp_size: int = 1
+    draft_dp_size: int = 1
+    draft_global_batch_size: int = 16
+    draft_micro_batch_size: int = 1
+    draft_accumulation_steps: int = 1


Why do we need micro-batch and globa-batch?

FrankLeeeee and others added 6 commits November 4, 2025 03:36

added abstraction for target model backend

ac8e82c

polish

9584780

polish

ad176f5

added abstraction for target model backend

bfd6853

polish

d448f9e

solve lint error

30cd50c

zyksir self-assigned this Nov 7, 2025

zyksir requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners November 7, 2025 09:45

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

zyksir changed the title ~~Feat: Support TP for long-context draft model training~~ Feat: Refactor & Support TP for long-context draft model training Nov 7, 2025

zyksir added 2 commits November 7, 2025 18:30

add refactor code

bc90506

add draft TP

55ba98c

zyksir force-pushed the feature/refactor branch from aa590a6 to 55ba98c Compare November 8, 2025 08:06

zyksir added 2 commits November 8, 2025 20:24

uniform generate eagle data

ba2babd

Merge branch 'main' into feature/refactor

48db64c

zyksir force-pushed the feature/refactor branch 2 times, most recently from cc13007 to 3530625 Compare November 8, 2025 21:58

remove comments

9ed921e

zyksir force-pushed the feature/refactor branch from 3530625 to 9ed921e Compare November 8, 2025 22:24

zyksir added 7 commits November 8, 2025 15:23

fix offline generator

38d99c0

fix bug

e586904

fix bug

a0d3941

try to rename some scripts

6d1d89f

fix bugs

e7aa897

try to fix some bugs

569912d

fix bugs

c5fb9a9

zyksir added 3 commits November 8, 2025 21:10

modify print on rank0

93d88e6

modify print on rank0

f9edbcc

debug

ec800a0

zyksir force-pushed the feature/refactor branch from 8519999 to ec800a0 Compare November 9, 2025 07:48

zyksir and others added 8 commits November 9, 2025 07:57

rename test

96f93ba

delete useless

220532a

Merge branch 'main' into feature/refactor

b367d33

merge main

63840a2

delete useless test

8f62494

delete useless test

c38512d

fix test issues

3314437

fix tp bug

6b3f700

zyksir force-pushed the feature/refactor branch from b9d32d0 to 6b3f700 Compare November 10, 2025 00:43

FrankLeeeee reviewed Nov 10, 2025

View reviewed changes

zyksir closed this Nov 10, 2025

zyksir deleted the feature/refactor branch November 10, 2025 19:41

	$ROOT_DIR/scripts/train_eagle3.py \
	$ROOT_DIR/scripts/train_eagle3_online.py \

Feat: Refactor & Support TP for long-context draft model training #280

Feat: Refactor & Support TP for long-context draft model training #280

Uh oh!

Conversation

zyksir commented Nov 7, 2025

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 7, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants