[Draft] Long Context Training VRAM Optimization #446

jiapingW · 2026-01-21T08:18:32Z

Motivation

This RP handle the VRAM in 64k training with sp=8 and decreae it from 94G per GPU to 76G per GPU. And this contains the PR #429.

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

[✅] Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

…ve_jsonl_data_file

gemini-code-assist · 2026-01-21T08:19:00Z

Summary of Changes

Hello @jiapingW, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the efficiency and stability of long context training for the eagle3 model, particularly for large models like Qwen3-30B-A3B. It introduces critical memory optimizations to reduce DRAM usage during forward passes and improves data loading robustness by handling diverse conversation data structures more gracefully. These changes collectively enable more reliable and scalable training with extended sequence lengths.

Highlights

DRAM Memory Optimization: Implemented several memory optimizations, including explicit GPU cache clearing (torch.cuda.empty_cache()) after intermediate tensor usage and deferring GPU transfer for large tensors like data['target'], to significantly reduce peak DRAM consumption during long context training.
Robust Dataset Loading: Introduced a new safe_conversations_generator utility to enhance the robustness of dataset loading. This generator converts complex data types (lists and dictionaries) within conversation messages to JSON strings, preventing mixed-type conflicts and ensuring compatibility with data processing frameworks like Arrow.
Long Context Tokenization: Updated tokenization logic to include truncation=True and max_length parameters, ensuring proper handling and prevention of issues with very long input sequences during processing.
New Training Example and Chat Template: Added a new example script (examples/repo-wiki.sh) demonstrating offline training of the eagle3 model for Qwen3-30B-A3B with long contexts (65536 tokens). A corresponding repo-wiki chat template has also been registered to support this new training configuration.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several optimizations for long context training, primarily focusing on reducing DRAM usage. The key changes include switching to a generator-based dataset loading approach to avoid loading entire datasets into memory, and reordering operations to perform preprocessing on the CPU before moving large tensors to the GPU. These are solid strategies for memory optimization. My review includes suggestions to improve the clarity of a new example script, remove leftover debugging code, and to reconsider the use of torch.cuda.empty_cache(), which can impact performance.

gemini-code-assist · 2026-01-21T08:21:11Z

examples/repo-wiki.sh

+TARGET_MODEL_PATH=/disk3/wjp/pretrained_models/Qwen3-Coder-30B-A3B-Instruct
+TRAIN_DATA_PATH=/disk3/wjp/datasets/repowiki/data_for_SpecForge_test.jsonl


The paths for TARGET_MODEL_PATH and TRAIN_DATA_PATH are hardcoded. This makes the script difficult for others to use without modification. Consider using environment variables with default placeholder values to make the script more portable.

Suggested change

TARGET_MODEL_PATH=/disk3/wjp/pretrained_models/Qwen3-Coder-30B-A3B-Instruct

TRAIN_DATA_PATH=/disk3/wjp/datasets/repowiki/data_for_SpecForge_test.jsonl

TARGET_MODEL_PATH=${TARGET_MODEL_PATH:-"/path/to/your/Qwen3-Coder-30B-A3B-Instruct"}

TRAIN_DATA_PATH=${TRAIN_DATA_PATH:-"/path/to/your/data_for_SpecForge_test.jsonl"}

gemini-code-assist · 2026-01-21T08:21:11Z

examples/repo-wiki.sh

+LOR_INTERNAL=200
+SAVE_INTERNAL=10


The variable names LOR_INTERNAL and SAVE_INTERNAL are confusing as they don't clearly represent their purpose. LOR_INTERNAL is used for --save-interval and SAVE_INTERNAL for --log-interval.

To improve readability and maintainability, I suggest renaming them to SAVE_INTERVAL and LOG_INTERVAL respectively. You'll also need to update their usage on lines 62 and 63.

Suggested change

LOR_INTERNAL=200

SAVE_INTERNAL=10

SAVE_INTERVAL=200

LOG_INTERVAL=10

gemini-code-assist · 2026-01-21T08:21:11Z

examples/repo-wiki.sh

+    --save-interval $LOR_INTERNAL \
+    --log-interval $SAVE_INTERNAL \


Following the renaming of variables for clarity, please update the usage here.

Suggested change

--save-interval $LOR_INTERNAL \

--log-interval $SAVE_INTERNAL \

--save-interval $SAVE_INTERVAL \

--log-interval $LOG_INTERVAL \

gemini-code-assist · 2026-01-21T08:21:11Z

scripts/train_eagle3.py

    target_model: Optional[Eagle3TargetModel] = None,
    is_online: bool = True,
 ) -> Tuple[List[torch.Tensor], List[torch.Tensor]]:
+    print(data["input_ids"].shape)


This print statement appears to be a leftover from debugging. It should be removed to avoid cluttering the logs.

gemini-code-assist · 2026-01-21T08:21:11Z

specforge/core/eagle3.py

            length=self.length,
        )
        del target
+        torch.cuda.empty_cache()


Explicitly calling torch.cuda.empty_cache() can introduce significant performance overhead due to CPU-GPU synchronization. The preceding del target should be sufficient to free the tensor's memory if there are no other references. Is this call strictly necessary for memory optimization in this case? If so, a comment explaining why would be helpful for future maintenance.

gemini-code-assist · 2026-01-21T08:21:11Z

specforge/core/eagle3.py

+            # from .forkedpdb import ForkedPdb
+            # ForkedPdb().set_trace()


These commented-out lines appear to be leftover debugging code and should be removed.

gemini-code-assist · 2026-01-21T08:21:11Z

specforge/core/eagle3.py

+            # ForkedPdb().set_trace()
+            logits = gather_outputs_and_unpad(logits_, gather_dim=1)
+            del logits_
+            torch.cuda.empty_cache()


Similar to my other comment, this empty_cache() call can impact performance and may be redundant after del logits_. Is this explicit cache clearing essential here to prevent out-of-memory errors? If so, please add a comment explaining why.

Same question as gemini, how much impact will this have on the training performance?

I‘ve deleted it and the impact on time is minimal based on the current results.

FrankLeeeee · 2026-01-21T15:07:29Z

specforge/core/eagle3.py

+            # ForkedPdb().set_trace()
+            logits = gather_outputs_and_unpad(logits_, gather_dim=1)
+            del logits_
+            torch.cuda.empty_cache()


Same question as gemini, how much impact will this have on the training performance?

FrankLeeeee · 2026-01-21T15:09:13Z

scripts/train_eagle3.py

+            input_ids = input_ids.cuda()
+            target = target_model(
+                target.cuda()
+            )  # The `data['target']` value occupies a large amount of GPU memory, with a shape of [seqlen, vocab_size]. It needs to be processed before being loaded into the GPU.
+            loss_mask = loss_mask.cuda()


What is the impact of this on performance? if it is large, maybe we can set it as a flag to control whether do this on GPU or CPU.

We can simply compute this: vocab size(150000) * seq_length(64k) will cost 10G more.

The reason is target_head's preprocess function will use padding will generate an extra copy of the target memory.

we can split hidden state in dataset getitem for usp to reduce memory use.

Yes, I think this is a better optimization method. Can you help add this optimization?

we can split hidden state in dataset getitem for usp to reduce memory use.

jiapingW added 8 commits January 14, 2026 15:39

support handle comprehensive jsonl data file

6c778df

support handle different tool-use message

2560f94

Merge remote-tracking branch 'upstream/main' into support_comprehensi…

5501b3a

…ve_jsonl_data_file

polish code

61cbcc1

tokenizer add max_length

9c94079

polish code

83b83d8

add repo-wiki template

79324f8

optimize long content training DRAM

d00ee39

jiapingW requested review from FlamingoPg, FrankLeeeee, shuaills, sleepcoo and zyksir as code owners January 21, 2026 08:18

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

jiapingW added 2 commits January 21, 2026 21:30

merge main

e4e5d02

del unuseful file

5412491

jiapingW changed the title ~~[Draft] Long Context Training DRAM Optimization~~ [Draft] Long Context Training VRAM Optimization Jan 21, 2026

FrankLeeeee reviewed Jan 21, 2026

View reviewed changes

jiapingW and others added 3 commits January 21, 2026 23:35

use Activation Checkpointing to compute loss

79975fc

can modify dataset rather than load from default cache

5adf840

Merge remote-tracking branch 'origin/main' into repo-wiki

dd792b8

		TARGET_MODEL_PATH=/disk3/wjp/pretrained_models/Qwen3-Coder-30B-A3B-Instruct
		TRAIN_DATA_PATH=/disk3/wjp/datasets/repowiki/data_for_SpecForge_test.jsonl

		--save-interval $LOR_INTERNAL \
		--log-interval $SAVE_INTERNAL \

[Draft] Long Context Training VRAM Optimization #446

Are you sure you want to change the base?

[Draft] Long Context Training VRAM Optimization #446

Conversation

jiapingW commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Jan 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiapingW commented Jan 21, 2026 •

edited

Loading