dLLM(Dflash) Offline Training Support #445

Ximingwang-09 · 2026-01-21T07:38:53Z

Motivation

Compared to EAGLE3, DFlash requires more epochs for model convergence, so offline training can greatly improve training efficiency.This PR adds offline training support for DFlash, enabling DFlash draft model training using pre-computed hidden states. This eliminates the need to load the full target model during training, significantly reducing GPU memory requirements and training costs.

Key benefits:

Memory Efficiency: No need to load the target model during training (only draft model + embeddings/lm_head needed)
Decoupled Pipeline: Hidden states generation can be done separately from training, enabling better resource utilization
Consistent with Online Mode: The offline mode uses the same data preprocessing logic as online training

Modifications

1. Hidden States Generation (scripts/prepare_hidden_states.py)

Added DFlashHiddenStatesGenerator class for generating DFlash-specific hidden states
- Captures hidden states from target layers based on draft model configuration
- Supports filtering samples with insufficient loss tokens (< 2 * block_size)
Added build_dflash_target_model() function to build DFlash target model with layer capture configuration
Added DFlash-specific CLI arguments: --model-type dflash, --num-draft-layers, --target-layers, --block-size

2. Offline Dataset Support (specforge/data/preprocessing.py)

Added OfflineDFlashDataset class for loading pre-computed hidden states
- Minimal preprocessing to maintain consistency with online training (block-size truncation handled in forward pass)
Added build_offline_dflash_dataset() factory function

3. Training Script Updates (scripts/train_dflash.py)

Added automatic mode detection: online (from conversation data) vs offline (from pre-computed hidden states)
Added --train-hidden-states-path and --eval-hidden-states-path arguments for offline mode
Added loss mask filtering for online mode (filter samples with loss_mask.sum() < 2 * block_size)
Refactored build_target_model() to skip loading target model in offline mode
Refactored build_dataloader() to handle both online and offline datasets

4. Data Collator Improvements (specforge/data/utils.py)

Added requires_target parameter to DataCollatorWithPadding for flexible field handling
- DFlash: requires_target=False (only needs hidden_state)
- Eagle3: requires_target=True (needs both hidden_state and target)
Updated prepare_dp_dataloaders() to pass requires_target parameter

Related Issues

Accuracy Test

Benchmark & Profiling

For Qwen3-8B：

Offline Training：1.72s/step
Online Training：3.85s/step

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist · 2026-01-21T07:38:56Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

sleepcoo · 2026-01-21T09:08:37Z

The implementation of DFlashHiddenStatesGenerator seems overly complex. My understanding is that the only difference from how Eagle3 retrieves hidden states is that we need to filter out unnecessary ones based on block size. Could we basically reuse the original hidden state logic? As for the layer differences, couldn't those be handled via configuration?

Ximingwang-09 · 2026-01-22T02:54:42Z

The implementation of DFlashHiddenStatesGenerator seems overly complex. My understanding is that the only difference from how Eagle3 retrieves hidden states is that we need to filter out unnecessary ones based on block size. Could we basically reuse the original hidden state logic? As for the layer differences, couldn't those be handled via configuration?

Thanks for the suggestion. I’ve refactored the code accordingly:

Removed the DFlashHiddenStatesGenerator class entirely.
Unified HiddenStatesGenerator, which now supports both Eagle3 and DFlash via configuration.

纬杭 added 9 commits January 19, 2026 18:32

dflash offline train support

54c7cc6

update hs generator

1ffe8df

fix some bug

6560338

change default backend

93735a2

add loss filter

5d096a0

add loss filter

5f032d8

clean code

b2b81f4

fix some bugs

fd4d66d

fix lint

d1ef412

Ximingwang-09 requested review from FlamingoPg, shuaills, sleepcoo and zyksir as code owners January 21, 2026 07:38

merge conflict

91ed55c

纬杭 and others added 2 commits January 21, 2026 18:02

simplify hs gen

10cc063

Merge branch 'main' into dflash_offline

888d23c

Merge branch 'main' into dflash_offline

ff02108

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dLLM(Dflash) Offline Training Support #445

dLLM(Dflash) Offline Training Support #445

Ximingwang-09 commented Jan 21, 2026

Uh oh!

gemini-code-assist bot commented Jan 21, 2026

Uh oh!

sleepcoo commented Jan 21, 2026

Uh oh!

Ximingwang-09 commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dLLM(Dflash) Offline Training Support #445

Are you sure you want to change the base?

dLLM(Dflash) Offline Training Support #445

Conversation

Ximingwang-09 commented Jan 21, 2026

Motivation

Modifications

1. Hidden States Generation (scripts/prepare_hidden_states.py)

2. Offline Dataset Support (specforge/data/preprocessing.py)

3. Training Script Updates (scripts/train_dflash.py)

4. Data Collator Improvements (specforge/data/utils.py)

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Jan 21, 2026

Uh oh!

sleepcoo commented Jan 21, 2026

Uh oh!

Ximingwang-09 commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants