-
Notifications
You must be signed in to change notification settings - Fork 143
Multi-layer Decoder for Llama3 Draft Model #401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @xiaoxi-s, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the flexibility of the Llama3 draft decoder by enabling the configuration of multiple Multi-Layer Perceptrons (MLPs) within each decoder layer. Previously limited to a single MLP, the model can now be instantiated with a user-defined number of MLPs, allowing for more complex and potentially more powerful draft model architectures. The changes include adding a new command-line option for specifying the number of MLPs, updating the core model classes to handle this configuration, and ensuring comprehensive test coverage for the new functionality. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully adds support for a multi-layer MLP in the draft decoder, with corresponding changes to the training script and tests. The implementation is consistent and the new functionality is well-tested. I have provided a couple of suggestions to improve the architecture of the multi-layer MLP for better training stability by incorporating residual connections for each MLP block. Additionally, I've pointed out a minor indentation issue in one of the test files.
I am having trouble creating individual review comments. Click here to see my feedback.
specforge/modeling/draft/llama3_eagle.py (1252)
Using nn.Sequential here creates a deep MLP stack without any intermediate normalization or residual connections. This can lead to training instability, especially if num_draft_hidden_layers is large. A more standard and stable approach is to use nn.ModuleList to treat each MLP as a separate block. I'll add a related suggestion on the forward method to complete this change.
self.mlps = nn.ModuleList([LlamaMLP(config) for _ in range(num_draft_hidden_layers)])
specforge/modeling/draft/llama3_eagle.py (1308-1311)
To accompany the change to nn.ModuleList for self.mlps, the forward pass should be updated to loop through the MLPs. This applies a residual connection around each MLP block, which is a more standard and stable architecture than a single deep MLP. This implementation reuses the same post_attention_layernorm for each block.
for mlp in self.mlps:
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = mlp(hidden_states)
hidden_states = residual + hidden_states
tests/test_modeling/test_draft/test_llama3.py (63)
This line has incorrect indentation. According to the PEP 8 style guide, indentation should be in multiples of 4 spaces. The line inside the for loop should be indented by 4 more spaces relative to the for statement.
self.assertIsInstance(model.midlayer.mlps[i], LlamaMLP)
Dogacel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for true multi-layer architecture, we should duplicate the decoder, not MLP. Why you wanted to duplicate the MLP instead?
scripts/train_eagle3.py
Outdated
| choices=["sglang", "hf", "custom"], | ||
| help="The backend of the target model", | ||
| ) | ||
| model_group.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable can be already set using the model config,
SpecForge/configs/llama3-8B-eagle3.json
Line 15 in 886ab9c
| "num_hidden_layers": 1, |
Which allows us to save & load models directly with the right number of hidden layers.
|
@sleepcoo LGTM |
|
Your implementation will do input_emb & hidden_states fusion every layer. May I ask why not just do fuse them in first layer. Them let others handle hidden_states only. |
I believe the way of only iterating on hidden_states on later layers exists mostly in traditional transformer architectures. For the decoder layers used in draft model, this implementation keeps the initial token info available across all decoder layers for better speculative decoding instead of only next-token generation. |
I agree, I think input embedding should only be injected in the first layer. Original EAGLE only has 1 decoder layer therefore it didn't matter for them where they have done this injection. Injecting the same data to each layer would result in excessive computations. However changing this means you have to update some code to make sure your input/output shapes match. For example attention takes 2 * hidden_size and outputs hidden_size. I think it would be nice to configure if you want to map from (2 * hidden_size -> hidden_size) and save compute on deeper layers, or if you want to keep the full 2 * hidden_size shape and be more expressive. SpecForge/specforge/modeling/draft/llama3_eagle.py Lines 527 to 529 in e1e9695
|
|
@xiaoxi-s Hi, xiaoxi. It seems fusing them only in the first layer could be more efficient and less redundant. Have you experimented with or compared both approaches? If not, could we test the alternative to verify performance? Appreciate your input. |
Working on the experiments. Will share after it's done. #449 contains the implementation of the more traditional decoder layer as the additional layers. Will share their benchmarks on SharedGPT after the experiment is done. |
Motivation
Support multiple Llama3DecoderLayers in
LlamaForCausalLMEagle3.Modifications
A
LlamaForCausalLMEagle3instance can now hold a arbitrary number (>=1) ofLlama3DecoderLayerss configured vianum_hidden_layersoption ofLlamaConfig. The test for the new config is also updated intests/test_modeling/test_draft/test_llama3.py.Related Issues
Sub-issue of #374
Accuracy Test
Tested under
tests/test_modeling/test_draft/test_llama3.py.Benchmark & Profiling
N/A
Checklist