Skip to content

vrathi101/deepseek-v2-architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Rough implementation of DeepSeek-V2 paper, including Multi-Head Latent Attention (MLA) with decoupled RoPE, and the DeepSeekMoE FFN replacement with shared + routed experts (using top-k routing).

One thing I didn't do (for later) is they have an auxiliary loss (in addition to typical cross entropy loss) to prevent the MoE router from sending tokens to the same few experts (basically using a load-balancer).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors