Implementation of the DeepSeek V3 architecture from scratch with modern transformer architecture including Decoupled Rotary PE Multi-Head Latent Attention (MLA), Mixture of Experts (MoE), and Multi-Token Prediction (MTP).(Also explains the prior innovations in the attention architecture such as after MHA, MQA, GQA as well.
TASMAYU/DEEPSEEK-V3-_FROM_SCRATCH
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|