Pytorch implementation of a relaxed recursive transformer architecture. Reduces GPT-2 parameters by 40% (1.5 billion to 84 million), uptrained on 20 billion tokens on openwebtext2. Achieves performance on par with GPT-2 and GPT-2 distilled on wiki-text-103 dataset in perplexity.
cattermelon1234/relaxed-recursive-transformer
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|