Awesome papers in Large Language Models (LLM). They focus on state-of-the-art LLM methods, such as algorithms, system, SFT, RL, Multi-modal LLMs, MOE, Quantization, and Applications (RAG, agent, coding).
- 2017 (OpenAI) (NIPS) [RLHF] Deep Reinforcement Learning from Human Preferences
- 2018 (OpenAI) (Arxiv) [GPT-1] Improving Language Understanding by Generative Pre-Training
- 2019 (OpenAI) (Arxiv) [GPT-2] Language Models are Unsupervised Multitask Learners
- 2019 (OpenAI) (Arxiv) [Sparse Transformers] Generating Long Sequences with Sparse Transformers
- 2020 (OpenAI) (Arxiv) [GPT-3] Language Models are Few-Shot Learners
- 2020 (OpenAI) (Arxiv) [Scaling laws] Scaling Laws for Neural Language Models
- 2021 (OpenAI) (Arxiv) [Code] Evaluating Large Language Models Trained on Code
- 2021 (OpenAI) (Arxiv) [DALL-E] Zero-Shot Text-to-Image Generation
- 2021 (OpenAI) (ICML) [CLIP] Learning Transferable Visual Models From Natural Language Supervision
- 2022 (OpenAI) (Arxiv) Learning to summarize from human feedback
- 2022 (OpenAI) (Arxiv) [DALL-E-2] Hierarchical Text-Conditional Image Generation with CLIP Latents
- 2022 (OpenAI) (Arxiv) [InstructGPT] [RLHF] Training language models to follow instructions with human feedback
- 2022 (OpenAI) (Arxiv) [WebGPT] WebGPT - Browser-assisted question-answering with human feedback
- 2022 (OpenAI) (ICML) [glide] GLIDE - Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
- 2023 (Arxiv) Sparks of Artificial General Intelligence - Early experiments with GPT-4
- 2023 (OpenAI) (Arxiv) [DALL-E-3] Improving Image Generation with Better Captions
- 2023 (OpenAI) (Arxiv) [GPT4] GPT-4 Technical Report
- 2023 (OpenAI) GPT-4V(ision) System Card
- 2023 (OpenAI) [Math] Let’s Verify Step by Step
- 2024 (OpenAI) GPT-4o System Card
- 2024 (OpenAI) OpenAI o1 System Card
- 2025 (OpenAI) Competitive Programming with Large Reasoning Models
- 2025 (OpenAI) Deep Research System Card
- 2025 (OpenAI) GPT-4.5 System Card
- 2025 (OpenAI) GPT-5 System Card
- 2025 (OpenAI) OpenAI o3 and o4-mini System Card
- 2025 (OpenAI) OpenAI o3-mini System Card
- 2013 (Google) (NIPS) [Word2vec] Distributed Representations of Words and Phrases and their Compositionality
- 2014 (Google) (NIPS) [Seq2Seq] Sequence to Sequence Learning with Neural Networks
- 2017 (Google) (NIPS) [Transformer] Attention Is All You Need
- 2019 (Google) (NAACL) [Bert] BERT - Pre-training of Deep Bidirectional Transformers for Language Understanding
- 2020 (Google) (ICLR) [ALBERT] ALBERT - A Lite BERT for Self-supervised Learning of Language Representations
- 2020 (Google) (JMLR) [T5] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- 2021 (Google) (ICLR) [VIT] An Image is Worth 16x16 Words - Transformers for Image Recognition at Scale
- 2022 (Google) (Arxiv) [PaLM] PaLM - Scaling Language Modeling with Pathways
- 2022 (Google) (Arxiv) [Retro] Improving language models by retrieving from trillions of tokens
- 2022 (Google) (JMLR) [SwitchTransfomers] Switch Transformers - Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- 2022 (Google) (NIPS) [COT] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- 2022 (Google) (TMLR) [Emergent] Emergent Abilities of Large Language Models
- 2022 (Goolge) (NIPS) Large Language Models are Zero-Shot Reasoners
- 2023 (DeepMind) (NIPS) [TOT] Tree of Thoughts - Deliberate Problem Solving with Large Language Models
- 2023 (Google (Arxiv) [SIGLIP] Sigmoid Loss for Language Image Pre-Training
- 2023 (Google) (Arxiv) PaLM 2 Technical Report
- 2023 (Google) (ICLR) Self-Consistency Improves Chain of Thought Reasoning in Language Models
- 2023 (Google) (ICLR) [ReAct] ReAct - Synergizing Reasoning and Acting in Language Models
- 2023 (Google) (ICML) [PaLM-E] PaLM-E - An Embodied Multimodal Language Model
- 2024 (Google) (Arxiv) Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- 2024 (Google) (Arxiv) [Gemini 1.5] Gemini 1.5 - Unlocking multimodal understanding across millions of tokens of context
- 2024 (Google) (Arxiv) [Gemini] Gemini - A Family of Highly Capable Multimodal Models
- 2024 (Google) (Arxiv) [Gemma] Gemma - Open Models Based on Gemini Research and Technology
- 2024 (Google) (ICLR) [OPRO] Large Language Models as Optimizers
- 2025 (Google (Arxiv) [SIGLIP2] SigLIP 2 - Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
- 2025 (Google) (Arxiv) [Gemini 2.5] Gemini 2.5 - Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
- 2025 (Google) (Arxiv) [Gemma3] Gemma 3 Technical Report
- 2024.01 (DeepSeek) DeepSeek LLM Scaling Open-Source Language Models with Longtermism
- 2024.01 (DeepSeek) [DeepSeekMoE] DeepSeekMoE - Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
- 2024.03 (DeepSeek) [DeepSeek-VL] DeepSeek-VL - Towards Real-World Vision-Language Understanding
- 2024.04 (DeepSeek) [DeepSeekMath] [GRPO] DeepSeekMath - Pushing the Limits of Mathematical Reasoning in Open Language Models
- 2024.06 (DeepSeek) [DeepSeek-Coder-V2] DeepSeek-Coder-V2 - Breaking the Barrier of Closed-Source Models in Code Intelligence
- 2024.06 (DeepSeek) [DeepSeek-Coder] DeepSeek-Coder - When the Large Language Model Meets Programming - The Rise of Code Intelligence
- 2024.06 (DeepSeek) [DeepSeek-V2] DeepSeek-V2 - A Strong, Economical, and Efficient Mixture-of-Experts Language Mode
- 2025.01 (DeepSeek) [DeepSeek-R1] DeepSeek-R1 -Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- 2025.02 (DeepSeek) [DeepSeek-V3] DeepSeek-V3 Technical Report
- 2025.09 (DeepSeek) (Nature) [DeepSeek-R1] DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
- 2025.10 (DeepSeek) [DeepSeek-OCR] DeepSeek-OCR - Contexts Optical Compression
- 2025.11 (DeepSeek) [DeepSeekMath-V2] DeepSeekMath-V2 - Towards Self-Verifiable Mathematical Reasoning
- 2025.12 (DeepSeek) Insights into DeepSeek-V3 - Scaling Challenges and Reflections on Hardware for AI Architectures
- 2025.12 (DeepSeek) [DeepSeek-V3.2] DeepSeek-V3.2 - Pushing the Frontier of Open Large Language Models
- 2026.01 (DeepSeek) DeepSeek-R1 Thoughtology - Let’s think about LLM reasoning
- 2026.01 (DeepSeek) [mHC] mHC - Manifold-Constrained Hyper-Connections
- 2026.04 (DeepSeek) [DeepSeek-V4] DeepSeek-V4 - Towards Highly Efficient Million-Token Context Intelligence
- 2026.04 (DeepSeek) [DeepSeek-V4] DeepSeek-V4 - Towards Highly Efficient Million-Token Context Intelligence_副本
- 2023 (Alibaba) (Arxiv) [QWEN] QWEN Technical Report
- 2023 (Alibaba) (Arxiv) [Qwen-VL] Qwen-VL - A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- 2024 (Alibaba) (Arxiv) [QWEN2] QWEN2 Technical Report
- 2025 (Alibaba) (Arxiv) [QWEN-2.5] QWEN 2.5 Technical Report
- 2025 (Alibaba) (Arxiv) [Qwen2.5-VL] Qwen2.5-VL Technical Report
- 2025 (Alibaba) (Arxiv) [Qwen3 Embedding] Qwen3 Embedding - Advancing Text Embedding and Reranking Through Foundation Models
- 2025 (Alibaba) (Arxiv) [Qwen3-VL] Qwen3-VL Technical Report
- 2025 (Alibaba) (Arxiv) [Qwen3] Qwen3 Technical Report
- 2025 (Alibaba) (NIPS) [Gated Attention] Gated AttentionforLarge LanguageModels - Non-linearity,Sparsity, and Attention-Sink-Free
- 2020 (Meta) (NIPS) [RAG] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- 2023 (Meta) (Arxiv) [LLaMA-2] Llama 2 - Open Foundation and Fine-Tuned ChatModels
- 2023 (Meta) (Arxiv) [LLaMA] LLaMA - Open and Efficient Foundation Language Models
- 2023 (Meta) (Arxiv) [Toolformer] Toolformer - Language Models Can Teach Themselves to Use Tools
- 2024 (Arxiv) [TinyLlama] TinyLlama - An Open-Source Small Language Model
- 2024 (Meta) (Arxiv) [Code Llama] Code Llama - Open Foundation Models for Code
- 2024 (Meta) (Arxiv) [LLaMA3] The Llama 3 Herd of Models
- 2022 (Zhipu) (ACL) GLM - General Language Model Pretraining with Autoregressive Blank Infilling
- 2023 (Zhipu) (ICLR) GLM-130B - An Open Bilingual Pre-trained Model
- 2024 (Zhipu) ChatGLM - A Family of Large Language Models from GLM-130B to GLM-4 All Tools
- 2025 (Zhipu) GLM-4.5 - Agentic, Reasoning, and Coding (ARC) Foundation Models
- 2026 (Zhipu) GLM-4.5V and GLM-4.1V-Thinking - Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
- 2026 (Zhipu) GLM-5 - from Vibe Coding to Agentic Engineering
- 2023 (ICLR) (Zhipu) GLM-130B - An Open Bilingual Pre-trained Model
- 2023 (Microsoft) (NIPS) [LLaVA] Visual Instruction Tuning
- 2023 (Mistral) (Arxiv) Mistral 7B
- 2025 (Arxiv) Kimi Linear - An Expressive, Efficient Attention Architecture
- 2019 (ACL) [Sentence-BERT] Sentence-BERT - Sentence Embeddings using Siamese BERT-Networks
- 2019 (Google) (Arxiv) [MQA] Fast Transformer Decoding - One Write-Head is All You Need
- 2019 (NIPS) [RMSNorm] Root Mean Square Layer Normalization
- 2019 (OpenAI) (Arxiv) [Sparse Transformers] Generating Long Sequences with Sparse Transformers
- 2020 (Google) (Arxiv) GLU Variants Improve Transformer
- 2020 (Microsoft) On Layer Normalization in the Transformer Architecture
- 2022 (Microsoft) [DeepNorm] DeepNet - Scaling Transformers to 1,000 Layers
- 2023 (Google) (EMNLP) [GQA] GQA - Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
- 2024 (Meta) [MTP] Better & Faster Large Language Models via Multi-token Prediction
- 2025 (ICLR) [Gated DeltaNet] Gated Delta Networks - Improving Mamba2 with Delta Rule
- 2020 (ICLR) (Google) [Reformer] Reformer - The Efficient Transformer
- 2020 [Longformer] Longformer - The Long-Document Transformer
- 2023 (Arxiv) (Meta) Effective Long-Context Scaling of Foundation Models
- 2023 (Arxiv) [YaRN] YaRN - Efficient Context Window Extension of Large Language Models
- 2024 (Alibaba) (ICML) [DCA] Training-Free Long-Context Scaling of Large Language Models
- 2022 (ICLR) [ALiBi] Train Short, Test Long - Attention with Linear Biases Enables Input Length Extrapolation
- 2023 (Arxiv) [ROPE] RoFormer - Enhanced Transformer with Rotary Position Embedding
- 2016 (ACL) [BPE] Neural Machine Translation of Rare Words with Subword Units
- 2018 (Google) (Arxiv) SentencePiece - A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
- 2019 (Nvidia) (Arxiv) Megatron-LM - Training Multi-Billion Parameter Language Models Using Model Parallelism
- 2020 (Microsoft) [ZeRO] ZeRO - Memory Optimizations Toward Training Trillion Parameter Models
- 2021 (Nvidia) (Arxiv) Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- 2022 (Google) (Arxiv) [KVcache] Efficiently Scaling Transformer Inference
- 2022 (Nvidia) (Arxiv) Reducing Activation Recomputation in Large Transformer Models
- 2022 (Stanford) (Arxiv)[FlashAttention] FlashAttention - Fast and Memory-Efficient Exact Attention with IO-Awareness
- 2023 (Princeton) (Arxiv) [FlashAttention2] FlashAttention-2 - Faster Attention with Better Parallelism and Work Partitioning
- 2023 (SOSP) [vLLM] Efficient Memory Management for Large Language Model Serving with PagedAttention
- 2024 (Princeton) (Arxiv) [FlashAttention3] FlashAttention-3 - Fast and Accurate Attention with Asynchrony and Low-precision
- 2019 (Google) (ICML) [Adapter] Parameter-Efficient Transfer Learning for NLP
- 2021 (Microsoft) (Arxiv) [LORA] LoRA - Low-Rank Adaptation of Large Language Models
- 2023 (ACL) [UltraChat] Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
- 2022 (Arxiv) [Anthropic] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- 2022 (OpenAI) (Arxiv) [PPO] Proximal Policy Optimization Algorithms
- 2023 (Stanford) (NIPS) [DPO] Direct Preference Optimization - Your Language Model is Secretly a Reward Model
- 2024 (ACL) [ORPO] ORPO - Monolithic Preference Optimization without Reference Model
- 2024 (DeepSeek) (Arxiv) [GRPO] DeepSeekMath - Pushing the Limits of Mathematical Reasoning in Open Language Models
- 2014 (ICML) [VAE] Auto-Encoding Variational Bayes
- 2014 (NIPS) [GAN] Generative Adversarial Nets
- 2017 (NIPS) [VQ-VAE] Neural Discrete Representation Learning
- 2020 (Google) (ICLR) [ALBERT] ALBERT - A Lite BERT for Self-supervised Learning of Language Representations
- 2020 (NIPS) [Diffusion] Denoising Diffusion Probabilistic Models
- 2021 (Google) (ICLR) [VIT] An Image is Worth 16x16 Words - Transformers for Image Recognition at Scale
- 2021 (Google) (ICML) Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- 2021 (OpenAI) (Arxiv) [DALL-E] Zero-Shot Text-to-Image Generation
- 2021 (OpenAI) (ICML) [CLIP] Learning Transferable Visual Models From Natural Language Supervision
- 2022 (CVPR) [Stable Diffusion] High-Resolution Image Synthesis with Latent Diffusion Models
- 2022 (OpenAI) (Arxiv) [DALL-E-2] Hierarchical Text-Conditional Image Generation with CLIP Latents
- 2022 (Salesforce) (ICML) [BLIP] BLIP - Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- 2023 (Alibaba) (Arxiv) [Qwen-VL] Qwen-VL - A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- 2023 (Google (Arxiv) [SIGLIP] Sigmoid Loss for Language Image Pre-Training
- 2023 (Microsoft) (NIPS) [LLaVA] Visual Instruction Tuning
- 2023 (OpenAI) (Arxiv) [DALL-E-3] Improving Image Generation with Better Captions
- 2023 (OpenAI) GPT-4V(ision) System Card
- 2023 (Salesforce) (ICML) [BLIP-2] BLIP-2 - Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- 2024 (Google) (Arxiv) [Gemini] Gemini - A Family of Highly Capable Multimodal Models
- 2024 (Google) (Arxiv) [Gemma] Gemma - Open Models Based on Gemini Research and Technology
- 2025 (Alibaba) (Arxiv) [Qwen2.5-VL] Qwen2.5-VL Technical Report
- 2025 (Alibaba) (Arxiv) [Qwen3-VL] Qwen3-VL Technical Report
- 2025 (Google (Arxiv) [SIGLIP2] SigLIP 2 - Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
- 2017 (Google) (ICLR) [Sparsely-Gated MOE] Outrageously large neural networks - The sparsely-gated mixture-of-experts layer
- 2018 (Google) (KDD) ** [MMoE] Modeling task relationships in multi-task learning with multi-gate mixture-of-experts
- 2022 (Arxiv) MegaBlocks - Efficient Sparse Training with Mixture-of-Experts
- 2022 (Google) (JMLR) [SwitchTransfomers] Switch Transformers - Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- 2022 (Meta) (EMNLP) Efficient Large Scale Language Modeling with Mixtures of Experts
- 2023 (Google) (ICLR) Sparse Upcycling - Training Mixture-of-Experts from Dense Checkpoints
- 2024 (Arxiv) [Mixtral] Mixtral of Experts
- 2024 (Deepseek) (ACL) [DeepSeekMoE] DeepSeekMoE - Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
- 2024 (Google) (ICLR) [SoftMoE] From Sparse to Soft Mixtures of Experts
- 2025 (ICLR) [ReMoE] ReMoE - Fully Differentiable Mixture-of-Experts with ReLU Routing
- 2020 (Meta) (NIPS) [RAG] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- 2023 (Google) (TMLR) [POT] Program of Thoughts Prompting - Disentangling Computation from Reasoning for Numerical Reasoning Tasks
- 2023 (Meta) (Arxiv) [Toolformer] Toolformer - Language Models Can Teach Themselves to Use Tools
- 2023 (ICML) PAL - Program-aided Language Models
- 2024 (Arxiv) [TinyLlama] Arxiv TinyLlama - An Open-Source Small Language Model
- 2024 (Meta) (Arxiv) [Code Llama] Code Llama - Open Foundation Models for Code
- 2024 (Microsoft) (ICLR) [TORA] ToRA - A Tool-Integrated Reasoning Agent for Mathematical Problem Solving Download PDF
- 2024 (Princeton) (ICLR) [Llemma] Llemma - An Open Language Model For Mathematics
- 2002 (ACL) BLEU - a Method for Automatic Evaluation of Machine Translation
- 2004 (ACL) ROUGE - A Package for Automatic Evaluation of Summaries
- 2019 (ICLR) (DeepMind) GLUE - A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
- 2021 (Arxiv) (OpenAI) Evaluating Large Language Models Trained on Code
- 2022 (ACL) (OpenAI) TruthfulQA - Measuring How Models Mimic Human Falsehoods
- 2023 (NIPS) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- 2024 (Arxiv) Chatbot Arena - An Open Platform for Evaluating LLMs by Human Preference