When LLM Agents Meet Reinforcement Learning

AgentsMeetRL is an awesome list that summarizes open-source repositories for training LLM Agents using reinforcement learning:

🤖 The criteria for identifying an agent project are that it must have at least one of the following: multi-turn interactions or tool use (so TIR projects, Tool-Integrated Reasoning, are considered in this repo).
⚠️ This project is based on code analysis from open-source repositories using LLM coding agents, which may contain unfaithful cases. Although manually reviewed, there may still be omissions. If you find any errors, please don't hesitate to let us know immediately through issues or PRs - we warmly welcome them!
🚀 We particularly focus on the reinforcement learning frameworks, RL algorithms, rewards, and environments that projects depend on, for everyone's reference on how these excellent open-source projects make their technical choices. See [Click to view technical details] under each table.
📅 Last updated: 2026-04-28
🤗 Feel free to submit your own projects anytime - we welcome contributions!

Taxonomy:

Base Framework: General-purpose RL training frameworks for LLM agents (e.g., veRL, OpenRLHF, trl)
General/MultiTask: Agent systems trained/evaluated across multiple tasks or environments
Search & RAG: Search-augmented reasoning agents that use retrieval tools to enhance LLM reasoning
Web & GUI: Agents that interact with web browsers, mobile/desktop GUIs, or operating systems
Tool-Use: Agents trained to invoke external tools (APIs, code executors, MCP, etc.)
Code & SWE: Software engineering and code generation agents
Reasoning: Reasoning agents with tool-integrated or multi-turn reasoning (math, QA, visual)
Multi-Agent RL: Multi-agent collaboration, negotiation, or credit assignment via RL
Memory: Agents that learn to manage, retrieve, or evolve memory
Embodied: Agents operating in embodied/physical simulation environments
Domain-Specific: RL agents for specialized domains (medical, OS tuning, etc.)
Reward & Training: Process/outcome reward models and training methodologies for agents
Safety: RL for agent safety alignment, adversarial red-teaming, and jailbreak defense/attack
VLM Agent: Vision-language model agents trained with RL for multimodal interaction
Self-Evolution: Agents that self-evolve via RL feedback loops (⚠️ definition still evolving in the community)
Environment: Benchmarks, gyms, and sandbox environments for agent training/evaluation

Some Enumeration:

Enumeration for Reward Type:
- External Verifier: e.g., a compiler or math solver
- Rule-Based: e.g., a LaTeX parser with exact match scoring
- Model-Based: e.g., a trained verifier LLM or reward LLM
- Custom

Updates

📢 2026-04 Update: Added 67 new repositories covering Apr 2025 – Apr 2026 across nearly every category (notably VLM Agent +9, Search & RAG +10, Web & GUI +7, Tool-Use +7). Also reclassified SkyRL (→ General) and SPIRAL (→ Multi-Agent), and updated the VAGEN entry to its NeurIPS'25 upstream repo.
📢 2026-03 Update: Restructured taxonomy from 12 to 16 categories (added Multi-Agent RL, Reward & Training, Safety, VLM Agent, Self-Evolution, Domain-Specific; merged GUI into Web & GUI; retired TextGame/Biomedical). Added ~70 new repositories covering Sep 2025 – Mar 2026, growing the total from ~134 to 205.

🔧 Base Framework

Github Repo	Date	Org	Paper Link
Open-AgentRL	2026.2	Gen-Verse	Paper
OpenClaw-RL	2026.3	Gen-Verse	Paper
Claw-R1	2026.3	USTC	--
prime-rl	2025.2	Prime Intellect	--
NeMo-RL	2026.1	NVIDIA	--
RLinf	2025.8	Tsinghua/Infinigence AI/PKU	Paper
siiRL	2025.7	Shanghai Innovation Institute	Paper
slime	2025.6	Tsinghua University (THUDM)	blog
agent-lightning	2025.6	Microsoft Research	Paper
AReaL	2025.6	AntGroup/Tsinghua	Paper
ROLL	2025.6	Alibaba	Paper
MARTI	2025.5	Tsinghua	--
Tunix	2025.4	Google	--
RL2	2025.4	Accio	–
verifiers	2025.3	Individual	--
oat	2024.11	NUS/Sea AI	Paper
veRL	2024.10	ByteDance	Paper
OpenRLHF	2023.7	OpenRLHF	Paper
trl	2019.11	HuggingFace	--

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
Open-AgentRL	GRPO-TCR	Single	Both	Multi	Reasoning/GUI/Coding	Model (PRM)	Yes (SandboxFusion)
OpenClaw-RL	GRPO/OPD	Both	Both	Multi	Terminal/GUI/SWE/Tool-call	Model/External	Yes
Claw-R1	Generic RL Framework	Multi	Both	Multi	General Agent	All	Yes (Framework-agnostic)
prime-rl	GRPO/PPO	Multi	Outcome	Multi	Math/Code/Search	Model/External	Yes
NeMo-RL	GRPO/DAPO/GDPO/DPO	Single	Outcome	Multi	Math/Reasoning/Code	Rule/External	No
RLinf	PPO/GRPO/DAPO/SAC/REINFORCE++/CrossQ/RLPD	Both	Both	Multi	Robotics/Math/Code/QA/VQA	All (Rule/Model/External)	Yes
siiRL	PPO/GRPO/CPGD/MARFT	Multi	Both	Multi	LLM/VLM/LLM-MAS PostTraining	Model/Rule	Planned
slime	GRPO/GSPO/REINFORCE++	Single	Both	Both	Math/Code	External Verifier	Yes
agent-lightning	PPO/Custom/Automatic Prompt Optimization	Multi	Outcome	Multi	Calculator/SQL	Model/External/Rule	Yes
AReaL	PPO	Both	Outcome	Both	Math/Code	External	Yes
ROLL	PPO/GRPO/Reinforce++/TOPR/RAFT++	Multi	Both	Multi	Math/QA/Code/Alignment	All	Yes
MARTI	PPO/GRPO/REINFORCE++/TTRL	Multi	Both	Multi	Math	All	Yes
Tunix	PPO/GRPO/GSPO-Token/DAPO/Dr.GRPO	Single	Outcome	Multi	Math/Code/Game	Rule/External	Yes
RL2	Dr. GRPO/PPO/DPO	Single	Both	Both	QA/Dialogue	Rule/Model/External	Yes
verifiers	GRPO	Multi	Outcome	Both	Reasoning/Math/Code	All	Code
oat	PPO/GRPO	Single	Outcome	Multi	Math/Alignment	External	No
veRL	PPO/GRPO	Single	Outcome	Both	Math/QA/Reasoning/Search	All	Yes
OpenRLHF	PPO/REINFORCE++/GRPO/DPO/IPO/KTO/RLOO	Multi	Both	Both	Dialogue/Chat/Completion	Rule/Model/External	Yes
trl	PPO/GRPO/DPO	Single	Both	Single	QA	Custom	No

💪 General/MultiTask

Github Repo	Date	Org	Paper Link	RL Framework
MetaClaw	2026.3	UNC-Chapel Hill (AIMING Lab)	Paper	Custom
SkillRL	2026.2	UNC-Chapel Hill (AIMING Lab)	Paper	Custom
LLM-in-Sandbox	2026.1	RUC/MSRA/THU	Paper	rllm (w/ veRL)
youtu-agent	2025.12	Tencent Youtu Lab	Paper	Custom
DEPO	2025.11	HKUST/SJTU	Paper	LLaMA-Factory
SPEAR	2025.10	Tencent Youtu Lab	Paper	veRL/verl-agent
DeepAgent	2025.10	RUC/Xiaohongshu	Paper	Custom
AgentRL	2025.9	Tsinghua	Paper	veRL
AgentGym-RL	2025.9	Fudan University	Paper	veRL
Agent_Foundation_Models	2025.8	OPPO Personal AI Lab	Paper	veRL
Trinity-RFT	2025.5	Alibaba	Paper	veRL
SPA-RL-Agent	2025.5	PolyU	Paper	TRL
verl-agent	2025.5	NTU/Skywork	Paper	veRL
SkyRL	2025.4	UC Berkeley / NovaSky-AI	Paper	Self (skyrl-train)
VAGEN	2025.3	Northwestern University (mll-lab-nu)	Paper	veRL
ART	2025.3	OpenPipe	Paper	TRL
OpenManus-RL	2025.3	UIUC/MetaGPT	--	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
MetaClaw	GRPO (LoRA)	Single	Process	Multi	General Agentic	Model (PRM)	Yes (Skill-augmented)
SkillRL	GRPO	Single	Outcome	Multi	ALFWorld/WebShop/Search	Rule	Yes (Web search, actions)
LLM-in-Sandbox	GRPO++	Single	Outcome	Multi	Math/Physics/Chemistry/Biomedicine/Long-context/IF/SWE	Rule	Yes (Code Sandbox w/ Terminal, File, Internet)
youtu-agent	Training-Free GRPO	Single	Outcome	Multi	Deep Research/Data Analysis/Tool-use	Model/External	Yes (Web search, code, file)
DEPO	KTO + Efficiency Loss	Single	Both	Multi	Agent (BabyAI/WebShop)	Rule	Yes
SPEAR	GRPO/GiGPO + SIL	Single	Both	Multi	Math/Agent	Rule/External	Yes (Search, Sandbox, Browser)
DeepAgent	ToolPO	Single	Outcome	Multi	ToolBench/ALFWorld/WebShop/GAIA/HLE	Model	Yes (16,000+ RapidAPIs)
AgentRL	GRPO/REINFORCE++/RLOO/ReMax/GAE	Single	Outcome	Multi	Agent Tasks	External	Yes
AgentGym-RL	PPO/GRPO/RLOO/REINFORCE++	Single	Outcome	Multi	Web/Search/Game/Embodied/Science	Rule/Model/External	Yes (Web, Search, Env APIs)
Agent_Foundation_Models	DAPO/PPO	Single	Outcome	Single	QA/Code/Math	Rule/External	Yes
Trinity-RFT	PPO/GRPO	Single	Outcome	Both	Math/TextGame/Web	All	Yes
SPA-RL-Agent	PPO	Single	Process	Multi	Navigation/Web/TextGame	Model	No
verl-agent	PPO/GRPO/GiGPO/DAPO/RLOO/REINFORCE++	Multi	Both	Multi	Phone Use/Math/Code/Web/TextGame	All	Yes
SkyRL	GRPO/PPO	Single	Both	Multi	Long-horizon Agents (SWE-Bench/Search/Math/SQL)	Rule/External/Custom	Yes
VAGEN	PPO/GRPO (World Modeling RL)	Single	Both	Multi	Navigation/TextGame/Multimodal	All	Yes
ART	GRPO	Multi	Both	Multi	TextGame	All	Yes
OpenManus-RL	PPO/DPO/GRPO	Multi	Outcome	Multi	TextGame	All	Yes

🔍 Search & RAG Agent

Github Repo	Date	Org	Paper Link	RL Framework
ProRAG	2026.1	RUC	Paper	Custom
MemSearcher	2025.11	CAS	Paper	Custom
DR-Venus	2026.4	Ant Group (inclusionAI)	Paper	veRL (IGPO-based)
IGPO	2025.10	Ant Group	Paper (ICLR 2026)	veRL
ReSeek	2025.10	Tencent PCG BAC/Tsinghua University	Paper	veRL
AutoGraph-R1	2025.10	HKUST KnowComp	Paper	Custom
Tree-GRPO	2025.9	AMAP	Paper	veRL
ASearcher	2025.8	Ant Research RL Lab Tsinghua University & UW	Paper	RealHF/AReaL
Graph-R1	2025.7	BUPT/NTU/NUS	Paper	veRL
Kimi-Researcher	2025.6	Moonshot AI	blog	Custom
R-Search	2025.6	Individual	--	veRL
R1-Searcher-plus	2025.5	RUC	Paper	Custom
StepSearch	2025.5	SenseTime	Paper	veRL
AutoRefine	2025.5	USTC	Paper	veRL
ZeroSearch	2025.5	Alibaba	Paper	veRL
ReasonRAG	2025.5	CityU HK / Huawei	Paper	Custom
Agentic-RAG-R1	2025.12	PKU	--	Custom
WebThinker	2025.4	RUC	Paper	Custom
DeepResearcher	2025.4	SJTU	Paper	veRL
Search-R1	2025.3	UIUC/Google	paper1, paper2	veRL
R1-Searcher	2025.3	RUC	Paper	OpenRLHF
C-3PO	2025.2	Alibaba	Paper	OpenRLHF
DeepRetrieval	2025.2	UIUC	Paper	veRL
SSRL	2025.8	Tsinghua	Paper	Custom
Research-Venus	2025.8	Ant Group	Paper	Custom
DeepResearch	2025.9	Alibaba/Tongyi Lab	Paper	Custom
DeepDive	2025.9	Tsinghua/THUDM	Paper	Custom
O-Researcher	2026.1	OPPO PersonalAI Lab	Paper	Custom
DR Tulu	2025.11	AI2 / UW / CMU / MIT	Paper	Open-Instruct
WebSeer	2025.10	Individual	Paper	veRL
HiPRAG	2025.10	Individual	Paper	veRL
VRAG	2025.5	USTC / Tongyi Lab, Alibaba	Paper	veRL
MaskSearch	2025.5	Tongyi Lab, Alibaba	Paper	DAPO / veRL
R3-RAG	2025.5	Fudan NLP	Paper	OpenRLHF
O2-Searcher	2025.5	KnowledgeXLab	Paper	veRL
s3	2025.5	UIUC	Paper	veRL
knowledge-r1	2025.5	CAS / UCAS	Paper	veRL

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
ProRAG	GRPO + DGA (dual-granularity advantage)	Single	Both	Multi	Multi-hop RAG	Model (PRM via MCTS)	Yes (Retrieval)
MemSearcher	Multi-context GRPO	Single	Outcome	Multi	Search/QA + Memory	Rule/Model	Yes (Web search + Memory)
DR-Venus	GRPO + IGPO (info-gain turn-level) w/ agentic SFT	Single	Both	Multi	Edge-scale Deep Research (4B)	Intrinsic (info-gain) + Rule (format)	Yes (Search/Browse)
IGPO	GRPO + IGPO (Information Gain turn-level reward)	Single	Both	Multi	Multi-turn Search Agent (BrowseComp/-ZH)	Intrinsic (belief Δ) + Outcome	Yes (Search)
ReSeek	GRPO/PPO	Single	Both	Multi	QA/Search	Rule	Search/JUDGE
AutoGraph-R1	GRPO (via VeRL)	Single	Outcome	Multi	KG Construction for QA	Rule	Yes (Graph retrieval)
Tree-GRPO	GRPO/Tree-GRPO	Single	Outcome	Multi	Search	Rule	Search
ASearcher	PPO/GRPO + Decoupled PPO	Single	Outcome	Multi	Math/Code/SearchQA	External/Rule	Yes
Graph-R1	GRPO/REINFORCE++/PPO	Single	Outcome	Multi	KGQA	Rule (EM/F1)	Yes (Graph retrieval)
Kimi-Researcher	REINFORCE	Single	Outcome	Multi	Research	Outcome	Search, Browse, Coding
R-Search	PPO/GRPO	Single	Both	Multi	QA/Search	All	Yes
R1-Searcher-plus	Custom	Single	Outcome	Multi	Search	Model	Search
StepSearch	PPO	Single	Process	Multi	QA	Model	Search
AutoRefine	PPO/GRPO	Multi	Both	Multi	RAG QA	Rule	Search
ZeroSearch	PPO/GRPO/REINFORCE	Single	Outcome	Multi	QA/Search	Rule	Yes
ReasonRAG	DPO + MCTS-based PRM	Single	Process	Multi	Multi-hop QA	Model (PRM)	Yes (Wikipedia search)
Agentic-RAG-R1	GRPO	Single	Outcome	Multi	Knowledge-intensive QA	Rule/Model	Yes (Wiki/Doc search)
WebThinker	DPO	Single	Outcome	Multi	Reasoning/QA/Research	Model/External	Web Browsing
DeepResearcher	PPO/GRPO	Multi	Outcome	Multi	Research	All	Yes
Search-R1	PPO/GRPO	Single	Outcome	Multi	Search	All	Search
R1-Searcher	PPO/DPO	Single	Both	Multi	Search	All	Yes
C-3PO	PPO	Multi	Outcome	Multi	Search	Model	Yes
DeepRetrieval	GRPO	Single	Outcome	Multi	Query Generation/IR	Rule	Yes (Search)
SSRL	GRPO	Single	Outcome	Multi	Self-Search	Rule	Yes (Self-search)
Research-Venus	GRPO	Single	Both	Multi	Deep Research	Model (atomic thought)	Yes (Search)
DeepResearch	RL-based	Single	Outcome	Multi	Deep Research	Model	Yes (Search, Browse)
DeepDive	GRPO	Single	Outcome	Multi	KG-augmented Search	Rule	Yes (KG + Search)
O-Researcher	GRPO + RLAIF	Multi	Process	Multi	Deep Research (Zhihu-KOL/WideSearch/ELI5)	Model (LLM-as-Judge)	Yes (Search/Crawl)
DR Tulu	GRPO + evolving rubrics	Single	Outcome	Multi	Long-form Deep Research	Model (rubrics)	Yes (Search/MCP)
WebSeer	GRPO-style	Single	Outcome	Multi	Web Search QA (w/ self-reflection)	Rule/Model	Yes (Search)
HiPRAG	PPO	Single	Process	Multi	Efficient Agentic RAG	Model/Rule	Yes (Retrieval)
VRAG	GRPO	Single	Both	Multi	Visually-rich RAG	Rule/Model	Yes (Visual retrieval)
MaskSearch	DAPO	Single	Outcome	Multi	RAMP Pretraining + QA	Rule/Model	Yes (Search)
R3-RAG	PPO	Single	Both	Multi	Multi-hop QA	Rule	Yes (Retrieval)
O2-Searcher	GRPO	Single	Outcome	Multi	Open-ended QA	Rule/Model	Yes (Search)
s3	GRPO	Single	Outcome	Multi	RAG / Medical QA	Model (Gain-Beyond-RAG)	Yes (Retrieval)
knowledge-r1	GRPO	Single	Outcome	Multi	Knowledge-intensive QA (KB-aware)	Rule	Yes (Retrieval)

🌐 Web & GUI Agent

Github Repo	Date	Org	Paper Link	RL Framework
MobileAgent	2025.9	X-PLUG (TongyiQwen)	paper	veRL
InfiGUI-G1	2025.8	InfiX AI	Paper	veRL
UI-AGILE	2025.7	Xiamen University	Paper	Custom
gui-rcpo	2025.8	Zhejiang University	Paper	Custom
Grounding-R1	2025.6	Salesforce	blog	trl
AgentCPM-GUI	2025.6	OpenBMB/Tsinghua/RUC	Paper	Huggingface
TTI	2025.6	CMU	Paper	Custom
SE-GUI	2025.5	Nankai University/vivo	Paper	trl
ARPO	2025.5	CUHK/HKUST	Paper	veRL
GUI-G1	2025.5	RUC	Paper	TRL
WebAgent-R1	2025.5	Amazon/UVA	Paper	Custom
GUI-R1	2025.4	CAS/NUS	Paper	veRL
UI-R1	2025.3	vivo/CUHK	Paper	TRL
CollabUIAgents	2025.2	Tsinghua/Alibaba/HKUST	Paper	Custom
WebAgent	2025.1	Alibaba	paper1, paper2	LLaMA-Factory
UI-TARS	2025.9	ByteDance Seed	Paper	Custom
DigiQ	2025.2	UC Berkeley/CMU/Amazon	Paper	Custom
ZeroGUI	2025.5	Shanghai AI Lab	Paper	Custom
InfiGUI-R1	2025.4	Zhejiang University	Paper	Custom
GUI-Agent-RL	2025.2	Microsoft	Paper	Custom
GUI-Libra	2026.2	GUI-Libra (MS-affiliated)	Paper	Custom
MobileRL	2025.9	Tsinghua / Zhipu AI (THUDM)	Paper	Custom
DART-GUI	2025.9	Computer-use-agents	Paper	veRL
Mano-P	2025.9	Mininglamp AI	Paper	Mano-SDK
GUI-G2	2025.7	Zhejiang University (ZJU-REAL)	Paper	Custom (VLM-R1)
MagicGUI	2025.7	Honor (MagicAgent-GUI)	Paper	Custom
GTA1	2025.6	Salesforce / ANU	Paper	Custom (DeepSpeed)

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
MobileAgent	semi-online RL	Single	Both	Multi	MobileGUI/Automation	Rule	Yes
InfiGUI-G1	AEPO	Single	Outcome	Single	GUI/Grounding	Rule	No
UI-AGILE	GRPO	Single	Outcome	Single	GUI Grounding	Rule (continuous)	No
gui-rcpo	RCPO	Single	Outcome	Single	GUI Grounding	Rule (self-supervised)	No
Grounding-R1	GRPO	Single	Outcome	Multi	GUI Grounding	Model	Yes
AgentCPM-GUI	GRPO	Single	Outcome	Multi	Mobile GUI	Model	Yes
TTI	REINFORCE/BC	Single	Outcome	Multi	Web	External	Web Browsing
SE-GUI	GRPO	Single	Both	Single	GUI Grounding	Rule	Yes
ARPO	GRPO	Single	Outcome	Multi	GUI	External	Computer Use
GUI-G1	GRPO	Single	Outcome	Single	GUI	Rule/External	No
WebAgent-R1	M-GRPO	Single	Outcome	Multi	Web Navigation (WebArena-Lite)	Rule (task success)	Yes (Web browsing)
GUI-R1	GRPO	Single	Outcome	Multi	GUI	Rule	No
UI-R1	GRPO	Single	Process	Both	GUI	Rule	Computer/Phone Use
CollabUIAgents	DPO (credit re-assignment)	Multi	Process	Multi	GUI (Mobile + Web)	Model (LLM)	Yes (GUI interaction)
WebAgent	DAPO	Multi	Process	Multi	Web	Model	Yes
UI-TARS	Multi-turn RL	Single	Both	Multi	GUI (Cross-platform)	Model	Yes (GUI actions)
DigiQ	Value-based offline RL	Single	Outcome	Multi	Android Device Control	Model (Q-function)	Yes
ZeroGUI	Online RL	Single	Outcome	Multi	GUI Agent	Rule	Yes (GUI actions)
InfiGUI-R1	RL + sub-goal guidance	Single	Both	Multi	GUI Reasoning	Rule	Yes
GUI-Agent-RL	Value-based RL (VEM)	Single	Outcome	Multi	GUI (Web Shopping)	Model	Yes
GUI-Libra	KL-regularized GRPO (Partially Verifiable RL)	Single	Outcome	Multi	GUI (AndroidWorld/WebArena/Online-Mind2Web)	Rule	Yes
MobileRL	AdaGRPO (Difficulty-Adaptive)	Single	Outcome	Multi	Mobile GUI (AndroidWorld/AndroidLab)	Rule	Yes (Android)
DART-GUI	Decoupled GRPO	Single	Outcome	Multi	GUI (OSWorld)	Rule	Yes
Mano-P	Three-stage SFT→Offline RL→Online RL	Single	Both	Multi	GUI (OSWorld)	Rule	Yes
GUI-G2	GRPO (Gaussian Reward)	Single	Outcome	Single	GUI Grounding	Rule (continuous)	No
MagicGUI	Reinforcement Fine-Tuning (RFT)	Single	Outcome	Multi	Mobile GUI	Model/Rule	Yes
GTA1	GRPO-style (click-success reward)	Single	Outcome	Multi	GUI Grounding (OSWorld/ScreenSpot-Pro)	Rule	Yes

🔨 Tool-Use Agent

Github Repo	Date	Org	Paper Link	RL Framework
MATPO	2025.10	MiroMind AI	Paper	Custom
MiroRL	2025.8	MiroMindAI	HF Repo	veRL
verl-tool	2025.6	TIGER-Lab	X	veRL
Multi-Turn-RL-Agent	2025.5	University of Minnesota	Paper	Custom
Tool-N1	2025.5	NVIDIA	Paper	veRL
Tool-Star	2025.5	RUC	Paper	LLaMA-Factory
RL-Factory	2025.5	Simple-Efficient	model	veRL
ReTool	2025.4	ByteDance	Paper	veRL
AWorld	2025.3	Ant Group (inclusionAI)	Paper	veRL
Agent-R1	2025.3	USTC	Paper	veRL
ReCall	2025.3	BaiChuan	Paper	veRL
ToolRL	2025.4	UIUC	Paper	veRL
ToolOrchestra	2025.11	NVIDIA / HKU	Paper	Custom (veRL-based)
ToolMaster	2025.11	Northeastern University (NEUIR)	Paper	Custom
CodeGym	2025.9	Academic	Paper	Custom
UserRL	2025.9	Salesforce AI Research	Paper	veRL
ToolBrain	2025.9	ToolBrain (AAMAS 2026)	Paper	Custom
Tool-R1	2025.9	Individual (YBYBZhang)	Paper	Custom
calculator_agent_rl	2025.5	Individual (Danau5tin)	--	Verifiers

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
MATPO	GRPO (multi-agent)	Multi	Outcome	Multi	Tool-use/Search	Rule	Yes (MCP: Serper, Web scraping)
MiroRL	GRPO	Single	Both	Multi	Reasoning/Planning/ToolUse	Rule-based	MCP
verl-tool	PPO/GRPO	Single	Both	Both	Math/Code	Rule/External	Yes
Multi-Turn-RL-Agent	GRPO	Single	Both	Multi	Tool-use/Math	Rule/External	Yes
Tool-N1	PPO	Single	Outcome	Multi	Math/Dialogue	All	Yes
Tool-Star	PPO/DPO/ORPO/SimPO/KTO	Single	Outcome	Multi	Multi-modal/Tool Use/Dialogue	Model/External	Yes
RL-Factory	GRPO	Multi	Both	Multi	Tool-use/NL2SQL	All	MCP
ReTool	PPO	Single	Outcome	Multi	Math	External	Code
AWorld	GRPO	Both	Outcome	Multi	Search/Web/Code	External/Rule	Yes
Agent-R1	PPO/GRPO	Single	Both	Multi	Tool-use/QA	Model	Yes
ReCall	PPO/GRPO/RLOO/REINFORCE++/ReMax	Single	Outcome	Multi	Tool-use/Math/QA	All	Yes
ToolRL	GRPO/PPO	Single	Outcome	Multi	Tool Learning	Rule/External	Yes
ToolOrchestra	End-to-end RL (outcome+efficiency+preference)	Single	Both	Multi	Tool orchestration / agentic workflows	All	Yes (Search/Code/LLMs)
ToolMaster	SFT + GRPO (trial-then-execute)	Single	Outcome	Multi	Tool trialing + execution (ToolHop/TMDB/StableToolBench)	Rule/External	Yes (Simulated tools)
CodeGym	GRPO-family	Single	Outcome	Multi	Synthetic Multi-turn Tool-Use	Rule (verifiable)	Yes (Synthesized tools)
UserRL	GRPO (multi-turn credit)	Single	Both	Multi	User-centric (Function/Persuade/Search/Tau Gyms)	Model/External	Yes
ToolBrain	GRPO/DPO	Single	Outcome	Multi	Agentic tool training	Rule/Model	Yes (User-defined tools)
Tool-R1	Policy optimization (PPO-style)	Single	Outcome	Multi	Agentic Tool Use (GAIA)	Model + External	Yes (Python exec)
calculator_agent_rl	GRPO	Single	Outcome	Multi	Calculator Tool Use	Model (Claude-judge)	Yes

💻 Code & SWE Agent

Github Repo	Date	Org	Paper Link	RL Framework
CUDA-Agent	2026.2	ByteDance/Tsinghua	Paper	Custom
LLM-in-Sandbox	2026.1	RUC/MSRA/THU	Paper	rllm (w/ veRL)
PPP-Agent	2025.11	CMU/OpenHands	Paper	veRL
RepoDeepSearch	2025.8	PKU, Bytedance, BIT	Paper	veRL
CUDA-L1	2025.7	DeepReinforce AI	Paper	Custom
MedAgentGym	2025.6	Emory/Georgia Tech	Paper	Hugginface
CURE	2025.6	University of Chicago Princeton/ByteDance	Paper	Huggingface
Time-R1	2025.5	UIUC	Paper	veRL
ML-Agent	2025.5	MASWorks	Paper	Custom
digitalhuman	2025.4	Tencent	Paper	veRL
sweet_rl	2025.3	Meta/UCB	Paper	OpenRLHF
swe-rl	2025.2	Meta/UIUC/CMU	Paper	Custom
rllm	2025.1	Berkeley Sky Computing Lab BAIR / Together AI	Notion Blog	veRL
open-r1	2025.1	HuggingFace	--	TRL
R1-Code-Interpreter	2025.5	MIT	Paper	Custom
CTRL	2025.2	HKU/ByteDance	Paper	Custom
DeepAnalyze	2025.10	RUC/Tsinghua	Paper	Custom
AceCoder	2025.2	Waterloo (TIGER-Lab)	Paper	Custom
SWE-World	2026.2	RUC (RUCAIBox)	Paper	OpenRLHF + veRL
CUDA-L2	2026.1	DeepReinforce AI	Paper	Custom
SWE-Swiss	2025.7	Tsinghua / ByteDance	--	veRL
Skywork-OR1	2025.4	Skywork AI	Paper	Custom (veRL fork)

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
CUDA-Agent	Agentic RL (staged)	Single	Outcome	Multi	CUDA Kernel Generation	Rule (correctness + performance)	Yes (compile/verify/profile)
LLM-in-Sandbox	GRPO++	Single	Outcome	Multi	Code/SWE + General (Math/Sci/Bio)	Rule	Yes (Code Sandbox w/ Terminal, File, Internet)
PPP-Agent	PPP-RL	Single	Both	Multi	SWE/Research	Rule+Model	Search, Ask, Browse
RepoDeepSearch	GRPO	Single	Both	Multi	Search/Repair	Rule/External	Yes
CUDA-L1	Contrastive RL	Single	Outcome	Single	CUDA Optimization	Rule (performance)	No
MedAgentGym	SFT/DPO/PPO/GRPO	Single	Outcome	Multi	Medical/Code	External	Yes
CURE	PPO	Single	Outcome	Single	Code	External	No
Time-R1	PPO/GRPO/DPO	Multi	Outcome	Multi	Temporal	All	Code
ML-Agent	Custom	Single	Process	Multi	Code	All	Yes
digitalhuman	PPO/GRPO/ReMax/RLOO	Multi	Outcome	Multi	Empathy/Math/Code/MultimodalQA	Rule/Model/External	Yes
sweet_rl	DPO	Multi	Process	Multi	Design/Code	Model	Web Browsing
swe-rl	RL-based	Single	Outcome	Single	SWE (SWE-bench)	Rule (similarity)	No
rllm	PPO/GRPO	Single	Outcome	Multi	Code Edit	External	Yes
open-r1	GRPO	Single	Outcome	Single	Math/Code	All	Yes
R1-Code-Interpreter	GRPO	Single	Outcome	Multi	Code Interpretation	Rule/External	Yes (Code exec)
CTRL	RL (critique-revision)	Single	Process	Multi	Code Refinement	Model	Yes (Code exec)
DeepAnalyze	Curriculum RL	Single	Outcome	Multi	Data Science	Rule/External	Yes (Code exec)
AceCoder	GRPO	Single	Outcome	Single	Code Generation	External (test cases)	Yes
SWE-World	RL with learned world model (SWT + SWR)	Single	Both	Multi	Docker-free SWE (SWE-Bench Verified)	Model (surrogate) + Rule	Yes
CUDA-L2	Contrastive RL	Single	Outcome	Single	HGEMM / CUDA Matmul	Rule (TFLOPs)	Yes (compile/benchmark)
SWE-Swiss	Two-stage RL curriculum	Single	Outcome	Multi	SWE (Localization/Repair/Unit-Test)	Rule (test-based)	Yes
Skywork-OR1	Large-scale rule-based RL (GRPO variant)	Single	Outcome	Single	Math + Code (AIME/LiveCodeBench)	Rule (verifiable)	No

🤔 Reasoning Agent

Github Repo	Date	Org	Paper Link	RL Framework
Agent0	2025.10	UNC‑Chapel Hill / Salesforce Research / Stanford University	Paper	veRL
KG-R1	2025.9	UIUC/Google	Paper1, Paper2	veRL
AgentFlow	2025.09	Stanford University	arXiv	veRL
ARPO	2025.7	RUC, Kuaishou	Paper	veRL
terminal-bench-rl	2025.7	Individual (Danau5tin)	N/A	rLLM
MOTIF	2025.6	University of Maryland	Paper	trl
cmriat/l0	2025.6	CMRIAT	Paper	veRL
agent-distillation	2025.5	KAIST	Paper	Custom
EasyR1	2025.4	Individual	repo1/paper2	veRL
AutoCoA	2025.3	BJTU	Paper	veRL
ToRL	2025.3	SJTU	Paper	veRL
ReMA	2025.3	SJTU, UCL	Paper	veRL
Agentic-Reasoning	2025.2	Oxford	Paper	Custom
SimpleTIR	2025.2	NTU, Bytedance	Notion Blog	veRL
openrlhf_async_pipline	2024.5	OpenRLHF	Paper	OpenRLHF
THOR	2025.9	USTC / iFLYTEK	Paper	veRL
Tool-Light	2025.9	RUC (RUC-NLPIR)	Paper	LLaMA-Factory
AutoTIR	2025.7	Beihang University / BAAI	Paper	veRL

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
Agent0	ADPO	Multi	Process	Multi	Math/Visual	Model/Verifier	Yes
KG-R1	GRPO/PPO	Single	Both	Multi	KGQA	Rule/Model	KG Retrieval
AgentFlow	Flow-GRPO	Single	Outcome	Multi	Search/Math/QA	Model/External	Yes
ARPO	GRPO	Single	Outcome	Multi	Math/Coding	Model/Rule	Yes
terminal-bench-rl	GRPO	Single	Outcome	Multi	Coding/Terminal	Model+External Verifier	Yes
MOTIF	GRPO	Single	Outcome	Multi	QA	Rule	No
cmriat/l0	PPO	Multi	Process	Multi	QA	All	Yes
agent-distillation	PPO	Single	Process	Multi	QA/Math	External	Yes
EasyR1	GRPO	Single	Process	Multi	Vision-Language	Model	Yes
AutoCoA	GRPO	Multi	Outcome	Multi	Reasoning/Math/QA	All	Yes
ToRL	GRPO	Single	Outcome	Single	Math	Rule/External	Yes
ReMA	PPO	Multi	Outcome	Multi	Math	Rule	No
Agentic-Reasoning	Custom	Single	Process	Multi	QA/Math	External	Web Browsing
SimpleTIR	PPO/GRPO (with extensions)	Single	Outcome	Multi	Math, Coding	All	Yes
openrlhf_async_pipline	PPO/REINFORCE++/DPO/RLOO	Single	Outcome	Multi	Dialogue/Reasoning/QA	All	No
THOR	Hierarchical GRPO (trajectory+step)	Single	Both	Multi	Math (MATH500/AIME/Olympiad)	External (SandboxFusion)	Yes (Python)
Tool-Light	Self-Evolved DPO	Single	Outcome	Multi	Tool-Integrated Reasoning	Model (preference)	Yes (FlashRAG/Python)
AutoTIR	PPO	Single	Outcome	Multi	Autonomous Tool Selection (QA/Math/IF)	Rule	Yes (Search/Python)

👥 Multi-Agent RL

Github Repo	Date	Org	Paper Link	RL Framework
PettingLLMs	2025.10	Intel / UCSD	Paper	Custom
MASPRM	2025.10	UBC / Huawei	Paper	Custom
ARIA	2025.6	Fudan University	Paper	Custom
AMPO	2025.5	Tongyi Lab, Alibaba	Paper	veRL
MAPoRL	2025.8	Academic	--	Custom
FlowReasoner	2025.4	Sea AI Lab / NUS	Paper	Custom
DrMAS	2026.2	NTU	Paper	Custom
MarsRL	2025.11	Academic	Paper	veRL
MrlX	2025.10	Ant Group (AQ-MedAI)	Paper	Custom (SGLang + Megatron)
CoMAS	2025.10	Shanghai AI Lab / CUHK / Oxford / NUS	Paper	Custom
CoMLRL	2025.8	OpenMLRL	Paper	TRL
SPIRAL	2025.6	NUS / A*STAR / Sea AI Lab	Paper	Oat
MARFT	2025.4	SII / SJTU	Paper	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
PettingLLMs	AT-GRPO	Multi	Both	Multi	Game/Code/Math/Planning	Rule (verifiable)	No
MASPRM	PRM (trained from MCTS rollouts)	Multi	Process	Multi	Reasoning (GSM8K/MATH/MMLU)	Learned PRM	No
ARIA	REINFORCE	Both	Process	Multi	Negotiation/Bargaining	Other	No
AMPO	BC/AMPO(GRPO improvement)	Multi	Outcome	Multi	Social Interaction	Model-based	No
MAPoRL	PPO	Multi	Outcome	Multi	Collaborative LLM Tasks	Rule	No
FlowReasoner	GRPO	Multi	Outcome	Multi	Multi-agent Workflow Design	Rule	Yes
DrMAS	GRPO (agent-wise)	Multi	Outcome	Multi	Multi-agent LLM Systems	Rule	No
MarsRL	RLVR (agent-specific rewards)	Multi	Both	Multi	Math Reasoning (AIME/BeyondAIME)	Rule (verifiable)	No
MrlX	M-GRPO (hierarchical)	Multi	Outcome	Multi	Deep Research (GAIA/XBench)	Rule + Model	Yes (Search)
CoMAS	RL w/ LLM-Judge intrinsic reward	Multi	Process	Multi	Co-evolving Reasoning	Model	No
CoMLRL	MAGRPO / MAREINFORCE / MARLOO	Multi	Outcome	Multi	Writing / Code / Minecraft	Custom	Minimal
SPIRAL	Role-conditioned Advantage Estimation (RAE)	Multi	Outcome	Multi	Zero-sum Games (TicTacToe/Kuhn/Negotiation)	Rule	No
MARFT	MARFT paradigm (action+token level)	Multi	Both	Multi	Research / Math	Rule	Yes

🧠 Memory

Github Repo	Date	Org	Paper Link	RL Framework
MEM1	2025.7	MIT	Paper	veRL (based on Search-R1)
Memento	2025.6	UCL, Huawei	Paper	Custom
MemAgent	2025.6	Bytedance, Tsinghua-SIA	Paper	veRL
Mem-alpha	2025.9	UCSD / USTC	Paper	veRL
M3-Agent	2025.7	ByteDance Seed / Zhejiang University	Paper	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
MEM1	PPO/GRPO	Single	Outcome	Multi	WebShop/GSM8K/QA	Rule/Model	Yes
Memento	soft Q-Learning	Single	Outcome	Multi	Research/QA/Code/Web	External/Rule	Yes
MemAgent	PPO, GRPO, DPO	Multi	Outcome	Multi	Long-context QA	Rule/Model/External	Yes
Mem-alpha	GRPO	Single	Outcome	Multi	Long-context QA + Memory Construction	Rule (downstream QA)	Yes (memory tools)
M3-Agent	RL-based	Single	Outcome	Multi	Long-video QA (M3-Bench)	Rule/Model	Yes (multimodal memory graph)

🦾 Embodied

Github Repo	Date	Org	Paper Link	RL Framework
Embodied-R1	2025.6	Tianjing University	Paper	veRL
STeCa	2025.2	The Hong Kong Polytechnic University	Paper	FastChat/TRL
VIKI-R	2025.6	MARS-EAI (NeurIPS 2025 D&B)	Paper	veRL + LLaMA-Factory

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
Embodied-R1	GRPO	Single	Outcome	Single	Grounding/Waypoint	Rule	No
STeCa	DPO (RFT)	Single	Both	Multi	Embodied/Household	Rule/MC	Environment Actions
VIKI-R	GRPO (RFT after SFT)	Multi	Outcome	Multi	Embodied Multi-Robot Cooperation (VIKI-Bench)	Rule + Model	No

🏷️ Domain-Specific

Github Repo	Date	Org	Paper Link	RL Framework	Domain
MedSAM-Agent	2026.2	CUHK/Tencent	Paper	Custom	Medical
OS-R1	2025.8	ISCAS	Paper	Custom	OS/Systems
MMedAgent-RL	2025.8	Unknown	paper	Unknown	Medical
DoctorAgent-RL	2025.5	UCAS/CAS/USTC	Paper	RAGEN	Medical
Biomni	2025.3	Stanford University (SNAP)	Paper	Custom	Biomedical
Doctor-R1	2025.12	Tsinghua (thu-unicorn)	Paper	veRL	Medical
Alpha-R1	2025.12	SJTU / FinStep.AI / StepFun	Paper	Custom	Financial
MedResearcher-R1	2025.8	Ant Group (AQ-MedAI)	Paper	Custom	Medical
LegalDelta	2025.8	Northeastern University (NEUIR)	Paper	Custom	Legal

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
MedSAM-Agent	GRPO (via veRL)	Single	Both	Multi	Medical Image Segmentation	Model (clinical fidelity)	Yes (SAM/MedSAM2)
OS-R1	GRPO (via veRL)	Single	Outcome	Multi	Linux Kernel Tuning	Rule	Yes (LightRAG, kernel config)
MMedAgent-RL	Unknown	Multi	Unknown	Unknown	Unknown	Unknown	Unknown
DoctorAgent-RL	GRPO	Multi	Both	Multi	Consultation/Diagnosis	Model/Rule	No
Biomni	TBD	Single	TBD	Single	scRNAseq/CRISPR/ADMET/Knowledge	TBD	Yes
Doctor-R1	Experiential Agentic RL	Multi	Both	Multi	Clinical inquiry & diagnosis	Model + Rule + safety veto	No
Alpha-R1	GRPO	Single	Outcome	Multi	Alpha factor screening (with real-time news)	External (portfolio returns) + Model	Yes
MedResearcher-R1	GRPO-based (SFT + Online RL)	Single	Outcome	Multi	Medical Deep Research (MedBrowseComp)	Rule + Model	Yes (Search/KG)
LegalDelta	GRPO (CoT-guided info-gain)	Single	Process	Multi	Legal Reasoning	Model + Rule	No

🎯 Reward & Training Methodology

Github Repo	Date	Org	Paper Link	Focus
ToolPRMBench	2026.1	Arizona State University	Paper	PRM Benchmark for Tool-Use
RLVR-World	2025.5	THU ML Group	Paper	RLVR for World Models
AgentPRM	2025.2	Cornell	Paper	Process Reward for Agents
Agentic-Reward-Modeling	2025.2	THU-KEG	Paper	Agentic Reward Agent
AgentRM	2025.2	THUNLP/Tsinghua	Paper	Generalizable Agent RM
AgentProg	2025.5	MobileLLM	Paper	Progress Reward Model (ProgRM)

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
ToolPRMBench	N/A (Benchmark)	Single	Process	Multi	Tool-Use	Rule/Model	Yes
RLVR-World	RLVR	Single	Outcome	Multi	World Modeling (Language/Video)	Model (verifiable)	No
AgentPRM	PPO/DPO + PRM	Single	Process	Multi	ALFWorld/General	Model (PRM)	Yes
Agentic-Reward-Modeling	DPO/Best-of-N	Single	Outcome	Single	General Instruction	Model (Reward Agent)	Yes (Verification)
AgentRM	MCTS/RM-guided	Single	Outcome	Multi	9 Agent Tasks	Model (regression PRM)	Yes
AgentProg	Online RL w/ progress reward	Single	Process	Multi	GUI Agent Training	Model (ProgRM)	Yes

🛡️ Safety

Github Repo	Date	Org	Paper Link	RL Framework
SafeSearch	2025.11	Amazon Science	Paper	veRL
curiosity_redteam	2024.2	MIT	Paper	Custom
RLbreaker	2024.6	Purdue	Paper	Custom
xJailbreak	2025.1	Academic	Paper	Custom
Auto-RT	2025.1	ICIP-CAS	Paper	Custom
ToolSafe	2026.1	Academic (MurrayTom)	Paper	veRL
TROJail	2025.12	Academic (ACL 2026)	Paper	RAGEN + vLLM
Jailbreak-R1	2025.6	Academic (yuki-younai)	Paper	Custom
GuardReasoner-VL	2025.5	NUS (yueliu1999)	Paper	Custom

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
SafeSearch	PPO (GAE/GRPO)	Single	Both	Multi	Safe QA/Search	Rule + Model	Search
curiosity_redteam	RL + Curiosity	Single	Outcome	Multi	Red Teaming	Model	Yes (iterative query)
RLbreaker	Custom PPO	Single	Outcome	Multi	Jailbreaking	Model	Yes (mutator selection)
xJailbreak	RL	Single	Outcome	Multi	Jailbreaking	Model (embedding)	Yes (iterative)
Auto-RT	PPO	Single	Outcome	Multi	Red Teaming	Model	Yes (strategy exploration)
ToolSafe	Multi-task GRPO	Single	Process	Multi	Tool-Invocation Safety Guardrail	Rule + Model	Yes (tool monitoring)
TROJail	Multi-turn GRPO variant	Single	Both	Multi	Multi-turn Jailbreak Attack	Model (harmfulness judge) + Rule	Yes (target LLM)
Jailbreak-R1	GRPO (3-stage: imitation→warm-up→progressive)	Single	Both	Multi	Red-teaming Prompt Generation	Model (judge)	Yes (target LLM)
GuardReasoner-VL	Online RL w/ rejection sampling	Single	Both	Multi	VLM Safety Guard (multimodal)	Rule + Model	No

👁️ VLM Agent

Github Repo	Date	Org	Paper Link	RL Framework
multimodal-search-r1	2025.6	ByteDance/NTU	Paper	Custom
DeepEyesV2	2025.11	Xiaohongshu	Paper	Custom
VDeepEyes	2025.5	Xiaohongshu/XJTU	Paper	veRL
CoSo	2025.5	NTU/Alibaba	Paper	Custom
RL4VLM	2024.5	UC Berkeley	Paper	Custom
VSC-RL	2025.2	Liverpool/Huawei/Tianjin/UCL	Paper	Custom
AlphaDrive	2025.3	HUST/Horizon Robotics	Paper	Custom
Mini-o3	2025.9	Mini-o3 team	Paper	veRL
VisionThink	2025.7	CUHK (dvlab-research)	Paper	veRL + EasyR1
AutoVLA	2025.6	UCLA Mobility Lab	Paper	Custom
Pixel-Reasoner	2025.5	University of Waterloo (TIGER-AI-Lab)	Paper	OpenRLHF
Visual-ARFT	2025.5	Shanghai AI Lab / SJTU	Paper	Custom
VTool-R1	2025.5	UIUC	Paper	veRL + EasyR1
OpenThinkIMG	2025.5	Academic (zhaochen0110)	Paper	OpenR1
Chain-of-Focus	2025.5	Multi-institution	Paper	veRL
GRIT	2025.5	UC Santa Cruz (eric-ai-lab)	Paper	trl

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
multimodal-search-r1	GRPO	Single	Outcome	Multi	Multimodal Search	Rule	Yes (Search)
DeepEyesV2	Outcome RL	Single	Outcome	Multi	Multimodal Reasoning	Rule	Yes (Code exec, Web search)
VDeepEyes	PPO/GRPO	Multi	Process	Multi	VQA	All	Yes
CoSo	Soft RL (counterfactual)	Single	Outcome	Multi	Android/Card/Embodied	Rule	Yes
RL4VLM	PPO	Single	Outcome	Multi	GymCards/ALFWorld	Rule	Yes
VSC-RL	Variational RL	Single	Outcome	Multi	Mobile Device Control	Rule	Yes
AlphaDrive	GRPO	Single	Outcome	Multi	Autonomous Driving	Rule (4 planning rewards)	No
Mini-o3	GRPO	Single	Outcome	Multi	Visual Search (V*/HR-Bench)	Rule	Yes (image crop)
VisionThink	GRPO w/ LLM-as-Judge	Single	Outcome	Multi	Efficient VQA	Model (LLM-Judge)	Yes (hi-res request)
AutoVLA	GRPO (RFT after SFT)	Single	Outcome	Multi	Autonomous Driving (nuScenes/nuPlan/Waymo)	Rule (PDMS)	No
Pixel-Reasoner	Curiosity-driven GRPO	Single	Both	Multi	Visual Reasoning (V*/TallyQA/Info-VQA)	Rule + Model	Yes (zoom/select-frame)
Visual-ARFT	GRPO (agentic RFT)	Single	Outcome	Multi	Multimodal Agentic Tool Use (MAT-Search/Coding)	Rule	Yes (Search/Python)
VTool-R1	RFT (GRPO-based)	Single	Outcome	Multi	Chart/Table VQA	Rule	Yes (Python visual tools)
OpenThinkIMG	V-ToolRL (GRPO)	Single	Outcome	Multi	Chart Reasoning	Rule	Yes (GroundingDINO/SAM/OCR/crop)
Chain-of-Focus	AGAR (GRPO)	Single	Outcome	Multi	Visual Reasoning (V*)	Rule (outcome+format)	Yes (zoom-in)
GRIT	GRPO-GR (Grounded Reasoning)	Single	Outcome	Single	Visual Reasoning (bbox)	Rule	Yes (bbox)

🔄 Self-Evolution

⚠️ Note: The definition of "Self-Evolution" in the context of RL for LLM agents is still evolving and not yet well-established. This category currently collects works whose paper titles explicitly contain "self-evolving" or "self-evolution", where the agent improves itself through RL-driven feedback loops.

Github Repo	Date	Org	Paper Link	RL Framework
AgentEvolver	2025.11	Alibaba/Tongyi Lab	Paper	Custom
SEAgent	2025.8	Shanghai AI Lab / CUHK	Paper	Custom
MemSkill	2026.2	NTU/UIUC/UIC/Tsinghua	Paper	Custom
MemRL	2026.1	SJTU/Xidian/NUS/USTC/MemTensor	Paper	Custom
RAGEN	2025.1	RAGEN-AI	Paper	veRL
WebRL	2024.11	Tsinghua/Zhipu AI	Paper	Custom
EvolveR	2025.10	KnowledgeXLab / Shanghai AI Lab	Paper	veRL
R-Zero	2025.8	Tencent AI Seattle Lab / WashU / UMD	Paper	EasyR1
Absolute-Zero-Reasoner	2025.5	Tsinghua (LeapLabTHU) / BIGAI / PSU	Paper	veRL

📋 Click to view technical details

Github Repo	RL Algorithm	Single/Multi Agent	Outcome/Process Reward	Single/Multi Turn	Task	Reward Type	Tool usage
AgentEvolver	ADCA-GRPO	Single	Outcome	Multi	Social Game/Tool-use	Rule	Yes
SEAgent	GRPO	Single	Outcome	Multi	Computer Use (OSWorld)	Model	Yes (Screenshot-based)
MemSkill	PPO	Single	Process	Multi	QA/ALFWorld	Model (learned skills)	Yes
MemRL	RL-based (Q-value)	Single	Process	Multi	HLE/BigCodeBench/ALFWorld	Model (retrieval)	Yes
RAGEN	PPO/GRPO (StarPO)	Single	Both	Multi	TextGame	All	Yes
WebRL	Actor-Critic RL + ORM	Single	Outcome	Multi	Web Navigation (WebArena)	Model (ORM)	Yes (Web browsing)
EvolveR	GRPO (closed-loop online+offline)	Single	Outcome	Multi	Multi-hop QA (NQ/HotpotQA)	Rule	Yes (experience retrieval)
R-Zero	GRPO (Challenger + Solver co-evolution)	Multi	Outcome	Multi	Math/SuperGPQA/MMLU-Pro/BBEH	Rule (majority voting)	No
Absolute-Zero-Reasoner	TRR++ (Task-Relative REINFORCE++)	Single	Outcome	Single	Code/Math Reasoning (HumanEval/MBPP/LiveCodeBench)	Rule + learnability	Yes (Python exec)

⛰️ Environment

Github Repo	Date	Org	Task
OpenSandbox	2026.3	Alibaba	Code/GUI/Agent Eval
OpenEnv	2026.3	Meta (PyTorch)	Chess/Arcade/Finance
NeMo-Gym	2026.1	NVIDIA	Multi-step/Multi-turn
open-trajectory-gym	2026.3	Individual	CTF/Security
R2E-Gym	2025.4	UC Berkeley/ANU	SWE
LoCoBench-Agent	2025.11	Salesforce AI Research	SWE
Simia-Agent-Training	2025.10	Microsoft	ToolUse/API
PaperArena	2025.9	University of Science and Technology of China	ScientificLiteratureQA
enterprise-deep-research	2025.9	Salesforce AI Research	DeepResearch
meta-agents-research-environments	2025.9	Meta (FAIR)	Gaia2 / Multi-universe
BrowseComp-Plus	2025.8	University of Waterloo	Deep Research Eval
MCP-Bench	2025.8	Accenture	MCP Tool-use (28 servers)
MCPVerse	2025.8	Individual	MCP Tools (550+)
CompassVerifier	2025.7	Shanghai AI Lab	Reasoning
tau2-bench	2025.6	Sierra Research	Tool-Agent-User
MCP-Universe	2025.5	Salesforce AI Research	MCP Tool-use
SWE-smith	2025.4	Princeton/Stanford/SWE-bench	SWE
SWE-Gym	2024.12	UC Berkeley/UIUC/CMU/Apple	SWE
Mind2Web-2	2025.6	Ohio State University	Web
gem	2025.5	Sea AI Lab	Math/Code/Game/QA
MLE-Dojo	2025.5	GIT, Stanford	MLE
atropos	2025.4	Nous Research	Game/Code/Tool
InternBootcamp	2025.4	InternBootcamp	Coding/QA/Game
loong	2025.3	CAMEL-AI.org	RLVR
DataSciBench	2025.2	Tsinghua	data analysis
reasoning-gym	2025.1	open-thought	Math/Game
llmgym	2025.1	tensorzero	TextGame/Tool
debug-gym	2024.11	Microsoft Research	Debugging/Game/Code
gym-llm	2024.8	Rodrigo Sánchez Molina	Control/Game
AgentGym	2024.6	Fudan	Web/Game
tau-bench	2024.6	Sierra	Tool
appworld	2024.6	Stony Brook University	Phone Use
android_world	2024.5	Google Research	Phone Use
TheAgentCompany	2024.3	CMU, Duke	Coding
LlamaGym	2024.3	Rohan Pandey	Game
visualwebarena	2024.1	CMU	Web
LMRL-Gym	2023.12	UC Berkeley	Game
OSWorld	2023.10	HKU, CMU, Salesforce, Waterloo	Computer Use
webarena	2023.7	CMU	Web
AgentBench	2023.7	Tsinghua University	Game/Web/QA/Tool
WebShop	2022.7	Princeton-NLP	Web
ScienceWorld	2022.3	AllenAI	TextGame/ScienceQA
alfworld	2020.10	Microsoft, CMU, UW	Embodied
factorio-learning-environment	2021.6	JackHopkins	Game
jericho	2018.10	Microsoft, GIT	TextGame
TextWorld	2018.6	Microsoft Research	TextGame

Under Review/Waiting for Open Source

Star History

Citation

If you find this repository useful, please consider citing it:

@misc{agentsMeetRL,
  title={When LLM Agents Meet Reinforcement Learning: A Comprehensive Survey},
  author={AgentsMeetRL Contributors},
  year={2025},
  url={https://github.com/thinkwee/agentsMeetRL}
}

Made with ❤️ by the AgentsMeetRL community

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
README.md		README.md
index.html		index.html
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

When LLM Agents Meet Reinforcement Learning

Updates

🔧 Base Framework

💪 General/MultiTask

🔍 Search & RAG Agent

🌐 Web & GUI Agent

🔨 Tool-Use Agent

💻 Code & SWE Agent

🤔 Reasoning Agent

👥 Multi-Agent RL

🧠 Memory

🦾 Embodied

🏷️ Domain-Specific

🎯 Reward & Training Methodology

🛡️ Safety

👁️ VLM Agent

🔄 Self-Evolution

⛰️ Environment

Under Review/Waiting for Open Source

Star History

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

When LLM Agents Meet Reinforcement Learning

Updates

🔧 Base Framework

💪 General/MultiTask

🔍 Search & RAG Agent

🌐 Web & GUI Agent

🔨 Tool-Use Agent

💻 Code & SWE Agent

🤔 Reasoning Agent

👥 Multi-Agent RL

🧠 Memory

🦾 Embodied

🏷️ Domain-Specific

🎯 Reward & Training Methodology

🛡️ Safety

👁️ VLM Agent

🔄 Self-Evolution

⛰️ Environment

Under Review/Waiting for Open Source

Star History

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages