AgentsMeetRL is an awesome list that summarizes open-source repositories for training LLM Agents using reinforcement learning:
- 🤖 The criteria for identifying an agent project are that it must have at least one of the following: multi-turn interactions or tool use (so TIR projects, Tool-Integrated Reasoning, are considered in this repo).
⚠️ This project is based on code analysis from open-source repositories using LLM coding agents, which may contain unfaithful cases. Although manually reviewed, there may still be omissions. If you find any errors, please don't hesitate to let us know immediately through issues or PRs - we warmly welcome them!- 🚀 We particularly focus on the reinforcement learning frameworks, RL algorithms, rewards, and environments that projects depend on, for everyone's reference on how these excellent open-source projects make their technical choices. See [Click to view technical details] under each table.
- 📅 Last updated: 2026-04-18
- 🤗 Feel free to submit your own projects anytime - we welcome contributions!
Taxonomy:
- Base Framework: General-purpose RL training frameworks for LLM agents (e.g., veRL, OpenRLHF, trl)
- General/MultiTask: Agent systems trained/evaluated across multiple tasks or environments
- Search & RAG: Search-augmented reasoning agents that use retrieval tools to enhance LLM reasoning
- Web & GUI: Agents that interact with web browsers, mobile/desktop GUIs, or operating systems
- Tool-Use: Agents trained to invoke external tools (APIs, code executors, MCP, etc.)
- Code & SWE: Software engineering and code generation agents
- Reasoning: Reasoning agents with tool-integrated or multi-turn reasoning (math, QA, visual)
- Multi-Agent RL: Multi-agent collaboration, negotiation, or credit assignment via RL
- Memory: Agents that learn to manage, retrieve, or evolve memory
- Embodied: Agents operating in embodied/physical simulation environments
- Domain-Specific: RL agents for specialized domains (medical, OS tuning, etc.)
- Reward & Training: Process/outcome reward models and training methodologies for agents
- Safety: RL for agent safety alignment, adversarial red-teaming, and jailbreak defense/attack
- VLM Agent: Vision-language model agents trained with RL for multimodal interaction
- Self-Evolution: Agents that self-evolve via RL feedback loops (
⚠️ definition still evolving in the community) - Environment: Benchmarks, gyms, and sandbox environments for agent training/evaluation
Some Enumeration:
- Enumeration for Reward Type:
- External Verifier: e.g., a compiler or math solver
- Rule-Based: e.g., a LaTeX parser with exact match scoring
- Model-Based: e.g., a trained verifier LLM or reward LLM
- Custom
- 📢 2026-04 Update: Added 67 new repositories covering Apr 2025 – Apr 2026 across nearly every category (notably VLM Agent +9, Search & RAG +10, Web & GUI +7, Tool-Use +7). Also reclassified SkyRL (→ General) and SPIRAL (→ Multi-Agent), and updated the VAGEN entry to its NeurIPS'25 upstream repo.
- 📢 2026-03 Update: Restructured taxonomy from 12 to 16 categories (added Multi-Agent RL, Reward & Training, Safety, VLM Agent, Self-Evolution, Domain-Specific; merged GUI into Web & GUI; retired TextGame/Biomedical). Added ~70 new repositories covering Sep 2025 – Mar 2026, growing the total from ~134 to 205.
| Github Repo | 🌟 Stars | Date | Org | Paper Link |
|---|---|---|---|---|
| Open-AgentRL | 2026.2 | Gen-Verse | Paper | |
| OpenClaw-RL | 2026.3 | Gen-Verse | Paper | |
| Claw-R1 | 2026.3 | USTC | -- | |
| prime-rl | 2025.2 | Prime Intellect | -- | |
| NeMo-RL | 2026.1 | NVIDIA | -- | |
| RLinf | 2025.8 | Tsinghua/Infinigence AI/PKU | Paper | |
| siiRL | 2025.7 | Shanghai Innovation Institute | Paper | |
| slime | 2025.6 | Tsinghua University (THUDM) | blog | |
| agent-lightning | 2025.6 | Microsoft Research | Paper | |
| AReaL | 2025.6 | AntGroup/Tsinghua | Paper | |
| ROLL | 2025.6 | Alibaba | Paper | |
| MARTI | 2025.5 | Tsinghua | -- | |
| Tunix | 2025.4 | -- | ||
| RL2 | 2025.4 | Accio | – | |
| verifiers | 2025.3 | Individual | -- | |
| oat | 2024.11 | NUS/Sea AI | Paper | |
| veRL | 2024.10 | ByteDance | Paper | |
| OpenRLHF | 2023.7 | OpenRLHF | Paper | |
| trl | 2019.11 | HuggingFace | -- |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| Open-AgentRL | GRPO-TCR | Single | Both | Multi | Reasoning/GUI/Coding | Model (PRM) | Yes (SandboxFusion) |
| OpenClaw-RL | GRPO/OPD | Both | Both | Multi | Terminal/GUI/SWE/Tool-call | Model/External | Yes |
| Claw-R1 | Generic RL Framework | Multi | Both | Multi | General Agent | All | Yes (Framework-agnostic) |
| prime-rl | GRPO/PPO | Multi | Outcome | Multi | Math/Code/Search | Model/External | Yes |
| NeMo-RL | GRPO/DAPO/GDPO/DPO | Single | Outcome | Multi | Math/Reasoning/Code | Rule/External | No |
| RLinf | PPO/GRPO/DAPO/SAC/REINFORCE++/CrossQ/RLPD | Both | Both | Multi | Robotics/Math/Code/QA/VQA | All (Rule/Model/External) | Yes |
| siiRL | PPO/GRPO/CPGD/MARFT | Multi | Both | Multi | LLM/VLM/LLM-MAS PostTraining | Model/Rule | Planned |
| slime | GRPO/GSPO/REINFORCE++ | Single | Both | Both | Math/Code | External Verifier | Yes |
| agent-lightning | PPO/Custom/Automatic Prompt Optimization | Multi | Outcome | Multi | Calculator/SQL | Model/External/Rule | Yes |
| AReaL | PPO | Both | Outcome | Both | Math/Code | External | Yes |
| ROLL | PPO/GRPO/Reinforce++/TOPR/RAFT++ | Multi | Both | Multi | Math/QA/Code/Alignment | All | Yes |
| MARTI | PPO/GRPO/REINFORCE++/TTRL | Multi | Both | Multi | Math | All | Yes |
| Tunix | PPO/GRPO/GSPO-Token/DAPO/Dr.GRPO | Single | Outcome | Multi | Math/Code/Game | Rule/External | Yes |
| RL2 | Dr. GRPO/PPO/DPO | Single | Both | Both | QA/Dialogue | Rule/Model/External | Yes |
| verifiers | GRPO | Multi | Outcome | Both | Reasoning/Math/Code | All | Code |
| oat | PPO/GRPO | Single | Outcome | Multi | Math/Alignment | External | No |
| veRL | PPO/GRPO | Single | Outcome | Both | Math/QA/Reasoning/Search | All | Yes |
| OpenRLHF | PPO/REINFORCE++/GRPO/DPO/IPO/KTO/RLOO | Multi | Both | Both | Dialogue/Chat/Completion | Rule/Model/External | Yes |
| trl | PPO/GRPO/DPO | Single | Both | Single | QA | Custom | No |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| MetaClaw | 2026.3 | UNC-Chapel Hill (AIMING Lab) | Paper | Custom | |
| SkillRL | 2026.2 | UNC-Chapel Hill (AIMING Lab) | Paper | Custom | |
| LLM-in-Sandbox | 2026.1 | RUC/MSRA/THU | Paper | rllm (w/ veRL) | |
| youtu-agent | 2025.12 | Tencent Youtu Lab | Paper | Custom | |
| DEPO | 2025.11 | HKUST/SJTU | Paper | LLaMA-Factory | |
| SPEAR | 2025.10 | Tencent Youtu Lab | Paper | veRL/verl-agent | |
| DeepAgent | 2025.10 | RUC/Xiaohongshu | Paper | Custom | |
| AgentRL | 2025.9 | Tsinghua | Paper | veRL | |
| AgentGym-RL | 2025.9 | Fudan University | Paper | veRL | |
| Agent_Foundation_Models | 2025.8 | OPPO Personal AI Lab | Paper | veRL | |
| Trinity-RFT | 2025.5 | Alibaba | Paper | veRL | |
| SPA-RL-Agent | 2025.5 | PolyU | Paper | TRL | |
| verl-agent | 2025.5 | NTU/Skywork | Paper | veRL | |
| SkyRL | 2025.4 | UC Berkeley / NovaSky-AI | Paper | Self (skyrl-train) | |
| VAGEN | 2025.3 | Northwestern University (mll-lab-nu) | Paper | veRL | |
| ART | 2025.3 | OpenPipe | Paper | TRL | |
| OpenManus-RL | 2025.3 | UIUC/MetaGPT | -- | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| MetaClaw | GRPO (LoRA) | Single | Process | Multi | General Agentic | Model (PRM) | Yes (Skill-augmented) |
| SkillRL | GRPO | Single | Outcome | Multi | ALFWorld/WebShop/Search | Rule | Yes (Web search, actions) |
| LLM-in-Sandbox | GRPO++ | Single | Outcome | Multi | Math/Physics/Chemistry/Biomedicine/Long-context/IF/SWE | Rule | Yes (Code Sandbox w/ Terminal, File, Internet) |
| youtu-agent | Training-Free GRPO | Single | Outcome | Multi | Deep Research/Data Analysis/Tool-use | Model/External | Yes (Web search, code, file) |
| DEPO | KTO + Efficiency Loss | Single | Both | Multi | Agent (BabyAI/WebShop) | Rule | Yes |
| SPEAR | GRPO/GiGPO + SIL | Single | Both | Multi | Math/Agent | Rule/External | Yes (Search, Sandbox, Browser) |
| DeepAgent | ToolPO | Single | Outcome | Multi | ToolBench/ALFWorld/WebShop/GAIA/HLE | Model | Yes (16,000+ RapidAPIs) |
| AgentRL | GRPO/REINFORCE++/RLOO/ReMax/GAE | Single | Outcome | Multi | Agent Tasks | External | Yes |
| AgentGym-RL | PPO/GRPO/RLOO/REINFORCE++ | Single | Outcome | Multi | Web/Search/Game/Embodied/Science | Rule/Model/External | Yes (Web, Search, Env APIs) |
| Agent_Foundation_Models | DAPO/PPO | Single | Outcome | Single | QA/Code/Math | Rule/External | Yes |
| Trinity-RFT | PPO/GRPO | Single | Outcome | Both | Math/TextGame/Web | All | Yes |
| SPA-RL-Agent | PPO | Single | Process | Multi | Navigation/Web/TextGame | Model | No |
| verl-agent | PPO/GRPO/GiGPO/DAPO/RLOO/REINFORCE++ | Multi | Both | Multi | Phone Use/Math/Code/Web/TextGame | All | Yes |
| SkyRL | GRPO/PPO | Single | Both | Multi | Long-horizon Agents (SWE-Bench/Search/Math/SQL) | Rule/External/Custom | Yes |
| VAGEN | PPO/GRPO (World Modeling RL) | Single | Both | Multi | Navigation/TextGame/Multimodal | All | Yes |
| ART | GRPO | Multi | Both | Multi | TextGame | All | Yes |
| OpenManus-RL | PPO/DPO/GRPO | Multi | Outcome | Multi | TextGame | All | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| ProRAG | 2026.1 | RUC | Paper | Custom | |
| MemSearcher | 2025.11 | CAS | Paper | Custom | |
| ReSeek | 2025.10 | Tencent PCG BAC/Tsinghua University | Paper | veRL | |
| AutoGraph-R1 | 2025.10 | HKUST KnowComp | Paper | Custom | |
| Tree-GRPO | 2025.9 | AMAP | Paper | veRL | |
| ASearcher | 2025.8 | Ant Research RL Lab Tsinghua University & UW |
Paper | RealHF/AReaL | |
| Graph-R1 | 2025.7 | BUPT/NTU/NUS | Paper | veRL | |
| Kimi-Researcher | 2025.6 | Moonshot AI | blog | Custom | |
| R-Search | 2025.6 | Individual | -- | veRL | |
| R1-Searcher-plus | 2025.5 | RUC | Paper | Custom | |
| StepSearch | 2025.5 | SenseTime | Paper | veRL | |
| AutoRefine | 2025.5 | USTC | Paper | veRL | |
| ZeroSearch | 2025.5 | Alibaba | Paper | veRL | |
| ReasonRAG | 2025.5 | CityU HK / Huawei | Paper | Custom | |
| Agentic-RAG-R1 | 2025.12 | PKU | -- | Custom | |
| WebThinker | 2025.4 | RUC | Paper | Custom | |
| DeepResearcher | 2025.4 | SJTU | Paper | veRL | |
| Search-R1 | 2025.3 | UIUC/Google | paper1, paper2 | veRL | |
| R1-Searcher | 2025.3 | RUC | Paper | OpenRLHF | |
| C-3PO | 2025.2 | Alibaba | Paper | OpenRLHF | |
| DeepRetrieval | 2025.2 | UIUC | Paper | veRL | |
| SSRL | 2025.8 | Tsinghua | Paper | Custom | |
| Research-Venus | 2025.8 | Ant Group | Paper | Custom | |
| DeepResearch | 2025.9 | Alibaba/Tongyi Lab | Paper | Custom | |
| DeepDive | 2025.9 | Tsinghua/THUDM | Paper | Custom | |
| O-Researcher | 2026.1 | OPPO PersonalAI Lab | Paper | Custom | |
| DR Tulu | 2025.11 | AI2 / UW / CMU / MIT | Paper | Open-Instruct | |
| WebSeer | 2025.10 | Individual | Paper | veRL | |
| HiPRAG | 2025.10 | Individual | Paper | veRL | |
| VRAG | 2025.5 | USTC / Tongyi Lab, Alibaba | Paper | veRL | |
| MaskSearch | 2025.5 | Tongyi Lab, Alibaba | Paper | DAPO / veRL | |
| R3-RAG | 2025.5 | Fudan NLP | Paper | OpenRLHF | |
| O2-Searcher | 2025.5 | KnowledgeXLab | Paper | veRL | |
| s3 | 2025.5 | UIUC | Paper | veRL | |
| knowledge-r1 | 2025.5 | CAS / UCAS | Paper | veRL |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| ProRAG | GRPO + DGA (dual-granularity advantage) | Single | Both | Multi | Multi-hop RAG | Model (PRM via MCTS) | Yes (Retrieval) |
| MemSearcher | Multi-context GRPO | Single | Outcome | Multi | Search/QA + Memory | Rule/Model | Yes (Web search + Memory) |
| ReSeek | GRPO/PPO | Single | Both | Multi | QA/Search | Rule | Search/JUDGE |
| AutoGraph-R1 | GRPO (via VeRL) | Single | Outcome | Multi | KG Construction for QA | Rule | Yes (Graph retrieval) |
| Tree-GRPO | GRPO/Tree-GRPO | Single | Outcome | Multi | Search | Rule | Search |
| ASearcher | PPO/GRPO + Decoupled PPO | Single | Outcome | Multi | Math/Code/SearchQA | External/Rule | Yes |
| Graph-R1 | GRPO/REINFORCE++/PPO | Single | Outcome | Multi | KGQA | Rule (EM/F1) | Yes (Graph retrieval) |
| Kimi-Researcher | REINFORCE | Single | Outcome | Multi | Research | Outcome | Search, Browse, Coding |
| R-Search | PPO/GRPO | Single | Both | Multi | QA/Search | All | Yes |
| R1-Searcher-plus | Custom | Single | Outcome | Multi | Search | Model | Search |
| StepSearch | PPO | Single | Process | Multi | QA | Model | Search |
| AutoRefine | PPO/GRPO | Multi | Both | Multi | RAG QA | Rule | Search |
| ZeroSearch | PPO/GRPO/REINFORCE | Single | Outcome | Multi | QA/Search | Rule | Yes |
| ReasonRAG | DPO + MCTS-based PRM | Single | Process | Multi | Multi-hop QA | Model (PRM) | Yes (Wikipedia search) |
| Agentic-RAG-R1 | GRPO | Single | Outcome | Multi | Knowledge-intensive QA | Rule/Model | Yes (Wiki/Doc search) |
| WebThinker | DPO | Single | Outcome | Multi | Reasoning/QA/Research | Model/External | Web Browsing |
| DeepResearcher | PPO/GRPO | Multi | Outcome | Multi | Research | All | Yes |
| Search-R1 | PPO/GRPO | Single | Outcome | Multi | Search | All | Search |
| R1-Searcher | PPO/DPO | Single | Both | Multi | Search | All | Yes |
| C-3PO | PPO | Multi | Outcome | Multi | Search | Model | Yes |
| DeepRetrieval | GRPO | Single | Outcome | Multi | Query Generation/IR | Rule | Yes (Search) |
| SSRL | GRPO | Single | Outcome | Multi | Self-Search | Rule | Yes (Self-search) |
| Research-Venus | GRPO | Single | Both | Multi | Deep Research | Model (atomic thought) | Yes (Search) |
| DeepResearch | RL-based | Single | Outcome | Multi | Deep Research | Model | Yes (Search, Browse) |
| DeepDive | GRPO | Single | Outcome | Multi | KG-augmented Search | Rule | Yes (KG + Search) |
| O-Researcher | GRPO + RLAIF | Multi | Process | Multi | Deep Research (Zhihu-KOL/WideSearch/ELI5) | Model (LLM-as-Judge) | Yes (Search/Crawl) |
| DR Tulu | GRPO + evolving rubrics | Single | Outcome | Multi | Long-form Deep Research | Model (rubrics) | Yes (Search/MCP) |
| WebSeer | GRPO-style | Single | Outcome | Multi | Web Search QA (w/ self-reflection) | Rule/Model | Yes (Search) |
| HiPRAG | PPO | Single | Process | Multi | Efficient Agentic RAG | Model/Rule | Yes (Retrieval) |
| VRAG | GRPO | Single | Both | Multi | Visually-rich RAG | Rule/Model | Yes (Visual retrieval) |
| MaskSearch | DAPO | Single | Outcome | Multi | RAMP Pretraining + QA | Rule/Model | Yes (Search) |
| R3-RAG | PPO | Single | Both | Multi | Multi-hop QA | Rule | Yes (Retrieval) |
| O2-Searcher | GRPO | Single | Outcome | Multi | Open-ended QA | Rule/Model | Yes (Search) |
| s3 | GRPO | Single | Outcome | Multi | RAG / Medical QA | Model (Gain-Beyond-RAG) | Yes (Retrieval) |
| knowledge-r1 | GRPO | Single | Outcome | Multi | Knowledge-intensive QA (KB-aware) | Rule | Yes (Retrieval) |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| MobileAgent | 2025.9 | X-PLUG (TongyiQwen) | paper | veRL | |
| InfiGUI-G1 | 2025.8 | InfiX AI | Paper | veRL | |
| UI-AGILE | 2025.7 | Xiamen University | Paper | Custom | |
| gui-rcpo | 2025.8 | Zhejiang University | Paper | Custom | |
| Grounding-R1 | 2025.6 | Salesforce | blog | trl | |
| AgentCPM-GUI | 2025.6 | OpenBMB/Tsinghua/RUC | Paper | Huggingface | |
| TTI | 2025.6 | CMU | Paper | Custom | |
| SE-GUI | 2025.5 | Nankai University/vivo | Paper | trl | |
| ARPO | 2025.5 | CUHK/HKUST | Paper | veRL | |
| GUI-G1 | 2025.5 | RUC | Paper | TRL | |
| WebAgent-R1 | 2025.5 | Amazon/UVA | Paper | Custom | |
| GUI-R1 | 2025.4 | CAS/NUS | Paper | veRL | |
| UI-R1 | 2025.3 | vivo/CUHK | Paper | TRL | |
| CollabUIAgents | 2025.2 | Tsinghua/Alibaba/HKUST | Paper | Custom | |
| WebAgent | 2025.1 | Alibaba | paper1, paper2 | LLaMA-Factory | |
| UI-TARS | 2025.9 | ByteDance Seed | Paper | Custom | |
| DigiQ | 2025.2 | UC Berkeley/CMU/Amazon | Paper | Custom | |
| ZeroGUI | 2025.5 | Shanghai AI Lab | Paper | Custom | |
| InfiGUI-R1 | 2025.4 | Zhejiang University | Paper | Custom | |
| GUI-Agent-RL | 2025.2 | Microsoft | Paper | Custom | |
| GUI-Libra | 2026.2 | GUI-Libra (MS-affiliated) | Paper | Custom | |
| MobileRL | 2025.9 | Tsinghua / Zhipu AI (THUDM) | Paper | Custom | |
| DART-GUI | 2025.9 | Computer-use-agents | Paper | veRL | |
| Mano-P | 2025.9 | Mininglamp AI | Paper | Mano-SDK | |
| GUI-G2 | 2025.7 | Zhejiang University (ZJU-REAL) | Paper | Custom (VLM-R1) | |
| MagicGUI | 2025.7 | Honor (MagicAgent-GUI) | Paper | Custom | |
| GTA1 | 2025.6 | Salesforce / ANU | Paper | Custom (DeepSpeed) |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| MobileAgent | semi-online RL | Single | Both | Multi | MobileGUI/Automation | Rule | Yes |
| InfiGUI-G1 | AEPO | Single | Outcome | Single | GUI/Grounding | Rule | No |
| UI-AGILE | GRPO | Single | Outcome | Single | GUI Grounding | Rule (continuous) | No |
| gui-rcpo | RCPO | Single | Outcome | Single | GUI Grounding | Rule (self-supervised) | No |
| Grounding-R1 | GRPO | Single | Outcome | Multi | GUI Grounding | Model | Yes |
| AgentCPM-GUI | GRPO | Single | Outcome | Multi | Mobile GUI | Model | Yes |
| TTI | REINFORCE/BC | Single | Outcome | Multi | Web | External | Web Browsing |
| SE-GUI | GRPO | Single | Both | Single | GUI Grounding | Rule | Yes |
| ARPO | GRPO | Single | Outcome | Multi | GUI | External | Computer Use |
| GUI-G1 | GRPO | Single | Outcome | Single | GUI | Rule/External | No |
| WebAgent-R1 | M-GRPO | Single | Outcome | Multi | Web Navigation (WebArena-Lite) | Rule (task success) | Yes (Web browsing) |
| GUI-R1 | GRPO | Single | Outcome | Multi | GUI | Rule | No |
| UI-R1 | GRPO | Single | Process | Both | GUI | Rule | Computer/Phone Use |
| CollabUIAgents | DPO (credit re-assignment) | Multi | Process | Multi | GUI (Mobile + Web) | Model (LLM) | Yes (GUI interaction) |
| WebAgent | DAPO | Multi | Process | Multi | Web | Model | Yes |
| UI-TARS | Multi-turn RL | Single | Both | Multi | GUI (Cross-platform) | Model | Yes (GUI actions) |
| DigiQ | Value-based offline RL | Single | Outcome | Multi | Android Device Control | Model (Q-function) | Yes |
| ZeroGUI | Online RL | Single | Outcome | Multi | GUI Agent | Rule | Yes (GUI actions) |
| InfiGUI-R1 | RL + sub-goal guidance | Single | Both | Multi | GUI Reasoning | Rule | Yes |
| GUI-Agent-RL | Value-based RL (VEM) | Single | Outcome | Multi | GUI (Web Shopping) | Model | Yes |
| GUI-Libra | KL-regularized GRPO (Partially Verifiable RL) | Single | Outcome | Multi | GUI (AndroidWorld/WebArena/Online-Mind2Web) | Rule | Yes |
| MobileRL | AdaGRPO (Difficulty-Adaptive) | Single | Outcome | Multi | Mobile GUI (AndroidWorld/AndroidLab) | Rule | Yes (Android) |
| DART-GUI | Decoupled GRPO | Single | Outcome | Multi | GUI (OSWorld) | Rule | Yes |
| Mano-P | Three-stage SFT→Offline RL→Online RL | Single | Both | Multi | GUI (OSWorld) | Rule | Yes |
| GUI-G2 | GRPO (Gaussian Reward) | Single | Outcome | Single | GUI Grounding | Rule (continuous) | No |
| MagicGUI | Reinforcement Fine-Tuning (RFT) | Single | Outcome | Multi | Mobile GUI | Model/Rule | Yes |
| GTA1 | GRPO-style (click-success reward) | Single | Outcome | Multi | GUI Grounding (OSWorld/ScreenSpot-Pro) | Rule | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| MATPO | 2025.10 | MiroMind AI | Paper | Custom | |
| MiroRL | 2025.8 | MiroMindAI | HF Repo | veRL | |
| verl-tool | 2025.6 | TIGER-Lab | X | veRL | |
| Multi-Turn-RL-Agent | 2025.5 | University of Minnesota | Paper | Custom | |
| Tool-N1 | 2025.5 | NVIDIA | Paper | veRL | |
| Tool-Star | 2025.5 | RUC | Paper | LLaMA-Factory | |
| RL-Factory | 2025.5 | Simple-Efficient | model | veRL | |
| ReTool | 2025.4 | ByteDance | Paper | veRL | |
| AWorld | 2025.3 | Ant Group (inclusionAI) | Paper | veRL | |
| Agent-R1 | 2025.3 | USTC | Paper | veRL | |
| ReCall | 2025.3 | BaiChuan | Paper | veRL | |
| ToolRL | 2025.4 | UIUC | Paper | veRL | |
| ToolOrchestra | 2025.11 | NVIDIA / HKU | Paper | Custom (veRL-based) | |
| ToolMaster | 2025.11 | Northeastern University (NEUIR) | Paper | Custom | |
| CodeGym | 2025.9 | Academic | Paper | Custom | |
| UserRL | 2025.9 | Salesforce AI Research | Paper | veRL | |
| ToolBrain | 2025.9 | ToolBrain (AAMAS 2026) | Paper | Custom | |
| Tool-R1 | 2025.9 | Individual (YBYBZhang) | Paper | Custom | |
| calculator_agent_rl | 2025.5 | Individual (Danau5tin) | -- | Verifiers |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| MATPO | GRPO (multi-agent) | Multi | Outcome | Multi | Tool-use/Search | Rule | Yes (MCP: Serper, Web scraping) |
| MiroRL | GRPO | Single | Both | Multi | Reasoning/Planning/ToolUse | Rule-based | MCP |
| verl-tool | PPO/GRPO | Single | Both | Both | Math/Code | Rule/External | Yes |
| Multi-Turn-RL-Agent | GRPO | Single | Both | Multi | Tool-use/Math | Rule/External | Yes |
| Tool-N1 | PPO | Single | Outcome | Multi | Math/Dialogue | All | Yes |
| Tool-Star | PPO/DPO/ORPO/SimPO/KTO | Single | Outcome | Multi | Multi-modal/Tool Use/Dialogue | Model/External | Yes |
| RL-Factory | GRPO | Multi | Both | Multi | Tool-use/NL2SQL | All | MCP |
| ReTool | PPO | Single | Outcome | Multi | Math | External | Code |
| AWorld | GRPO | Both | Outcome | Multi | Search/Web/Code | External/Rule | Yes |
| Agent-R1 | PPO/GRPO | Single | Both | Multi | Tool-use/QA | Model | Yes |
| ReCall | PPO/GRPO/RLOO/REINFORCE++/ReMax | Single | Outcome | Multi | Tool-use/Math/QA | All | Yes |
| ToolRL | GRPO/PPO | Single | Outcome | Multi | Tool Learning | Rule/External | Yes |
| ToolOrchestra | End-to-end RL (outcome+efficiency+preference) | Single | Both | Multi | Tool orchestration / agentic workflows | All | Yes (Search/Code/LLMs) |
| ToolMaster | SFT + GRPO (trial-then-execute) | Single | Outcome | Multi | Tool trialing + execution (ToolHop/TMDB/StableToolBench) | Rule/External | Yes (Simulated tools) |
| CodeGym | GRPO-family | Single | Outcome | Multi | Synthetic Multi-turn Tool-Use | Rule (verifiable) | Yes (Synthesized tools) |
| UserRL | GRPO (multi-turn credit) | Single | Both | Multi | User-centric (Function/Persuade/Search/Tau Gyms) | Model/External | Yes |
| ToolBrain | GRPO/DPO | Single | Outcome | Multi | Agentic tool training | Rule/Model | Yes (User-defined tools) |
| Tool-R1 | Policy optimization (PPO-style) | Single | Outcome | Multi | Agentic Tool Use (GAIA) | Model + External | Yes (Python exec) |
| calculator_agent_rl | GRPO | Single | Outcome | Multi | Calculator Tool Use | Model (Claude-judge) | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| CUDA-Agent | 2026.2 | ByteDance/Tsinghua | Paper | Custom | |
| LLM-in-Sandbox | 2026.1 | RUC/MSRA/THU | Paper | rllm (w/ veRL) | |
| PPP-Agent | 2025.11 | CMU/OpenHands | Paper | veRL | |
| RepoDeepSearch | 2025.8 | PKU, Bytedance, BIT | Paper | veRL | |
| CUDA-L1 | 2025.7 | DeepReinforce AI | Paper | Custom | |
| MedAgentGym | 2025.6 | Emory/Georgia Tech | Paper | Hugginface | |
| CURE | 2025.6 | University of Chicago Princeton/ByteDance |
Paper | Huggingface | |
| Time-R1 | 2025.5 | UIUC | Paper | veRL | |
| ML-Agent | 2025.5 | MASWorks | Paper | Custom | |
| digitalhuman | 2025.4 | Tencent | Paper | veRL | |
| sweet_rl | 2025.3 | Meta/UCB | Paper | OpenRLHF | |
| swe-rl | 2025.2 | Meta/UIUC/CMU | Paper | Custom | |
| rllm | 2025.1 | Berkeley Sky Computing Lab BAIR / Together AI |
Notion Blog | veRL | |
| open-r1 | 2025.1 | HuggingFace | -- | TRL | |
| R1-Code-Interpreter | 2025.5 | MIT | Paper | Custom | |
| CTRL | 2025.2 | HKU/ByteDance | Paper | Custom | |
| DeepAnalyze | 2025.10 | RUC/Tsinghua | Paper | Custom | |
| AceCoder | 2025.2 | Waterloo (TIGER-Lab) | Paper | Custom | |
| SWE-World | 2026.2 | RUC (RUCAIBox) | Paper | OpenRLHF + veRL | |
| CUDA-L2 | 2026.1 | DeepReinforce AI | Paper | Custom | |
| SWE-Swiss | 2025.7 | Tsinghua / ByteDance | -- | veRL | |
| Skywork-OR1 | 2025.4 | Skywork AI | Paper | Custom (veRL fork) |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| CUDA-Agent | Agentic RL (staged) | Single | Outcome | Multi | CUDA Kernel Generation | Rule (correctness + performance) | Yes (compile/verify/profile) |
| LLM-in-Sandbox | GRPO++ | Single | Outcome | Multi | Code/SWE + General (Math/Sci/Bio) | Rule | Yes (Code Sandbox w/ Terminal, File, Internet) |
| PPP-Agent | PPP-RL | Single | Both | Multi | SWE/Research | Rule+Model | Search, Ask, Browse |
| RepoDeepSearch | GRPO | Single | Both | Multi | Search/Repair | Rule/External | Yes |
| CUDA-L1 | Contrastive RL | Single | Outcome | Single | CUDA Optimization | Rule (performance) | No |
| MedAgentGym | SFT/DPO/PPO/GRPO | Single | Outcome | Multi | Medical/Code | External | Yes |
| CURE | PPO | Single | Outcome | Single | Code | External | No |
| Time-R1 | PPO/GRPO/DPO | Multi | Outcome | Multi | Temporal | All | Code |
| ML-Agent | Custom | Single | Process | Multi | Code | All | Yes |
| digitalhuman | PPO/GRPO/ReMax/RLOO | Multi | Outcome | Multi | Empathy/Math/Code/MultimodalQA | Rule/Model/External | Yes |
| sweet_rl | DPO | Multi | Process | Multi | Design/Code | Model | Web Browsing |
| swe-rl | RL-based | Single | Outcome | Single | SWE (SWE-bench) | Rule (similarity) | No |
| rllm | PPO/GRPO | Single | Outcome | Multi | Code Edit | External | Yes |
| open-r1 | GRPO | Single | Outcome | Single | Math/Code | All | Yes |
| R1-Code-Interpreter | GRPO | Single | Outcome | Multi | Code Interpretation | Rule/External | Yes (Code exec) |
| CTRL | RL (critique-revision) | Single | Process | Multi | Code Refinement | Model | Yes (Code exec) |
| DeepAnalyze | Curriculum RL | Single | Outcome | Multi | Data Science | Rule/External | Yes (Code exec) |
| AceCoder | GRPO | Single | Outcome | Single | Code Generation | External (test cases) | Yes |
| SWE-World | RL with learned world model (SWT + SWR) | Single | Both | Multi | Docker-free SWE (SWE-Bench Verified) | Model (surrogate) + Rule | Yes |
| CUDA-L2 | Contrastive RL | Single | Outcome | Single | HGEMM / CUDA Matmul | Rule (TFLOPs) | Yes (compile/benchmark) |
| SWE-Swiss | Two-stage RL curriculum | Single | Outcome | Multi | SWE (Localization/Repair/Unit-Test) | Rule (test-based) | Yes |
| Skywork-OR1 | Large-scale rule-based RL (GRPO variant) | Single | Outcome | Single | Math + Code (AIME/LiveCodeBench) | Rule (verifiable) | No |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| Agent0 | 2025.10 | UNC‑Chapel Hill / Salesforce Research / Stanford University | Paper | veRL | |
| KG-R1 | 2025.9 | UIUC/Google | Paper1, Paper2 | veRL | |
| AgentFlow | 2025.09 | Stanford University | arXiv | veRL | |
| ARPO | 2025.7 | RUC, Kuaishou | Paper | veRL | |
| terminal-bench-rl | 2025.7 | Individual (Danau5tin) | N/A | rLLM | |
| MOTIF | 2025.6 | University of Maryland | Paper | trl | |
| cmriat/l0 | 2025.6 | CMRIAT | Paper | veRL | |
| agent-distillation | 2025.5 | KAIST | Paper | Custom | |
| EasyR1 | 2025.4 | Individual | repo1/paper2 | veRL | |
| AutoCoA | 2025.3 | BJTU | Paper | veRL | |
| ToRL | 2025.3 | SJTU | Paper | veRL | |
| ReMA | 2025.3 | SJTU, UCL | Paper | veRL | |
| Agentic-Reasoning | 2025.2 | Oxford | Paper | Custom | |
| SimpleTIR | 2025.2 | NTU, Bytedance | Notion Blog | veRL | |
| openrlhf_async_pipline | 2024.5 | OpenRLHF | Paper | OpenRLHF | |
| THOR | 2025.9 | USTC / iFLYTEK | Paper | veRL | |
| Tool-Light | 2025.9 | RUC (RUC-NLPIR) | Paper | LLaMA-Factory | |
| AutoTIR | 2025.7 | Beihang University / BAAI | Paper | veRL |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| Agent0 | ADPO | Multi | Process | Multi | Math/Visual | Model/Verifier | Yes |
| KG-R1 | GRPO/PPO | Single | Both | Multi | KGQA | Rule/Model | KG Retrieval |
| AgentFlow | Flow-GRPO | Single | Outcome | Multi | Search/Math/QA | Model/External | Yes |
| ARPO | GRPO | Single | Outcome | Multi | Math/Coding | Model/Rule | Yes |
| terminal-bench-rl | GRPO | Single | Outcome | Multi | Coding/Terminal | Model+External Verifier | Yes |
| MOTIF | GRPO | Single | Outcome | Multi | QA | Rule | No |
| cmriat/l0 | PPO | Multi | Process | Multi | QA | All | Yes |
| agent-distillation | PPO | Single | Process | Multi | QA/Math | External | Yes |
| EasyR1 | GRPO | Single | Process | Multi | Vision-Language | Model | Yes |
| AutoCoA | GRPO | Multi | Outcome | Multi | Reasoning/Math/QA | All | Yes |
| ToRL | GRPO | Single | Outcome | Single | Math | Rule/External | Yes |
| ReMA | PPO | Multi | Outcome | Multi | Math | Rule | No |
| Agentic-Reasoning | Custom | Single | Process | Multi | QA/Math | External | Web Browsing |
| SimpleTIR | PPO/GRPO (with extensions) | Single | Outcome | Multi | Math, Coding | All | Yes |
| openrlhf_async_pipline | PPO/REINFORCE++/DPO/RLOO | Single | Outcome | Multi | Dialogue/Reasoning/QA | All | No |
| THOR | Hierarchical GRPO (trajectory+step) | Single | Both | Multi | Math (MATH500/AIME/Olympiad) | External (SandboxFusion) | Yes (Python) |
| Tool-Light | Self-Evolved DPO | Single | Outcome | Multi | Tool-Integrated Reasoning | Model (preference) | Yes (FlashRAG/Python) |
| AutoTIR | PPO | Single | Outcome | Multi | Autonomous Tool Selection (QA/Math/IF) | Rule | Yes (Search/Python) |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| PettingLLMs | 2025.10 | Intel / UCSD | Paper | Custom | |
| MASPRM | 2025.10 | UBC / Huawei | Paper | Custom | |
| ARIA | 2025.6 | Fudan University | Paper | Custom | |
| AMPO | 2025.5 | Tongyi Lab, Alibaba | Paper | veRL | |
| MAPoRL | 2025.8 | Academic | -- | Custom | |
| FlowReasoner | 2025.4 | Sea AI Lab / NUS | Paper | Custom | |
| DrMAS | 2026.2 | NTU | Paper | Custom | |
| MarsRL | 2025.11 | Academic | Paper | veRL | |
| MrlX | 2025.10 | Ant Group (AQ-MedAI) | Paper | Custom (SGLang + Megatron) | |
| CoMAS | 2025.10 | Shanghai AI Lab / CUHK / Oxford / NUS | Paper | Custom | |
| CoMLRL | 2025.8 | OpenMLRL | Paper | TRL | |
| SPIRAL | 2025.6 | NUS / A*STAR / Sea AI Lab | Paper | Oat | |
| MARFT | 2025.4 | SII / SJTU | Paper | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| PettingLLMs | AT-GRPO | Multi | Both | Multi | Game/Code/Math/Planning | Rule (verifiable) | No |
| MASPRM | PRM (trained from MCTS rollouts) | Multi | Process | Multi | Reasoning (GSM8K/MATH/MMLU) | Learned PRM | No |
| ARIA | REINFORCE | Both | Process | Multi | Negotiation/Bargaining | Other | No |
| AMPO | BC/AMPO(GRPO improvement) | Multi | Outcome | Multi | Social Interaction | Model-based | No |
| MAPoRL | PPO | Multi | Outcome | Multi | Collaborative LLM Tasks | Rule | No |
| FlowReasoner | GRPO | Multi | Outcome | Multi | Multi-agent Workflow Design | Rule | Yes |
| DrMAS | GRPO (agent-wise) | Multi | Outcome | Multi | Multi-agent LLM Systems | Rule | No |
| MarsRL | RLVR (agent-specific rewards) | Multi | Both | Multi | Math Reasoning (AIME/BeyondAIME) | Rule (verifiable) | No |
| MrlX | M-GRPO (hierarchical) | Multi | Outcome | Multi | Deep Research (GAIA/XBench) | Rule + Model | Yes (Search) |
| CoMAS | RL w/ LLM-Judge intrinsic reward | Multi | Process | Multi | Co-evolving Reasoning | Model | No |
| CoMLRL | MAGRPO / MAREINFORCE / MARLOO | Multi | Outcome | Multi | Writing / Code / Minecraft | Custom | Minimal |
| SPIRAL | Role-conditioned Advantage Estimation (RAE) | Multi | Outcome | Multi | Zero-sum Games (TicTacToe/Kuhn/Negotiation) | Rule | No |
| MARFT | MARFT paradigm (action+token level) | Multi | Both | Multi | Research / Math | Rule | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| MEM1 | 2025.7 | MIT | Paper | veRL (based on Search-R1) | |
| Memento | 2025.6 | UCL, Huawei | Paper | Custom | |
| MemAgent | 2025.6 | Bytedance, Tsinghua-SIA | Paper | veRL | |
| Mem-alpha | 2025.9 | UCSD / USTC | Paper | veRL | |
| M3-Agent | 2025.7 | ByteDance Seed / Zhejiang University | Paper | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| MEM1 | PPO/GRPO | Single | Outcome | Multi | WebShop/GSM8K/QA | Rule/Model | Yes |
| Memento | soft Q-Learning | Single | Outcome | Multi | Research/QA/Code/Web | External/Rule | Yes |
| MemAgent | PPO, GRPO, DPO | Multi | Outcome | Multi | Long-context QA | Rule/Model/External | Yes |
| Mem-alpha | GRPO | Single | Outcome | Multi | Long-context QA + Memory Construction | Rule (downstream QA) | Yes (memory tools) |
| M3-Agent | RL-based | Single | Outcome | Multi | Long-video QA (M3-Bench) | Rule/Model | Yes (multimodal memory graph) |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| Embodied-R1 | 2025.6 | Tianjing University | Paper | veRL | |
| STeCa | 2025.2 | The Hong Kong Polytechnic University | Paper | FastChat/TRL | |
| VIKI-R | 2025.6 | MARS-EAI (NeurIPS 2025 D&B) | Paper | veRL + LLaMA-Factory |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| Embodied-R1 | GRPO | Single | Outcome | Single | Grounding/Waypoint | Rule | No |
| STeCa | DPO (RFT) | Single | Both | Multi | Embodied/Household | Rule/MC | Environment Actions |
| VIKI-R | GRPO (RFT after SFT) | Multi | Outcome | Multi | Embodied Multi-Robot Cooperation (VIKI-Bench) | Rule + Model | No |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | Domain |
|---|---|---|---|---|---|---|
| MedSAM-Agent | 2026.2 | CUHK/Tencent | Paper | Custom | Medical | |
| OS-R1 | 2025.8 | ISCAS | Paper | Custom | OS/Systems | |
| MMedAgent-RL | 2025.8 | Unknown | paper | Unknown | Medical | |
| DoctorAgent-RL | 2025.5 | UCAS/CAS/USTC | Paper | RAGEN | Medical | |
| Biomni | 2025.3 | Stanford University (SNAP) | Paper | Custom | Biomedical | |
| Doctor-R1 | 2025.12 | Tsinghua (thu-unicorn) | Paper | veRL | Medical | |
| Alpha-R1 | 2025.12 | SJTU / FinStep.AI / StepFun | Paper | Custom | Financial | |
| MedResearcher-R1 | 2025.8 | Ant Group (AQ-MedAI) | Paper | Custom | Medical | |
| LegalDelta | 2025.8 | Northeastern University (NEUIR) | Paper | Custom | Legal |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| MedSAM-Agent | GRPO (via veRL) | Single | Both | Multi | Medical Image Segmentation | Model (clinical fidelity) | Yes (SAM/MedSAM2) |
| OS-R1 | GRPO (via veRL) | Single | Outcome | Multi | Linux Kernel Tuning | Rule | Yes (LightRAG, kernel config) |
| MMedAgent-RL | Unknown | Multi | Unknown | Unknown | Unknown | Unknown | Unknown |
| DoctorAgent-RL | GRPO | Multi | Both | Multi | Consultation/Diagnosis | Model/Rule | No |
| Biomni | TBD | Single | TBD | Single | scRNAseq/CRISPR/ADMET/Knowledge | TBD | Yes |
| Doctor-R1 | Experiential Agentic RL | Multi | Both | Multi | Clinical inquiry & diagnosis | Model + Rule + safety veto | No |
| Alpha-R1 | GRPO | Single | Outcome | Multi | Alpha factor screening (with real-time news) | External (portfolio returns) + Model | Yes |
| MedResearcher-R1 | GRPO-based (SFT + Online RL) | Single | Outcome | Multi | Medical Deep Research (MedBrowseComp) | Rule + Model | Yes (Search/KG) |
| LegalDelta | GRPO (CoT-guided info-gain) | Single | Process | Multi | Legal Reasoning | Model + Rule | No |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | Focus |
|---|---|---|---|---|---|
| ToolPRMBench | 2026.1 | Arizona State University | Paper | PRM Benchmark for Tool-Use | |
| RLVR-World | 2025.5 | THU ML Group | Paper | RLVR for World Models | |
| AgentPRM | 2025.2 | Cornell | Paper | Process Reward for Agents | |
| Agentic-Reward-Modeling | 2025.2 | THU-KEG | Paper | Agentic Reward Agent | |
| AgentRM | 2025.2 | THUNLP/Tsinghua | Paper | Generalizable Agent RM | |
| AgentProg | 2025.5 | MobileLLM | Paper | Progress Reward Model (ProgRM) |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| ToolPRMBench | N/A (Benchmark) | Single | Process | Multi | Tool-Use | Rule/Model | Yes |
| RLVR-World | RLVR | Single | Outcome | Multi | World Modeling (Language/Video) | Model (verifiable) | No |
| AgentPRM | PPO/DPO + PRM | Single | Process | Multi | ALFWorld/General | Model (PRM) | Yes |
| Agentic-Reward-Modeling | DPO/Best-of-N | Single | Outcome | Single | General Instruction | Model (Reward Agent) | Yes (Verification) |
| AgentRM | MCTS/RM-guided | Single | Outcome | Multi | 9 Agent Tasks | Model (regression PRM) | Yes |
| AgentProg | Online RL w/ progress reward | Single | Process | Multi | GUI Agent Training | Model (ProgRM) | Yes |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| SafeSearch | 2025.11 | Amazon Science | Paper | veRL | |
| curiosity_redteam | 2024.2 | MIT | Paper | Custom | |
| RLbreaker | 2024.6 | Purdue | Paper | Custom | |
| xJailbreak | 2025.1 | Academic | Paper | Custom | |
| Auto-RT | 2025.1 | ICIP-CAS | Paper | Custom | |
| ToolSafe | 2026.1 | Academic (MurrayTom) | Paper | veRL | |
| TROJail | 2025.12 | Academic (ACL 2026) | Paper | RAGEN + vLLM | |
| Jailbreak-R1 | 2025.6 | Academic (yuki-younai) | Paper | Custom | |
| GuardReasoner-VL | 2025.5 | NUS (yueliu1999) | Paper | Custom |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| SafeSearch | PPO (GAE/GRPO) | Single | Both | Multi | Safe QA/Search | Rule + Model | Search |
| curiosity_redteam | RL + Curiosity | Single | Outcome | Multi | Red Teaming | Model | Yes (iterative query) |
| RLbreaker | Custom PPO | Single | Outcome | Multi | Jailbreaking | Model | Yes (mutator selection) |
| xJailbreak | RL | Single | Outcome | Multi | Jailbreaking | Model (embedding) | Yes (iterative) |
| Auto-RT | PPO | Single | Outcome | Multi | Red Teaming | Model | Yes (strategy exploration) |
| ToolSafe | Multi-task GRPO | Single | Process | Multi | Tool-Invocation Safety Guardrail | Rule + Model | Yes (tool monitoring) |
| TROJail | Multi-turn GRPO variant | Single | Both | Multi | Multi-turn Jailbreak Attack | Model (harmfulness judge) + Rule | Yes (target LLM) |
| Jailbreak-R1 | GRPO (3-stage: imitation→warm-up→progressive) | Single | Both | Multi | Red-teaming Prompt Generation | Model (judge) | Yes (target LLM) |
| GuardReasoner-VL | Online RL w/ rejection sampling | Single | Both | Multi | VLM Safety Guard (multimodal) | Rule + Model | No |
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| multimodal-search-r1 | 2025.6 | ByteDance/NTU | Paper | Custom | |
| DeepEyesV2 | 2025.11 | Xiaohongshu | Paper | Custom | |
| VDeepEyes | 2025.5 | Xiaohongshu/XJTU | Paper | veRL | |
| CoSo | 2025.5 | NTU/Alibaba | Paper | Custom | |
| RL4VLM | 2024.5 | UC Berkeley | Paper | Custom | |
| VSC-RL | 2025.2 | Liverpool/Huawei/Tianjin/UCL | Paper | Custom | |
| AlphaDrive | 2025.3 | HUST/Horizon Robotics | Paper | Custom | |
| Mini-o3 | 2025.9 | Mini-o3 team | Paper | veRL | |
| VisionThink | 2025.7 | CUHK (dvlab-research) | Paper | veRL + EasyR1 | |
| AutoVLA | 2025.6 | UCLA Mobility Lab | Paper | Custom | |
| Pixel-Reasoner | 2025.5 | University of Waterloo (TIGER-AI-Lab) | Paper | OpenRLHF | |
| Visual-ARFT | 2025.5 | Shanghai AI Lab / SJTU | Paper | Custom | |
| VTool-R1 | 2025.5 | UIUC | Paper | veRL + EasyR1 | |
| OpenThinkIMG | 2025.5 | Academic (zhaochen0110) | Paper | OpenR1 | |
| Chain-of-Focus | 2025.5 | Multi-institution | Paper | veRL | |
| GRIT | 2025.5 | UC Santa Cruz (eric-ai-lab) | Paper | trl |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| multimodal-search-r1 | GRPO | Single | Outcome | Multi | Multimodal Search | Rule | Yes (Search) |
| DeepEyesV2 | Outcome RL | Single | Outcome | Multi | Multimodal Reasoning | Rule | Yes (Code exec, Web search) |
| VDeepEyes | PPO/GRPO | Multi | Process | Multi | VQA | All | Yes |
| CoSo | Soft RL (counterfactual) | Single | Outcome | Multi | Android/Card/Embodied | Rule | Yes |
| RL4VLM | PPO | Single | Outcome | Multi | GymCards/ALFWorld | Rule | Yes |
| VSC-RL | Variational RL | Single | Outcome | Multi | Mobile Device Control | Rule | Yes |
| AlphaDrive | GRPO | Single | Outcome | Multi | Autonomous Driving | Rule (4 planning rewards) | No |
| Mini-o3 | GRPO | Single | Outcome | Multi | Visual Search (V*/HR-Bench) | Rule | Yes (image crop) |
| VisionThink | GRPO w/ LLM-as-Judge | Single | Outcome | Multi | Efficient VQA | Model (LLM-Judge) | Yes (hi-res request) |
| AutoVLA | GRPO (RFT after SFT) | Single | Outcome | Multi | Autonomous Driving (nuScenes/nuPlan/Waymo) | Rule (PDMS) | No |
| Pixel-Reasoner | Curiosity-driven GRPO | Single | Both | Multi | Visual Reasoning (V*/TallyQA/Info-VQA) | Rule + Model | Yes (zoom/select-frame) |
| Visual-ARFT | GRPO (agentic RFT) | Single | Outcome | Multi | Multimodal Agentic Tool Use (MAT-Search/Coding) | Rule | Yes (Search/Python) |
| VTool-R1 | RFT (GRPO-based) | Single | Outcome | Multi | Chart/Table VQA | Rule | Yes (Python visual tools) |
| OpenThinkIMG | V-ToolRL (GRPO) | Single | Outcome | Multi | Chart Reasoning | Rule | Yes (GroundingDINO/SAM/OCR/crop) |
| Chain-of-Focus | AGAR (GRPO) | Single | Outcome | Multi | Visual Reasoning (V*) | Rule (outcome+format) | Yes (zoom-in) |
| GRIT | GRPO-GR (Grounded Reasoning) | Single | Outcome | Single | Visual Reasoning (bbox) | Rule | Yes (bbox) |
⚠️ Note: The definition of "Self-Evolution" in the context of RL for LLM agents is still evolving and not yet well-established. This category currently collects works whose paper titles explicitly contain "self-evolving" or "self-evolution", where the agent improves itself through RL-driven feedback loops.
| Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
|---|---|---|---|---|---|
| AgentEvolver | 2025.11 | Alibaba/Tongyi Lab | Paper | Custom | |
| SEAgent | 2025.8 | Shanghai AI Lab / CUHK | Paper | Custom | |
| MemSkill | 2026.2 | NTU/UIUC/UIC/Tsinghua | Paper | Custom | |
| MemRL | 2026.1 | SJTU/Xidian/NUS/USTC/MemTensor | Paper | Custom | |
| RAGEN | 2025.1 | RAGEN-AI | Paper | veRL | |
| WebRL | 2024.11 | Tsinghua/Zhipu AI | Paper | Custom | |
| EvolveR | 2025.10 | KnowledgeXLab / Shanghai AI Lab | Paper | veRL | |
| R-Zero | 2025.8 | Tencent AI Seattle Lab / WashU / UMD | Paper | EasyR1 | |
| Absolute-Zero-Reasoner | 2025.5 | Tsinghua (LeapLabTHU) / BIGAI / PSU | Paper | veRL |
📋 Click to view technical details
| Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
|---|---|---|---|---|---|---|---|
| AgentEvolver | ADCA-GRPO | Single | Outcome | Multi | Social Game/Tool-use | Rule | Yes |
| SEAgent | GRPO | Single | Outcome | Multi | Computer Use (OSWorld) | Model | Yes (Screenshot-based) |
| MemSkill | PPO | Single | Process | Multi | QA/ALFWorld | Model (learned skills) | Yes |
| MemRL | RL-based (Q-value) | Single | Process | Multi | HLE/BigCodeBench/ALFWorld | Model (retrieval) | Yes |
| RAGEN | PPO/GRPO (StarPO) | Single | Both | Multi | TextGame | All | Yes |
| WebRL | Actor-Critic RL + ORM | Single | Outcome | Multi | Web Navigation (WebArena) | Model (ORM) | Yes (Web browsing) |
| EvolveR | GRPO (closed-loop online+offline) | Single | Outcome | Multi | Multi-hop QA (NQ/HotpotQA) | Rule | Yes (experience retrieval) |
| R-Zero | GRPO (Challenger + Solver co-evolution) | Multi | Outcome | Multi | Math/SuperGPQA/MMLU-Pro/BBEH | Rule (majority voting) | No |
| Absolute-Zero-Reasoner | TRR++ (Task-Relative REINFORCE++) | Single | Outcome | Single | Code/Math Reasoning (HumanEval/MBPP/LiveCodeBench) | Rule + learnability | Yes (Python exec) |
| Github Repo | 🌟 Stars | Date | Org | Task |
|---|---|---|---|---|
| OpenSandbox | 2026.3 | Alibaba | Code/GUI/Agent Eval | |
| OpenEnv | 2026.3 | Meta (PyTorch) | Chess/Arcade/Finance | |
| NeMo-Gym | 2026.1 | NVIDIA | Multi-step/Multi-turn | |
| open-trajectory-gym | 2026.3 | Individual | CTF/Security | |
| R2E-Gym | 2025.4 | UC Berkeley/ANU | SWE | |
| LoCoBench-Agent | 2025.11 | Salesforce AI Research | SWE | |
| Simia-Agent-Training | 2025.10 | Microsoft | ToolUse/API | |
| PaperArena | 2025.9 | University of Science and Technology of China | ScientificLiteratureQA | |
| enterprise-deep-research | 2025.9 | Salesforce AI Research | DeepResearch | |
| meta-agents-research-environments | 2025.9 | Meta (FAIR) | Gaia2 / Multi-universe | |
| BrowseComp-Plus | 2025.8 | University of Waterloo | Deep Research Eval | |
| MCP-Bench | 2025.8 | Accenture | MCP Tool-use (28 servers) | |
| MCPVerse | 2025.8 | Individual | MCP Tools (550+) | |
| CompassVerifier | 2025.7 | Shanghai AI Lab | Reasoning | |
| tau2-bench | 2025.6 | Sierra Research | Tool-Agent-User | |
| MCP-Universe | 2025.5 | Salesforce AI Research | MCP Tool-use | |
| SWE-smith | 2025.4 | Princeton/Stanford/SWE-bench | SWE | |
| SWE-Gym | 2024.12 | UC Berkeley/UIUC/CMU/Apple | SWE | |
| Mind2Web-2 | 2025.6 | Ohio State University | Web | |
| gem | 2025.5 | Sea AI Lab | Math/Code/Game/QA | |
| MLE-Dojo | 2025.5 | GIT, Stanford | MLE | |
| atropos | 2025.4 | Nous Research | Game/Code/Tool | |
| InternBootcamp | 2025.4 | InternBootcamp | Coding/QA/Game | |
| loong | 2025.3 | CAMEL-AI.org | RLVR | |
| DataSciBench | 2025.2 | Tsinghua | data analysis | |
| reasoning-gym | 2025.1 | open-thought | Math/Game | |
| llmgym | 2025.1 | tensorzero | TextGame/Tool | |
| debug-gym | 2024.11 | Microsoft Research | Debugging/Game/Code | |
| gym-llm | 2024.8 | Rodrigo Sánchez Molina | Control/Game | |
| AgentGym | 2024.6 | Fudan | Web/Game | |
| tau-bench | 2024.6 | Sierra | Tool | |
| appworld | 2024.6 | Stony Brook University | Phone Use | |
| android_world | 2024.5 | Google Research | Phone Use | |
| TheAgentCompany | 2024.3 | CMU, Duke | Coding | |
| LlamaGym | 2024.3 | Rohan Pandey | Game | |
| visualwebarena | 2024.1 | CMU | Web | |
| LMRL-Gym | 2023.12 | UC Berkeley | Game | |
| OSWorld | 2023.10 | HKU, CMU, Salesforce, Waterloo | Computer Use | |
| webarena | 2023.7 | CMU | Web | |
| AgentBench | 2023.7 | Tsinghua University | Game/Web/QA/Tool | |
| WebShop | 2022.7 | Princeton-NLP | Web | |
| ScienceWorld | 2022.3 | AllenAI | TextGame/ScienceQA | |
| alfworld | 2020.10 | Microsoft, CMU, UW | Embodied | |
| factorio-learning-environment | 2021.6 | JackHopkins | Game | |
| jericho | 2018.10 | Microsoft, GIT | TextGame | |
| TextWorld | 2018.6 | Microsoft Research | TextGame |
- JoyAgents-R1: Joint Evolution Dynamics for Versatile Multi-LLM Agents with Reinforcement Learning
- Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
- Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
- Acting Less is Reasoning More! Teaching Model to Act Efficiently
- Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning
- ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
- Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
- MUA-RL: MULTI-TURN USER-INTERACTING AGENTREINFORCEMENT LEARNING FOR AGENTIC TOOL USE
- Understanding Tool-Integrated Reasoning
- Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning
- Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning
- SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
- WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents
- EnvX: Agentize Everything with Agentic AI
- UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
- UI-Venus Technical Report: Building High-performance UI Agents with RFT
- Agent2 : An Agent-Generates-Agent Framework for Reinforcement Learning Automation
- Adversarial Reinforcement Learning for Large Language Model Agent Safety
- Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction
- InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
If you find this repository useful, please consider citing it:
@misc{agentsMeetRL,
title={When LLM Agents Meet Reinforcement Learning: A Comprehensive Survey},
author={AgentsMeetRL Contributors},
year={2025},
url={https://github.com/thinkwee/agentsMeetRL}
}Made with ❤️ by the AgentsMeetRL community
