We’re entering the era of experience, where large language models (LLMs) learn not just from static datasets, but from interactive experience gathered in complex, expressive environments.
As a step toward this, we introduce GEM — a General Experience Maker for LLMs — an open-source environment suite designed for training agentic LLMs via online reinforcement learning.
Like OpenAI Gym for traditional RL, GEM provides a standardized API and a growing collection of diverse environments. It is training framework-agnostic and supports seamless integration with six popular RL training frameworks including Oat and Tinker, offering:
- 🧩 Clean, composable environment APIs
- ⚙️ Async vectorized execution for high-throughput simulation
- 🔧 Tool integration & custom wrappers
- 🧠 Multi-environment training
- 🎈 Ready-to-use benchmark environments and algorithms
pip install -U gem-llmOr install from source for the latest version:
git clone https://github.com/axon-rl/gem.git
cd gem
pip install -e .Please check Getting Started for more setup details.
🔥 You can jump into examples to quickly start your agentic RL training with GEM & your favorite training framework.
GEM's interface closely follows OpenAI-Gym's API. Here's an example using the game:GuessTheNumber-v0 environment:
import gem
# List all supported environments
gem.print_envs()
# Initialize the environment
env = gem.make("game:GuessTheNumber-v0")
# Reset the environment to generate the first observation
observation, info = env.reset()
# Start the agent-environment loop
while True:
action = env.sample_random_action() # insert policy here, e.g.,
# (pseudocode) action = llm.generate(observation)
# apply action and receive next observation, reward
# and whether the episode has ended
next_observation, reward, terminated, truncated, info = env.step(action)
print("OBS", observation)
print("ACT", action)
# update the policy (online) here
# e.g., policy = learn(policy, observation, action, reward, info)
observation = next_observation
# Exit when the episode terminates
if terminated or truncated:
break- Environments consist of tasks and (optional) tools. Tool-calling is achieved via an environment wrapper, as demonstrated here.
- GEM is training framework-agnostic, and we demonstrate its integration with six popular RL training frameworks.
- We provide implementations and benchmarking results for different algorithms across a diverse set of environments.
| Category | Example Environments | Description |
|---|---|---|
| Games | game:GuessTheNumber-v0-hard, game:Sudoku-v0-easy |
Classic language games |
| Math | math:Math12K, math:DeepScaleR40K |
Mathematical reasoning |
| Code | code:CodeContest, code:Taco8k |
Competitive coding |
| QA | qa:NaturalQuestions, qa:HotpotQA |
Knowledge-intensive question answering |
| ReasoningGym | rg:arc_1d, rg:letter_counting |
Diverse synthetic reasoning tasks |
| Tool | Description |
|---|---|
| Python | Python code executor that parses code blocks, executes them, and returns outputs |
| Search | Calls a search engine to retrieve documents for any query |
| MCP | Calls the general MCP API to train tool-use agents |
| Framework | Description |
|---|---|
| Oat | vLLM + DeepSpeed, modular, no ray |
| Tinker | SDK provided by Thinking Machines, frees you from infra issues |
| Verl | Support diverse backends, models, and algorithms |
| RL2 | SGLang + FSDP, no ray, easy to hack |
| ROLL | Support diverse backends, models, and algorithms |
| OpenRLHF | Support diverse backends, models, and algorithms |
Examples of training agents on GEM environments with all above frameworks can be found in here!
| Algorithm | Description |
|---|---|
| REINFORCE | A general policy gradient algorithm that can be applied to single- and multi-turn environments |
| GRPO | Mainly for bandits (single-turn), using group advantage normalization |
| PPO | Learns a turn-level critic to compute generalized advantage estimation (GAE) |
| REINFORCE + ReBN | REINFORCE with return batch normalization as introduced in our paper |
Please check out our paper for a more detailed description for each algorithm and empirical results showing their tradeoffs.
We welcome all forms of contribution — from adding new environments to integrating additional training frameworks. We're planning to write a community-driven technical report, and major contributors will be recognized with authorship. Join discord to discuss more!
- This work is supported by Sea AI Lab for computing resources.
- Our code learns from and builds on several awesome projects such as gym, rllm, TextArena, Search-R1, ReasoningGym.
- The training example code is built on Oat, Tinker, Verl, RL2, ROLL, OpenRLHF.
If you find our works useful for your research, please consider citing:
-
GEM paper (please prioritize citing the paper unless you believe the blog is a better fit):
@article{liu2025gem, title={GEM: A Gym for Agentic LLMs}, author={Liu, Zichen and Sims, Anya and Duan, Keyu and Chen, Changyu and Yu, Simon and Zhou, Xiangxin and Xu, Haotian and Xiong, Shaopan and Liu, Bo and Tan, Chenmien and others}, journal={arXiv preprint arXiv:2510.01051}, year={2025} }
-
GEM blog:
@misc{liu2025gemblog, title={GEM: A Gym for Generalist LLMs}, author={Liu, Zichen and Sims, Anya and Duan, Keyu and Chen, Changyu and Yang, Diyi and Lee, Wee Sun and Lin, Min}, year={2025}, howpublished={\url{https://axon-rl.notion.site/gem}}, note={Notion Blog}, }