By Zakaria Mhammedi and James Cohan, Google Research.
GowU decouples exploration from policy optimization in reinforcement learning. Instead of training agents with intrinsic motivation, GowU uses an uncertainty-guided tree search, inspired by the Go-With-The-Winner algorithm, to systematically drive exploration without the overhead of policy optimization. This enables an order of magnitude more efficient exploration than standard intrinsic motivation baselines on hard exploration benchmarks, achieving state-of-the-art performance on Montezuma's Revenge, Pitfall!, and Venture, and solving sparse-reward MuJoCo Adroit and AntMaze tasks directly from pixels without expert demonstrations.
GowU requires Python 3.11 and careful version alignment between JAX and DeepMind's legacy RL libraries (Acme, Reverb, Launchpad). Follow the steps below in order.
python3.11 -m venv gowu_env
source gowu_env/bin/activate# Install base package in editable mode (--no-deps to avoid resolving
# dm-launchpad, which must be built from source in Step 3)
pip install --no-deps -e .
pip install \
'numpy<2.0.0' \
'scipy==1.11.4' \
'jax==0.4.18' \
'jaxlib==0.4.18' \
'dm-haiku==0.0.10' \
'optax==0.1.7' \
'chex==0.1.85' \
'tensorflow==2.15.1' \
'tensorflow-probability==0.23.0' \
'dm-env==1.6' \
'dm-tree==0.1.8' \
'gym>=0.26.0' \
'ale-py==0.8.1' \
'gym[accept-rom-license]' \
'fasteners' \
'tqdm' \
'frozendict>=2.3.0' \
'ml-collections>=0.1.1' \
'absl-py>=1.0.0' \
'Pillow' \
'portpicker' \
'mock' \
'xmanager'
# Install Reverb, Acme, and Sonnet (--no-deps to prevent
# pip from pulling newer incompatible transitive dependencies)
pip install --no-deps \
'dm-acme==0.4.0' \
'dm-reverb==0.13.0' \
'dm-sonnet==2.0.2'
# Pin protobuf to match TensorFlow
pip install 'protobuf==4.25.9'dm-launchpad is no longer distributed on PyPI. Build it from source using
Bazel:
git clone https://github.com/google-deepmind/launchpad.git
cd launchpad
./oss_build.sh --python 3.11
pip install /tmp/launchpad/dist/dm_launchpad-*.whldm-acme 0.4.0 references deprecated JAX and NumPy symbols that were removed
in the versions required by this stack. Apply these three patches inside your
virtual environment's site-packages:
Patch acme/jax/utils.py:
SITE=$(python -c "import site; print(site.getsitepackages()[0])")
sed -i 's/isinstance(x, jax\.xla\.DeviceArray)/isinstance(x, jax.Array)/g' \
"$SITE/acme/jax/utils.py"
sed -i 's/jax\.xla\.Device/jax.Device/g' \
"$SITE/acme/jax/utils.py"
sed -i 's/jnp\.DeviceArray/jax.Array/g' \
"$SITE/acme/jax/utils.py"Patch acme/jax/variable_utils.py:
sed -i 's/jax\.xla\.Device/jax.Device/g' \
"$SITE/acme/jax/variable_utils.py"Patch acme/wrappers/atari_wrapper.py:
sed -i 's/np\.float\b/np.float64/g' \
"$SITE/acme/wrappers/atari_wrapper.py"python -c "
import jax; print(f'JAX: {jax.__version__}')
import acme; print('Acme: OK')
import reverb; print('Reverb: OK')
import launchpad; print('Launchpad: OK')
import gowu; print('GowU: OK')
print()
print('All imports successful!')
"These additional steps are only needed for continuous control tasks (not for Atari games like MontezumaRevenge or Pitfall).
Note on Gym compatibility: Gym 0.26.2 (used in
gowu_envfor Atari) introduced breaking API changes that are incompatible with D4RL environments. All MuJoCo-based tasks — both Adroit (hammer-v0, relocate-v0, door-v0) and AntMaze — require a separategowu_mujoco_envwith Gym 0.23.1.
sudo apt update
sudo apt install -y libosmesa6-dev patchelf python3-devmkdir -p ~/.mujoco
cd ~/.mujoco
curl -sSL https://mujoco.org/download/mujoco210-linux-x86_64.tar.gz -o mujoco.tar.gz
tar -xzf mujoco.tar.gz
rm mujoco.tar.gzAll D4RL/MuJoCo environments require Gym 0.23.1 because Gym 0.26+ introduced breaking API changes that are incompatible with D4RL's environment wrappers.
python3.11 -m venv gowu_mujoco_env
source gowu_mujoco_env/bin/activate
pip install --no-deps -e .
# Install the same dependency stack but with gym==0.23.1 instead of gym>=0.26.0
pip install \
'numpy<2.0.0' \
'scipy==1.11.4' \
'jax==0.4.18' \
'jaxlib==0.4.18' \
'dm-haiku==0.0.10' \
'optax==0.1.7' \
'chex==0.1.85' \
'tensorflow==2.15.1' \
'tensorflow-probability==0.23.0' \
'dm-env==1.6' \
'dm-tree==0.1.8' \
'gym==0.23.1' \
'dm-control==1.0.41' \
'mujoco==3.8.1' \
'fasteners' \
'tqdm' \
'frozendict>=2.3.0' \
'ml-collections>=0.1.1' \
'absl-py>=1.0.0' \
'Pillow' \
'portpicker' \
'mock' \
'xmanager'
pip install --no-deps \
'dm-acme==0.4.0' \
'dm-reverb==0.13.0' \
'dm-sonnet==2.0.2'
pip install 'protobuf==4.25.9'
pip install 'Cython<3' 'setuptools==69.5.1' imageio
pip install --no-deps 'mujoco-py<2.2,>=2.1'
pip install --no-deps --ignore-requires-python git+https://github.com/Farama-Foundation/d4rl@master
pip install --no-deps git+https://github.com/aravindr93/mjrl.git
# dm-launchpad must be built from source (see Step 3 in the main guide)Apply the same dm-acme patches to this venv as well.
Environment summary:
gowu_env(Gym 0.26.2) — Atari tasks (MontezumaRevenge, Pitfall)gowu_mujoco_env(Gym 0.23.1) — Adroit (hammer-v0, relocate-v0, door-v0) and AntMaze tasks
These must be set before every GowU launch:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/.mujoco/mujoco210/bin
export MUJOCO_GL=osmesaGowU spawns many parallel processes (actors, learners, replay buffer, coordinator). The default configuration (64 particles, 32 actors, 2 groups) has the following requirements:
| Resource | Recommended |
|---|---|
| CPU cores | 48+ |
| RAM | 64+ GB |
| Disk | 64 GB (for checkpoints) |
| GPU | Not required (CPU-only) |
Typical memory breakdown (64 particles, 32 actors, 2 groups):
- Coordinator (gowu): ~5–10 GB RSS (includes exploration tree)
- Actors: ~1 GB each × 32 = ~32 GB
- Reverb + RND + other workers: ~1–2 GB
- Exploration tree: ~2–10 GB (grows over time; included in coordinator RSS)
- Total: 40–60 GB in steady state
For smaller machines, reduce num_particles, num_actors, and
agent.num_ensemble proportionally. For example, on a 32 GB machine:
--config.num_particles=16 --config.agent.num_ensemble=16 \
--config.n_groups=1 --config.num_actors=8
GowU uses Launchpad for distributed execution.
The simplest way to run GowU is with local_mp, which spawns all nodes
(actors, learners, curriculum server) as separate processes on a single machine.
By default, Launchpad opens a tmux session with one pane per node. For long
runs, set LAUNCHPAD_LAUNCH_LOCAL_TERMINAL=output_to_files to write all node
logs to files instead.
cd /path/to/gowu-opensource
source gowu_env/bin/activate
# Write logs to files instead of tmux (optional)
export LAUNCHPAD_LAUNCH_LOCAL_TERMINAL=output_to_files
# Launch training
python -m gowu.run_gowu \
--lp_launch_type=local_mp \
--config=gowu/configs/base_config.py \
--config.env_name=MontezumaRevenge \
--config.num_particles=64 \
--config.agent.num_ensemble=64 \
--config.n_groups=2 \
--config.num_actors=32For Adroit or AntMaze tasks, use gowu_mujoco_env instead:
cd /path/to/gowu-opensource
source gowu_mujoco_env/bin/activate
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/.mujoco/mujoco210/bin
export MUJOCO_GL=osmesa
export D4RL_SUPPRESS_IMPORT_ERROR=1
export LAUNCHPAD_LAUNCH_LOCAL_TERMINAL=output_to_files
# Adroit example (hammer-v0):
python -m gowu.run_gowu \
--lp_launch_type=local_mp \
--config=gowu/configs/base_config.py \
--config.env_name=hammer-v0 \
--config.num_particles=64 \
--config.agent.num_ensemble=64 \
--config.n_groups=4 \
--config.num_actors=32
# AntMaze example (antmaze-large-diverse-v0):
python -m gowu.run_gowu \
--lp_launch_type=local_mp \
--config=gowu/configs/base_config.py \
--config.env_name=antmaze-large-diverse-v0 \
--config.num_particles=64 \
--config.agent.num_ensemble=64 \
--config.n_groups=4 \
--config.num_actors=32When using output_to_files, logs are written to /tmp/launchpad_out/:
/tmp/launchpad_out/
├── gowu/0 # Curriculum server (coordinator) log
├── reverb/0 # Replay buffer log
├── rnd_learner/0 # RND learner log
└── actor/
├── 0 # Actor 0 log
├── 1 # Actor 1 log
└── ... # One file per actor
You can monitor the run in real time with:
tail -f /tmp/launchpad_out/gowu/0
grep "Reward =" /tmp/launchpad_out/gowu/0GowU logs training metrics (rewards, losses, active tree size, steps) to TensorBoard. By default, logs are written to the tb_logs subdirectory under the checkpoint directory (e.g. /tmp/gowu_checkpoints/tb_logs).
To start TensorBoard and view the metrics, run:
tensorboard --logdir=/tmp/gowu_checkpoints/tb_logsNavigate to http://localhost:6006 in your browser.
For true distributed runs across multiple machines, Launchpad supports several
launch types. Set --lp_launch_type to one of the following:
local_mp— All nodes on one machine (default, shown above).vertex_ai— Launch on Vertex AI with each node running in its own container.ssh— Launch across multiple machines via SSH. Requires passwordless SSH access to all worker machines.
For example, to launch on Vertex AI:
python -m gowu.run_gowu \
--lp_launch_type=vertex_ai \
--config=gowu/configs/base_config.py \
--config.env_name=hammer-v0 \
--config.num_actors=32See the Launchpad documentation for details on configuring each launch type.
gowu.py: The core curriculum server.actor.py: Distributed actor implementation.run_gowu.py: Launchpad program entry point.learners/: Contains different learners (policy, rnd, contrastive).models/: Network architectures.env_utils/: Environment utilities and factories.configs/: Configuration files.optimizers/: Custom optimizers (Muon).utils/: Utility functions.
If you use GowU in your research, please cite:
@misc{mhammedi2026decouplingexplorationpolicyoptimization,
title={Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration},
author={Zakaria Mhammedi and James Cohan},
year={2026},
eprint={2603.22273},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.22273},
}See CONTRIBUTING.md for guidelines. For bugs and feature requests, please open an issue on GitHub.
Apache 2.0. See LICENSE for details.
This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.