A repository to train and improve a small Latent World Model to simulate DOOM (1993). You can play the game inside the model in real-time.
uv syncor
pip install -r requirements.txtDependencies:
- pytorch
- numpy
- vizdoom (to generate the training data)
- raylib (to show the generated frames and listen to users inputs)
- datasets (to store and use our training data)
- pillow (to handle images and videos)
- safetensors (to store our weights)
- transformers (to easily train our model)
- kernels (to speed up our model)
You can directly train the current best model on our data by doing uv run train.py or python3 train.py (it will download the training data from huggingface).
After that you can run uv run play.py ./weights to load your train world model and play DOOM!
Run uv run play.py or python3 play.py, this will download a pre-trained version of the world model, so you can play DOOM inside it.
You can contribute by submitting in a PR a new loss record in Model Architecture or Training Data. Other PRs might be closed.
Here is some relevant academic papers to learn more about latent world models:
- Next Embedding Prediction Makes World Models Stronger [Bredis et al, 2026]: we train our world model entirely in latent representations, which is faster than pixel space
- LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics [Balestriero & LeCun, 2025]: a theorical paper, but this is the framework that we use to make self-supervised learning much more efficient and easy to do
- V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning [Assran et al., 2025]: we use the same method as theirs to train the world model using rollouts (you can see it in part 3 "V-JEPA 2-AC: Learning an Action-Conditioned World Model")
- Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos [Baker et al., 2022]: we use the same idea of an inverse dynamics model to recover actions from gameplay videos
Here are all the papers and scientific projects that we are using for miniworld:
- ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning [Kempka et al., 2016]
- Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos [Baker et al., 2022]
- Next Embedding Prediction Makes World Models Stronger [Bredis et al, 2026]
- LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics [Balestriero & LeCun, 2025]
- V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning [Assran et al., 2025]
- Attention Is All You Need [Vaswani et al., 2017]