O1 Embedder is a reasoning-enhanced dense retriever that mimics the step-by-step thinking behavior of Large Language Models (LLMs) to solve complex and zero-shot retrieval tasks.
It is the first retrieval model that integrates long-form thought generation and discriminative embedding in a unified framework — enabling high performance on both in-domain and out-of-distribution (OOD) information retrieval benchmarks.
The training of O1 Embedder involves two types of data. One is used for the embedding capability, which is made up of queries and their relevant documents, i.e., q-doc tuples. The other one is used for the thinking capability, which includes queries and their thoughts, i.e., q-thought tuples. Unlike q-doc tuples which have been widely existed, there are no available q-thought tuples in reality. To resolve this problem, we propose a data synthesis pipeline, leveraging LLMs’ readily equipped reasoning capacity to generate such datasets.
- 🧠 Thought-Augmented Retrieval: Generates LLM-style "thoughts" before embedding the query to uncover hidden semantic intents.
- 🔁 Joint Multi-task Training: Simultaneous optimization for generation and retrieval via behavior cloning & contrastive learning.
- 📊 Strong Generalization: Achieves SoTA or near-SoTA results on 12 retrieval benchmarks including MS MARCO, HotPotQA, SciFact, CosQA.
- 🧪 Backbone-Agnostic: Compatible with LLaMA, Mistral, Qwen, and other major open-source LLMs.
git clone https://github.com/RuiranYan/o1embedder.git
cd O1-Embedderpip install -r requirements.txt
pip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whlhuggingface-cli download --repo-type dataset --resume-download Ruiran/msmarco_thought final.jsonl --local-dir dataset --local-dir-use-symlinks FalseYou can build your own thought-augmented dataset via:
-
Start the vLLM server:
bash ./scripts/vllm.sh
-
Open another terminal and access the vLLM server:
bash ./scripts/gen_data.sh
-
Vote and get the best thought:
python ./data_preparation/vote.py --input_file "./dataset/toy_thought.jsonl" --output_file "./dataset/toy_vote_res.jsonl" --model_zoo '["BAAI/bge-large-en-v1.5", "dunzhang/stella_en_1.5B_v5", "Alibaba-NLP/gte-large-en-v1.5"]'
bash scripts/train.shbash scripts/eval.shUse a strong LLM (e.g., LLaMA-3) to generate long-form "thoughts" before retrieval.
Evaluate thought quality via a diverse set of retrievers and select via majority voting.
- Behavior Cloning: teaches the model to generate thoughts.
- Contrastive Learning: aligns query-thought pairs with relevant documents.
| Backbone | Model Sizes | Supported |
|---|---|---|
| LLaMA | 7B, 8B | ✅ |
| Mistral | 7B | ✅ |
| Qwen2.5 | 0.5B–7B | ✅ |
We thank FlagEmbedding for providing the open-source framework, and XiongWenXww for her key contributions to the data preparation process.
If you find our work helpful, please cite our paper:
@misc{yan2025o1embedderletretrievers,
title={O1 Embedder: Let Retrievers Think Before Action},
author={Ruiran Yan and Zheng Liu and Defu Lian},
year={2025},
eprint={2502.07555},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.07555},
}


