🧠 O1 Embedder: Enhanced Retrieval Model with Thinking Capabilities

🔍 Overview

O1 Embedder is a reasoning-enhanced dense retriever that mimics the step-by-step thinking behavior of Large Language Models (LLMs) to solve complex and zero-shot retrieval tasks.

It is the first retrieval model that integrates long-form thought generation and discriminative embedding in a unified framework — enabling high performance on both in-domain and out-of-distribution (OOD) information retrieval benchmarks.

🏭 Data Production

The training of O1 Embedder involves two types of data. One is used for the embedding capability, which is made up of queries and their relevant documents, i.e., q-doc tuples. The other one is used for the thinking capability, which includes queries and their thoughts, i.e., q-thought tuples. Unlike q-doc tuples which have been widely existed, there are no available q-thought tuples in reality. To resolve this problem, we propose a data synthesis pipeline, leveraging LLMs’ readily equipped reasoning capacity to generate such datasets.

✨ Key Features

🧠 Thought-Augmented Retrieval: Generates LLM-style "thoughts" before embedding the query to uncover hidden semantic intents.
🔁 Joint Multi-task Training: Simultaneous optimization for generation and retrieval via behavior cloning & contrastive learning.
📊 Strong Generalization: Achieves SoTA or near-SoTA results on 12 retrieval benchmarks including MS MARCO, HotPotQA, SciFact, CosQA.
🧪 Backbone-Agnostic: Compatible with LLaMA, Mistral, Qwen, and other major open-source LLMs.

🏁 Quick Start

1. Clone this repo

git clone https://github.com/RuiranYan/o1embedder.git
cd O1-Embedder

2. Install Dependencies

pip install -r requirements.txt
pip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

3. Download Datasets

huggingface-cli download --repo-type dataset --resume-download Ruiran/msmarco_thought final.jsonl --local-dir dataset --local-dir-use-symlinks False

(Optional) Prepare Your Own Training Data

You can build your own thought-augmented dataset via:

Start the vLLM server:
```
bash ./scripts/vllm.sh
```
Open another terminal and access the vLLM server:
```
bash ./scripts/gen_data.sh
```

Vote and get the best thought:

python ./data_preparation/vote.py --input_file "./dataset/toy_thought.jsonl" --output_file "./dataset/toy_vote_res.jsonl" --model_zoo '["BAAI/bge-large-en-v1.5", "dunzhang/stella_en_1.5B_v5", "Alibaba-NLP/gte-large-en-v1.5"]'

4. Train O1 Embedder

bash scripts/train.sh

5. Evaluation

bash scripts/eval.sh

🧠 Core Ideas

🧪 1. Thought Generation via LLM

Use a strong LLM (e.g., LLaMA-3) to generate long-form "thoughts" before retrieval.

🧪 2. Retrieval Committee

Evaluate thought quality via a diverse set of retrievers and select via majority voting.

🧪 3. Joint Learning

Behavior Cloning: teaches the model to generate thoughts.
Contrastive Learning: aligns query-thought pairs with relevant documents.

🤖 Supported Backbones

Backbone	Model Sizes	Supported
LLaMA	7B, 8B	✅
Mistral	7B	✅
Qwen2.5	0.5B–7B	✅

🏆 Acknowledgements

We thank FlagEmbedding for providing the open-source framework, and XiongWenXww for her key contributions to the data preparation process.

📝 Citation

If you find our work helpful, please cite our paper:

@misc{yan2025o1embedderletretrievers,
      title={O1 Embedder: Let Retrievers Think Before Action}, 
      author={Ruiran Yan and Zheng Liu and Defu Lian},
      year={2025},
      eprint={2502.07555},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07555}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
FlagEmbedding		FlagEmbedding
data_preparation		data_preparation
deepspeed_configs		deepspeed_configs
images		images
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 O1 Embedder: Enhanced Retrieval Model with Thinking Capabilities

🔍 Overview

🏭 Data Production

✨ Key Features

🏁 Quick Start

1. Clone this repo

2. Install Dependencies

3. Download Datasets

(Optional) Prepare Your Own Training Data

4. Train O1 Embedder

5. Evaluation

🧠 Core Ideas

🧪 1. Thought Generation via LLM

🧪 2. Retrieval Committee

🧪 3. Joint Learning

🤖 Supported Backbones

🏆 Acknowledgements

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 O1 Embedder: Enhanced Retrieval Model with Thinking Capabilities

🔍 Overview

🏭 Data Production

✨ Key Features

🏁 Quick Start

1. Clone this repo

2. Install Dependencies

3. Download Datasets

(Optional) Prepare Your Own Training Data

4. Train O1 Embedder

5. Evaluation

🧠 Core Ideas

🧪 1. Thought Generation via LLM

🧪 2. Retrieval Committee

🧪 3. Joint Learning

🤖 Supported Backbones

🏆 Acknowledgements

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages