Skip to content

RuiranYan/o1embedder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 O1 Embedder: Enhanced Retrieval Model with Thinking Capabilities

[Paper Page]

🔍 Overview

O1 Embedder is a reasoning-enhanced dense retriever that mimics the step-by-step thinking behavior of Large Language Models (LLMs) to solve complex and zero-shot retrieval tasks.

It is the first retrieval model that integrates long-form thought generation and discriminative embedding in a unified framework — enabling high performance on both in-domain and out-of-distribution (OOD) information retrieval benchmarks.

O1embedder train_inference

🏭 Data Production

The training of O1 Embedder involves two types of data. One is used for the embedding capability, which is made up of queries and their relevant documents, i.e., q-doc tuples. The other one is used for the thinking capability, which includes queries and their thoughts, i.e., q-thought tuples. Unlike q-doc tuples which have been widely existed, there are no available q-thought tuples in reality. To resolve this problem, we propose a data synthesis pipeline, leveraging LLMs’ readily equipped reasoning capacity to generate such datasets.

✨ Key Features

  • 🧠 Thought-Augmented Retrieval: Generates LLM-style "thoughts" before embedding the query to uncover hidden semantic intents.
  • 🔁 Joint Multi-task Training: Simultaneous optimization for generation and retrieval via behavior cloning & contrastive learning.
  • 📊 Strong Generalization: Achieves SoTA or near-SoTA results on 12 retrieval benchmarks including MS MARCO, HotPotQA, SciFact, CosQA.
  • 🧪 Backbone-Agnostic: Compatible with LLaMA, Mistral, Qwen, and other major open-source LLMs.

🏁 Quick Start

1. Clone this repo

git clone https://github.com/RuiranYan/o1embedder.git
cd O1-Embedder

2. Install Dependencies

pip install -r requirements.txt
pip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

3. Download Datasets

huggingface-cli download --repo-type dataset --resume-download Ruiran/msmarco_thought final.jsonl --local-dir dataset --local-dir-use-symlinks False

(Optional) Prepare Your Own Training Data

You can build your own thought-augmented dataset via:

  1. Start the vLLM server:

    bash ./scripts/vllm.sh
  2. Open another terminal and access the vLLM server:

    bash ./scripts/gen_data.sh
  3. Vote and get the best thought:

    python ./data_preparation/vote.py --input_file "./dataset/toy_thought.jsonl" --output_file "./dataset/toy_vote_res.jsonl" --model_zoo '["BAAI/bge-large-en-v1.5", "dunzhang/stella_en_1.5B_v5", "Alibaba-NLP/gte-large-en-v1.5"]'

4. Train O1 Embedder

bash scripts/train.sh

5. Evaluation

bash scripts/eval.sh

🧠 Core Ideas

🧪 1. Thought Generation via LLM

Use a strong LLM (e.g., LLaMA-3) to generate long-form "thoughts" before retrieval.

🧪 2. Retrieval Committee

Evaluate thought quality via a diverse set of retrievers and select via majority voting.

🧪 3. Joint Learning

  • Behavior Cloning: teaches the model to generate thoughts.
  • Contrastive Learning: aligns query-thought pairs with relevant documents.

🤖 Supported Backbones

Backbone Model Sizes Supported
LLaMA 7B, 8B
Mistral 7B
Qwen2.5 0.5B–7B

🏆 Acknowledgements

We thank FlagEmbedding for providing the open-source framework, and XiongWenXww for her key contributions to the data preparation process.

📝 Citation

If you find our work helpful, please cite our paper:

@misc{yan2025o1embedderletretrievers,
      title={O1 Embedder: Let Retrievers Think Before Action}, 
      author={Ruiran Yan and Zheng Liu and Defu Lian},
      year={2025},
      eprint={2502.07555},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07555}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors