π Accepted at ICLR 2026 (Oral)
Q-RAG is a resource-efficient method for multi-step retrieval trained with reinforcement learning directly in the latent space of text-chunk embeddings. Instead of expensive LLM fine-tuning, Q-RAG trains only a lightweight embedder agent using value-based RL (temporal difference learning), keeping the LLM frozen. This repository provides the full training and evaluation code to reproduce the results from the paper.
Q-RAG achieves state-of-the-art results on long-context benchmarks (BabiLong, RULER) for contexts up to 10M tokens and competitive performance on open-domain multi-hop QA (HotpotQA, Musique) β all trained on a single A100 GPU.
You can download the necessary dataset and pretraining models from π€ Hugging Face or Google Drive. Default paths are set in configs/envs/. Dataset paths use relative paths (e.g., ../datasets/...), so the datasets/ folder must be placed next to the Q-RAG repository directory:
parent_dir/ βββ Q-RAG/ β this repository βββ datasets/ β downloaded datasets go here
Q-RAG achieves near-perfect retrieval on all NIAH subtasks, trained on 4K-length documents and generalizing up to 1M tokens:
Results on HotpotQA (in-domain) and Musique (out-of-distribution). QwQ-32B was used as the reader LLM for Q-RAG and Beam Retriever:
Q-RAG achieves the highest average performance across 5 tasks (QA1βQA5) at context lengths from 1M to 10M tokens, outperforming Titans, Atlas, ARMT, RMT, and proprietary LLMs. On the hardest subtask QA3 (3-hop temporal reasoning), Q-RAG shows virtually no degradation as context grows to 10M tokens, while all baselines degrade significantly.
Reproducibility note: Results may vary slightly across seeds. All training was performed on a single A100-80GB GPU within 12 hours per model.
- Python 3.12
- CUDA-compatible GPU (80 GB A100 recommended for full reproduction)
- Linux recommended (tested on Ubuntu with CUDA)
# Create conda environment
conda create -n qrag python=3.12 -y
conda activate qrag
# Install dependencies
python -m pip install pip==26.0.1 wheel==0.46.3
pip install vllm==0.18.0 # pulls compatible PyTorch, Transformers, Triton, etc.
pip install hydra-core==1.3.2 tensorboard==2.20.0 rotary-embedding-torch==0.8.9 pandas==3.0.1 nltk==3.9.4 sortedcontainers==2.4.0 accelerate==1.13.0 datasets==4.8.4python -c "from rl.agents.pqn import PQNActor; print('β
Q-RAG installed successfully')"General notes:
- Training is launched via
train_q_rag.py. All hyperparameters are managed by Hydra configs inconfigs/.- Results may vary slightly across seeds. All training was performed on a single A100-80GB GPU within 12 hours per model.
- Config priority:
CLI args>configs/testing.yaml>pretrained_path/config.yaml
Models per benchmark:
- BabiLong & RULER β
facebook/contriever - HotpotQA & Musique β
intfloat/multilingual-e5-largeandAlibaba-NLP/gte-multilingual-base
Download HotpotQA and Musique datasets and place them so that the environment configs can find them. Default paths are set in configs/envs/hotpotqa.yaml, configs/envs/musique.yaml, and configs/envs/hotpotqa+musique.yaml.
E5 HotpotQA only:
python train_q_rag.py \
envs=hotpotqa \
algo=pqn_e5_hotpotqa \
envs.data_path="your/path/to/datasets/hotpotqa" \
steps_count=10000 \
batch_size=12 \
accumulate_grads=8 \
eval_interval=100 \
envs_parallel=1 \
max_action_length=220
max_action_length_in_memory=220E5 MuSiQue only:
python train_q_rag.py \
envs=musique \
algo=pqn_e5_musique \
envs.data_path="your/path/to/datasets/musique" \
steps_count=10000 \
batch_size=12 \
accumulate_grads=8 \
eval_interval=100 \
envs_parallel=1 \
max_action_length=110
max_action_length_in_memory=110HotpotQA + Musique (combined, GTE embedder):
python train_q_rag.py \
algo=pqn_gte \
envs=hotpotqa+musique \
eval_interval=100 \
eval_episodes=200 \
max_action_length=512 \
max_action_length_in_memory=256 \
batch_size=16 \
accumulate_grads=2 \
feedback.ground_truth.penalize_extra_steps=True \
feedback.never_terminate=True \
envs_parallel=1 \
envs.max_steps=6Note:
max_action_lengthandmax_action_length_in_memorymay need adjustment depending on the dataset, GPU memory, and the modelβs context window.
Retriever evaluation:
eval_retriever.py evaluates a pretrained retriever checkpoint and writes logs to the model's folder as eval_seed{seed}.jsonl.
E5 HotpotQA only
python eval_retriever.py \
pretrained_path=your/path/to/qrag-ft-e5-on-hotpotqa \
num_samples=-1 \
+envs.max_steps=2 \
+envs.data_path=your/path/to/datasets/hotpotqaE5 MuSiQue only
python eval_retriever.py \
pretrained_path=your/path/to/qrag-ft-e5-on-musique \
num_samples=-1 \
+envs.max_steps=4 \
+envs.data_path=your/path/to/datasets/musiqueLLM evaluation:
python eval_llm_openqa.py \
--file_path your/path/to/qrag-ft-e5-on-hotpotqa/eval_seed42.jsonl \
--model_name Qwen/QwQ-32B \
--output_file_path your/path/to/qrag-ft-e5-on-hotpotqa/llm-answering_eval.jsonDownload BabiLong data and set default paths are set in configs/envs/babilong.yaml.
We use standart BabiLong pipline with pre-prepared samples from PG19 books as noise. This dataset you can download from π€ Hugging Face or Google Drive
The total number of chunks is controlled by the num_chunks / num_sentences parameter. The agent's task is to find the supporting facts among all chunks.
CUDA_VISIBLE_DEVICES=0 python train_q_rag.py \
eval_interval=500 \
eval_episodes=1000 \
batch_size=64 \
accumulate_grads=1 \
max_action_length=64 \
max_action_length_in_memory=64 \
feedback.ground_truth.penalize_extra_steps=True \
feedback.never_terminate=True \
envs_parallel=1 \
logger.log_dir=runs/ \
envs.task="qa3_three-supporting-facts"Retriever evaluation with single-length BabiLong:
CUDA_VISIBLE_DEVICES=0 python eval_retriever.py \
pretrained_path="your/path/to/model" \
envs.num_sentences=1200 \
+envs.test_env.feedback_model.never_terminate=True \
num_samples=-1 \
seed=42Retriever evaluation with multi-length BabiLong sweep (1K β 1M tokens):
./scripts/eval_retriever_babilong.sh runs/<run_name> 0 42Sentence-to-token mapping:
50β1k, 160β4k, 1200β32k, 4600β128k, 40000β1M
LLM evaluation with single-length BabiLong:
CUDA_VISIBLE_DEVICES=0 python eval_llm_synthetics.py \
retriever_logdir/retriever_logs.jsonl \
--llm_name "Qwen/Qwen3-4B" \
--babi_task qa3 \
--chunk_filter qvalue \
--stopping_threshold 0.5LLM evaluation with multi-length BabiLong sweep:
# Multi-length sweep
./scripts/eval_llm_babilong.sh path/to/retriever_logdir "Qwen/Qwen3-4B" "qa3" 0Download RULER data and set default paths are set in configs/envs/niah.yaml or configs/envs/hotpotqa+musique.yaml.
CUDA_VISIBLE_DEVICES=0 python train_q_rag.py envs=niah
Retriever evaluation:
CUDA_VISIBLE_DEVICES=0 python eval_retriever.py \
pretrained_path="your/path/to/model" \
num_samples=1000 \
seed=42 \
use_last=TrueLLM evaluation:
CUDA_VISIBLE_DEVICES=0 python eval_llm_synthetics.py \
your/path/to/retriever_log.jsonl \
--llm_name "Qwen/Qwen3-4B" \
--babi_task "niahmv" \
--max_tokens 512Retriever evaluation:
For single-hop QA
python eval_retriever.py \
pretrained_path=your/path/to/qrag-ft-gte-on-hotpotqa_musique \
num_samples=-1 \
+envs.max_steps=1 \
+envs.data_path=your/path/to/datasets/data_sources/RULER/QA-SQuADFor multi-hop QA
python eval_retriever.py \
pretrained_path=your/path/to/qrag-ft-gte-on-hotpotqa_musique \
num_samples=-1 \
+envs.max_steps=3 \
+envs.data_path=your/path/to/datasets/data_sources/RULER/QA-HotpotQALLM evaluation:
For both single-hop QA and multi-hop QA
python eval_llm_openqa.py \
--file_path your/path/to/qrag-ft-gte-on-hotpotqa_musique/eval_seed42.jsonl \
--model_name Qwen/QwQ-32B \
--output_file_path your/path/to/qrag-ft-gte-on-hotpotqa_musique/llm-answering_eval.jsonQ-RAG/
βββ train_q_rag.py # Main training script
βββ eval_retriever.py # Retriever evaluation
βββ eval_llm_synthetics.py # LLM evaluation on BabiLong/Ruler
βββ eval_llm_longbench.py # LLM evaluation on LongBench
βββ eval_sbor_q.py # Q-value evaluation
βββ eval_feedback.py # Feedback model evaluation
βββ βeval_llm_openqa.py # LLM evaluation via vLLM on HotpotQA/MuSiQue
β
βββ configs/ # Hydra configs
β βββ training.yaml # Main training config
β βββ testing.yaml # Evaluation config overrides
β βββ algo/ # Algorithm configs (pqn, pqn_gte, β¦)
β βββ envs/ # Environment configs (babilong, hotpotqa, combined, β¦)
β βββ feedback/ # Feedback model configs
β βββ logger/ # Logging configs
β
βββ rl/ # Core RL module
β βββ agents/ # Agent implementations (PQN, DQN, SAC-D, SARSA)
β βββ feedback/ # Feedback / reward models
β βββ q_module.py # Q-function neural network
β βββ optim.py # Optimizer utilities
β βββ langchain_utils.py # LangChain integration utilities
β
βββ envs/ # Environments
β βββ text_env.py # Text retrieval environment
β βββ parallel_env.py # Parallelized environment wrapper
β βββ qa_env.py # QA environment
β βββ dataloaders/ # Dataset loaders (BabiLong, HotpotQA, Musique, β¦)
β βββ utils.py # Environment utilities
β
βββ prompts_and_metrics/ # Prompts and evaluation metrics
β βββ babilong.py # BabiLong prompts & metrics
β βββ general_qa.py # General QA metrics
β βββ chunk_filtering.py # Chunk filtering logic
β βββ answer_metric.py # Answer quality metric
β
βββ scripts/ # Shell scripts for batch evaluation
βββ eval_retriever_babilong.sh
βββ eval_llm_babilong.sh
βββ train_niah.sh
If you find Q-RAG useful, please cite our paper:
@inproceedings{sorokin2026qrag,
title = {{Q-RAG}: Long Context Multi-Step Retrieval via Value-Based Embedder Training},
author = {Sorokin, Artyom and Buzun, Nazar and Anokhin, Alexander and Inozemcev, Oleg and Vedernikov, Egor and Anokhin, Petr and Burtsev, Mikhail and Trushkov, Alexey and Yin, Wenshuai and Burnaev, Evgeny},
booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
year = {2026}
}@article{sorokin2025qrag,
title = {{Q-RAG}: Long Context Multi-Step Retrieval via Value-Based Embedder Training},
author = {Sorokin, Artyom and Buzun, Nazar and Anokhin, Alexander and Inozemcev, Oleg and Vedernikov, Egor and Anokhin, Petr and Burtsev, Mikhail and Trushkov, Alexey and Yin, Wenshuai and Burnaev, Evgeny},
journal = {arXiv preprint arXiv:2511.07328},
year = {2025}
}This work is licensed under CC BY 4.0.
We thank the developers of the open-source tools and frameworks that made this work possible, including Hydra, vLLM, PyTorch, Contriever, and Multilingual E5. We also thank the creators of the BabiLong, RULER, HotpotQA, and Musique benchmarks.
For bug reports and questions, please open a GitHub Issue.


