Cross-Encoder-Reranker-Signals-for-Robust-RAG-Defense-Against-Poisoning-Attacks

Cross-Encoder Reranker Signals for Robust RAG Defense Against Poisoning Attacks introduces CEG-RAG, a practical defense framework for corpus poisoning in Retrieval-Augmented Generation (RAG). RAG systems improve factuality by grounding LLM outputs in retrieved evidence, but they also expand the attack surface: adversaries can inject crafted passages into the knowledge base so that poisoned chunks are retrieved, ranked highly, and ultimately steer generation toward attacker-chosen targets. CEG-RAG leverages a security signal that already exists in many strong RAG pipelines: the internal representations of a cross-encoder reranker. Instead of treating the reranker as a black box that outputs relevance scores, we extract the reranker’s final-layer pooled [CLS] embeddings for query–passage pairs and use them as features for a multiple-instance learning (MIL) detector. The retrieved set is modeled as a “bag” (context) and each passage as an “instance” (chunk), enabling weak supervision with context-level labels only. The model jointly supports (i) context-level poisoning detection and (ii) chunk-level localization of suspicious passages. When poisoning is detected, CEG-RAG performs context repair before generation: it filters high-risk chunks using learned poison scores and replaces them with lower-ranked candidates while maintaining a fixed context budget. This design keeps overhead low—no extra LLM calls are required—and prevents malicious evidence from reaching the generator. We evaluate CEG-RAG on MS-MARCO, Natural Questions (NQ), and HotpotQA under the strong PoisonedRAG attack. Results show consistently strong detection and localization (high TPR at low FPR) and substantial mitigation effectiveness, reducing attack success rate (ASR) while recovering correct answers. We further demonstrate robustness across rerankers (BGE, mGTE, and MS-MARCO MiniLM) and compare against recent defenses including GMTP, RobustRAG, TrustRAG, and RAGuard, where CEG-RAG delivers the most consistent ASR reduction across datasets. The methodology overview is presented in the figure below.

In our evaluation setup, we construct a controlled poisoned-RAG benchmark over three open-domain QA datasets: Natural Questions (NQ), HotpotQA, and MS-MARCO. For each dataset, we randomly sample 4,000 queries together with their associated evidence passages to form the underlying knowledge source used by the RAG system. From these, we designate 2,000 queries as attack targets and poison the corpus using the PoisonedRAG attack. For every targeted query, we inject five adversarial passages that are all crafted to support the same attacker-chosen answer, resulting in 10,000 poisoned passages per dataset. The remaining sampled content is kept clean and serves as benign background evidence, allowing us to evaluate whether the defense can distinguish malicious from non-malicious retrieval content under realistic mixed-corpus conditions.

To support weakly supervised training, we represent each retrieved context as a bag and each retrieved passage within that context as an instance, following the multiple-instance learning (MIL) formulation. During training and evaluation, the RAG pipeline first retrieves candidate passages using Contriever, reranks them using a cross-encoder, and then extracts the reranker’s final-layer pooled [CLS] embeddings for each query-passage pair. These embeddings are used as the input features for the MIL detector. Context-level labels indicate whether the retrieved set contains poisoning, while chunk-level suspiciousness is learned implicitly through the MIL mechanism, enabling localization without requiring explicit chunk annotations during training.

When poisoning is detected, the system applies a context repair step before generation. Specifically, chunks assigned high poison scores are filtered out and replaced with lower-ranked candidates drawn from the same retrieval pool, while preserving the fixed context budget used by the generator. The repaired context is then passed to Meta-Llama-3-8B-Instruct, using the top three reranked chunks ($k=3$) for final answer generation. This allows the entire defense to operate within the existing RAG pipeline, without requiring extra LLM calls or expensive post hoc verification stages.

Additional results paragraph you can add in the README results section

Across datasets, CEG-RAG achieves strong and consistent poisoning detection. At the context level, the method reaches a TPR of 0.9833 on MS-MARCO, 0.9507 on NQ, and 0.8500 on HotpotQA, while maintaining relatively low FPR values. At the finer-grained chunk level, localization performance remains strong, with TPR values of 0.9764, 0.9404, and 0.8840 on MS-MARCO, NQ, and HotpotQA, respectively. These results show that the method is not only able to detect that a retrieved context is compromised, but can also identify the suspicious passages responsible for the attack.

The repair stage substantially reduces attacker control over final generation. After filtering and replacement of suspicious passages, the Attack Success Rate (ASR) drops to 11.20% on MS-MARCO, 6.30% on NQ, and 16.28% on HotpotQA, corresponding to reductions of 88.8%, 93.7%, and 83.72%, respectively. At the same time, the repaired context often restores useful evidence for correct generation, yielding post-repair accuracy values of 50.00% on MS-MARCO, 41.06% on NQ, and 13.95% on HotpotQA. This combination of lower ASR and meaningful accuracy recovery indicates that the method does more than merely disrupt the attack; it often reconstructs a context that is informative enough to support the correct answer.

We also evaluate robustness across different rerankers, including BGE-reranker, mGTE-reranker, and MS-MARCO MiniLM. Detection performance remains strong across reranker choices, suggesting that the core signal exploited by CEG-RAG is not tied to a single model family. While repair outcomes vary somewhat depending on reranker behavior, the method consistently maintains low ASR and competitive answer recovery. In addition, sensitivity analysis on the number of forwarded chunks shows that performance remains stable across a broad range of top-$k$ values, indicating that the defense does not rely on fragile tuning of the final context size.

Finally, when compared against recent RAG-specific defenses including GMTP, RobustRAG, TrustRAG, and RAGuard, CEG-RAG delivers the most consistent ASR reduction across all three datasets. It achieves the lowest ASR on MS-MARCO, NQ, and HotpotQA, while remaining competitive in gold-answer recovery, particularly on MS-MARCO and NQ. These findings suggest that reranker-representation-based detection, combined with lightweight context repair, provides a strong and practical defense against corpus poisoning in modern RAG systems.

If you want a shorter version for README

If your README should be a little less paper-like, use this compressed version instead:

We evaluate CEG-RAG on Natural Questions, HotpotQA, and MS-MARCO under the PoisonedRAG attack. For each dataset, we sample 4,000 queries to build the knowledge source, then target 2,000 queries for poisoning and inject five malicious passages per target, resulting in 10,000 poisoned passages per dataset. Retrieved query-passage pairs are encoded using a cross-encoder reranker, and the final-layer pooled [CLS] embeddings are fed into a multiple-instance learning (MIL) detector that models each retrieved context as a bag of passages. This allows the system to perform both context-level poisoning detection and chunk-level localization without requiring explicit chunk labels during training.

When poisoning is detected, CEG-RAG repairs the context before generation by removing high-risk passages and replacing them with lower-ranked candidates while preserving the fixed context budget. Across datasets, the method achieves strong detection and localization performance, with high TPR and low FPR, and substantially reduces Attack Success Rate (ASR) after repair. For example, ASR drops to 11.20% on MS-MARCO, 6.30% on NQ, and 16.28% on HotpotQA, while also recovering correct answers in many cases. The approach remains robust across multiple rerankers and outperforms recent defenses such as GMTP, RobustRAG, TrustRAG, and RAGuard in terms of consistent ASR reduction.

For detector training, we use the attack-successful poisoned retrieval files successful_poisoned_hotpot.json, successful_poisoned_msmarco (1).json, and successful_poisoned_nq (1).json. To evaluate the detector inside the full RAG pipeline, we use the held-out test files Final_test_Msmarco_poisoned_rag.json, Final_test_poisonedRAG_nq2.json, and finaltest_poisoned_hotpot.json, which are used to assess detection performance and the effectiveness of the defense during inference. For constructing the poisoned-RAG knowledge base, we use hotpot_contriever_chunks_poisoned_gpt.json, msmarco_contriever_chunks_poisoned_gpt (1).json, and nq_contriever_chunks_poisoned_gpt2.json. And for non-poisoned knowledge database we use the following files: hotpot_contriever_chunks.json.gz, msmarco_contriever_chunks.json, and nq_contriever_chunks.json

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
Data		Data
Figures		Figures
code		code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Encoder-Reranker-Signals-for-Robust-RAG-Defense-Against-Poisoning-Attacks

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Cross-Encoder-Reranker-Signals-for-Robust-RAG-Defense-Against-Poisoning-Attacks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages