This repository is a research prototype that combines dense retrieval with a Grover-inspired top-k selector and compares LLM answers with and without context. It includes a Streamlit demo and evaluation scripts using the SQuAD 1.1 dataset.
- Dense retrieval with
mixedbread-ai/mxbai-embed-large-v1embeddings and FAISS cosine search. - GroverTopK selection implemented with Qiskit Aer.
- Multi-model comparisons via Hugging Face Inference (llama-3-8b, mixtral-8x7b, phi-3.5 model keys).
- Benchmark scripts that export CSV/JSON results; plots are generated from the notebooks in
tests/evaluation/.
- A user question is prefixed with the retrieval prompt and embedded by the transformer model.
ContextRetrieversearches a FAISSIndexFlatIPindex over SQuAD contexts.GroverTopKselects the top-k contexts from the top-10 candidates using a dynamic threshold and Grover iterations.AgentHandlerqueries Hugging Face Inference for each model and returns answers with and without context.
app.py: Streamlit entrypoint for the interactive demo.src/components/: core logic (retrieval, Grover selection, LLM client).src/utils/: embedding and dataset helpers.tests/evaluation/: benchmark runners, artifacts, and plots.squad_dataset/: SQuAD 1.1 data used by the evaluation scripts.saved_embeddings/: cached embeddings and document lists for retrieval.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export HF_API_KEY=your_hf_token
streamlit run app.pyThe demo loads embeddings from saved_embeddings/ and builds a FAISS index on startup. If you change the dataset or embedding model, regenerate embeddings using the ContextRetriever class and update the file paths in app.py.
- LLM models are defined in
src/components/AgentHandler.py(model keys map to Hugging Face model IDs). - Retrieval defaults are in
app.py:MODEL_NAMEandQUERY_PREFIXfor embeddings.TOP_K_FIRST(default 10) andTOP_K_FINAL(default 3).
GroverTopKparameters (threshold, shots, top_k) are configurable insrc/components/GroverTopK.py.
Run the quality and timing experiments (requires HF_API_KEY and network access):
python tests/evaluation/tests_runner.py
python tests/evaluation/time_test_runner.pyOutputs:
- Quality:
tests/test_results.csv,tests/test_results.json - Timing:
tests/time_test_result.csv,tests/time_test_results.json
The repository also includes prior results and plots under tests/evaluation/ and a narrative summary in tests/tests_summary.md.
The plots and numbers below are from the included evaluation summary (tests/tests_summary.md) and the saved artifacts in tests/evaluation/.
- Objective: compare Grover vs. classic context selection across LLMs and context variants (
no_context,top1,top3). - Metrics: word overlap (with vs. without context) and cosine similarity vs. the ideal answer.
- Key insights from the recorded summary:
- Top-3 contexts outperform top-1;
llama-3-8breaches the highest cosine similarity (0.80). - Grover and classic selection are nearly identical (quality differences <0.5%).
- No-context variants show a significant quality drop across models.
- Top-3 contexts outperform top-1;
Reported averages (end-to-end timing, including context selection):
| Component | Time |
|---|---|
| Context retrieval (top-10) | 0.297 s |
| Grover selection (top-3) | 0.030 s |
Answer generation (mixtral-8x7b) |
1.18 s |
Answer generation (llama-3-8b) |
2.56 s |
Answer generation (phi-3.5) |
2.65 s |
Additional findings from the recorded summary:
- Context consistency: 99.11% match between Grover and classic selection; a single discrepancy occurred when Grover’s dynamic threshold excluded all contexts.
- Quality validation:
top3>top1>no_context;llama-3-8bagain achieved the highest cosine similarity (0.80).
- Grover adds ~30 ms latency vs. classic selection while maintaining selection quality.
- Context is critical: three contexts substantially improve accuracy vs. no context.
- Model tradeoff:
mixtral-8x7bis fastest,llama-3-8bis most accurate on this benchmark.
The Streamlit interface supports model comparison, context inspection, and collapsible answers:
| Screen | Preview |
|---|---|
| Home | ![]() |
| Answers | ![]() |
| Top Contexts | ![]() |
squad_dataset/train-v1.1.jsonis the SQuAD 1.1 dataset used by the evaluation scripts.saved_embeddings/contains embeddings and document lists for the default model and dataset.- Plots and UI screenshots live in
tests/evaluation/results_images/andtests/evaluation/GUI_images/.
- LLM calls use the Hugging Face Inference API and are subject to rate limits and model availability.
- GroverTopK runs on the Qiskit Aer simulator rather than quantum hardware.
- Results are research-oriented and not intended as production benchmarks.
Grover's algorithm is designed for unstructured search with a marked-item oracle. In this project it is used as a quantum-inspired top-k selector on a small, already ranked candidate set (top-10 from FAISS), mainly to explore the integration of quantum techniques in a RAG pipeline. Because the selection runs on a simulator and the search space is small, it does not offer a practical speedup over classical selection. For production retrieval quality or efficiency, classical approaches (e.g., improved embeddings, larger ANN indices, or reranking) are typically more impactful.
The Grover selector in src/components/GroverTopK.py works as follows:
- Thresholding: a dynamic threshold is adjusted to target roughly
kitems above the similarity cutoff. - Marked indices: all candidates exceeding the threshold are marked.
- Oracle: a Grover oracle is built by applying multi-controlled X gates (MCX) to mark the target bitstrings.
- Diffusion: a standard diffusion operator is applied to amplify marked states.
- Sampling: the circuit runs on the Qiskit Aer simulator for a fixed number of shots.
- Top-k extraction: the most frequent bitstrings are mapped back to candidate indices to return the top contexts.
This project builds on SQuAD, Qiskit, FAISS, and Hugging Face tooling.
MIT License. See LICENSE.




