This repository contains the implementation for the paper "UnWeaving the knots of GraphRAG - turns out VectorRAG is almost enough". The project presents UnWeaver, a novel approach to Retrieval-Augmented Generation (RAG) that challenges the conventional wisdom of using graph-based knowledge representations.
The diagram above illustrates the flow of data through the UnWeaver system, showing how documents are processed, indexed, and retrieved for question answering.
The project consists of two main components:
- UnWeaver: A RAG system that implements the novel approach described in the paper
- Evaluation: A comprehensive evaluation framework for assessing RAG system performance
.
|
├── unweaver/ # UnWeaver RAG system implementation
├── evaluation/ # Evaluation framework
├── data_preprocessing/ # Data preprocessing tools
└── README.md # This file
The project uses Poetry for dependency management. Make sure you have Poetry installed on your system.
- Python 3.9 or higher
- Poetry
- MongoDB (for LLM/Embedding caching if using cache)
- Clone the repository:
git clone <repository-url>
cd unweaver_arxiv- Install dependencies for UnWeaver:
cd unweaver
poetry install- Install dependencies for Evaluation:
cd ../evaluation
poetry installTo obtain the datasets used in the paper and preprocess them to a format digestible by the UnWeaver pipeline run data_preprocessing/run.sh script.
The UnWeaver system can be run using the provided shell script or by executing the Python modules directly.
The unweaver/run.sh script automates the indexing and querying process for all datasets:
cd unweaver
./run.shThis script will:
- Index the COVID-QA, E-Manual, and TechQA datasets
- Query each dataset using the configured retrieval methods
- Store results in the
index_<dataset_name>directories
You can also run the indexing and querying steps manually:
Indexing:
cd unweaver
poetry run python -m unweaver.index \
../data/<dataset_name>/files_preprocessed/ \
./index_<dataset_name> \
--config configs/custom.jsonQuerying:
cd unweaver
poetry run python -m unweaver.query \
../data/<dataset_name>/questions.json \
./index_<dataset_name> \
--run_name <run_name> \
--config configs/custom.jsonTo evaluate the results generated by UnWeaver:
cd evaluation
poetry run python -m evaluation \
../unweaver/index_<dataset_name> \
--config configs/custom.jsonThe evaluation framework will:
- Load query results from the specified working directory
- Calculate metrics using RAGAS
- Generate timing and token usage statistics
- Log results to MLflow (if configured)
Both UnWeaver and Evaluation use JSON configuration files to control their behavior:
- UnWeaver config:
unweaver/configs/custom.json - Evaluation config:
evaluation/configs/custom.json
Key configuration parameters include:
- LLM settings (model, API endpoints, timeouts)
- Embedder settings (model, dimensions, batch size)
- Retrieval parameters (top-k values, chunk sizes)
- Evaluation metrics and MLflow tracking
See the individual READMEs in the unweaver/ and evaluation/ directories for detailed configuration options.
The project includes three datasets for evaluation:
- COVID-QA: Biomedical question answering dataset
- E-Manual: Technical manual dataset
- TechQA: Technical question answering dataset
Each dataset should be placed in the data/ directory with the following structure:
data/<dataset_name>/
├── questions.json # Questions for evaluation
├── files/ # Original documents
└── files_preprocessed/ # Preprocessed documents for indexing
If you use this code in your research, please cite our paper:
@article{unweaver2026,
title={UnWeaving the knots of GraphRAG - turns out VectorRAG is almost enough},
author={Ryszard Tuora, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Mateusz Czyżnikiewicz, Adam Kozakiewicz, Tomasz Ziętkiewicz},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}For questions or issues, please open an issue on the repository.
