Skip to content

Novel RAG system that extends the capabilities of VectorRAG by simplifying GraphRAG

License

Notifications You must be signed in to change notification settings

SamsungLabs/UnWeaver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UnWeaving the knots of GraphRAG - turns out VectorRAG is almost enough

This repository contains the implementation for the paper "UnWeaving the knots of GraphRAG - turns out VectorRAG is almost enough". The project presents UnWeaver, a novel approach to Retrieval-Augmented Generation (RAG) that challenges the conventional wisdom of using graph-based knowledge representations.

Architecture of UnWeaver

UnWeaver Flow

The diagram above illustrates the flow of data through the UnWeaver system, showing how documents are processed, indexed, and retrieved for question answering.

Overview

The project consists of two main components:

  • UnWeaver: A RAG system that implements the novel approach described in the paper
  • Evaluation: A comprehensive evaluation framework for assessing RAG system performance

Project Structure

.
|
├── unweaver/               # UnWeaver RAG system implementation
├── evaluation/             # Evaluation framework
├── data_preprocessing/     # Data preprocessing tools
└── README.md               # This file

Installation

The project uses Poetry for dependency management. Make sure you have Poetry installed on your system.

Prerequisites

  • Python 3.9 or higher
  • Poetry
  • MongoDB (for LLM/Embedding caching if using cache)

Setup

  1. Clone the repository:
git clone <repository-url>
cd unweaver_arxiv
  1. Install dependencies for UnWeaver:
cd unweaver
poetry install
  1. Install dependencies for Evaluation:
cd ../evaluation
poetry install

Usage

Getting data

To obtain the datasets used in the paper and preprocess them to a format digestible by the UnWeaver pipeline run data_preprocessing/run.sh script.

Running UnWeaver

The UnWeaver system can be run using the provided shell script or by executing the Python modules directly.

Using the run script (recommended)

The unweaver/run.sh script automates the indexing and querying process for all datasets:

cd unweaver
./run.sh

This script will:

  1. Index the COVID-QA, E-Manual, and TechQA datasets
  2. Query each dataset using the configured retrieval methods
  3. Store results in the index_<dataset_name> directories

Manual execution

You can also run the indexing and querying steps manually:

Indexing:

cd unweaver
poetry run python -m unweaver.index \
  ../data/<dataset_name>/files_preprocessed/ \
  ./index_<dataset_name> \
  --config configs/custom.json

Querying:

cd unweaver
poetry run python -m unweaver.query \
  ../data/<dataset_name>/questions.json \
  ./index_<dataset_name> \
  --run_name <run_name> \
  --config configs/custom.json

Running Evaluation

To evaluate the results generated by UnWeaver:

cd evaluation
poetry run python -m evaluation \
  ../unweaver/index_<dataset_name> \
  --config configs/custom.json

The evaluation framework will:

  1. Load query results from the specified working directory
  2. Calculate metrics using RAGAS
  3. Generate timing and token usage statistics
  4. Log results to MLflow (if configured)

Configuration

Both UnWeaver and Evaluation use JSON configuration files to control their behavior:

  • UnWeaver config: unweaver/configs/custom.json
  • Evaluation config: evaluation/configs/custom.json

Key configuration parameters include:

  • LLM settings (model, API endpoints, timeouts)
  • Embedder settings (model, dimensions, batch size)
  • Retrieval parameters (top-k values, chunk sizes)
  • Evaluation metrics and MLflow tracking

See the individual READMEs in the unweaver/ and evaluation/ directories for detailed configuration options.

Datasets

The project includes three datasets for evaluation:

  1. COVID-QA: Biomedical question answering dataset
  2. E-Manual: Technical manual dataset
  3. TechQA: Technical question answering dataset

Each dataset should be placed in the data/ directory with the following structure:

data/<dataset_name>/
├── questions.json          # Questions for evaluation
├── files/                  # Original documents
└── files_preprocessed/     # Preprocessed documents for indexing

Citation

If you use this code in your research, please cite our paper:

@article{unweaver2026,
  title={UnWeaving the knots of GraphRAG - turns out VectorRAG is almost enough},
  author={Ryszard Tuora, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Mateusz Czyżnikiewicz, Adam Kozakiewicz, Tomasz Ziętkiewicz},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

Contact

For questions or issues, please open an issue on the repository.

About

Novel RAG system that extends the capabilities of VectorRAG by simplifying GraphRAG

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors