Semantic Document Retrieval using Single-Link Clustering and Vector Space Models

An document retrieval system that leverages Vector Space Model (VSM) and Efficient Clustering to extract, store, rank, and retrieve documents Designed to serve research and enterprise needs for smarter information access.

IEEE Publication

This project is based on our IEEE-published research paper:

📝 Title: Document Storage and Retrieval with Efficient Clustering Using the Vector Space Model
🔗 IEEE Xplore: https://ieeexplore.ieee.org/abstract/document/10940682

The model introduces a hybrid approach combining TF-IDF-based retrieval with Single-Link Clustering, improving accuracy, relevance, and ranking.

Watch the Demo

WhatsApp.Video.2025-05-06.at.13.57.online-video-cutter.com.mp4

🌟 Features

📥 Text Extraction (using PyMuPDF)
✨ Preprocessing: Tokenization, Stopword Removal, Normalization
📊 TF-IDF Vectorization: Classic Vector Space Model
🔗 Single-Link Clustering for smarter group-based ranking
🧠 Hybrid Retrieval combining VSM + clustering similarity
🔎 Streamlit UI for fast search and document preview
🧾 MongoDB backend for storing documents and search logs
📈 Built-in Evaluation Metrics (Precision, Recall, MAP, NDCG, F2)

📁 Project Structure

Efficient-Doc-Retrieval-Storage/
├── app.py                         # Streamlit app interface
├── requirements.txt
├── README.md
├── .gitignore
├── LICENSE

├── ground_truth/
│   └── ground_truth.csv          # CSV for ground-truth relevance

├── mongodb_data/                 # MongoDB dump (not included in GitHub)

├── src/
│   ├── __init__.py
│   ├── BM25.py                   # BM25 scorer (optional)
│   ├── clustering.py             # Single-link clustering logic
│   ├── compute_tfidf.py          # TF-IDF vector computation
│   ├── config.py                 # Global configuration variables
│   ├── evaluation.py             # Precision, Recall, MAP, NDCG etc.
│   ├── extract_text.py           # Extract text from documents
│   ├── fetch_gdrive.py           # (Optional) fetch PDFs from Google Drive
│   ├── log_search.py             # Log user queries
│   ├── preprocess_text.py        # Clean and normalize document text
│   ├── query_log.py              # Access and analyze search history
│   ├── query_mongo.py            # Search MongoDB for past queries
│   ├── retrieve_documents.py     # Main hybrid retrieval logic
│   ├── run_evaluation.py         # CLI script for evaluation
│   ├── store_mongo.py            # Insert documents into MongoDB
│   ├── update_links.py           # Fix or update download links
│   ├── update_mongo.py           # Modify or refresh DB content
│   ├── vsm.py                    # Vector Space Model ranking
│   ├── token.pickle              # OAuth token for GDrive
│   └── downloaded_files.txt

│   └── data/                     # Folder for raw/extracted files

├── venv/                         # Virtual environment.

⚙️ Installation & Setup

To set up this project locally, follow the steps below:

1. Clone the repository:

    git clone https://github.com/rawoolsiddhi/Efficient-Doc-Retrieval-Storage.git

Navigate into the directory

    cd Efficient-Doc-Retrieval-Storage

2. Create and activate a virtual environment (recommended):

      python3 -m venv venv
      source venv/bin/activate  # For Linux/Mac
      venv\Scripts\activate     # For Windows

3. Install required Python packages

pip install -r requirements.txt

4.Start MongoDB:

Make sure MongoDB is installed and running locally (default port: 27017), or update your code with a cloud MongoDB URI.

5. Run the Streamlit app

streamlit run app.py

This will launch the document retrieval interface in your browser.

UI

Dependencies: Python 3.8 and above MongoDB (local or cloud instance)

Architecture Diagram

Evaluation Metrics

Metric	Hybrid Model (VSM + Clustering)	VSM Only	BM25 (Optional)
Precision	0.91	0.83	0.77
Recall	0.94	0.87	0.81
F2 Score	0.93	0.85	0.79
NDCG	0.92	0.82	0.78
MAP	0.89	0.80	0.75
Accuracy Range	0.88 - 0.96	0.85 - 0.95	0.82 - 0.91

How It Works

Text Extraction: Extract content from PDFs using PyMuPDF.
Preprocessing: Clean, normalize, and tokenize text.
TF-IDF Generation: Compute vectors using scikit-learn.
Clustering: Apply Single-Link Clustering based on cosine distances.
Hybrid Retrieval: Combine VSM ranking and cluster similarity.
Search Logging: Store user queries and retrieval metadata.
Ranking: Return documents with highest combined relevance.

Contributing

Contributions are welcome! Please fork the repo and submit a pull request for review.

License

This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Document Retrieval using Single-Link Clustering and Vector Space Models

IEEE Publication

Watch the Demo

🌟 Features

📁 Project Structure

⚙️ Installation & Setup

1. Clone the repository:

2. Create and activate a virtual environment (recommended):

3. Install required Python packages

4.Start MongoDB:

5. Run the Streamlit app

UI

Architecture Diagram

Evaluation Metrics

How It Works

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
ground_truth		ground_truth
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Semantic Document Retrieval using Single-Link Clustering and Vector Space Models

IEEE Publication

Watch the Demo

🌟 Features

📁 Project Structure

⚙️ Installation & Setup

1. Clone the repository:

2. Create and activate a virtual environment (recommended):

3. Install required Python packages

4.Start MongoDB:

5. Run the Streamlit app

UI

Architecture Diagram

Evaluation Metrics

How It Works

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages