Skip to content

rawoolsiddhi/Efficient-Doc-Retrieval-Storage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Document Retrieval using Single-Link Clustering and Vector Space Models

An document retrieval system that leverages Vector Space Model (VSM) and Efficient Clustering to extract, store, rank, and retrieve documents Designed to serve research and enterprise needs for smarter information access.

IEEE Publication

This project is based on our IEEE-published research paper:

📝 Title: Document Storage and Retrieval with Efficient Clustering Using the Vector Space Model
🔗 IEEE Xplore: https://ieeexplore.ieee.org/abstract/document/10940682

The model introduces a hybrid approach combining TF-IDF-based retrieval with Single-Link Clustering, improving accuracy, relevance, and ranking.

Watch the Demo

WhatsApp.Video.2025-05-06.at.13.57.online-video-cutter.com.mp4

🌟 Features

  • 📥 Text Extraction (using PyMuPDF)
  • Preprocessing: Tokenization, Stopword Removal, Normalization
  • 📊 TF-IDF Vectorization: Classic Vector Space Model
  • 🔗 Single-Link Clustering for smarter group-based ranking
  • 🧠 Hybrid Retrieval combining VSM + clustering similarity
  • 🔎 Streamlit UI for fast search and document preview
  • 🧾 MongoDB backend for storing documents and search logs
  • 📈 Built-in Evaluation Metrics (Precision, Recall, MAP, NDCG, F2)

📁 Project Structure

Efficient-Doc-Retrieval-Storage/
├── app.py                         # Streamlit app interface
├── requirements.txt
├── README.md
├── .gitignore
├── LICENSE

├── ground_truth/
│   └── ground_truth.csv          # CSV for ground-truth relevance

├── mongodb_data/                 # MongoDB dump (not included in GitHub)

├── src/
│   ├── __init__.py
│   ├── BM25.py                   # BM25 scorer (optional)
│   ├── clustering.py             # Single-link clustering logic
│   ├── compute_tfidf.py          # TF-IDF vector computation
│   ├── config.py                 # Global configuration variables
│   ├── evaluation.py             # Precision, Recall, MAP, NDCG etc.
│   ├── extract_text.py           # Extract text from documents
│   ├── fetch_gdrive.py           # (Optional) fetch PDFs from Google Drive
│   ├── log_search.py             # Log user queries
│   ├── preprocess_text.py        # Clean and normalize document text
│   ├── query_log.py              # Access and analyze search history
│   ├── query_mongo.py            # Search MongoDB for past queries
│   ├── retrieve_documents.py     # Main hybrid retrieval logic
│   ├── run_evaluation.py         # CLI script for evaluation
│   ├── store_mongo.py            # Insert documents into MongoDB
│   ├── update_links.py           # Fix or update download links
│   ├── update_mongo.py           # Modify or refresh DB content
│   ├── vsm.py                    # Vector Space Model ranking
│   ├── token.pickle              # OAuth token for GDrive
│   └── downloaded_files.txt

│   └── data/                     # Folder for raw/extracted files

├── venv/                         # Virtual environment.

⚙️ Installation & Setup

To set up this project locally, follow the steps below:

1. Clone the repository:

    git clone https://github.com/rawoolsiddhi/Efficient-Doc-Retrieval-Storage.git

Navigate into the directory

    cd Efficient-Doc-Retrieval-Storage

2. Create and activate a virtual environment (recommended):

      python3 -m venv venv
      source venv/bin/activate  # For Linux/Mac
      venv\Scripts\activate     # For Windows

3. Install required Python packages

pip install -r requirements.txt

4.Start MongoDB:

Make sure MongoDB is installed and running locally (default port: 27017), or update your code with a cloud MongoDB URI.

5. Run the Streamlit app

streamlit run app.py

This will launch the document retrieval interface in your browser.

UI

image

Dependencies: Python 3.8 and above MongoDB (local or cloud instance)

Architecture Diagram

image

Evaluation Metrics


Metric Hybrid Model
(VSM + Clustering)
VSM Only BM25 (Optional)
Precision 0.91 0.83 0.77
Recall 0.94 0.87 0.81
F2 Score 0.93 0.85 0.79
NDCG 0.92 0.82 0.78
MAP 0.89 0.80 0.75
Accuracy Range 0.88 - 0.96 0.85 - 0.95 0.82 - 0.91

How It Works

  1. Text Extraction: Extract content from PDFs using PyMuPDF.

  2. Preprocessing: Clean, normalize, and tokenize text.

  3. TF-IDF Generation: Compute vectors using scikit-learn.

  4. Clustering: Apply Single-Link Clustering based on cosine distances.

  5. Hybrid Retrieval: Combine VSM ranking and cluster similarity.

  6. Search Logging: Store user queries and retrieval metadata.

  7. Ranking: Return documents with highest combined relevance.

Contributing

Contributions are welcome! Please fork the repo and submit a pull request for review.

License

This project is released under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages