An document retrieval system that leverages Vector Space Model (VSM) and Efficient Clustering to extract, store, rank, and retrieve documents Designed to serve research and enterprise needs for smarter information access.
This project is based on our IEEE-published research paper:
📝 Title: Document Storage and Retrieval with Efficient Clustering Using the Vector Space Model
🔗 IEEE Xplore: https://ieeexplore.ieee.org/abstract/document/10940682
The model introduces a hybrid approach combining TF-IDF-based retrieval with Single-Link Clustering, improving accuracy, relevance, and ranking.
WhatsApp.Video.2025-05-06.at.13.57.online-video-cutter.com.mp4
- 📥 Text Extraction (using PyMuPDF)
- ✨ Preprocessing: Tokenization, Stopword Removal, Normalization
- 📊 TF-IDF Vectorization: Classic Vector Space Model
- 🔗 Single-Link Clustering for smarter group-based ranking
- 🧠 Hybrid Retrieval combining VSM + clustering similarity
- 🔎 Streamlit UI for fast search and document preview
- 🧾 MongoDB backend for storing documents and search logs
- 📈 Built-in Evaluation Metrics (Precision, Recall, MAP, NDCG, F2)
Efficient-Doc-Retrieval-Storage/
├── app.py # Streamlit app interface
├── requirements.txt
├── README.md
├── .gitignore
├── LICENSE
├── ground_truth/
│ └── ground_truth.csv # CSV for ground-truth relevance
├── mongodb_data/ # MongoDB dump (not included in GitHub)
├── src/
│ ├── __init__.py
│ ├── BM25.py # BM25 scorer (optional)
│ ├── clustering.py # Single-link clustering logic
│ ├── compute_tfidf.py # TF-IDF vector computation
│ ├── config.py # Global configuration variables
│ ├── evaluation.py # Precision, Recall, MAP, NDCG etc.
│ ├── extract_text.py # Extract text from documents
│ ├── fetch_gdrive.py # (Optional) fetch PDFs from Google Drive
│ ├── log_search.py # Log user queries
│ ├── preprocess_text.py # Clean and normalize document text
│ ├── query_log.py # Access and analyze search history
│ ├── query_mongo.py # Search MongoDB for past queries
│ ├── retrieve_documents.py # Main hybrid retrieval logic
│ ├── run_evaluation.py # CLI script for evaluation
│ ├── store_mongo.py # Insert documents into MongoDB
│ ├── update_links.py # Fix or update download links
│ ├── update_mongo.py # Modify or refresh DB content
│ ├── vsm.py # Vector Space Model ranking
│ ├── token.pickle # OAuth token for GDrive
│ └── downloaded_files.txt
│ └── data/ # Folder for raw/extracted files
├── venv/ # Virtual environment.
To set up this project locally, follow the steps below:
git clone https://github.com/rawoolsiddhi/Efficient-Doc-Retrieval-Storage.gitNavigate into the directory
cd Efficient-Doc-Retrieval-Storage
python3 -m venv venv
source venv/bin/activate # For Linux/Mac
venv\Scripts\activate # For Windows
pip install -r requirements.txtMake sure MongoDB is installed and running locally (default port: 27017), or update your code with a cloud MongoDB URI.
streamlit run app.pyThis will launch the document retrieval interface in your browser.
Dependencies: Python 3.8 and above MongoDB (local or cloud instance)
| Metric | Hybrid Model (VSM + Clustering) |
VSM Only | BM25 (Optional) |
|---|---|---|---|
| Precision | 0.91 | 0.83 | 0.77 |
| Recall | 0.94 | 0.87 | 0.81 |
| F2 Score | 0.93 | 0.85 | 0.79 |
| NDCG | 0.92 | 0.82 | 0.78 |
| MAP | 0.89 | 0.80 | 0.75 |
| Accuracy Range | 0.88 - 0.96 | 0.85 - 0.95 | 0.82 - 0.91 |
-
Text Extraction: Extract content from PDFs using PyMuPDF.
-
Preprocessing: Clean, normalize, and tokenize text.
-
TF-IDF Generation: Compute vectors using scikit-learn.
-
Clustering: Apply Single-Link Clustering based on cosine distances.
-
Hybrid Retrieval: Combine VSM ranking and cluster similarity.
-
Search Logging: Store user queries and retrieval metadata.
-
Ranking: Return documents with highest combined relevance.
Contributions are welcome! Please fork the repo and submit a pull request for review.
This project is released under the MIT License.

