Skip to content

Information-Retrieval-aka-Uniwa/Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

120 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UNIWA

UNIVERSITY OF WEST ATTICA
SCHOOL OF ENGINEERING
DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATICS

University of West Attica · Department of Computer Engineering and Informatics


Information Retrieval

Building a Search Engine for Academic Papers

Vasileios Evangelos Athanasiou
Student ID: 19390005

GitHub · LinkedIn

Pantelis Tatsis
Student ID: 20390226

GitHub · LinkedIn


Supervision

Supervisor: Panagiota Tselenti, Laboratory Teaching Staff

UNIWA Profile · LinkedIn


Athens, January 2024



README

Building a Search Engine for Academic Papers

This project was developed for the Information Retrieval course at the University of West Attica.

It implements a complete Information Retrieval (IR) system that:

  • Crawls academic papers from arXiv
  • Preprocesses textual data
  • Builds an inverted index
  • Supports multiple retrieval models
  • Provides ranked search results through a graphical user interface (GUI)

Table of Contents

Section Path / File Description
1 assign/ Assignment specifications and project instructions
1.1 assign/IR LabProject 2023-2024new.pdf Official laboratory project description
1.2 assign/ΑΠ ΕργασίαΕργαστηρίου 2023-2024νέο.pdf Greek version of the assignment
2 docs/ Project documentation and reports
2.1 docs/Academic-Paper-Search-Engine.pdf Technical documentation of the search engine
2.2 docs/Μηχανή-Αναζήτησης-Ακαδημαϊκών-Εργασιών.pdf Greek documentation
3 src/ Source code directory
3.1 src/main.py Application entry point
3.2 src/search_engine.py Core search engine controller
3.3 src/inverted_index.py Inverted index construction and lookup
3.4 src/query_processing.py Query parsing and preprocessing
3.5 src/ranking.py Document ranking algorithms
3.6 src/text_preprocessing.py Text cleaning, normalization, and tokenization
3.7 src/web_crawler.py Web crawling and data acquisition
4 README.md Project documentation
5 INSTALL.md Usage instructions

1. Project Objective

The objective of this project is to design and implement a functional academic search engine demonstrating:

  • Web Crawling
  • Text Preprocessing
  • Inverted Index Construction
  • Boolean Retrieval
  • Vector Space Model (TF-IDF + Cosine Similarity)
  • Probabilistic Retrieval Model (Okapi BM25)
  • Query Processing with operator precedence
  • Metadata filtering (Author, Date)

2. System Architecture

Web Crawler
↓
dataset.json
↓
Text Preprocessing
↓
Inverted Index
↓
Query Processing
↓
Retrieval Model
↓
Ranking
↓
Filtering
↓
Top-20 Results (GUI Output)

3. Project Structure

.
├── assign/
│ ├── IR LabProject 2023-2024new.pdf
│ └── ΑΠ ΕργασίαΕργαστηρίου 2023-2024νέο.pdf
│
├── docs/
│ ├── Academic-Paper-Search-Engine.pdf
│ └── Μηχανή-Αναζήτησης-Ακαδημαϊκών-Εργασιών.pdf
│
├── src/
│ ├── main.py
│ ├── web_crawler.py
│ ├── text_preprocessing.py
│ ├── inverted_index.py
│ ├── query_processing.py
│ ├── ranking.py
│ ├── search_engine.py
│
├── dataset.json (generated at runtime)
├── README.md
└── INSTALL.md

4. System Modules

4.1 Web Crawler (web_crawler.py)

  • Retrieves up to 100 papers per subject.
  • Randomly selects between 2–8 subject categories:
    • Physics
    • Mathematics
    • Computer
    • Biology
    • Finance
    • Statistics
    • Electronics
    • Economics

4.2 Extracted Metadata

  • Title
  • Authors
  • Subjects
  • Abstract
  • Comments
  • Submission Date
  • PDF URL
  • Unique doc_id

Data is stored in dataset.json.


5. Text Preprocessing (text_preprocessing.py)

Pipeline:

  • Tokenization (NLTK)
  • Punctuation removal
  • Special character cleaning
  • Lowercasing
  • Stopword removal (English)
  • Porter Stemming

Applied to:

  • Document abstracts
  • User queries

6. Inverted Index (inverted_index.py)

Creates:

term → [doc_id1, doc_id2, ...]

  • Alphabetically sorted terms
  • Sorted posting lists
  • Stored in memory

7. Query Processing (query_processing.py)

Supports:

  • AND
  • OR
  • NOT
  • Parentheses
  • Operator precedence

Boolean evaluation is implemented using set operations.


8. Retrieval Models

Implemented in:

  • search_engine.py
  • ranking.py

8.1 Boolean Retrieval

  • Logical matching
  • Parentheses support
  • Set-based operations

8.2 Vector Space Model (VSM)

  • TF-IDF weighting
  • Cosine Similarity
  • Ranked results

8.3 Probabilistic Retrieval Model

  • Okapi BM25
  • Parameters:
    • k = 1.2
    • b = 0.75

9. Graphical User Interface

Built using:

  • tkinter
  • ttk

Features:

  • Query input
  • Retrieval model selection
  • Top-20 ranked results
  • Filtering by:
    • Author
    • Date

10. Technologies Used

  • Python 3.11
  • requests
  • beautifulsoup4
  • nltk
  • tkinter
  • math
  • collections
  • json
  • re
  • string

11. System Evaluation

The system was evaluated using datasets of 200–800 documents.

Evaluation metrics included:

  • Precision
  • Recall
  • Comparative ranking analysis between:
    • Boolean Retrieval
    • Vector Space Model
    • BM25

12. Notes

  • Internet connection is required for crawling.
  • Each execution generates a new dataset (random subject selection).
  • Boolean operators must be lowercase: and, or, not.

Releases

No releases published

Packages

 
 
 

Contributors

Languages