Building a Search Engine for Academic Papers

UNIVERSITY OF WEST ATTICA
SCHOOL OF ENGINEERING
DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATICS

University of West Attica · Department of Computer Engineering and Informatics

Information Retrieval

Building a Search Engine for Academic Papers

Vasileios Evangelos Athanasiou
Student ID: 19390005

GitHub · LinkedIn

Pantelis Tatsis
Student ID: 20390226

GitHub · LinkedIn

Supervision

Supervisor: Panagiota Tselenti, Laboratory Teaching Staff

UNIWA Profile · LinkedIn

Athens, January 2024

README

Building a Search Engine for Academic Papers

This project was developed for the Information Retrieval course at the University of West Attica.

It implements a complete Information Retrieval (IR) system that:

Crawls academic papers from arXiv
Preprocesses textual data
Builds an inverted index
Supports multiple retrieval models
Provides ranked search results through a graphical user interface (GUI)

Section	Path / File	Description
1	`assign/`	Assignment specifications and project instructions
1.1	`assign/IR LabProject 2023-2024new.pdf`	Official laboratory project description
1.2	`assign/ΑΠ ΕργασίαΕργαστηρίου 2023-2024νέο.pdf`	Greek version of the assignment
2	`docs/`	Project documentation and reports
2.1	`docs/Academic-Paper-Search-Engine.pdf`	Technical documentation of the search engine
2.2	`docs/Μηχανή-Αναζήτησης-Ακαδημαϊκών-Εργασιών.pdf`	Greek documentation
3	`src/`	Source code directory
3.1	`src/main.py`	Application entry point
3.2	`src/search_engine.py`	Core search engine controller
3.3	`src/inverted_index.py`	Inverted index construction and lookup
3.4	`src/query_processing.py`	Query parsing and preprocessing
3.5	`src/ranking.py`	Document ranking algorithms
3.6	`src/text_preprocessing.py`	Text cleaning, normalization, and tokenization
3.7	`src/web_crawler.py`	Web crawling and data acquisition
4	`README.md`	Project documentation
5	`INSTALL.md`	Usage instructions

1. Project Objective

The objective of this project is to design and implement a functional academic search engine demonstrating:

Web Crawling
Text Preprocessing
Inverted Index Construction
Boolean Retrieval
Vector Space Model (TF-IDF + Cosine Similarity)
Probabilistic Retrieval Model (Okapi BM25)
Query Processing with operator precedence
Metadata filtering (Author, Date)

2. System Architecture

Web Crawler
↓
dataset.json
↓
Text Preprocessing
↓
Inverted Index
↓
Query Processing
↓
Retrieval Model
↓
Ranking
↓
Filtering
↓
Top-20 Results (GUI Output)

3. Project Structure

.
├── assign/
│ ├── IR LabProject 2023-2024new.pdf
│ └── ΑΠ ΕργασίαΕργαστηρίου 2023-2024νέο.pdf
│
├── docs/
│ ├── Academic-Paper-Search-Engine.pdf
│ └── Μηχανή-Αναζήτησης-Ακαδημαϊκών-Εργασιών.pdf
│
├── src/
│ ├── main.py
│ ├── web_crawler.py
│ ├── text_preprocessing.py
│ ├── inverted_index.py
│ ├── query_processing.py
│ ├── ranking.py
│ ├── search_engine.py
│
├── dataset.json (generated at runtime)
├── README.md
└── INSTALL.md

4. System Modules

4.1 Web Crawler (`web_crawler.py`)

Retrieves up to 100 papers per subject.
Randomly selects between 2–8 subject categories:
- Physics
- Mathematics
- Computer
- Biology
- Finance
- Statistics
- Electronics
- Economics

4.2 Extracted Metadata

Title
Authors
Subjects
Abstract
Comments
Submission Date
PDF URL
Unique doc_id

Data is stored in dataset.json.

5. Text Preprocessing (`text_preprocessing.py`)

Pipeline:

Tokenization (NLTK)
Punctuation removal
Special character cleaning
Lowercasing
Stopword removal (English)
Porter Stemming

Applied to:

Document abstracts
User queries

6. Inverted Index (`inverted_index.py`)

Creates:

term → [doc_id1, doc_id2, ...]

Alphabetically sorted terms
Sorted posting lists
Stored in memory

7. Query Processing (`query_processing.py`)

Supports:

AND
OR
NOT
Parentheses
Operator precedence

Boolean evaluation is implemented using set operations.

8. Retrieval Models

Implemented in:

search_engine.py
ranking.py

8.1 Boolean Retrieval

Logical matching
Parentheses support
Set-based operations

8.2 Vector Space Model (VSM)

TF-IDF weighting
Cosine Similarity
Ranked results

8.3 Probabilistic Retrieval Model

Okapi BM25
Parameters:
- k = 1.2
- b = 0.75

9. Graphical User Interface

Built using:

tkinter
ttk

Features:

Query input
Retrieval model selection
Top-20 ranked results
Filtering by:
- Author
- Date

10. Technologies Used

Python 3.11
requests
beautifulsoup4
nltk
tkinter
math
collections
json
re
string

11. System Evaluation

The system was evaluated using datasets of 200–800 documents.

Evaluation metrics included:

Precision
Recall
Comparative ranking analysis between:
- Boolean Retrieval
- Vector Space Model
- BM25

12. Notes

Internet connection is required for crawling.
Each execution generates a new dataset (random subject selection).
Boolean operators must be lowercase: and, or, not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building a Search Engine for Academic Papers

README

Building a Search Engine for Academic Papers

Table of Contents

1. Project Objective

2. System Architecture

3. Project Structure

4. System Modules

4.1 Web Crawler (`web_crawler.py`)

4.2 Extracted Metadata

5. Text Preprocessing (`text_preprocessing.py`)

6. Inverted Index (`inverted_index.py`)

7. Query Processing (`query_processing.py`)

8. Retrieval Models

8.1 Boolean Retrieval

8.2 Vector Space Model (VSM)

8.3 Probabilistic Retrieval Model

9. Graphical User Interface

10. Technologies Used

11. System Evaluation

12. Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
assign		assign
docs		docs
src		src
INSTALL.md		INSTALL.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Building a Search Engine for Academic Papers

README

Building a Search Engine for Academic Papers

Table of Contents

1. Project Objective

2. System Architecture

3. Project Structure

4. System Modules

4.1 Web Crawler (web_crawler.py)

4.2 Extracted Metadata

5. Text Preprocessing (text_preprocessing.py)

6. Inverted Index (inverted_index.py)

7. Query Processing (query_processing.py)

8. Retrieval Models

8.1 Boolean Retrieval

8.2 Vector Space Model (VSM)

8.3 Probabilistic Retrieval Model

9. Graphical User Interface

10. Technologies Used

11. System Evaluation

12. Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

4.1 Web Crawler (`web_crawler.py`)

5. Text Preprocessing (`text_preprocessing.py`)

6. Inverted Index (`inverted_index.py`)

7. Query Processing (`query_processing.py`)

Packages