UNIVERSITY OF WEST ATTICA
SCHOOL OF ENGINEERING
DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATICS
University of West Attica · Department of Computer Engineering and Informatics
Information Retrieval
Vasileios Evangelos Athanasiou
Student ID: 19390005
Pantelis Tatsis
Student ID: 20390226
Supervision
Supervisor: Panagiota Tselenti, Laboratory Teaching Staff
Athens, January 2024
This project was developed for the Information Retrieval course at the University of West Attica.
It implements a complete Information Retrieval (IR) system that:
- Crawls academic papers from arXiv
- Preprocesses textual data
- Builds an inverted index
- Supports multiple retrieval models
- Provides ranked search results through a graphical user interface (GUI)
| Section | Path / File | Description |
|---|---|---|
| 1 | assign/ |
Assignment specifications and project instructions |
| 1.1 | assign/IR LabProject 2023-2024new.pdf |
Official laboratory project description |
| 1.2 | assign/ΑΠ ΕργασίαΕργαστηρίου 2023-2024νέο.pdf |
Greek version of the assignment |
| 2 | docs/ |
Project documentation and reports |
| 2.1 | docs/Academic-Paper-Search-Engine.pdf |
Technical documentation of the search engine |
| 2.2 | docs/Μηχανή-Αναζήτησης-Ακαδημαϊκών-Εργασιών.pdf |
Greek documentation |
| 3 | src/ |
Source code directory |
| 3.1 | src/main.py |
Application entry point |
| 3.2 | src/search_engine.py |
Core search engine controller |
| 3.3 | src/inverted_index.py |
Inverted index construction and lookup |
| 3.4 | src/query_processing.py |
Query parsing and preprocessing |
| 3.5 | src/ranking.py |
Document ranking algorithms |
| 3.6 | src/text_preprocessing.py |
Text cleaning, normalization, and tokenization |
| 3.7 | src/web_crawler.py |
Web crawling and data acquisition |
| 4 | README.md |
Project documentation |
| 5 | INSTALL.md |
Usage instructions |
The objective of this project is to design and implement a functional academic search engine demonstrating:
- Web Crawling
- Text Preprocessing
- Inverted Index Construction
- Boolean Retrieval
- Vector Space Model (TF-IDF + Cosine Similarity)
- Probabilistic Retrieval Model (Okapi BM25)
- Query Processing with operator precedence
- Metadata filtering (Author, Date)
Web Crawler
↓
dataset.json
↓
Text Preprocessing
↓
Inverted Index
↓
Query Processing
↓
Retrieval Model
↓
Ranking
↓
Filtering
↓
Top-20 Results (GUI Output).
├── assign/
│ ├── IR LabProject 2023-2024new.pdf
│ └── ΑΠ ΕργασίαΕργαστηρίου 2023-2024νέο.pdf
│
├── docs/
│ ├── Academic-Paper-Search-Engine.pdf
│ └── Μηχανή-Αναζήτησης-Ακαδημαϊκών-Εργασιών.pdf
│
├── src/
│ ├── main.py
│ ├── web_crawler.py
│ ├── text_preprocessing.py
│ ├── inverted_index.py
│ ├── query_processing.py
│ ├── ranking.py
│ ├── search_engine.py
│
├── dataset.json (generated at runtime)
├── README.md
└── INSTALL.md- Retrieves up to 100 papers per subject.
- Randomly selects between 2–8 subject categories:
- Physics
- Mathematics
- Computer
- Biology
- Finance
- Statistics
- Electronics
- Economics
- Title
- Authors
- Subjects
- Abstract
- Comments
- Submission Date
- PDF URL
- Unique
doc_id
Data is stored in dataset.json.
Pipeline:
- Tokenization (NLTK)
- Punctuation removal
- Special character cleaning
- Lowercasing
- Stopword removal (English)
- Porter Stemming
Applied to:
- Document abstracts
- User queries
Creates:
term → [doc_id1, doc_id2, ...]
- Alphabetically sorted terms
- Sorted posting lists
- Stored in memory
Supports:
- AND
- OR
- NOT
- Parentheses
- Operator precedence
Boolean evaluation is implemented using set operations.
Implemented in:
search_engine.pyranking.py
- Logical matching
- Parentheses support
- Set-based operations
- TF-IDF weighting
- Cosine Similarity
- Ranked results
- Okapi BM25
- Parameters:
- k = 1.2
- b = 0.75
Built using:
- tkinter
- ttk
Features:
- Query input
- Retrieval model selection
- Top-20 ranked results
- Filtering by:
- Author
- Date
- Python 3.11
- requests
- beautifulsoup4
- nltk
- tkinter
- math
- collections
- json
- re
- string
The system was evaluated using datasets of 200–800 documents.
Evaluation metrics included:
- Precision
- Recall
- Comparative ranking analysis between:
- Boolean Retrieval
- Vector Space Model
- BM25
- Internet connection is required for crawling.
- Each execution generates a new dataset (random subject selection).
- Boolean operators must be lowercase:
and,or,not.
