All the dependencies you need to install are mentioned in file requirements.txt. The most important libraries used for the algorithm are listed below:
- Algorithm:
- numpy
- scipy
- sklearn
- wikipeda
- threading
- nltk
The dataset used for the search engine contains of about 50 000 articles crawled from Wikipedia. For crawling I used ready-made python library wikipedia with a custom crawler running on multiple threads. The crawler is located in file crawler.ipynb. All the articles were crawled in simple english version. Due to some computational limitations (more details in section 4), after filtering out some tokens, it resulted in final dictionary with about 50 000 tokens. The proccess of crawling took approximately about 13 hours. Initially, the dataset was meant to focus on broadly understood culture and entertainment, however, finally the dataset contains full range of topics.
Code related to data preprocessing is located in file preprocessor.py.
Data preprocessing was based on three stages:
- removing punctuaction from the text
- stemming the text (Porter Stemmer from nltk module was used here). During this step all words ale simplified and all the stopwords and words shorter than 3 letters are filtered out, too.
- tokenization (word_tokenize from nltk module was used here) of the text
Creating a dictionary of all words is connected with creating bag of words for each document (more details in section 5). The class responsible for the proccess of creating a dictionary of tokens is located in file dictionaryCreator.py.
Creating the dictionary also uses class in file bagOfWordsCreator.py, which returns Counter of tokens (for each token, we count number of occurrences in the text/file we passed as an argument). Text passed to the BagOfWordsCreator is preproccessed with the preprocessor mentioned in section 3.
During the proccess of creating dictionary, we create a bag of words for each article (and save to a dictionary of counters by document - it is used for creating term by document matrix, more details in section 5) and add it to global counter representing the dictionary.
Due to some computational limitations, I decided to filter out all the tokens that are mentioned less than 8 times among all articles (minimum number of occurrences is a parameter). Thanks to this, approximately 75% of tokens got filtered out and we got rid of uncommon words. The final dictionary contains of about 50 000 tokens.
The class responsible for creating a term by document matrix is located in file termByDocumentMatrixCreator.py.
During the proccess of creating the matrix, we use dedicated library for sparse matrices. To be more precise, scipy.sparse module with dok_matrix and csr_matrix were used. As initialization of csr_matrix and filling it with so many values turned out to be really time consuming, I decided to initialize dok_matrix and then convert it to csr_matrix. It resulted in approximately 50 times faster matrix initialization.
As we want the matrix to be the following data representation:
- rows denote tokens
- columns denote documents,
during the initialization of the TermByDocumentMatrixCreator we index all the tokens and documents, so we know rows and columns representing adequate tokens and documents, respectively.
One of the parameters for TermByDocumentMatrixCreator is idf. The user can decide whether he wants to use IDF - Inverse Document Frequency (more details on IDF in section 5.1 below). However, as it to recommended to use IDF for better performance, by default it is set to True.
The term by document matrix is then normalized along the columns.
The class responsible for calculating IDF is located in file idfCalculator.py. To avoid calculating IDF multiple times, during initialization, we calculate it for each token in the dictionary and only read precalculated value later. During the proccess, for each token we calculate the value of:
where:
-
$N$ - number of documents in total -
$n_w$ - number of documents where token$t$ is present.
Using IDF significantly improves results as it reduces significance of common words.
The class responsible for noise cancelling with low rank approximation is located in file svdNoiseCanceler.py. As a parameter, it takes rank (default value is set to rank = 3000, as it gave the best results).
Function used for SVD decomposition is randomized_svd from sklearn.utils.extmath module. The function returns decomposition to matrices
Theoretically, we should multiply matrices
A class responsible for handling requests (located in file searchHandler.py) takes 3 parameters:
- Instance of TermByDocumentMatrixCreator used for creating TBD matrix
- expected number of responses
- information if an algorithm should use SVD
When search method is called on the SearchHandler it calculates cosine norm between vector
We normalize vector
The class is also responsible for preparing the response in a format desired by the backend and frontend.
A class responsible for initialization of all the helper classes and communication with backend. An implementation is located in file engine.py.
As the initialization of all the classes and structures (all the structures weight approximately 3GB) takes much time, every significant structure is saved as a .pickle file so we can initialize the search engine faster. If user wants to reinitialize the structures, each class has dedicated method for that.
The algorithm gives us quite good results, we notice significant rise in the quality of responses while using SVD and IDF.
I would consider using articles in English (not in simple english), as some of the articles in simple english are really short and do not contain many key words. The decision was made due to some time constraints and computational limitations.
- Algorithm:
- Python
- Backend:
- Flask
- Frontend:
- React