Skip to content

Latest commit

 

History

History
42 lines (32 loc) · 1.36 KB

File metadata and controls

42 lines (32 loc) · 1.36 KB

Requirements

  • Python 3.x and pip
  • Gemsim, Numpy, NLTK, NLTK Trainer, Spacy, Sklearn, Pandas, Pyphen, Pyspellchecker

Configuration

  • It's highly recommended creating a virtualenv before installing the dependencies

  • Dependencies

pip3 install virtualenv
virtualenv <YOU_NAME_IT>
source <THE_NAME_ABOVE>/bin/activate
pip install -r requirements.txt
sh setup.sh
  • NLTK setup (Within a python terminal)
import nltk
nltk.download('punkt')
nltk.download('mac_morpho')
nltk.download('stopwords')

The step above should install the dependencies in your nltk_data folder (~/nltk_data)

#Usage

  • TBD

ML text-extractor

  • Extract textual document content from different sources (PDF, Docs and text files)
  • Convert textual document into stylometric features
  • Contains Random Forest and Simple Neural Network classifiers over the data described in the next section

The data

  • There are two main types of data set inside the data/parsed-data folder: -- Regular data files, with textual content and masked author name -- Stylometric data files, that represent the conversion of the raw text into stylometric features (~50)

PS: Each data set has two versions of it, 'selected' means that samples with less than 3 per author were removed, 'data' is the complete data set with no exclusions