SENG474 Project

Data Processing

To run the data processing script, you need to have newspaper3k and nltk installed. It takes about an hour to run but it outputs a csv file with the processed data, so we won't need to run it each time we train our models.
The cleaned and processed data can be found in data/processed-dataset.csv
Each row of the csv file contains the number of occurrences of each english word in an article, excluding stop words
The last column is the label and is 1 is the article is real, 0 if the article is fake.
I was having problems with stemming the words, but we can take another crack at this if we need to reduce our number of features
We could also try using sklearn's TfidfVectorizer instead CountVectorizer later to see if that improves our accuracy