-
Notifications
You must be signed in to change notification settings - Fork 0
Project Setup
On this page we will explain how we set up this project.
We used a dataset from Kaggle which contains two files. One with reddit comments that got a high score and one that with comments that got a lot of downvotes. The datasets both contain the following columns:
-
id: unique id of the comment -
parent_id: id of the parent comment -
subreddit_id: id of the subreddit -
link_id: id of the link -
text: the comment text -
score: score of the comment -
ups: The number of ups -
author: author of the comment -
controversiality-
0- not flagged -
1- flagged controversial
-
-
parent_link_id: parent link id -
parent_text: The text of the parent comment -
parent_score: score of the parent comment -
parent_ups: number of ups of the parent comment -
parent_author: author of the parent comment -
parent_controversiality-
0- not flagged -
1- flagged controversial
-
We used two Jupyter Notebooks for this project with Python 3.7.3.
In the first notebook we mainly load the hughe dataset of 4 million entries, take a small fraction from it and clean that using RegEx (Regular Expressions). After that we choose the columns that we want to preserve, in our case text, score, ups, controversiality, parent_text, parent_score, parent_ups, parent_controversiality.
After that we write the clean fraction to two new files.
In the second notebook we load the files that we wrote in the first notebook, combine both the positive and negative dataset and shuffle it so it is one big random dataset.
After that we use the tokenizer from the Keras library. This tokenizer first learns from all the text we have and assigns an integer to each word (tokenizer.fit(text_data)). This allows us to change every comment into a sequence of integers which is easier to learn to our neural network.
After we created a sequence for the combination of the text and parent_text we create a neural network using TensorFlow.
First we create an Embedding layer which turns positive integers into dense vectors of fixed size. Then we have a GlobalAveragePooling layer which averages the input and returns a list of tensors. Then we have three Dense layers which are normal neural network layers. And the last layer is our output layer which has a sigmoid function to either fire or not, allowing the network to predict a 0 (negative comment score) or a 1 (positive comment score).