Project Setup

On this page we will explain how we set up this project.

Dataset

We used a dataset from Kaggle which contains two files. One with reddit comments that got a high score and one that with comments that got a lot of downvotes. The datasets both contain the following columns:

id: unique id of the comment
parent_id: id of the parent comment
subreddit_id: id of the subreddit
link_id: id of the link
text: the comment text
score: score of the comment
ups: The number of ups
author: author of the comment
controversiality
- 0 - not flagged
- 1 - flagged controversial
parent_link_id: parent link id
parent_text: The text of the parent comment
parent_score: score of the parent comment
parent_ups: number of ups of the parent comment
parent_author: author of the parent comment
parent_controversiality
- 0 - not flagged
- 1 - flagged controversial

Project Structure

We used two Jupyter Notebooks for this project with Python 3.7.3.

In the first notebook we mainly load the hughe dataset of 4 million entries, take a small fraction from it and clean that using RegEx (Regular Expressions). After that we choose the columns that we want to preserve, in our case text, score, ups, controversiality, parent_text, parent_score, parent_ups, parent_controversiality. After that we write the clean fraction to two new files.

In the second notebook we load the files that we wrote in the first notebook, combine both the positive and negative dataset and shuffle it so it is one big random dataset. After that we use the tokenizer from the Keras library. This tokenizer first learns from all the text we have and assigns an integer to each word (tokenizer.fit(text_data)). This allows us to change every comment into a sequence of integers which is easier to learn to our neural network.

After we created a sequence for the combination of the text and parent_text we create a neural network using TensorFlow. First we create an Embedding layer which turns positive integers into dense vectors of fixed size. Then we have a GlobalAveragePooling layer which averages the input and returns a list of tensors. Then we have three Dense layers which are normal neural network layers. And the last layer is our output layer which has a sigmoid function to either fire or not, allowing the network to predict a 0 (negative comment score) or a 1 (positive comment score).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Setup

Project Setup

Dataset

Project Structure

Uh oh!

Uh oh!

Clone this wiki locally