The reddit data is in a PostgreSQL database on m2. The whole dataset is over 1 TB, and the database lets you get portions of the data.
The reddit data is not filtered. You may see offensive information, even in subreddits that seem safe. Use discretion in what you use and share.
Before you can use this database, you must turn on an SSH tunnel to m2.
You need to do this once, only for Windows:
- Open PuTTY. (Download and install if you don't already have it.)
- On the left side, select Connection > SSH > Tunnels.
- On the right side, fill in these fields:
- Source port:
5432 - Destination:
login02.m2.smu.edu:5432
- Source port:
- Click Add.
- On the left side, click on Session (the first selection).
- For Host Name (or IP address), enter
login02.m2.smu.edu. - For Saved Sessions, type
login02.m2.smu.edu SSH tunnel. - Click Save.
- Open a new terminal window.
- Type this into your terminal and press enter, replacing username with your m2 username:
ssh -C -Y -L 5432:localhost:5432 username@login02.m2.smu.edu
- Open PuTTY.
- Under Saved Sessions, double-click on login02.m2.smu.edu SSH tunnel.
- If prompted, accept the certificate for login02.m2.smu.edu.
- When prompted, enter your m2 username and password.
Keep the PuTTY or terminal window open the whole time you need to connect to the PostgreSQL database. Once you've downloaded all the data you need, it's fine to close the connection.
You need the RPostgreSQL package installed first. If it's not installed, run install.packages("RPostgreSQL") in R. You only need to run this once.
Load the library and driver:
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")Create a connection to the database:
con <- dbConnect(drv, dbname = "reddit",
host = "localhost",
port = 5432,
user = "tph_read",
password = "think-play-hack")Now you'll execute your query. More info on how to build a proper query is below. (This code will not work without constructing a query!) Change the object name to something that describes the data.
reddit_data <- dbGetQuery(con, "replace this with a proper SQL query")When you are done getting data from the database, disconnect using this command:
dbDisconnect(con)The SQL queries you'll use start with this:
SELECT id, ups, downs, parent_id, body, subreddit_id, subreddit, link_id, created_utc, created_utc_timestamp, author
FROM reddit_new
WHERE ...What you need to build is the WHERE clause. This filters things down so that you don't overwhelm yourself with 1.6 billion rows of data!
Determine which subreddits you want to edit. The dataset has almost 240,000 subreddits. The full list of subreddits is available here. Get the list: CSV (4.9 MB) Google Sheet (may be slow).
Once you know your subreddits, build a WHERE clause to filter by them. First, create a comma-separated list of subreddits you want to get, like this:
subreddit IN ('BoyScoutsofAmerica','cubscouts','EagleScout','EagleScouts','Eagle_Scouts','EagleScoutsOnReddit')Important notes:
- Keep capitalization, punctuation, all other details the same. Differences will prevent that subreddit from appearing.
- Sometimes, there will be multiple subreddits for one topic.
Now replace the ellipsis (...) in the sample SQL query above (starts with SELECT) with the clause you just created. Viola, you have a SQL query! Take that query and substitute it in the R code where I wrote replace this with a proper SQL query.
It is possible to filter by other criteria, such as date and more. These criteria can be in addition to or instead of filtering by subreddit. Aren Cambre can help you create the proper SQL query for this.
If you would like to use something other than R to analyze the data you downloaded, you can export it. All the below commands come from Tidyverse libraries, so let's load the Tidyverse first:
install.packages(tidyverse) # only run this line if you have not done this since installing R
library(tidyverse)Assuming you used the same reddit_data object name as above, you can use this command to export your data to a CSV:
write_csv(reddit_data, "reddit_data.csv")Want to write to a TSV instead?
write_tsv(reddit_data, "reddit_data.csv")