Analyzing Reddit communities with Python — Part 1: Introduction

Getting Reddit data using the Timesearch package

Tom van Nuenen
6 min readSep 21, 2021
Analyzing Reddit communities with Python

I’m starting this new series on data science and Natural Language Processing (NLP) in Python for people who want to do critical analyses of online discourse communities. In these posts, I will explain how to explore Reddit communities using Python.

In this introductory post, I will explain how to retrieve a Reddit dataset using the excellent Timesearch package. In future posts, I will introduce text analysis and NLP tools in Python you can use to explore this data, such as tf-idf, topic modeling, and word embeddings.

Discourse communities on Reddit

For the uninitiated, Reddit is a web platform for social news aggregation, web content rating, and discussion. It serves as a platform for multiple, linked topical discussion forums, as well as a network for shared identity-making.

Reddit members can submit content such as text posts, pictures, or direct links, organized in message boards. These so-called ‘subreddits’ are curated around particular topics, such as /r/pics for sharing pictures or /r/funny for posting jokes.

Reddit has been a popular site for Natural Language Processing studies — for instance, to successfully classify mental health discourses, domestic abuse stories, and to explore social norms and language biases (which we will do in the posts to come as well).

What these studies show is that social media platforms such as Reddit not only reflect a distinct offline world but increasingly serve as spaces for contemporary ideological groups and processes. Due to their topical organization, we can think of subreddits as ‘discourse communities’, in that they have a broadly agreed set of common public goals and functioning mechanisms of intercommunication among their members.

From a linguistic perspective, we can then ask: what binds these discourse communities together? What kinds of themes and topics can we find in this data? Or, more broadly: how does this particular community make sense of the world? These are questions we will explore in the following posts.

Accessing Reddit API

The Reddit API allows you to do lots of things, such as automatically post as a user. It also allows you to retrieve data from Reddit, such as subreddit posts and comments.

There are restrictions in place: Reddit’s API typically only allows you to retrieve 1000 posts (and associated comments) per task. But we probably want to get more data, so we can use the Timesearch package (see below).

  1. Sign up
    Go to http://www.reddit.com and sign up for an account.
  2. Create an OAuth app
    Go to https://ssl.reddit.com/prefs/apps/ and click on ‘create app’. Make sure to make it script type!

3. Write down details
Note the client ID, client secret, and your username/password for Reddit, as you’ll need them here. Set the redirect URI to http://localhost:8080. The name and description can be anything you want, and the about URL is not required.

Installing Timesearch package

The Timesearch package is a collection of utilities for archiving subreddits. Here are instructions on how to install it (I am using macOS).

  • First, make sure you have Python installed on your machine! You can check this by typing Python — version or python -V in your terminal and pressing enter. The Python version appears in the next line right below your command.
  • Next, download the Timesearch package from GitHub using the green “Clone or Download” button in the upper right of GitHub page.
  • Install PRAW v.4 or higher, as well as the other required modules. We can do this through the terminal, by navigating to the folder you just downloaded. Then, in the terminal, typepip install -r requirements.txt to get them all.
  • Use this PRAW script to generate a refresh token. Just save it as a .py file somewhere and run it through your terminal (or command line if you are on Windows).
  • Next, download a copy of this file and save it as bot.py. Fill out the variables using your OAuth information (see the previous section), and read the instructions to see where to put it. Save it in the same folder you downloaded Timesearch package in. Note that the USERAGENT is a description of your API usage. Typically something like ‘/u/username's praw client’ is sufficient. The CONTACT_INFO is sent when downloading from Pushshift, and can be your email address or reddit username.

Using Timesearch package

After installation, we will now use Timesearch to get ourselves some Reddit data. We will use two functionalities: get_submissions and get_comments, in order to get posts and comments of one particular subreddit. Open up Terminal again, and navigate to your Timesearch folder. Then, type the following, changing [subredditname] with the name of the subreddit you are interested in (for instance, “politics” or “historymemes”).

python timesearch.py get_submissions -r [subredditname]

Timesearch will now go through the subreddit in chronological order. This might take a while, and the stability obviously depends on your internet connection. Make sure to check in once in a while and restart the crawl if it has stopped (it will go on where it stopped). Your crawl will be saved in a .db file in the `subreddits’ folder, which is inside your timesearch folder.

After the posts have been downloaded, we can get the associated comments in a separate (related) database by running the following:

python timesearch.py get_comments -r [subredditname]

This will probably take (a lot) longer than the previous query, as there typically are a lot more comments than posts on a subreddit!

Accessing database file

Once you are done, you will find a DB file with the name of your subreddit in the ‘subreddits’ folder within the timesearch folder. A DB file is a generic database file that stores data in a structured format; it requires special software to be opened. I suggest the free DB Browser for SQLite. It looks like this:

As you can see, there are different tables in this database for your submissions, comments, and some other fields we won’t consider ourselves with for now. Before moving on to Python in the next post, we need to export this DB to two CSV files. We can do this by going to File -> Export -> Table(s) as CSV file. Select the ‘submissions’ and ‘comments’ fields, leave everything else, and click ‘Save’.

That’s it! You now have two CSV files with your Reddit posts and comments.

In the next post, I will discuss some basic operations in Python (using Pandas) to access this data. See you there!

Sources

If you want to know more, check out these articles I wrote on Reddit.

Inge van de Ven, Tom Van Nuenen: Digital Hermeneutics and Media Literacy: Scaled Readings of The Red Pill. Tilburg Papers in Culture Studies 241 (2020).

Xavier Ferrer Aran, Tom van Nuenen, Jose M. Such, Mark Coté, Natalia Criado: Bias and Discrimination in AI: A Cross-Disciplinary Perspective. IEEE Technol. Soc. Mag. 40(2): 72–80 (2021)

Xavier Ferrer, Tom van Nuenen, Jose M. Such, Natalia Criado:
Discovering and Categorising Language Biases in Reddit. ICWSM 2021: 140–151

Tom van Nuenen, Xavier Ferrer Aran, Jose M. Such, Mark Coté:
Transparency for Whom? Assessing Discriminatory Artificial Intelligence. Computer 53(11): 36–44 (2020)

Xavier Ferrer Aran, Tom van Nuenen, Natalia Criado, Jose M. Such: Discovering and Interpreting Conceptual Biases in Online Communities. CoRR abs/2010.14448 (2020)

--

--

Tom van Nuenen

Tom is a scholar in data-based culture and media studies. He holds a PhD in culture studies and mostly writes on data science, technology, and tourism.