Analyzing Reddit communities with Python — Part 2: Selecting a community

How to find interesting subreddits for text analysis

Tom van Nuenen
5 min readOct 9, 2021

In this series, I explain data science and Natural Language Processing (NLP) approaches in Python for people who want to engage in critical analyses of online discourse communities.

In this post, I will focus on the first question we should ask: How to pick a community? There is no coding in this post consider it a light introduction in critical discourse analysis. Before westart analyzing, we should ask ourselves why we want to analyze a particular communty in the first place.

What is a discourse community?

The communities you find on Reddit tend to center around particular topics. Due to their topical organization, some of them could be thought of as ‘discourse communities’ (Kehus, Walters, and Shaw 2010).

For instance, take r/letsnotmeet. It’s a subreddit that describes itself as follows:

About-section of r/letsnotmeet

Like most subreddits, it also has a “Guidelines” page, indicating what kind of user content they do and do not consider appropriate. For instance, here’s a type of post that the subreddit’s moderators think does not belong on r/letsnotmeet:

You passed somebody on the street, and they gave you a creepy look. We know, it was really, really scary, and you don’t want to ever meet them again. But your post should probably go to /r/CreepyEncounters instead — LNM is set up to focus more on creepy encounters that are out of the ordinary.

As we can see, subreddits are often communities that have a broadly agreed set of common public goals and mechanisms of communication among their members. They also share expectations about the kinds of things that can and cannot be said, as well as specific words and phrases (Swales 2011). A related sociolinguistic concept is that of voice: the way in which people manage to make themselves understood or fail to do so (Blommaert 2005).

In this series, we will look at these ideological issues: what kind of language is and is not permissible for a particular community? What kind of voice do people in a community need to deploy in order to be understood?

Speaking about “discourse analysis”, in this context, means to think about questions about power and knowledge in relation to language. That means we are interested in how people use language to make particular statements — for instance, to construct facts, or to refute a statement they do not agree with.

Discourse analysis is motivated by the simple observation that “any form of writing is considered to be a selection, an interpretation, and a dramatization of events” (Riggins 1997: 2). Far from being “transparent,” language does a great deal of “social and ideological ‘work’ […] in producing, reproducing, or transforming social structures, relations and identities” (Fairclough 1992: 211).

What makes online communities such as those on Reddit especially interesting in this light is that they self-organize the content they produce. For instance, by allocating points or badges to particular types of content. Such “platform signals” can help us focus on comments that the community has deemed high-quality, and more in line with the ideas informing the community (LaViolette and Hogan 2019).

With this said, how do we go about finding an interesting community? We have a few options.

Follow your interests

For those who are unfamiliar with Reddit, the website offers more than 2.8 million subreddits (i.e., communities), wildly varying in size and popularity. Finding interesting ones is the challenge.

One approach is to explore content you are interested in yourself. In this case, it would be a good idea to create a Reddit profile and look for subreddits that deal with topics that interest you (politics, finance, video games, whatever!). Reddit’s recommendation algorithm will then offer you related content on the home page, which could point you to new subreddits of interest.

Find controversies

You might be interested in controversies, which is an entire academic field in and of itself. Reddit has a long history of incorporating controversial communities, which you could look through in order to find interesting ones. Reddit occasionally places subreddits in ‘quarantine’, which is a method intended for ‘communities that, while not prohibited by the Content Policy, average redditors may nevertheless find highly offensive or upsetting’. Reddit also bans communities that have violated its terms of use.

Look for platform signals

Finally, we can look at communities that are interesting because of the earlier mentioned “platform signals” they incorporate. Notably, the Reddit data we will download includes several metadata fields such as “flair” and “score” that can yield interesting results.

It is worth looking for communities that distribute these scores or flairs in a structured way. For instance, the subreddit r/amitheasshole allows members to post about “any non-violent conflict you have experienced; give us both sides of the story, and find out if you’re right, or you’re the asshole.” This judgment, which is determined by the community, is captured by a “flair” (see the text highlighted in blue in the screenshot below).

Screenshot from r/amitheasshole

The community has decided that the author of the above story is “Not the A-hole”. Crucially, this is data that will show up as metadata in our database when we use the Reddit API to get our data.

Screenshot from r/amitheasshole data

Notice the “flair_text” and “flair_css_class” columns above: they contain the flair that shows the community’s judgment.

We can use this classification — for instance, by getting all the post that are classified as being written by an “asshole-ish” person and asking ourselves what they have in common, linguistically speaking. Can we find themes, topics and patterns that distinguish this class of posts from others?

This is just one example, of course. The point is: content that is self-moderated allows us better insights into a discourse community’s ideas, goals, or beliefs.

In the next post, we will start to explore a subreddit using Python’s oft-used Pandas package.

Sources

Blommaert, J. (2005). Discourse. Cambridge: Cambridge University Press.

Fairclough, N. 1992. Discourse and text: Linguistic and intertextual analysis within discourse analysis. Discourse & Society 3(2):193–217

Kehus, M.; Walters, K.; and Shaw, M. 2010. Definition and genesis of an online discourse community. International Journal of Learning 17(4):67–86.

LaViolette, J., & Hogan, B. 2019. Using platform signals for distinguishing discourses: The case of men’s rights and men’s liberation on Reddit. Proceedings of the 13th International Conference on Web and Social Media, ICWSM 2019, (Icwsm), 323–334.

Riggins, S. H. 1997. The rhetoric of othering. In Riggins, S. H., ed., The language and politics of exclusion. London: Sage.

Swales, J. 2011. The Concept of Discourse Community. Writing About Writing 466–473.

--

--

Tom van Nuenen

Tom is a scholar in data-based culture and media studies. He holds a PhD in culture studies and mostly writes on data science, technology, and tourism.