Analyzing Reddit communities with Python — Part 5: topic modeling

How to find topics in Reddit data

Tom van Nuenen
18 min readFeb 10, 2022

In this series, I explain data science and Natural Language Processing (NLP) approaches in Python for people with little to no programming knowledge, who want to engage in critical analyses of online discourse communities.

In previous posts, I discussed how to use the Reddit API to retrieve data, how to select a subreddit, and using Pandas and NLTK to explore your data, and how to use tf-idf to compare subreddits.

This week, we look at a popular approach to exploring a community: topic modeling. I will explain topic modeling in detail for newcomers, and then use gensim to run the algorithm over one of our Reddit datasets.

As topic modeling is a heavily iterative method, we’ll also look at ways to improve the model.

An applied introduction to topic modeling

Topic modeling is a type of statistical modeling allowing us to discover abstract “topics” that occur in a collection of documents. It is used frequently as a text-mining tool to find “hidden” semantic structures in textual data.

Crucially, topic modeling does not require us to know anything about our texts in order to do its job. It is, in other words, an unsupervised machine learning technique that allows us to scan a set of documents, detect word and phrase patterns within them, and automatically cluster word groups and similar expressions that best characterize a set of documents.

There are many ways of doing topic modeling; one of them is LDA. Instead of implementing it without knowing what it does, let’s spend some time thinking about how this algorithm works.

What is Latent Dirichlet Allocation?

LDA — Latent Dirichlet Allocation (LDA) — is a generative statistical model that is just one popular method for fitting a topic model. It approaches the existence of words in documents as a probability issue, through what’s called Bayesian statistics. We can summarize the method as follows: it is a form of (1) unsupervised, (2) probabilistic modeling (3) using hidden variables.

Let’s explain these terms.

  1. Unsupervised: it doesn’t require us to provide classifications upfront to train and fit a model.
  2. Probabilistic: it treats data — so say, a collection of texts — as observations that arise from some generative probabilistic process.
  3. Hidden layer: it assumes that there is a hidden (latent) layer in this process, which explains why documents look the way they do. This hidden layer consists of the topics that topic modeling is named after. LDA assumes that these topics reflect the thematic structure of our data.

The only thing that this unsupervised model cannot do for us is select how many hidden variables there are. We have to decide this ourselves, upfront. How to be smart about that is something we will discuss later. In the below example of using LDA on Reddit, we’ll run our imaginary model with 3 topics.

LDA: An example

Let’s say we are interested in a Reddit community in the manosphere, which we have been focusing on in this series.

The assumption that an LDA algorithm makes is that each submission in our data is a random mixture of corpus-wide topics, and that each word/token in each document is drawn, with some probability, from one of these topics.

In a sense, LDA assumes that when an author in this community is writing some document, they can choose from a fixed amount of topics. For our purposes, we could call this imaginary author the prototypical community member from the Reddit community we’re investigating. In the following image, for each word that our imaginary community member writes down in a submission/post/document, they are drawing from a so-called topic distribution, sort of a wheel of fortune with 3 choices.

LDA assumes an author draws from a topic for each word they write

Let’s say they draw topic 2. They now have to pick a word from that topic with some probability, and that’s the next word they write down.

The author then picks a word from that topic

A fancy way of saying what I just said is that with LDA we infer a hidden structure using “posterior inference”: we compute the distribution of hidden variables given the observations we do have (i.e., the documents themselves). So we’re kind of backtracking.

This is happening across all our submissions: each of them is assumed to be a mixture of our 3 topics. And these 3 topics are a mixture of the word types (distinct words) in the total corpus. The below image shows a different way of representing the process. The scores are the probabilities with which topics appear in documents, and word types in topics.

Documents are a mixture of topics, topics are a mixture of tokens

Of course, this is not how writing works. However, if we look at this from a sociological perspective, it does make some sense: after all, the language we use is often not determined by us, but by hidden structures such as norms and values that are dominant in our social context. So it’s not a completely ridiculous intuition. In a previous post, I discussed such intuitions behind “discourse communities”.

To reiterate, LDA assumes the 3 topics we have chosen live “outside” of our data collection. The colored word types in the image below have a particularly high chance of originating in a particular topic. If our topic model works well, those topics will be coherent, as they are here: they each express a coherent topic that occurs across the posts in our dataset.

This is also why domain knowledge of your data is very important when doing topic modeling! On Reddit in particular, communities make use of jargon and specialized language — The Red Pill is full of abbreviations such as SMV (‘Sexual Market Value’), LMR (‘Last Minute Resistance’), and LTR (‘Long-Term Relationship’) (see Van de Ven & Van Nuenen 2020).

Here, we have given our topics names.

Note that the topics we get from LDA are not named by the algorithm. The naming is something you need to do as a researcher after inspecting them! This is why the coherence and interpretability of topic models are really the most important thing.

(Note that things are a bit more complicated than this image. Each topic can be defined as a “distribution over terms in the vocabulary”. That means we assume a fixed vocabulary in our entire corpus, and each topic contains all words in the vocabulary with some probability).

In mathematical terms, in LDA, each topic is a Dirichlet distribution of V dimensions where V is the size of your vocabulary. Similarly, each document is a Dirichlet distribution of W dimensions where W is the number of topics we have chosen. The Dirichlet distribution is often used in Bayesian statistics: it’s a “prior probability distribution” of an uncertain quantity.

Note that with this, we’re implicitly assuming that word order doesn’t matter. That’s why we say LDA is a “bag of words” approach. The intuition behind this is that, if I’d give you a document with the words all scrambled up, you’d still be able to tell what the document was about, more or less.

What is Gibb’s Sampling?

How does LDA calculate which word types belong in which topics?

Essentially, it begins by throwing random words in random topics. Of course, that first topic model sucks, as it doesn’t capture thematically coherent themes. All we have at this point is a set of documents and some random probabilistic distributions.

This is where Gibb’s sampling comes in. We basically assume we have almost solved the problem. So, for each document, we go through each word, and we assume that all words except this one are placed in their topics correctly. Now, how will we decide if this word, we’ll call it word type W, belongs to topic T that it is in now? Basically, we take a calculated guess: we consider 2 things: (a) what the chance is of this word W occurring in topic T, and (b) what the chance is of topic T occurring in this document D.

How do you do that? For each topic, we multiply the relative frequency of this word type W in topic T by the number of other words in document D that already exist in T. So what matters is 1) how often this word already exists in this topic, and 2) how often other words in this document exist in this topic. This is why it’s all about co-occurring words.

As we iterate over all of the words in our corpus like this, two things happen: 1) word types will gradually become more common in topics where they are already common, and 2) topics will become more common in documents where they are already common.

This iteration part is why topic modeling is a form of machine learning. We go over our data until the documents begin to “converge”, i.e., the probability distributions of topics doesn’t change anymore.

Topic modeling in Python

I hope you now have a better intuition of what topic modeling does. Time to implement it!

We’ll use the gensim package in Python to create our topic models, as it allows us to run tests to optimize our topic amount. After reading this notebook, you’ll be able to:

  1. Use gensim to create topic models;
  2. Explore the topic models using PyLDAvis;
  3. Evaluate the coherence of topic models.

Again, I am assuming you have your own data in CSV format (see this post on how to do this).

The data I am using consists of submissions from a subreddit called The Red Pill (/r/TheRedPill). This community defines itself as a forum for the “discussion of sexual strategy in a culture increasingly lacking a positive identity for men” (Watson 2016). It belongs to the online Manosphere, a loose collection of movements and communities such as pickup artists, involuntary celibates (‘incels’), and Men Going Their Own Way (MGTOW). Within the ‘masculinist’ belief system of this community, society is ruled by feminine ideas and values, yet this fact is repressed by feminists and politically correct ‘social justice warriors’ (Marwick and Lewis 2017; LaViolette and Hogan 2019; Van de Ven and Van Nuenen 2020). Previous studies have used topic modeling to explore this community (Mountford 2018).

import pandas as pdtrp_sub = pd.read_csv(“TRP_submissions.csv”, lineterminator=’\n’)

Time to tokenize. Let’s use Gensim’s simple_preprocess() method this time. If you haven’t seen yield before, it is used in what’s called a generator function. This is simply a function that iterates, instead of only returning something once.

Return sends a specified value back to its caller whereas yield can produce a sequence of values. We should use yield when we want to iterate over a sequence, but don’t want to store the entire sequence in memory.

from gensim.utils import simple_preprocessdef tokenizer(texts):
for text in texts:
yield(simple_preprocess(text, deacc=True))
tokens_list = list(tokenizer(data))

Creating N-grams with Gensim

Topic modeling — as well as many other kinds of NLP methods — works better when using N-grams, as this allows words that frequently appear together to be concatenated (e.g. the bigram “red pill” means something different than “red” and “pill” separately).

Gensim’s Phrases model implements bigrams, trigrams, quadgrams, etc. Phrases detects phrases based on collocation counts. It builds a model of input text that you then can use on other data.

Gensim detects a bigram if a scoring function for two words exceeds a threshold. The two important arguments to Phrases are min_count and threshold. The higher the values of these parameters, the harder it is for words to be combined into bigrams.

(Note that we’re also running Phraser which must be built from an initial Phrases instance. It then works faster while using much less memory. See here for more info.)

bigram = gensim.models.Phrases(tokens_list, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[tokens_list], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Stopwords removal, bigrams, lemmatization

Let’s define some functions for stopword removal, making bigrams and trigrams, and lemmatization. We import a stopwords list from NLTK, to which we add a list of some recurring characters we want to get rid of in our data. We then use SpaCy for lemmatization and POS tagging to get only nouns and adjectives— see here for more info).

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import spacy
!spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()
def remove_stopwords(texts):
return [[word for word in doc if word not in stop] for doc in.
texts]

def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ']):
"""https://spacy.io/api/annotation"""
texts_out = []
for doc in texts:
joined = nlp(" ".join(doc))
texts_out.append([token.lemma_ for token in joined if token.pos_ in allowed_postags])
return texts_out

Creating a `Dictionary` with Gensim

Now, let’s create our gensim dictionary — a mapping of each word to a unique id. It will be used to create a Corpus object, which is gensim’s equivalent of a Document-Term matrix.

# Create Dictionary
dictionary = corpora.Dictionary(lemmas)
# Create Corpus, i.e. Document-Term Matrix
corpus = [dictionary.doc2bow(text) for text in lemmas]

Let’s view a small part of the corpus we have now:

print(corpus[0][:10])
>>> [(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1)]

Observe the first 10 tuples above. This is a mapping of (word_id, word_frequency). For example, (0, 1) above means that the word with ID=0 occurs once in the first document. Word ID=1 occurs 2 times, and so on. This is used as the input by the LDA model.

If you want to see what word a given ID corresponds to, pass the ID as a key to the dictionary.

dictionary[5]
>>> 'advance'

And if you want to see the associated id for some word:

dictionary.token2id['advance']
>>> 5

Running an LDA model

It’s finally time to run a Gensim LDA model. For more information on the parameters, see here.

## Build LDA model. Make sure to play around with chunksize and passes and check if coherence score changes a lot.lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=10,
random_state=100,
eval_every = 20
update_every=1,
chunksize=500,
passes=10,
alpha='auto',
per_word_topics=True)

That was it. Long buildup to a short operation!

Visualizing the model

Let’s try to evaluate our topics. First, we can visualize our topics using pyLDAvis. A “good” topic model produces non-overlapping, fairly large bubbles, which should be scattered throughout the chart instead of being clustered in one quadrant. A model with too many topics will typically have many overlaps, small-sized bubbles clustered in one region of the chart. This is the first way in which you can evaluate your topic models — basically, through eyeballing.

!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()
lda_viz = gensimvis.prepare(lda_model, corpus, dictionary)
lda_viz

PyLDAvis graph

Note: if you display the above graph in your Python notebook, it will be interactive, which is what makes it great.

On the left, there is a 2D plot of the “distance” between all of the topics (labeled as the Intertopic Distance Map). This plot uses a multidimensional scaling (MDS) algorithm. This means that similar topics should appear close together on the plot, while dissimilar topics should appear far apart. Further, the relative size of a topic’s circle in the plot corresponds to the relative frequency of the topic in the corpus.

Exploring topics and words

You can scrutinize a topic more closely by clicking on its circle, or entering its number in the “selected topic” box in the upper-left (Note that, though the data used by gensim and pyLDAvis are the same, they don’t use the same ID numbers for topics).

If you roll your mouse over a term in the bar chart on the right, the topic circles will resize in the plot on the left. This shows the strength of the relationship between the topics and the selected term.

Salience

On the right, there is a bar chart with the top terms. When no topic is selected in the plot on the left, the bar chart shows the top-30 most salient terms in the corpus. A term’s saliency is a measure of both how frequent the term is in the corpus and how “distinctive” it is in distinguishing between different topics.

Probability vs. Exclusivity

When you select a particular topic, this bar chart changes to show the top-30 most “relevant” terms for the selected topic. The relevance metric is controlled by the parameter λ, which can be adjusted with a slider above the bar chart:

  • Setting λ close to 1.0 (the default) will rank the terms according to their probability within the topic.
  • Setting λ close to 0.0 will rank the terms according to their “distinctiveness” or “exclusivity” within the topic. This means that terms that occur only in this topic, and do not occur in other topics.

You can move the slider between 0.0 and 1.0 to weigh term probability and exclusivity. Note that if you see the same keywords being repeated at the top of multiple topics, it’s probably a sign that you have too many topics.

Usefulness

The interactive visualization pyLDAvis produces is helpful to explore individual topics: you can manually select each topic to view its top most-frequent and/or “relevant” terms, using different values of the λ parameter. This can help when you’re trying to assign a name or “meaning” to each topic.

It also helps you to see the relationships between topics: exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

Note that in the image above, our topic model has many overlapping bubbles, meaning that the topics consist of the same words, and are therefore not capturing distinct themes. These clusters of topics are about larger themes in the data. As you can see in the image above, the words in topic 1 are not very coherent.

Calculating Topic Coherence

We can apply some statistical measures to help us determine the optimal number of topics in our topic model.

Topic Coherence is a measure applied to the top N words from each topic. There are multiple ways to calculate this metric: here, we will define it as the average of the pairwise word-similarity scores of the words in the topic. This helps to distinguish between topics that are semantically interpretable topics, and topics that are artifacts of statistical inference.

A set of statements or facts is said to be coherent if the statements support each other. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”.

A good model will generate topics with high topic coherence scores. Good topics are topics that can be described by a short label, therefore this is what the topic coherence measure should capture.

coherence_model = CoherenceModel(model=lda_model, corpus=corpus, texts=lemmas, dictionary=dictionary, coherence='c_v')# The higher the better. A coherence score of .4 is bad, it means you're probably not using the right number of topics; .6 is great. Anything more is suspiciously great.coherence = coherence_model.get_coherence()
print('\nCoherence Score: ', coherence)
>>> Coherence Score: 0.3641326014061597

Optimizing coherence scores

The most obvious thing we can do to find optimal scores is to preprocess our data differently. For instance, we might want to use only nouns, or all of the lemmas. We could also remove tokens that are very rare, or very common, from our data. Topic modeling is all about iteration — you never get it right the first time. So play around with your data and see if it improves the model!

The next logical step is to can change the number of topics our model creates. One way to do this is to build many LDA models with a different number of topics (k), and then pick the one that gives the highest coherence value. The compute_coherence_values() function below trains multiple LDA models, provides the models, and tells us their corresponding coherence scores.

def compute_coherence_values(dictionary, corpus, texts, limit=50, start=10, step=10):
"""
Compute c_v coherence for various number of topics
Parameters:
----------
dictionary : Gensim dictionary
corpus : Gensim corpus
texts : List of input texts
limit : Max num of topics
Returns:
-------
model_list : List of LDA topic models
coherence_values : Coherence values corresponding to the LDA
model with respective number of topics
"""
coherence_values = []
model_list = []
total_amount = limit / step
current_amount = 0
for num_topics in range(start, limit, step):
model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=dictionary, num_topics=num_topics,
random_state=100, update_every=1, chunksize=500,
passes=10, alpha='auto', per_word_topics=False)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, texts=texts,
dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
current_amount += 1
print("Built " + str(current_amount) + " of " +
str(total_amount) + " models")
return model_list, coherence_values

Using our new function, let’s run a bunch of topic models with different amounts of topics.

model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=lemmas, start=10, limit=50, step=10)>>> Built 1 of 5.0 models 
>>> Built 2 of 5.0 models
>>> Built 3 of 5.0 models
>>> Built 4 of 5.0 models

Now, from all those models, let’s visualize the output of the coherence scores.

import matplotlib.pyplot as plt
%matplotlib inline
limit=50; start=10; step=10
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

If the coherence score seems to keep increasing, it generally makes sense to pick the model that gave the highest coherence score before dropping again (the so-called elbow method — another heuristic often used by data scientists).

It may seem the most fitting amount of topics is 20 here. However, this goes against our earlier indication that we had too many topics! If you look closer at the graph above, you see the coherence score is hardly going up. This is just one way in which visualizations can be misleading!

For now, we’ll just try 20 topics and see what comes up.

optimal_model = model_list[2]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=20))

Let’s pick out one topic that gets printed with the above code.

(0,  '0.054*"body" + 0.039*"muscle" + 0.039*"weight" + 0.034*"gym" + 0.031*"fat" '   '+ 0.022*"workout" + 0.018*"pound" + 0.017*"diet" + 0.017*"lift" + '   '0.016*"strength" + 0.015*"food" + 0.013*"discipline" + 0.013*"health" + '   '0.012*"lifting" + 0.011*"sleep" + 0.011*"training" + 0.011*"day" + '   '0.011*"fitness" + 0.011*"com" + 0.010*"calorie"'),

Our topic model has picked up on a topic (topic 0) that looks coherent. It seems to be about working out and physical health — a typical topic in the manosphere. We could thus call this topic “working out”.

Finding most distinctive threads per topic

With our topic model, we can now find the submissions that include the highest amount of words for a certain topic. You could use this if you have found a really interesting topic, and you want to know the top-n submissions this topic is typically found in.

We’ll create a new DataFrame that includes the top-n submissions per topic.

# Group top 5 submissions under each topic
sub_topics_sorteddf = pd.DataFrame()
thread_topics_outdf_grpd = df_topic_sub_keywords.groupby('Dominant_Topic')for i, grp in sub_topics_outdf_grpd:
sub_topics_sorteddf = pd.concat([sub_topics_sorteddf,
grp.sort_values(['Perc_Contribution'], ascending=[0]).head(5)],
axis=0)
# Reset Index
sub_topics_sorteddf.reset_index(drop=True, inplace=True)
# Format
sub_topics_sorteddf.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]
# Show
sub_topics_sorteddf

Here’s an excerpt from the DataFrame we’ve created.

Finally, let’s have a look at one of the posts to see if it indeed is about working out.

thread_topics_sorteddf['Text'][4]>>> **Background:** This first section can be skipped but is included so you know that I **actually know** what I am talking about. This is not some guy who lost a couple of pounds or read some fitness blogs about weight loss. Everyone knows the BMI scale: * <18.5 = Underweight * 18.5 - 25 = Normal weight * >25 = Overweight What most people don\'t know is that there are further categories for being overweight * 25 - 30 = Overweight * 30 - 35 = Obese Class I (Moderately obese) * 35 - 40 = Obese Class II (Severely obese) * >40 = Obese Class III (Very severely obese) I was at a BMI of 47. Following the logic of the system above, they would have to invent an additional 45-50 category to file me under. I was fucked far worse than 99 % of you people reading this. Ironically, I have a degree in food technology and should have put my knowledge to better use at the time. Hindsight is always 20:20. It took me almost 2 years to lose approx. 200 lbs. I lost all that weight through discipline.

Looks like it works!

Conclusion

That was quite a lot. Hopefully, you now have a better conceptual and practical understanding of what topic modeling could do for your data analyses on social media such as Reddit.

One thought to end with: for most topic models you will create, it will be hard to apply a meaningful interpretation to each topic. Not every topic will have some meaningful insight fall out of it upon first inspection. This is a typical issue in machine learning, which can pick up on granular patterns in a way people do not.

It is an open question to which extent you should let yourself be surprised by particular combinations of words in a topic, or if topic models primarily should follow the intuitions you have as a researcher. What makes for a “good” topic model probably straddles the boundaries of surprise and convention.

In the next post, we will look at another unsupervised method, one that helps us to find relations between words in our data: word embeddings. See you then!

Sources

LaViolette, J., & Hogan, B. (2019). Using platform signals for distinguishing discourses: The case of men’s rights and men’s liberation on Reddit. Proceedings of the 13th International Conference on Web and Social Media, ICWSM 2019, (Icwsm), 323–334.

Marwick, A., & Lewis, R. (2017). Media Manipulation and Disinformation Online. Data & Society Research Institute, 1–104.

Mountford, J. (2018). Topic Modeling The Red Pill. Social Sciences, 7(3), 42. https://doi.org/10.3390/socsci7030042

Van de Ven, I., & Van Nuenen, T. (2020). Digital Hermeneutics and Media Literacy: Scaled Readings of The Red Pill (Tilburg Papers in Culture Studies №241). Tilburg.

Watson, Z. (2016). Red Pill Men and Women, Reddit, And The Cult of Gender | Inverse. Retrieved August 3, 2019, from https://www.inverse.com/article/15832-red-pill-men-and-women-reddit-and-the-cult-of-gender

--

--

Tom van Nuenen

Tom is a scholar in data-based culture and media studies. He holds a PhD in culture studies and mostly writes on data science, technology, and tourism.