Analyzing Reddit communities with Python — Part 4: TF-IDF
Comparing subreddits
In this series, I explain data science and Natural Language Processing (NLP) approaches in Python for people with little to no programming knowledge, who want to engage in critical analyses of online discourse communities.
In previous posts, I discussed how to use the Reddit API to retrieve data, how to select a subreddit, and using Pandas and NLTK to explore your data.
In this post, I’ll introduce some methods to engage in a simple distant reading using the Python package NLTK. We will also learn how to tokenize text from a DataFrame and run some basic functions to explore our text using NLTK, a well-known Python package for Natural Language Processing.
TF-IDF allows us to compare different related subreddits, in order to find the most distinctive words in a particular subreddit. It can also help us to find similar posts to ones we’re interested in.
After reading this post, you will be able to:
- Preprocess Reddit data, including removing punctuation, tokenizing, and lemmatizing;
- Understand how TF-IDF can be used to compare datasets;
- Find most-distinctive words in a subreddit using TF-IDF;
- Find similar posts using TF-IDF.
What is TF-IDF?
TF-IDF is a basic but intuitive way to find words that are typical of a particular document, when compared to other comparable documents.
Terms with high TF-IDF values for a given document are generally the most descriptive of that document. If a word occurs many times in one document but rarely in the rest of the corpus, it is probably useful for characterizing that document; conversely, if a word occurs frequently in a document but also occurs frequently in the corpus, it is probably less characteristic of that document.
Put succinctly, the term frequency is the relative frequency of term t within document d, while the inverse document frequency is a measure of how much information the word provides, i.e., if it is common or rare across all documents.
TF-IDF is based on the Bag of words (BoW) model, which contains information about the less and more relevant words in a document. However, it builds on this BoW model by considering the importance of the word in a document as compared to the rest of the corpus.
While this is valuable information, TF-IDF does have limitations: for instance, it does not encode the semantic meaning of words. This limitation of TFIDF can be overcome by more advanced techniques such as Word2Vec.
That said, TF-IDF can be helpful for studying Reddit: research indicates that lexical differences between subreddits are more pronounced in higher-scoring comments (LaViolette and Hogan 2019).
Implementation in Python
Typically, we will need to some more preprocessing to clean up our data. In the last post, we looked at r/seduction, a discourse community about “pick-up artistry” that describes itself as a space for “Help with dating, with a focus on how to get something started up, whether the goal is casual sex or a relationship”.
The other two datasets we will use are taken from subreddits that are, to different degrees: r/theredpill and r/mgtow (Men Going Their Own Way). Deciding what determines a “comparable” subreddit is a critical task, and it requires a good understanding of the kinds of concepts, theories, or critical frameworks you want to engage with.In this example, the discourse communities I compare share a belief system that has called ‘masculinism’, which holds that society is ruled by feminine ideas and values, that this fact is repressed by feminists and politically correct ‘social justice warriors,’ and that men must protect themselves against a ‘misandrist’ culture (Marwick and Lewis, 2017; Gotell and Dutton 2016).
import pandas as pd
sed = pd.read_csv('seduction-submissions.csv', lineterminator="\n")
trp = pd.read_csv("TRP-submissions.csv", lineterminator="\n")
mgtow = pd.read_csv("mgtow-submissions.csv", lineterminator="\n")
Removing empty rows
Missing values (`NaN`) in a DataFrame can cause a lot of errors. In general, it’s a good idea to get rid of those rows whose “selftext” is missing. We’ll write a function to do this. This function takes in an argument (which we call “df” here as we expect this to be a DataFrame).
The .dropna()
method allows us to remove rows that are empty. However, our data also includes posts that have been removed or deleted by the poster or moderators of the subreddit. Social media data often is “dirty”. In this case, we have to first look at our data to see how these removed entries are expressed. It turns out that these rows consist of either “[removed]” or “[deleted]” strings.
We set a new variable data_clean
consisting of the DataFrame we inputted, but excluding (using “~”) all the rows in the “selftext” column that consist of “[removed]” or “[deleted]”. We then also do .dropna()
to get rid of all rows that are actually empty.
Using this function we can overwrite
def mr_clean(df):
data_clean = df[~df['selftext'].isin(['[removed]',
'[deleted]']).dropna()]
return data_cleansed = mr_clean(sed)
trp = mr_clean(trp)
mgtow = mr_clean(mgtow)
Preprocessing
Let’s begin with preprocessing our data. We will create a function in Python called preprocessing
to preprocess our three DataFrames. The function takes in a DataFrame. It then turns the text in our “selftext” column into lowercase, remove punctuation, and tokenize it using the NLTK package we used last time. We also lemmatize our text using NLTK’s WordNet Lemmatizer. Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item (e.g. “good”, “better” and “best” all become “good”; “swim”, “swam”, “swimming” all become “swim”). We save our output in a long string to be processed again in the next step.
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()def preprocessing(df):
total = ""
for text in df['selftext']:
# lowercase
text = text.lower()
# remove punctuation
text = ''.join(ch for ch in text if ch not in
string.punctuation
# tokenize
tokens = word_tokenize(text)
# lemmatize
lemmas = ' '.join([wordnet_lemmatizer.lemmatize(token) for
token in tokens])
# save
total += lemmas
return totalsed_pp = preprocessing(sed)
trp_pp = preprocessing(trp)
mgtow_pp = preprocessing(mgtow)
Using TF-IDF
We will be using Scikit-LEARN’s TfidfVectorizer
. It is a class that basically allows us to create a matrix of simple word counts (a Bag of Words model), and immediately transform them into TF-IDF values. See the documentation if you want to learn more.
In the second line below, we instantiate an object of the vectorizer. Then, we run it by applying the fit_transform
method to our reddit_list
.
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernelreddit_list = [sed_pp, trp_pp, mgtow_pp]tfidf_vectorizer = TfidfVectorizer(max_df=0.85, decode_error='ignore', stop_words='english',smooth_idf=True,use_idf=True)tfidf_vectorizer_vectors = tfidf_vectorizer.fit_transform(reddit_list)
Let’s have a peek at our matrix by running the .toarray()
method. This shows us one value per word in the total vocabulary. We’re printing the vectors at index 1: this shows us the TF-IDF values of the words as they appear in trp_pp
(due to zero-based indexing). We have a vocabulary of 32,145 words, so the screenshot below just shows the values for the first couple of hundred words!
tfidf_vectorizer_vectors.toarray()[0]
Putting distinctive words in a DataFrame
We can now take out one vector (i.e., the TF-IDF values of one text) that `.fit_transform()` yielded. We can put them in a DataFrame, and print out that DataFrame after sorting it based on the highest score.
# get the second vector out (for the second document)vector_tfidfvectorizer = tfidf_vectorizer_vectors[1]
# Note that 2 refers to document3, due to zero-based indexing# place tf-idf values in a DataFramedf = pd.DataFrame(vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])df.sort_values(by=["tfidf"],ascending=False)[:10]
The output shows that the most distinguishing words for The Red Pill. As we can see, these are jargon terms such as smv (“sexual market value”) as well as terms such as “feminism” and “government”. These terms indicate that this subreddit might distinguish itself from the others through a focus on political discussion — a hypothesis which, once investigated further, would bear out.
Well done! We have compared three Reddit datasets using TF-IDF in order to help us with the exploration of differences. This can be helpful for highlighting what makes a particular discourse community unique among its “peers”.
Of course, the words that we will get out of our TF-IDF algorithm depend greatly on which kinds of subreddits/data you are comparing. So make sure to collect data from subreddits that make comparative sense!
In the next post, we will look for a way to automatically cluster word groups and similar expressions that best characterize a subreddit. We will do so using topic modeling.
Sources
Gotell, L., & Dutton, E. (2016). Sexual Violence in the ‘Manosphere’: Antifeminist Men’s Rights Discourses on Rape. International Journal for Crime, Justice and Social Democracy, 5(2), 65. https://doi.org/10.5204/ijcjsd.v5i2.310
LaViolette, J., & Hogan, B. (2019). Using platform signals for distinguishing discourses: The case of men’s rights and men’s liberation on Reddit. Proceedings of the 13th International Conference on Web and Social Media, ICWSM 2019, (Icwsm), 323–334.
Marwick, A., & Lewis, R. (2017). Media Manipulation and Disinformation Online. Data & Society Research Institute, 1–104.
Van de Ven, I., & Van Nuenen, T. (2020). Digital Hermeneutics and Media Literacy: Scaled Readings of The Red Pill (Tilburg Papers in Culture Studies №241). Tilburg.