Analyzing Reddit communities with Python — Part 3: Pandas and NLTK
Distant Reading using Pandas and NLTK
In this series, I explain data science and Natural Language Processing (NLP) approaches in Python for people with little to no programming knowledge, who want to engage in critical analyses of online discourse communities.
In the previous posts, I discussed how to use the Reddit API to retrieve data, and how to think about selecting a community for analysis.
In this post, I’ll introduce some methods to engage in a simple distant reading using the Python package NLTK. We will also learn how to tokenize text from a DataFrame and run some basic functions to explore our text using NLTK, a well-known Python package for Natural Language Processing.
After reading this post, you will be able to:
- Know how to open and perform simple operations on a Pandas DataFrame;
- Use NLTK’s `Text()` object to perform some basic “distant reading” operations on a subreddit.
We’ll be using data from the subreddit r/seduction, a discourse community that describes itself as a space for “Help with dating, with a focus on how to get something started up, whether the goal is casual sex or a relationship. Learn how to connect with the ones you’re trying to get with!” The dataset I’m using only includes posts, called “submissions” (so not the comments).
If you want to try out the code on your own data, try out Google Colab, a web IDE for python with storage on the cloud. Colab allows you to use Python without having to install it on your machine!
Pandas basics
We’ll first read in our CSV using pandas. We then have a look at the first 3 rows of our new dataframe.
import pandas as pd
sed = pd.read_csv(‘seduction_submissions.csv’)
sed.head(3)
Besides our submission text data (the column “selftext”), we have important metadata such as the number of upvotes minus the number of downvotes the submission has received (“score”), the length of the submission (“textlen”), the number of comments following the submission (“num_comments”).
Sorting
Using the .sort_values()
method we can sort the df by particular columms. We use two parameters: the `by` parameter indicates by which column we want to sort, the `ascending` parameter indicated whether our sortation is in ascending or descending order.
Here, I’m assigning my sorted DataFrame to the same variable sed
, effectively overwriting the old version.
sed = sed.sort_values(by=['score'], ascending=False)
Converting to datetime
Did you ever wonder which format the “created” column is in? It is a Unix timestamp: the number of seconds that have elapsed since the Unix epoch, minus leap seconds; the Unix epoch is 00:00:00 UTC on 1 January 1970.
Pandas allows us to create a new column evaluating the Unix timestamp to more readable datetimes using the .to_datetime
method.
pd.to_datetime(1207632114,unit='s')
Creating a new column in Pandas is as easy as using the bracket notation to write a new column name, then assigning it. In this case, we can just use the .to_datetime
method again to point to the entire “created” column. Our new column “created_datetime” will contain a more legible version of the datetime per submission.
sed['created_datetime'] = pd.to_datetime(sed['created'],unit='s')
Selecting a column
To select a single column of data, simply put the name of the column in between brackets. Let’s select the “selftext” column. We can print out the first entry in this column as follows:
sed['selftext'][:3]
As you see, using the `[]` operator selects a set of rows and/or columns from a DataFrame.
One thing we often do when we’re exploring a dataset is filtering the data based on a given condition. For example, we might need to find all the rows in our dataset where the score is over 500. We can use the `.loc[]` method to do so.
sed = sed.loc[sed.score >= 500]
This kind of “top sampling” is a strategy that is often utilized by researchers, especially those interested in the behavior of user that have a disproportionate level of influence on online conversations based on popularity metrics such as likes, retweets, favorites, and so on. This procedure is particularly suitable when it comes to highly public and visible conversations which tend to have a strong power-law distribution and few communication centers (Gerbaudo 2016).
Distant reading with NLTK
“Distant reading” is a method of literary criticism that uses computational and data analysis techniques to identify meaningful patterns within large collections of texts. Unlike close reading, the object of analysis is often a collection of hundreds or thousands of texts that no individual could read within the span of a lifetime.
NLTK (which stands for Natural Language ToolKit) is a leading platform for building Python programs to work with human language data. It allows us to engage in some very basic distant reading to discover themes in our discourse community.
We will need to tokenize our data in order to make use of it. Tokenization is a way of separating text into smaller units called tokens. These can be words, characters, subwords, and so on. For now, let’s focus on words, which NLTK allows us to do with the word_tokenize()
method. It works like this:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenizeword_tokenize("He is a lumberjack and he is okay. He sleeps all night and he works all day.")
This yield the following list of tokens:
['He', 'is', 'a', 'lumberjack', 'and', 'he', 'is', 'okay', '.', 'He', 'sleeps', 'all', 'night', 'and', 'he', 'works', 'all', 'day', '.']
Now let’s tokenize our entire “selftext” column. Here’s what we need to do:
- Create a new list called
sed_tokens
; - Begin a for-loop that iterates over the “selftext” column of our `sed` DataFrame;
- For each text in that column, tokenize it using
word_tokenize()
; - Add these tokenized words to our new
sed_tokens
list using the list.extend()
method.
Note that we use .extend()
instead of .append()
. This is because we want one long list, instead of a list of lists. While appending adds its argument as a single element to the end of a list — meaning the length of the list itself will increase by one — .extend()
adds each element to the list, extending the list.
sed_tokens=[]
for text in sed.selftext:
sed_tokens.extend(word_tokenize(text))
The NLTK `Text()` class
Now, let’s have a look at our data. NLTK provides a Text()
class, which is a “wrapper” that allows for initial exploration of texts. It supports counting, concordancing, collocation discovery, and so on.
sed_t = Text(sed_tokens)
(Using help(Text)
we can print out the “docstring” of NLTK’s Text()
object, as well as all the things you can do with this object. Have a read through this to see what it allows you to do!)
Concordances
One of the most basic, but quite helpful, ways to quickly get an overview of the contexts in which a word appears is through a concordance view.
sed_t.concordance(‘game’, width=115)
This allows us to see all the occurrences of this word “game” in context. In this kind of view, words appear in their rawest form as mere rows listed in a table, chunks of interactions stripped from their surrounding context of a live conversation. A number of elements can however already be identified at this stage. Researchers can explore the topics discussed in each post, as well as the form they are expressed in, such as the use of a certain type of language, imagery, tone, or specific rhetorical figures. For instance, we see that “game” is a term typically used to describe someone’s confidence or self-esteem, overcoming obstacles as fear, self-doubt, lapses in focus, and so on.
Collocations
A collocation is a sequence of words that often appear together. The .collocations()
method can find these in our data.
sed_t.collocations()
This yields the following terms, which again point to some themes in our data. Besides “inner game”, bigrams such as “eye contact”, “dance floor” and “body language” tell us something about the kinds of situations and contexts described in this data. “Shit test” is another form of jargon typical of dating coaching discourse (this is what the Urban Dictionary says about it).
eye contact; n't know; feel like; dance floor; body language; n't want; last night; social proof; high school; n't really; blah blah; little bit; even though; pretty much; first time; inner game; could n't; next day; shit test; months ago
Word plotting
Using the dispersion_plot()
method we can easily visualize how often some word appears throughout the text. We have to feed it a list with several words.
Sorting our df by date allows us to look “through time” to see whether particular words start (dis)appearing in our data.
sed_t.dispersion_plot(["dating", "friendzone"])
Similar words
Finally, using the .similar()
method we can look at “distributional similarity”: finding other words which appear in the same contexts as the specified word.
sed_t.similar('game')
This yields:
and time girl place friends but it girls her night conversation that me way friend so kino life this bar
Note that these methods are not very granular; rather, they allow you to begin exploring a textual corpus. Note down the patterns you see and that you want to pursue further. In the next post, we will use a simple algorithm, tf-idf (term frequency — inverse document frequency) to compare language use in comparable subreddits.
Sources
Gerbaudo, P. (2016). From Data Analytics to Data Hermeneutics: the Continuing Relevance of Interpretive Approaches. Digital Culture & Society, 2(2). https://doi.org/10.14361/dcs-2016-0207