Analyzing Reddit communities with Python — Part 6: Word Embeddings
Tracing semantic relations in Reddit data
In this series, I explain data science and Natural Language Processing (NLP) approaches in Python for people with little to no programming knowledge, who want to engage in critical analyses of online discourse communities.
In previous posts, I discussed how to use the Reddit API to retrieve data, how to select a subreddit, using Pandas and NLTK to explore your data, how to use TF-IDF to compare subreddits, and doing topic modeling to find latent topics in the data.
This time, we look at another popular approach in NLP: Word Embeddings. I will explain what word embeddings are for newcomers, and then use gensim
to create word embeddings of the Reddit dataset we’ve been using.
After reading this post, you will be able to:
After working through this post, you’ll be able to:
- Understand what Word Embeddings are, and how they make use of Neural Networks to encode semantic relations;
- Use Gensim’s word2vec method to create word vectors for a corpus;
- Use these word vectors to reflect on implicit norms and values in your data;
- Visualize topic models using K-means clustering and t-SNE.
An introduction to word embeddings
V(King) — V(man) + V(Woman) ~ V(Queen)
This is the classic example of what word embeddings allow you to do with textual data: word algebra. In essence, doing “calculation” with words. How does this work?
You should first know something about how words typically are represented in NLP. Remember the count vector representations we saw in our topic models? This is called a “bag of words” model, and it is basically a table with documents as rows and words as columns, showing you how often these words appear in each document.
Each row in this table is basically a “vector”: a number of values that describe that word in terms of its local appearance.
Now, this kind of representation becomes a problem when our document collection gets larger, these vectors get very sparse (i.e. lots of zeroes). But also, they don’t tell us anything about word relationships!
We did encode some of these word relationships when we looked at TF-IDF: essentially, this allowed us not just to encode how often a word appears in each document, but also how important the word was to that document as compared to the entire corpus.
Word embeddings, similarly, make use of these kinds of “dense vectors” that can thereby represent semantic relations between words. Theoretically, if we use this kind of representation, we can start looking at similarity measures, e.g. through using cosine distances.
In word embeddings, we treat semantic relations as a machine learning problem. Essentially, we teach a neural network how to predict the similarity of words by throwing in all of the words in our collection — as well as their local contexts (that is, the words directly surrounding our target word). The learning that goes on within that network is about nudging all these vectors so that they get better and better at being predicted.
Word2vec is Google’s version of this method, and it has a couple of implementations. One is the skip-grams algorithm, which tries to predict context words given a target word. The continuous bag of words model does the opposite: it predicts a target word given the context.
What is the skip-grams algorithm?
When using skip-grams we try to predict the context according to a center word. Here’s what we tell our neural network: given a specific word, look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose.
Note that “nearby” actually refers to a “window size” parameter: e.g. 3 words to the left and 3 words to the right. The output probabilities are going to relate to how likely it is to find each vocabulary word nearby our input word. So if I feed in “cat” it’s going to be probable to find words like “litterbox” and “paw” than “guitar” or “tank”.
Prediction and loss functions
For each estimation our neural network makes, we input one word, and the model tries to predict its surrounding words based on some window size, or radius M. So we calculate the probability (p) of some target word (e.g. wₜ₋₃) given the center word (wₜ), and we try to maximize that probability.
To do that, we also need to create a loss (or cost) function that tells us if we’re doing well. This is what you do in Machine Learning: you try to make an algorithm better at predicting something. To do that you have to tell it when it’s screwing up.
Look at the loss function in the image above. The “-t” part means: all the words that aren’t word at index t; the words surrounding t). If we can predict those context words perfectly from t, we’d have no loss. So our goal is to change the vector representations of all words so as to minimize loss. Doing this, we end up with one big probability distribution per word.
Above you see the hyperparameters we set when running word embeddings in Python. As you see, in this example we’re just using a context of 4. Our neural network is then going to learn the statistics from the number of times each pairing shows up. So in our example, the network is going to get training samples of “fox” and “brown”, but not of “fox” and “dog”.
Then, when the training is finished, if you give it the word “fox” as input, then it will output a much higher probability for “brown” than it will for “dog”.
How do Word Embeddings use neural networks?
So what’s going on behind the scenes here?
As noted, we can’t feed a word just as a text string to a neural network. We need some kind of vector. To do this, we can build a vocabulary of words from our training documents–let’s say we have a vocabulary of 10,000 unique words.
We’re going to represent an input word (like the word “fox” used in the image above) as a one-hot vector. This vector will have 10,000 components, one for every word in our vocabulary, and we’ll place a “1” in the position corresponding to the word “ants”, and 0s in all of the other positions.
When training this network on word pairs — this is not represented here — the input is a one-hot vector representing the input words and the training output is also a one-hot vector representing the output words, including the correct ones that are found in its vicinity. In this iterative learning process, the hidden layer of the neural network changes slightly every learning cycle based on the output we provide it. After all, we want it to get better at predicting.
But when you evaluate the trained network on an input word, what you see in the image here, the output vector will actually be a probability distribution. It is a probability of those other words being near the center word. Now, the output of the network is a single vector (also with 10,000 components) containing, for every word in our vocabulary, the probability of that word to appear in a random position in the window we’ve created around our focus word. So the output probabilities are going to relate to how likely it is to find each vocabulary word nearby our input word.
By the way: for our example with 300 features and a vocab of 10,000 words, that’s 3 million weights in the hidden layer and output layer!
Back to our Python example. Here, we’re learning word vectors with 300 features (300 is what Google used in their original model). These ‘features’ refer to the neurons in the hidden layer of our network. They act as 300 ‘dimensions’ of the word we’re throwing in. So in our hidden layer, each word will have a vector with 300 numbers that incorporate information about its neighboring words.
In the end, it is this hidden layer of our neural network that we end up using as our word embeddings model (not the output layer).
Doing math with words
So, in word embeddings, you can think of each vector as incorporating a set of semantic characteristics. For instance, let’s take the classical example again:
V(King) — V(man) + V(Woman) ~ V(Queen)
Following our previous workflow, each word would be represented by a 300-dimensional vector. The point is that each of these numbers, after training, encodes some semantic information about that word. We can’t inspect these meanings easily, but for instance, `V(King)` might have semantic characteristics of royality, kingdom, and masculine in its vector. `V(man)` will have masculine, but not the other characteristics.
So when we subtract the vector for “man”, these characteristics of masculinity will get nullified from the vector. When we then add the vector for “Woman”, which has “feminine” characteristics added to it, the resulting vector will be most similar to that of V(Queen).
Word Embeddings and social biases
The example we’ve been using about kings and queens is a sort of “neutral” one. However, it is not difficult to imagine that, if word embeddings encode the meaning of their input data, the biases in that input data will become part of the model!
In “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings” Bolukbasi et al. discussed how these biases are encoded in word embeddings. There are a lot more sources about this problem and how to tackle it (see the source list below).
Instead of debiasing Word Embeddings, however, they can also be used to expose biases in a particular community. Which is what we’ll be doing here.
Word embeddings in Python
After this extensive explanation, let’s dive into Python.
TRIGGER WARNING: The language in the dataset I am using comes from The Red Pill, a community known for its misogynist language.
Creating our word embeddings model, we need a text corpus (split up in sentences or other chunks). The output will be a set of “vectors” in N dimensions. We can then reduce the dimensionality to visualize the results in a way humans can understand (such as in two-dimensional space), or to perform linear algebra in order to find how words are related.
Assuming you have your data in a nested list (a list of sentences, each composed of a list of tokens), we can create the skip-gram Word2Vec model by setting a few hyperparameters and running it like so:
import gensim
from gensim.models import Word2Vec# Word vector dimensionality (how many features for each word)
num_features = 500
# Minimum word count to be taken into account
min_word_count = 2
# Number of threads to run in parallel (set this equal to your amount of cores)
num_workers = 2
# Context window size
context = 10
# Downsample setting for frequent words
downsampling = 0 #1e-2
# Seed for the random number generator (to create reproducible results)
seed_n = 1
# Skip-gram = 1, CBOW = 0
sg_n = 1 model = Word2Vec(YOUR_DATA, workers=num_workers, \
size=num_features, min_count = min_word_count, \
window = context, sample = downsampling, seed=seed_n,
sg=sg_n)
That was it! We can save this model in a Gensim object.
model.save("word2vec.vec")
We can easily print the amount of terms we have in our vocabulary:
print('{:,} terms in the vocabulary.'.format(len(model.wv.vocab)))
Getting related terms
With the information in our word embeddings model, we can try to find similarities between words that interest us (i.e. words that have a similar vector). Let’s create a function that retrieves related terms to some input.
def get_related_terms(token, topn=20):
"""
look up the topn most similar terms to token and print them as a
formatted list
"""for word, similarity in model.most_similar(positive=[token], topn=topn):
print(word, round(similarity, 3))get_related_terms(u'woman')
Word algebra
Let’s engage in some word algebra (like the famous example “king — man + woman = queen”). The mathematical procedure works as follows:
- Provide a set of words or phrases you want to add or subtract.
- Look up the vectors that represent those terms in the word vector model.
- Add and subtract those vectors to produce a new, combined vector.
- Look up the most similar vector(s) to this new, combined vector via cosine similarity.
- Return the word(s) associated with the similar vector(s).
We’ll create a function that does this for us, and run it using the words “women” and “relationships”. This is where word embeddings can be useful not as a “neutral” representation of meaning, but as a method to explore social biases in a community.
def word_algebra(add=[], subtract=[], topn=1):
"""
combine the vectors associated with the words provided
in add= and subtract=, look up the topn most similar
terms to the combined vector, and print the result(s)
"""
answers = model.most_similar(positive=add, negative=subtract,
topn=topn)
for term, similarity in answers:
print(term)word_algebra(add=['women','relationship'])
As we can see here, there are some subtle differences when adding “dating” and “men” versus “dating” and “women” and looking at the top-10 words whose vectors come closest — such as the word “physically” for men, or “emotionally” and “excitement” for women.
The research you can do here should be based on your domain knowledge of a dataset, as well as the design of interesting research questions to pursue and explore using methods like these.
Word Vector Visualization with t-SNE
What can we do to visualize our model? For instance, we can use t-Distributed Stochastic Neighbor Embedding, or t-SNE, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets. It attempts to map high-dimensional data onto a low two- or three-dimensional representation. It tries to keep the relative distances between points as closely as possible in both high-dimensional and low-dimensional space.
Scikit-learn provides a convenient implementation of the t-SNE algorithm with its `TSNE` class. First, we need to create a DataFrame with the terms as the row labels, and the 500 dimensions of the word vector model as the columns.
# build a list of the terms, integer indices, and term counts from the word2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count) for term, voc in model.wv.vocab.items()]# sort by the term counts, so the most common terms appear first
ordered_vocab.sort(key = lambda x: x[2])# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)# create a DataFrame with the vectors as data, and the terms as row labels
word_vectors = pd.DataFrame(model.wv.syn0norm[term_indices, :], index=ordered_terms)word_vectors.head()
We then fit and transform T-SNE based on this DataFrame. Our output of this will be a bunch of coordinates in 2-dimensional space which we can then plot.
from sklearn.manifold import TSNEtsne = TSNE()tsne_vectors = tsne.fit_transform(word_vectors.values)tsne_vectors = pd.DataFrame(tsne_vectors,
index=pd.Index(word_vectors.index),
columns=['x_coord', 'y_coord'])tsne_vectors['word'] = tsne_vectors.index
Finally, we plot using the Bokeh library:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)# create the plot and configure the title, dimensions, and tools
tsne_plot = figure(title='t-SNE Word Embeddings',
plot_width = 800,
plot_height = 800,
tools= ('pan, wheel_zoom, box_zoom, box_select,
reset, reset'),
active_scroll='wheel_zoom')# add a hover tool to display words on roll-over
tsne_plot.add_tools(HoverTool(tooltips = '@word'))# draw the words as circles on the plot
tsne_plot.circle('x_coord', 'y_coord', source=plot_data,
color='blue', line_alpha=0.2, fill_alpha=0.1,
size=10, hover_line_color='black')# configure visual elements of the plot
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None# show the plot
show(tsne_plot);
The output shows us an interactive “cloud” of related words we can search through. You will find clusters of words pertaining to particular topics in the data. For instance, the one I am hovering over in the screenshot below seems to be about politics and social justice.
Zooming further in, we can see more granural vectors of related words. In the image below I am zooming in to the words around politics.
Reflection
king:queen::man:[woman, Attempted abduction, teenager]
This is one example I found in a paper, where the author noted to have found a “strange word embedding”. Instead of being surprised at such findings, we should recognize that word embeddings always encode commonsensical knowledge. And anytime we are in the realm of the commonsensical, we are in the realm of ideology. This is particularly interesting for social media communities like those on Reddit, as it can give us insight into shared world-building through language.
Sources
Bolukbasi, Tolga, et al. “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings.” Advances in Neural Information Processing Systems 29, edited by D D Lee et al., Curran Associates, Inc., 2016, pp. 4349–57, http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf.
Caliskan, Aylin, et al. “Supplementary Materials for: Semantics Derived Automatically from Language Corpora Contain Human-like Biases.” Science, vol. 356, no. 6334, 2017, pp. 183–86, doi:10.1126/science.aal4230.
Colleen, Marc-etienne Brunet, et al. Understanding the Origins of Bias in Word Embeddings. 2019.
Flowers, Natasha, and Sam Temlock. Reducing Gender Bias During Fine-Tuning of a Pre-Trained Language Model. 2021, pp. 1–10.
Garg, Nikhil, et al. Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes. 2017, pp. 1–33.
Gordon, Joshua, et al. “Studying Political Bias via Word Embeddings.” The Web Conference 2020 — Companion of the World Wide Web Conference, WWW 2020, 2020, pp. 760–64, doi:10.1145/3366424.3383560.
Kaneko, Masahiro. Gender-Preserving Debiasing for Pre-Trained Word Embeddings. 2019, pp. 1641–50.
Swinger, Nathaniel, et al. What Are the Biases in My Word Embedding ? 2019, pp. 305–11.
Zhao, Jieyu, et al. “Gender Bias in Contextualized Word Embeddings.” Proceedings of NAACL-HLT 2019, 2019, pp. 629–34.