Word Vectors Visualization

Benjamin Schmidt & Sarah Connell
6/3/2021

This walkthrough is a static version of an R notebook published by the Women Writers Project. In an environment such as RStudio, the code blocks below would be editable and interactive. For the full set of notebooks and more context on their usage, visit our GitHub repository.

Word Vectors Plotting and Exploration

This walkthrough offers options for exploring word2vec models, using several methods for plotting and visualization.

This code uses the “wwo-regularized.bin” model as an example, with query terms chosen to work with that model. You can read in any model you like, and update the terms to match your own interests. Depending on your model and the terms that you select, your results may not necessarily be revelatory, but they should give you different ways to think about and engage with your models.

This walkthrough assumes you are familiar with the basics of setting up new RStudio sessions, running code, and reading in models. If you would like more information on these, see the introductory walkthroughs in this folder. If you are working on your own computer, you will need to first install the necessary packages outlined in the “Word Vectors Installation, Training, Querying, and Validation” file.

These walkthroughs are drawn from code published by Ben Schmidt in two vignettes: an “Introduction” and an “Exploration” accompanying the WordVectors package, as well as code samples in Schmidt’s post “Vector Space Models for the Digital Humanities”. The code has been generalized and lightly modified, and additional explanations have been added.

Getting started

Checking your working directory

getwd()

Loading packages

library(tidyverse)
library(tokenizers)
library(tidytext)
library(magrittr)
library(devtools)
library(tsne)
library(wordVectors)
library(lsa)
library(ggplot2)

Reading in an existing model file

w2vModel <- read.vectors("data/wwo-regularized.bin")

Plotting the terms closest to a query set

This is a way to visualize the words closest to a set of terms that you choose, giving you a view of your results that’s a bit different from the lists of terms and their cosine similarities that we’ve been working with. You lose the information on specific cosine similarities, but you gain the ability to review the whole set of results simultaneously, without the linearity of a list. You can also start to explore whether there might be patterns in the relationships among the closest terms in your results.

This visualization relies on principal component analysis, a statistical procedure that makes it possible to plot a set of items—in this case, the terms closest to your query terms—as a way to identify patterns in a dataset. The details are fairly complex (and we encourage you to read more about them if you plan to go further with this kind of analysis), but, essentially, principal component analysis reduces a large set of variables into a smaller set that still contains much of the information from the original. Each of the principal components is calculated to account for as much of the variation in the initial dataset as possible: the first principal component covers the most variation, the second the next most, and so on. Organizing data by these principal components makes it possible to reduce the dimensionality of your analysis without losing as much information, since you’re selecting from the principal components that are able to represent the maximum amount of variance in the original variables.

Looking at a set of animal words in the regularized Women Writers Online model, you can see that the terms closely related to foods cluster together and that the second principal component seems to be capturing something about domesticated vs. wild animals.

You can include as many query terms as you like, following the pattern in the example by adding new terms in quotation marks, separated by commas. You can also specify how many words to plot by changing the number at the end of the second line of code.

# Here is where you define the terms that you want to investigate. 
query_words <- c("dog","pig","horse","cat","fish","goose","lion","tiger")

# You can determine how many words you want to plot by changing the number at the end of this line of code (set to 100 in the example).
query_set <- closest_to(w2vModel, w2vModel[[query_words]], 100)

# This produces a vector space model object with the words and vectors defined above.
model_object <- w2vModel[[query_set$word, average=FALSE]]

# This will plot the model object using principal component analysis. 
plot(model_object, method="pca")

Defining and exploring clusters with dendrograms

This is a way of looking for clusters among the words closest to a set of terms that you choose. The result is a dendrogram, or tree diagram. This is a bit like the clustering you’ve seen already, but instead of operating on the level of the whole model, it instead clusters the results that are closest to a specified set of terms in vector space.

The clustering algorithm here is also different, using the hclust function for performing hierarchical clustering analysis. By default, hclust uses the complete linkage agglomeration method, which merges the two nearest clusters in the set at every stage in the clustering process until there is a single cluster; this process determines which clusters are closest to each other (the ones that merge sooner). An important difference between hierarchical clustering and the k-means clustering we’ve done is that with hierarchical clustering you don’t specify the number of clusters to group your results into.

To read a dendrogram, look for the lines showing the height at which any two words are joined together; the shorter the height, the closer the two words are clustered. You can also look for broader clusters of terms among the branches (called “clades”) of the tree diagram. The greater the height of the branch points for each clade, the greater the difference between them.

You can include as many query terms as you like, following the pattern in the example by adding new terms in quotation marks, separated by commas. You can also specify how many of the words nearest to each of your query terms to include in the resulting dendrogram.

# Here is where you define the terms that you want to investigate.
query_words <- c("dress","lace","frock","silk","hat")

# This establishes the set of terms to be used in your dendrogram. For each term you've defined in `query_set`, the code below selects a specified number of its closest words. You can determine the maximum number of words to plot by changing the number below. In the sample code, the number of selected words is set at 10. With 5 words in `query_set`, the `subset` will contain at most 50 words, since some may be closest to more than one of your query terms.
query_set <- lapply(query_words, 
       function(query_term) {
          nearest_words <- w2vModel %>% closest_to(w2vModel[[query_term]], 10) 
          nearest_words$word
        }) %>% unlist
subset <- w2vModel[[query_set, average=FALSE]]

# This line uses the `hclust` function for hierarchical clustering analysis to produce the dendrogram.
subset %>%
  cosineDist(subset) %>% 
  as.dist %>%
  hclust %>%
  plot

Plotting related word pairs

A method for looking at related pairs of words, this visualization lets you select a pair of key terms and then plot out the words that are closest to each, according to their cosine similarities. This approach is useful in that it shows your results in two-dimensional space, rather than as individual lists, thus making it easier to review multiple terms at once and look for patterns while still retaining basic information about cosine similarity. The plot allows you to see which resulting words are closely associated with one or the other input term, as well as those associated with both, which will appear in the middle.

You can fill in any two terms in the first line of the code block below. You can also show more or fewer words in the resulting plot and adjust the frequency threshold for which terms are part of the calculations. Try this with both opposites and with closely related words to get a better sense of how it works.

# Here is where you define the two terms that you want to investigate.
query_words <- c("rich","poor")
query_set <- w2vModel[[query_words, average=FALSE]]

# Here, `model[1:3000,]` restricts the dataset to the 3000 most common words in your corpus. You can adjust this number to include more or fewer words.
concept_pairs <- w2vModel[1:3000,] %>% cosineSimilarity(query_set)

# This filters to the top *n* words related to the input terms; change the numbers following the greater-than symbols below to see more or fewer words in your plot (set to 50 in the example).
concept_pairs <- concept_pairs[
  rank(-concept_pairs[,1]) < 50 |
  rank(-concept_pairs[,2]) < 50,
  ]

# The two lines below will produce the plot.
plot(concept_pairs, type='n')
text(concept_pairs, labels=rownames(concept_pairs))

Plotting terms associated with multiple keywords

This option makes it possible to plot the words closest to multiple input terms with principal component analysis (see above for a quick explanation of how PCA works). This method allows you to explore more complex word relationships for cases in which the concepts you wish to examine are better expressed with multiple terms.

As with the example above, the code below first selects the most common words in the set; it then filters the results down to the n words closest to any of the input terms (where “n” is a number that you can define; the default is 50), and then plots those using principal component analysis.

You can include as many query terms as you like, following the pattern in the example by adding new terms in quotation marks, separated by commas. You can also show more or fewer words in the resulting plot and adjust the frequency threshold for which terms are part of the calculations.

# Here is where you define the terms that you want to investigate.
query_words <- c("sight","taste","touch","smell","hearing")
query_set <- w2vModel[[query_words, average=FALSE]]

# Here, `model[1:3000,]` restricts to the 3000 most common words in the set; you can adjust this number to include more or fewer words.
common_similarities <- w2vModel[1:3000,] %>% cosineSimilarity(query_set)

# Here is where you set the number of terms that you want to plot; to adjust this, change the number after the less-than sign (set to 50 in the example).
high_similarities_to_query_set <- common_similarities[rank(-apply(common_similarities, 1, max)) < 50,]

# This line plots your terms with principal component analysis.
high_similarities_to_query_set %>% 
  prcomp %>% 
  biplot

Plotting related terms in a multi-concept plane

This more complex visualization allows you to define a conceptual plane according to two sets of binaries or otherwise paired concepts (for example: rich and poor, female and male, society and nature, or good and evil) and plot a set of terms within that space.

The terms that appear in the plot are generated from their proximity to a set of key terms that you define. Thus, this visualization allows you to identify a domain that you wish to explore (for example, clothing words, food words, or animal words), and plot terms related to that domain across a plane in which each axis represents the distinction between two sets of concepts that you also define. Words will appear higher or lower on the y axis, depending on how close they are to each of the two conceptual poles from one set of opposed concepts; words will appear farther to the left or right on the x axis, depending on how close they are to each of the poles in the other set of concepts.

In addition to selecting terms, you can also determine how many words to plot in this space (the example is set to 300). You can include as many query terms as you like, following the pattern in the example by adding new terms in quotation marks, separated by commas. In fact, the example below has a relatively small number of input terms; to explore this in earnest, you would want to have many more. To come up with more words for your list, try querying the terms that are closest to some of the words that you want to examine.

You can also have more or fewer contrast words in each of your concept pairs, following the model in the sample code below. These will strongly impact your results, so you should try multiple variations to make sure that you’re getting as close as possible to the concepts you want to investigate—remember that we are using these terms as proxies for much more complex concepts, and that it is unlikely that a single term or small set of terms will fully express the complexity at stake.

# Here is where you define the terms that you want to investigate. In this sample, we are looking at clothing words. 
query_words <- c("dress","frock","gown","coat","linen","silk","lace","hem","clothing","garment","muslin","garments","clothes","satin","shoes","waistcoat","brocade","hat","cloak","apron","aprons","robe","pelisse","diamond","rags","cap","tattered","plaid","cotton")

# Here is where you are establishing the `query_set` variable and deciding how many words to include in the plot
query_set <- w2vModel %>% nearest_to(w2vModel[[query_words]], 300) %>% names

# The line below isn't necessary for generating the plot, but it will show you a sample of the resulting words from your query above, so you have a better sense of what you are examining.
sample(query_set, 50)

# This line is where you define the first set of opposed concepts; below, the template text refers to the first group of terms as "Thing A" and the second group as "Thing B".
first_binary_vector <- w2vModel[[c("rich","richer","riches" ,"wealth")]] - w2vModel[[c("poor","poorer","poverty","indigence")]]

# This line is where you define the second set of opposed concepts; below, the template text refers to the first group of terms as "Thing X" and the second group as "Thing Y".
second_binary_vector <- w2vModel[[c("man","boy","he" ,"his")]] - w2vModel[[c("woman","girl","she","her")]]

# These four lines set up a dataframe with the words and vectors from your queries above.
concepts <- w2vModel[rownames(w2vModel) %in% query_set,]
all_concepts <- data.frame(word = rownames(concepts))
all_concepts$thingX_vs_thingY <- cosineSimilarity(concepts, first_binary_vector)
all_concepts$thingA_vs_thingB <- cosineSimilarity(concepts, second_binary_vector)

# Here is where you create a plot from your dataframe above; you can edit the text for the labels (e.g., "closer to thing X" and "closer to thing Y") to match your queries. You might get a warning message about removing rows containing missing values, but you can ignore that.
ggplot(all_concepts, aes(x=thingX_vs_thingY, y=thingA_vs_thingB, label=word)) + geom_text(size=2.5) +
  scale_y_continuous("<----- closer to thing Y ..............  closer to thing X ------>", limits=c(-.45, .25)) +
  scale_x_continuous("<----- closer to thing B ..............  closer to thing A ------>", limits=c(-.25, .33))

Credits and thanks

This tutorial uses the wordVectors package developed by Ben Schmidt and Jian Li, itself based on the original word2vec code developed by Mikolov et al.

This walkthrough was developed as part of the Word Vectors for the Thoughtful Humanist series at Northeastern. Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.