Word Vectors Starter Queries

Jonathan Fitzgerald & Sarah Connell
6/23/2021

This walkthrough is a static version of an R notebook published by the Women Writers Project. In an environment such as RStudio, the code blocks below would be editable and interactive. For the full set of notebooks and more context on their usage, visit our GitHub repository.

Getting started

Using this file

This file is a first introduction to word vectors; we’ll use it to practice querying models before we move on to training models of our own.

Reminder on running code

To run a single line of code from an R Markdown file, put your cursor anywhere in that line of code and then hit command-enter or control-enter. If you want to run all of the code in a code snippet, you can hit the green triangle button on the right. If you want to run a particular section of code, highlight the section you want to run and hit command-enter or control-enter.

Much of our code will run almost instantly, but some things will take a few seconds or minutes, or even longer. You can tell code is still running if you see a red stop sign in the top-right corner of the console. If you’d like to stop a process, you can hit this stop sign. You will know that the code has been run or successfully stopped when you see a new > prompt in the bottom of the console.

Opening a new session: checking project and working directory

As a reminder, at the start of any new session, you should make sure that you have the right project open and you should check your working directory.

If you opened the “WordVectors” project file first, then you should already be working in the “WordVectors” project space. To confirm that you have the correct project open, check the top-right corner of the RStudio screen and make sure it says “WordVectors” in the project selection dropdown. If the project is not open, you can open it by going to File then Open Project... in the menu bar at the top of RStudio, or by clicking on the project file.

At the start of each new session, you should check your working directory with the code below. As long as you opened this file from the WordVectors project, your working directory should be in the right place: the “WordVectors” folder. If you do need to change your working directory, you can do so with the setwd() function.

getwd()

Opening a new session: loading packages

All the packages you will need for this exercise have been installed ahead of time on our RStudio Server instance, but you’ll want to load them using the library() function at the start of each new session.

When you run this code for the first time after you start a session, you’ll see a lot of text go through the console, possibly with some warning messages. Even if the text looks alarming, it probably won’t cause any issues. To confirm that the packages have loaded correctly, you can run this code a second time—if you see the code pop into the console with no additional text, that means the packages loaded properly and you are all set.

library(tidyverse)
library(tidytext)
library(magrittr)
library(devtools)
library(tsne)
library(wordVectors)

Working with word2vec models

Reading in existing models

Before you train models of your own, you’ll start by reading in existing models. We’ve provided a set of sample models (all the files that end in ‘.bin’) in the data folder; to read one of them in, edit and run the code below. You can also use this code to read in models that you’ve trained when you start a new session. And, you can even read in models trained by others, if you upload their .bin file to your data folder using the Upload button near the top of the Files menu (make sure you are already inside the data folder when you do this!).

We’ve filled in a sample model to get you started. If you want to use a different model, change the file path in the code below to select a different ‘.bin’ file. Navigating file paths can be a bit confusing, but fortunately there is a trick you can use. Delete the text inside of the quotation marks below (but don’t delete the quotation marks!) and then hit tab. You should get a pop-up with a view of the folder system you’re more used to navigating; you can use this to fill in file paths whenever you need to.

w2vModel <- read.vectors("data/wwo-regularized.bin")

Visualizing models

We can get a glimpse of what the model looks like by plotting it in two dimensions. Keep in mind that the model actually has many more dimensions, so we are, in effect, flattening it. Though the visualization may be somewhat difficult to read, you should be able to see that similar words—words that are near each other in vector space—tend to clump together. The code below will likely take a minute or two to run, and your results will appear in the “Plots” window to the right (you can hit the “Zoom” button to get a better view).

As the code is running, you’ll see a set of lines in the console that will say something like “Epoch: Iteration #100 error is: 20.3048394873336”; note that this is not an error message! As the code runs, the values for “error” should decrease—this reflects increasing confidence about how to plot the vector representation.

w2vModel %>% plot(perplexity=10)

Clustering

The following script provides a way to cluster words that are near each other in vector space, using the “k-means” clustering algorithm. Below, we choose 150 centers, or 150 points around which to cluster words. Then we select 10 random clusters and 15 words from each cluster to view. This code will also take a minute or two to run. You can change the number of centers, the number of clusters to view, or the number of words to see—you can also increase the number of iterations (the number of times the algorithm should adjust where the centers are and where terms are positioned in relation to those centers).

centers <- 150
clustering <- kmeans(w2vModel,centers=centers,iter.max = 40)

sapply(sample(1:centers,10),function(n) {
  names(clustering$cluster[clustering$cluster==n][1:15])
})

Closest to

Now that you’ve had a chance to think about the corpus as a whole, it’s time to start investigating individual words. To find the words closest to a particular word in vector space, enter a term between the quotation marks and then run the code below. You’ll notice the output shows up in the console. If you want to see more words, increase the number inside of the argument. Make sure not to delete the quotation marks, and enter your word in lowercase.

w2vModel %>% closest_to('grace', 10)

Closest to two terms

You might also want to see the words closest to a combination of two words. Notice that this will open a new window with the results because of the view() function. If you prefer to see your results in this format, you can paste “%>% view()” at the end of the code above; or, if you prefer to see your results in the console, you can delete “%>% View()” from the code below. Note that the code below also shows 20 results, instead of 10.

w2vModel %>% closest_to(~"grace"+"young", 20) %>% view()

Closest to the space between two terms

Or, you might want to look at the space between two terms, to see which words are similar to one term but not another:

w2vModel %>% closest_to(~'grace'-'beauty', 20)

In our next walkthrough, we’ll cover training models, as well as a few other things you can do with word2vec.

Credit and thanks

This tutorial uses the wordVectors package developed by Ben Schmidt and Jian Li, itself based on the original word2vec code developed by Mikolov et al. The walkthrough was also informed by workshop materials authored by Schmidt, as well as by an exercise created by Thanasis Kinias and Ryan Cordell for the “Humanities Data Analysis” course, and a later version used in Elizabeth Maddock Dillon and Sarah Connell’s “Literature and Digital Diversity” class, both at Northeastern University.

This version of the walkthrough was developed as part of the Word Vectors for the Thoughtful Humanist series at Northeastern. Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.