Word Vectors Installation, Training, Querying, and Validation
This walkthrough is a static version of an R notebook published by the Women Writers Project. In an environment such as RStudio, the code blocks below would be editable and interactive. For the full set of notebooks and more context on their usage, visit our GitHub repository.
Getting started
Using this File
This file is an introduction to training and querying a model using word2vec with R and RStudio on your own computer.
This is an R Markdown file, which means that it contains both text (what you’re reading now) that can be formatted for display on the web or as a pdf file, and snippets of code, which you can run right from the file.
Before running any code, you should do a quick check on your preferences for how RStudio will handle R Markdown files. Go to the Tools menu above, select “Global Options”, then select “R Markdown Preferences,” and make sure that “Show output inline for all R Markdown documents” is not selected.
To run a single line of code from an R Markdown file, put your
cursor anywhere in that line of code and then hit
command-enter or control-enter. If you
want to run all of the code in a code snippet, you can hit the
green triangle button on the right. If you want to run a
particular section of code, highlight the section you want to run
and hit command-enter or
control-enter.
Much of our code will run almost instantly, but some things
will take a few seconds or minutes, or even longer. You can tell
code is still running if you see a red stop sign in the top-right
corner of the console. If you’d like to stop a process, you can
hit this stop sign. You will know that the code has been run or
successfully stopped when you see a new > prompt
in the bottom of the console.
If you don’t see the stop sign but want to cancel a process,
you can also hit control-C.
If you are running code in a block line by line, your cursor
will automatically go down to the next line after each is run, so
you can move through the block by repeatedly hitting
command-enter or control-enter.
Below is a code block: try running the code below,
print("Put your cursor on this line and try running it!")
print("Now run this line.")
print("Now run all three lines at once by selecting them and running them together or by hitting the green play triangle")You can also run code directly in the console by typing or
pasting it in and hitting enter. You will get the
same results, but if you want to save code that you have written,
it is better to keep it in the R Markdown file, since edits there
will be saved. On the other hand, if you prefer to run some code
but not make changes to your file, you can just run that
in the console.
Projects
Projects are a way to organize your work in RStudio. If you opened the “WordVectors” project file first, then you should already be working in the “WordVectors” project space—and, as long as you have this project open, your files should be where you expect them to be. It will usually be easiest to start any session by opening the “WordVectors.Rproj” file, at least while you are getting used to working in RStudio.
To confirm that you have the correct project open, check the
top-right corner of the RStudio screen. If the project is not
open, you can open it by going to File then
Open Project... in the menu bar at the top of
RStudio, or by clicking on the project file. Always check at the
beginning of a session to make sure you have the project open; if
you don’t, it will likely cause errors. If you do hit an error,
one of the first things you should check is whether your project
is open.
This introduction provides some basic instructions to get you started, but it is not a substitute for learning the fundamentals of R and RStudio. For a very basic intro to some key principles, see the “Introduction to R and RStudio” in the WordVectors folder. For more detailed information, there are many helpful resources online, including tutorials by the Programming Historian and Lincoln Mullen’s Digital History Methods in R.
Downloading and installing R and RStudio
You can download R from the CRAN (Comprehensive R Archive Network) repository: https://cloud.r-project.org/. There are specific instructions for downloading to Linus, Mac OS X, and Windows machines.
To download RStudio see: https://rstudio.com/products/rstudio/download/.
Setting and checking your working directory
If this is a new RStudio session, you should check your working directory with the code below. As long as you opened this file from the WordVectors project, your working directory should be in the right place: the “WordVectors” folder.
You should check your working directory because if the working directory is not where you are expecting, then not much else in your files will work. Any time you see an error message that says a file does not exist in the current working directory, that’s a good sign your working directory isn’t where you think it is.
There are two lines of code in the block below; the first will
allow you to check your working directory and the second will
allow you to set your working directory with the
setwd() function, if you ever need to change it.
We’ve provided you with some template text that you can replace
with a file path specific to your computer.
Navigating file paths can be a bit confusing, but, fortunately
there is a trick you can use. If you try deleting the text inside
of the quotation marks below and then hitting tab,
you should get a pop-up with a view of the folder system you’re
used to navigating; you can use this to fill in file paths
whenever you need to.
# How to check your working directory (this is also an example of how you add a comment to your code—by typing "#")
getwd()
# How to set your working directory (do not run this unless you actually want to change your working directory—as long as you opened the project file first, you should not need to change your working directory!). Delete the hashtag to take the line below out of the comment, then fill your file path in.
# setwd("path/to/your/directory")Installing packages
R Packages are collections of functions, variables, data, and documentation that you can use to get options that go beyond what’s available with the base R code. Like extensions for Google Chrome or plugins on WordPress, these are additions that you can install directly from the internet, in this case through R’s repository. Packages make it possible to do a wider range of tasks in R and are essential for working with word vectors.
To install an R package, you can navigate through the Tools
menu (Tools –> Install Packages) or install packages directly
by running code with the install() function, as
below. Whenever you are installing packages, make sure that you
have an internet connection because the code pulls directly from
online repositories.
Once you have installed a package on your computer you do not need to do so again.
Run each line of code below to install the packages you will need.
install.packages("tidyverse")
install.packages("tidytext")
install.packages("magrittr")
install.packages("devtools")
install.packages("tsne")
install.packages("lsa")
install.packages("ggplot2")Loading packages
Once the packages have been installed, you should load them
using the library() function. You only need to
install packages once, but you’ll have to load them every time you
start a new session in RStudio. You should make a habit of
checking your working directory and loading your packages at the
start of each session.
When you run this code for the first time after you start a session, you’ll see a lot of text go through the console, possibly with some warning messages. Even if the text looks alarming, it probably won’t cause any issues. To confirm that the packages have loaded correctly, you can run this code a second time—if you see the code pop into the console with no additional text, that means the packages loaded properly and you are all set.
library(tidyverse)
library(tidytext)
library(magrittr)
library(devtools)
library(tsne)
library(lsa)
library(ggplot2)
# Hold off on running the line below until after you get to the next section
library(wordVectors)Installing word2vec
Because the wordVectors package lives on GitHub,
you will need to install it in a different way, using the
devtools() function. That’s why we had you pause
before loading wordVectors above. After you’ve
installed the wordVectors package (by running the
code snippet below), you can ignore or even delete the comment
above and just load all the packages at once.
Make sure to load wordVectors after you install
it; you can either scroll up and load it from the code block
above, or you could try writing the code to load it yourself
(either in the code block below or right in the console).
#This is the code to install the `WordVectors` package from GitHub
devtools::install_github('bmschmidt/wordVectors', force=TRUE)Training a model
Reading in text files
The code we will be using in this session is set up to require
minimal editing, but that does mean that you will need to have
your input files in a very specific format. You need to have a set
of .txt files all saved in the same folder (without any files in
subfolders). The folder with your texts should be saved in the
data folder.
This tutorial also comes with a small sample folder, called “WomensNovelsDemo”; it is not large enough to produce a useful model, but will run more quickly and so is useful for initial experimentation.
The following script allows you to “read in” multiple text files and combine them into a “tibble,” which is a type of data table. Think of it as being like a spreadsheet, with rows and columns organizing information.
First, we get a list of the files to read in
(fileList), then we create a function
(readTextFiles) to create a tibble with two columns,
filename and text for each text file in
the folder. Then, we run the function to combine everything into
one tibble called combinedTexts.
There are some special requirements when you want to run code that is defining functions; unlike most of the time, where you can put your cursor anywhere in the line of code to run it, you need to have your cursor either at the beginning or the end of the code defining your function when you run it (or just select the whole thing and run it). There are comments both before and after the code that defines the function, so you can see what its boundaries are.
The only thing you’ll need to change in the code below is the file path in the first line.
As long as you have the folder with your text files inside the
data folder, you should only need to change the part
after the slash (the part that reads “name_of_your_folder”).
Remember that you can use tab to navigate to the
folder you want. Make sure to change that one line before
you run any of the code below.
# Change this line to match the name of the folder with your corpus
path2file <- "data/WomensNovelsDemo/"
# This will create a list of files in the folder
fileList <- list.files(path2file, full.names=TRUE)
# This is where you define a function to read in multiple text files and paste them into a tibble (remember that the code that defines functions must be run by putting your cursor at the beginning or end, or by selecting the whole section of code). You are only defining the function here; the next section of code is when you actually run the function.
readTextFiles <- function(file) {
message(file)
rawText = paste(scan(file, sep="\n", what="raw", strip.white=TRUE))
output = tibble(filename=gsub(path2file, "", file), text=rawText) %>%
group_by(filename) %>%
summarise(text = paste(rawText, collapse=" "))
return(output)
}
# This is where you run the function to create a tibble of combined files called "combinedTexts"
combinedTexts <- tibble(filename=fileList) %>%
group_by(filename) %>%
do(readTextFiles(.$filename)) Preparing text for word2vec
The section below defines several variables so that they can be called on in training your model. Working with general names (such as “w2vInput”) for these variables lets you use them in the code that follows without having to change each instance; the first line is where you set up the specifics you need to distinguish one model from another.
You can pick any name you want in the first line of code below; make sure there are no spaces in the name you select and that it is descriptive enough that you will remember which corpus you were working from when you want to read in a trained model.
The only line in the block of code below that you will need to change is the first one, but make sure to do this, or you will end up with a file called “your_file_name.bin”!
The last line of this code section creates a single text file, with a name based on the one that you chose, combining all of the texts in your corpus.
# This section is where you define the variables you will be using to train your model; don't forget to change the text in the first line to whatever you want to call your model file
baseFile <- "your_file_name"
w2vInput <- paste("data/",baseFile,".txt", sep = "")
w2vCleaned <- paste("data/",baseFile,"_cleaned.txt", sep="")
w2vBin <- paste("data/",baseFile,".bin", sep="")
#This line creates a singe text file with all the texts in your corpus
combinedTexts$text %>% write_lines(w2vInput)Creating a vector space model
The code below is how you actually train your model. There are some parameters you might want to modify, or, if this is your first time training a model, you can also keep the defaults to start.
You can adjust the number of processors to use on your computer
in training the model with the threads parameter;
this will impact how quickly the model is trained.
The vectors parameter allows you to change the
dimensionality of your model to include more or fewer dimensions.
Higher numbers of dimensions can make your model more precise, but
will also increase both training time and the possibility of
random errors. A value between 100 and 500 will work for most
projects.
The window parameter allows you to control the
number of words on either side of the target word that the model
treats as relevant context; the smaller the window, the closer the
context words will be.
The iter parameter allows you to control how many
times your corpus is read through during model training. If your
corpus is on the smaller side, then increasing the number of
iterations can improve the reliability of your results.
The negative_samples parameter allows you to
control the number of “negative samples” used in training. During
the training process, each iteration updates the information about
the position of each word in the model (making it progressively
more and more accurate). Because there are many thousands of words
in the model, doing that update with every iteration is
time-consuming and computationally costly. With negative sampling,
instead of updating every word, the training process updates only
the words directly observed within the window, plus a random
sampling of the other words in the model. For smaller datasets, a
value between 5 and 20 is recommended; for larger ones, you can
use smaller values, between 2 and 5.
For more on these parameters, and other options that you have in training a model, see the code documentation.
This code will check if there is already a .bin file with the same name in the current directory—if there isn’t, it will train a new model. If there is, it will read in the existing one. If you ever want to overwrite a model you’ve already trained, make sure to delete or rename that model’s .bin file first.
# This controls how much of your computer's processing power the code is allowed to use.
THREADS <- 3
# prep_word2vec will prepare your corpus by creating a single text file and cleaning and lowercasing your text with the `tokenizers` package. If you set the value of `bundle_ngrams` to be greater than 1, it will automatically join common bigrams into a single word.
prep_word2vec(origin=w2vInput, destination=w2vCleaned, lowercase=T, bundle_ngrams=1)
# The code below will train or read in a model
if (!file.exists(w2vBin)) {
w2vModel <- train_word2vec(
w2vCleaned,
output_file=w2vBin,
vectors=100,
threads=THREADS,
window=6, iter=10, negative_samples=15
)
} else {
w2vModel <- read.vectors(w2vBin)
}Querying the model
Visualizing
We can get a glimpse of what the model looks like by plotting it in two dimensions. Keep in mind that the model actually has many more dimensions, so we are, in effect, flattening it. Though the visualization may be somewhat difficult to read, you should be able to see that similar words—words that are near each other in vector space—tend to clump together. The code below will likely take a minute or two to run, and your results will appear in the “Plots” window to the right (you can hit the “Zoom” button to get a better view).
As the code is running, you’ll see a set of lines in the console that will something like “Epoch: Iteration #100 error is: 20.3048394873336”; note that this is not an error message! As the code runs, the values for “error” should decrease—this reflects increasing confidence about how to plot the vector representation.
w2vModel %>% plot(perplexity=10)Clustering
The following script provides a way to cluster words that are
near each other in vector space, using the “k-means” clustering
algorithm. Below, we choose 150 centers, or 150
points around which to cluster words. Then we select 10 random
clusters and 15 words from each cluster to view. This code will
also take a minute or two to run. You can change the number of
centers, the number of clusters to view, or the number of words to
see—you can also increase the number of iterations (the number of
times the algorithm should adjust where the centers are and where
terms are positioned in relation to those centers).
centers <- 150
clustering <- kmeans(w2vModel, centers=centers, iter.max=40)
sapply(sample(1:centers, 10), function(n) {
names(clustering$cluster[clustering$cluster==n][1:15])
})Closest to
To find the words closest to a particular word in vector space, fill in that term and then run the code below. If you want to see more words, just increase the number in the argument. Make sure not to delete the quotation marks, and enter your word in lowercase
w2vModel %>% closest_to("girl", 30) Closest to two terms
You might also want to see the words closest to a combination
of two (or more) words. Notice that this will open a new window
with the results because of the view() function. If
you prefer to see your results in this format, you can paste
“%>% view()” at the end of the code above; or, if you prefer to
see your results in the console, you can delete “%>% View()”
from the code below. Note that the code below also shows 20
results, instead of 30. If you want to continue adding terms, just
follow the format as in the example by putting a +
between each pair and putting each word in quotation marks.
# Closest to two terms
w2vModel %>% closest_to(~"girl"+"woman", 20) %>% view()
# Closest to more than two terms
w2vModel %>% closest_to(~"girl"+"woman"+"daughter"+"aunt"+"sister"+"lady", 20) %>% view()Closest to the difference between two terms
Or, you might want to look at the difference between two terms, to see which words are similar to one term but not another:
w2vModel %>% closest_to(~'woman'-'man',20) Analogies
You can even construct analogies, such as in the example below; these use vector math to subtract the contexts associated with one word from another and then add a third term, which brings you to new vector space where you will find terms associated with the distinction between the first two terms plus the contexts of the third term.
In the classic example, you might start with the vector for “woman” and subtract the vector for “man”, thus producing a vector that represents the contexts for “woman” as distinct from those for “man”. You might then add a third term, such as “king”, to add its own contexts to the query. This would let you look at a vector associated with something like femininity and then adds a vector associated with royalty; you might expect to get a result like “queen”.
Or, to frame this as an analogy: this lets you ask questions like “man” is to “king” as “woman” is to what?
w2vModel %>% closest_to(~"woman"-"man"+"king", 20)It is not always helpful to think strictly within the analogy framework; in many cases, it can be more productive to think about constructing a vector that represents the difference between two terms, and then adding the contexts of a third term. In the line of code below, for instance, we are constructing a vector that might be described as a “wealth” vector (by looking at the contexts for “rich” as distinct from “poor”) and adding to that vector the semantic space of clothing (by adding the contexts for “dress”). We might expect to get results associated with expensive clothing or the dress habits of the wealthy.
To experiment with this, try adding different third terms (perhaps “food” or “house”) or reverse the first two terms, to look at the contexts for poverty instead of wealth.
w2vModel %>% closest_to(~"rich"-"poor"+"dress", 20)Working with other models and exporting results
Reading in existing model files
If you want to read in an existing model, you can do so with
the code below (just replace “name_of_your_file” with the name of
your file, and make sure you don’t delete the .bin extension or
the quotation marks). If you follow the instructions above, all of
your trained models will be saved as binary files (with a .bin
extension) in your data folder. You only need to
train each model once, and then you can use this code to read it
in at the start of each new session.
You can also read in models trained by others if you save them
to your data folder and then read them in with the
code below.
After you’ve restarted RStudio (in addition to checking your working directory and loading your packages), you’ll also need to use the code below to read in your model again.
# Replace this with the path to the model you want to read in
w2vModel <- read.vectors("data/wwo-regularized.bin")Exporting queries
The code below will enable you to export the results from a particular query. To export query results, change the part after “w2vModel %>%” to match the query that you want to export. An example is filled in so that you can see what this looks like. You can also adjust the number of words in the results set, if you want to see more or fewer. If you’d like to export results from a different query, such as addition or subtraction, paste over the example query with the one that you want to export.
The first line of code defines the variable “w2vExport” as whatever query you set. The second line exports a CSV file (which you can open in any program on your computer that works with tabular data, including Excel and Numbers). You can call the file whatever you like by replacing the template text inside of the quotation marks. The CSV file will be exported to the “output” folder in your current working directory, and it will overwrite existing files with the same name, so make sure to rename the export file if you want to keep earlier versions. Make sure not to use any spaces in the file names you choose.
w2vExport <- w2vModel %>% closest_to("girl", 30)
#Change "name_of_your_query" to a descriptive name that you want to give to your export file. Don't put any spaces in the file name.
write.csv(file="output/name_of_your_query.csv", x=w2vExport)Exporting clusters
You can use a similar method to export your clusters; the code below will first generate a set of clusters and then export a specified (by you) number of terms from those clusters. As above, you can change the number of centers and iterations when you are generating the clusters; you can also change how many sets of clusters and words from each cluster to export. The exporting mechanism is the same as with exporting queries above; you change the language in the quotation marks to match the name that you want to give your file. The export file can be fairly large, so this code might take a bit of time to run.
#Change "name_of_your_cluster" to a descriptive name that you want to give to your export file.
centers <- 150
clustering <- kmeans(w2vModel,centers=centers,iter.max = 40)
w2vExport <-sapply(sample(1:centers,150),function(n) {
names(clustering$cluster[clustering$cluster==n][1:15])
})
write.csv(file="output/name_of_your_cluster.csv", x=w2vExport)Evaluating the Model
Below is a very simple test that will calculate the cosine similarities for a small set of word pairs that are likely to be related in many models. You can customize this list for your own corpus by editing the pairs below, or adding new ones (add as many as you like, but make sure to follow the same format as in the examples below). This code will produce a “model-test-results.csv” file with cosine similarity scores on these word pairs for every model in your folder. The results file will be in the “output” folder of your current working directory. This is meant to be an example of the kinds of testing that are used in model evaluation, and is not a substitute for more rigorous testing processes.
files_list = list.files(pattern = "*.bin$", recursive = TRUE)
rownames <- c()
data_frame <- data.frame()
data = list(c("away", "off"),
c("before", "after"),
c("cause", "effects"),
c("children", "parents"),
c("come", "go"),
c("day", "night"),
c("first", "second"),
c("good", "bad"),
c("last", "first"),
c("kind", "sort"),
c("leave", "quit"),
c("life", "death"),
c("girl", "boy"),
c("little", "small"))
data_list = list()
for(fn in files_list) {
wwp_model = read.vectors(fn)
sims <- c()
for(pairs in data)
{
vector1 <- c()
for(x in wwp_model[[pairs[1]]]) {
vector1 <- c(vector1, x)
}
vector2 <- c()
for(x in wwp_model[[pairs[2]]]) {
vector2 <- c(vector2, x)
}
sims <- c(sims, cosine(vector1,vector2))
#f_name <- strsplit(fn, "/")[[1]][[2]]
data_list[[fn]] <- sims
}
}
for(pairs in data){
rownames <- c(rownames, paste(pairs[1], pairs[2], sep = "-"))
}
results <- structure(data_list,
class = "data.frame",
row.names = rownames
)
#If you want to give your results document a more specific name; you can edit "model-test-results" below.
write.csv(file="output/model-test-results.csv", x=results)Credits and Thanks
This tutorial uses the wordVectors package
developed by Ben Schmidt and Jian Li, itself based on the original
word2vec code developed by Mikolov et al. The
walkthrough was also informed by workshop materials authored by
Schmidt, as well as by an exercise created by Thanasis Kinias and
Ryan Cordell for the “Humanities Data Analysis” course, and a
later version used in Elizabeth Maddock Dillon and Sarah Connell’s
“Literature and Digital Diversity” class, both at Northeastern
University.
This version of the walkthrough was developed as part of the Word Vectors for the Thoughtful Humanist series at Northeastern. Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.