Word Vectors: Walkthroughs and Hands-on Practice

Sarah Connell and Julia Flanders

2021-06-29

R, RStudio, RMarkdown

Let’s start with some brief orientation in the language, software, and files we’ll be using:

Tour of RStudio Server

Here’s a quick tour of the RStudio interface. By default, RStudio is divided into four panes, each of which has a different function; this lets us keep track of all our essential information at one glance.

Orienting yourself—projects and working directories

Now, we’ll cover two other important aspects of orienting ourselves in RStudio Server. First: projects are a way to organize your workspace. They establish which files you’re working with by default, and also can control some settings.

When you’re first getting started, it can be helpful to start with a tested environment where everything is already set up, so that you can focus on the new things you’re learning, and not have too much unnecessary and potentially confusing variation. As you get farther along, you’ll be comfortable with more experimentation and flexibility. But, when you get started, it helps to have a strong framework that lets you focus on essential tasks.

We’ve set up a wordVectors project that will serve as our navigational framework. One of the things that the project controls is what’s called our working directory. The working directory is essentially the starting point for all of our interactions with the files we’ll be using.

To think metaphorically about this: imagine that I’m giving you directions from my house to the playground. I might tell you to go south for a block, then east for another block, and then south for one more block. And then you’re at the playground! But, if you were starting at someone else’s house, the path you’d have to take would be different. The starting place is really essential for thinking about directions. In the same way, if you were working with these files and you needed to get into the novels folder, you’d first need to go down into the data folder, and from there into novels. You have to go down two levels because of where your starting point is. If your working directory were the data folder instead, you’d only go down one level. If you were in output you’d need to go up a level and then down again.

There are some pragmatic implications of these points:

Running code in RStudio

Functions

So far, we’ve talked about running code, projects, and our working directories. Now let’s talk about a couple more key concepts: functions and variables.

Functions are pieces of code in R that are designed to perform some specific task.

For example, at the top of this slide we have the sum function being used to add a set of numbers separated by commas. The second example shows the print function to print out a string of text. Because this is a string, you’ll see that it’s enclosed in quotation marks.

Variables

Our last key concept is variables.

Roadmap: querying an existing model

Here’s a roadmap for querying an existing model:

Packages are bundles of code, data, and documentation that let you access additional functionalities in R. They essentially allow you to do more advanced or specialized things than the ones you get for free from R.

This is kind of like if you have a Kitchenaid mixer. When you get the mixer, it comes with about three basic attachments for things like dough or eggs. But, if you want the grain mill, or the pasta sheet roller, or the fruit and vegetable spiralizer, or the food processor with commercial style dicing kit, you’re going to need to acquire those separately.

We didn’t need to do this earlier because we were doing some core tasks with essential functions, basically the equivalent of beating dough for chocolate chip cookies.

But now that we’ll be working with word2vec models, we’ll need to bring in a few packages with more specialized capabilities.

Normally, you have to install packages before you use them, but we’ll be working in RStudio Server, so the packages are pre-installed.

However, we still need to load them in with the library() function at the start of each session.

Once we’ve done that prep work, we can load in a model file, with a new function, called read.vectors()—we get this function from one of the new packages we’ll be loading in.

And finally, we’re ready to query a model!

Before we show you how to run that code, let’s talk about some of the query operations you can try.

Clusters

We’re using a specific clustering algorithm called k-means; this method asks you to set a number of centers for locating the dense semantic spaces in your model. You’re essentially saying: find me n many places where larger numbers of terms cluster together where you get to decide what n is. And, then the algorithm iterates over your model, in each case doing a better job of finding where those densely-populated centers are, and what terms are located around them.

Then, you can ask to see the terms closest to some of those centers, to get a sense of some meaningful domains for your corpus. In the Toolkit, you can choose how many words to see, and also request different random selections of clusters from the model.

Working directly on the code, there’s more you can control: the number of centers, how many clusters to select and view, and how many words to see from each of those clusters.

Closest to

As we’ve seen, the words closest to a word in vector space are those with the highest cosine similarities. They’re not necessarily synonyms but instead words that the model predicts are likely to appear in the same contexts. And, indeed, the closest words are often antonyms: think how often words like good and evil or happy and sad appear together.

Nobody would ever expect that tuesday is a synonym for friday, but those two terms and the other days of the week are especially likely to be used in similar ways, and in the same kinds of places within your texts, so they have fairly high cosine similarities.

As a rough rule, any cosine over .5 is probably worth paying attention to.

All the examples here are coming from the regularized Women Writers Online corpus. It’s likely that many of you will get similar results when you query your own models if you look at the days of the week. If you do get different results, that is probably telling you something interesting about your corpus.

Addition

What’s great about vectors is that you can also do math with them, as we explored a bit earlier. For example, you can add two or more terms together, to amplify their semantic connections.

In this example, I’m adding oranges to grapes, to focus in on the fruit meaning of the word.

If I switched to add oranges and blue I would still get some fruits, but I would also get more colors: for example, the Women Writers Online model has: red, yellow, and lilac which is also both a plant and a color.

This example only shows two terms, but it is often useful to add a much larger number of terms together. If you’re looking at a complicated concept like the natural word, as distinct from human industry, you can’t just stick in bird + tree and really get at the complexity there, but a larger number of terms would get you closer.

Subtraction

Finally, you can also subtract vectors from each other. This can be really helpful if you want to get at specific associations for words that have multiple meanings. Or, if you’re looking for this concept as distinct from this other one, such as wealth as distinct from poverty.

In this example, we’re subtracting books from library to get at the library as a physical building, which has associations with other kinds of buildings, especially public ones like hotels, hospitals, palaces, and even spunging houses, which were sort of like debtors prisons.

As you’ll see here, the cosine similarities are overall lower; that’s often the case with subtraction.

Outline of Model Training in RStudio Server

Here’s a roadmap for training a model.

So, that’s a birds-eye view of model training; now let’s run through the full process, starting with a pre-loaded sample folder with a small number of texts.

Loading your texts: process overview

Now that we have a sense of the overall steps involved in training a model, let’s focus in on the processes for reading in your own corpora.

These instructions are designed for RStudio Server; if you’re working in RStudio Desktop, you don’t need to compress your texts or upload them, just move the folder with your texts into your data folder.

Here’s our roadmap for reading in texts of our own:

Let’s look at these steps in more detail before we get to the hands-on.

Step one: Collect your texts

The first step is to collect your texts, thinking about all the considerations we discussed earlier on what to include, how much to regularize, what to correct, and so on.

The particular code we’re using requires that all the text be saved as .txt files in a single folder. Saving everything as a single large text file will work, but you don’t want any subfolders.

As a good general practice, you should make sure not to include any spaces in the folder name.

Step two: Compress your folder

The next step is to compress the folder, so it can be read in as a single file. This process is different for Apple and Windows machines.

Step three: Upload the folder

Next, we’ll go into RStudio Server and upload the folder.

The main thing here is to make sure that you’re in the data folder before you do the upload. You can move the folder after you upload it, so it’s not a major concern, but if you put the folder in a different location and don’t notice, you’ll get an error message that could potentially be confusing.

When you upload the compressed folder, RStudio Server will automatically expand it.

Step four: Read your files into R

In the RMD file, we’ll define a variable that points to the folder we just uploaded, and then create the specialized function that will read our texts in. When we run the function, we’ll get a tibble with two columns, as in the screenshot here, one listing the names of each file and another where each row contains the full text of every file in the folder.

Step five: Name your model and combine your texts

The next block of code is where you supply a name for your model file and the associated text files that are created during preparation and training.

You define a variable called baseFile and then use that to derive the names and contents for:

The final line in this code block is what creates that input file of combined texts.

Step six: Prepare your texts

prep_word2vec is a built-in command that makes it possible to do some light text cleaning within the R code itself. It’s far from the only way to prepare texts, and you might find that you want to have more control over this part of the process, or that you’d prefer to do the final text processing in a program you’re used to working in. But, this is useful for our purposes because it’s built right into the package we’re already using, and because the default text preparation it does is likely to be applicable in most contexts.

That is, this command will lowercase all the text so that Rabbit and rabbit are treated as the same token. Or, to put this another way, so that it doesn’t matter whether words might happen to appear at the beginning of sentences, in titles, or in other contexts that lead to capitalization.

This step also removes most punctuation. Notably, it does not remove apostrophes, so contractions and possessives will remain unchanged, and contractions will be treated as single tokens. It also leaves underscores, which can be useful if you want to do any preprocessing on your texts to combine up particular phrases, by replacing the spaces between words with underscores.

Related to that last point, the command can also do some automatic recognition of multiword phrases that represent a single concept—such as united states or dunkin donuts—by looking for two- or three-word strings that often appear together. In fact, there’s no limit for how many word strings you can instruct the prep_word2vec function to try and recognize, but even 2 or 3 will get fairly mixed results in terms of accurately detecting words that should be understood as a single token.

After that, your text is ready and you can adjust your parameters as needed and train the model.

Exporting results from clusters and queries

Before we turn to the hands-on practice, there’s one more skill to cover: exporting the results of your queries.

When we get to the RMD file, we’ll be using some code that’s set up to allow you to export both queries and clusters. After you run the code, you’ll get a CSV file with the results in your output folder. As a quick reminder, CSV (or comma separated value) files are a plain text format for storing tabular data.

If you’re working in RStudio Server, there’s one additional step after you run the code: exporting the CSV files from RStudio Server onto your own computer. You can select the file, then use the Export option to download it. After that, you can open the file with any software you prefer for working with CSV files.

Validation

As we’ve just discussed, there are many different ways that researchers have tried to answer the very important questions questions of:

What we’ll be showing now is a fairly minimalist testing process, but one that should enable you to do an initial sniff test of your models’ performance and consistency, while also illustrating some of the principles at stake behind more robust testing methods.

The code we’ll be showing will enable you to define a set of word pairs that you expect to have fairly high cosine similarities within your model.

When you run the code, your results will look a bit like this (depending on which program you use to open the resulting file). You’ll see each of the word pairs, along with their cosine similarity scores. These particular word pairs are ones that are likely to be close in many different models, but you can improve on these substantially! The code template provides a test set of word pairs that is both very short and very generalized. You’ll get much better results with a longer set of test terms, and ones that are specific to your model.

Setting up a testing folder

Before we get to running the validation code, there’s an additional step we’ll want to take: setting up a test folder that will contain only the models we want to validate, and changing the working directory to that folder.

This is necessary because the validation code is designed to run on all the files contained within the working directory, even those in subfolders.

If you forget or skip this step, that’s fine—you’ll just end up with a lot of extra models in your test results and the code will take longer to run.

This slide shows the first two steps for setting up the test folder

Changing your working directory

Now, we have a folder with our specific files, and our next step is to change the working directory temporarily to that folder.

The screenshots on the left show the code you can use to change your working directory without losing track of where you are.

The two images on the right show a before-and-after for what we’ll be doing in this step. On the left, the working directory is WordVectors and the model validation code will run on any of the model files in any of these subfolders, including the data folder and anything it might contain. On the right, we see the outcome of what happens when we use setwd() to change to the testing folder. That becomes the working directory and the model validation code will run only on the files in that folder.

Exporting results

The next-to-last line of code in this section will generate a table called results, which you can view right in RStudio by using the view() function or by clicking on results in your Global Environment

You may also want to export this file and view it in another program; if so, you can use the final line of code. You will likely want to update the file name (the part before the .csv extension), and you should remember that the code as it stands assumes that you have an output folder in your current working directory.

Visualizing semantic spaces

The goal of this session is to briefly share a few options for generating plots with word vector models. One of the most useful capabilities that visualizations provide is that they can offer new ways to see and think about our data. As we’ve already observed, it can be hard to get a sense of the wholeness of a model by reviewing individual lists of cosine similarities for query terms.

The same is true for the conceptual spaces we are using models to explore: you can add a lot of gendered terms together to try and get at the domain of nineteenth-century conceptions of femininity but if the only tool you have to examine those connections is a static list of terms and cosine similarities, it can be hard to feel as if you’re really getting the full picture.

Visualizations can also move us away from the tempting positivism engendered through lists: we can get a sense of how words are in proximity to each other, without getting caught up in whether they are seventh or eight on a list.

But, it’s also worth remembering that visualizations have a rhetoric of their own, and that the choices you make in generating these plots can have a dramatic impact on what they seem to be showing. For that reason, it’s essential to vary your settings and query terms, and then evaluate how your results change.

Principal component analysis

Several of the visualizations we’ll be demonstrating use principal component analysis, usually abbreviated PCA. This is a powerful and flexible method that is often used to simplify complex datasets and make it possible to discover patterns and relationships in the data.

PCA is one of a set of methods designed for dimensionality reduction. We’ve already seen another one: in the TSNE plot that produced the flattened word cloud showing the most common words in a model. PCA is another approach for taking something large and complex with many dimensions and reducing its complexity in order to enable us as humans to understand and analyze it.

The core idea here is that the PCA reduces the dimensionality of complex datasets by transforming a large number of variables into a smaller one that can still contain most of the information in the larger dataset. That is, the main goal of PCA is to reduce the number of variables while still preserving as much information as possible.

This is an unsupervised machine learning method that identifies a set of principal components: the first principal component can account for the largest amount of variation, the second principal component accounts for the next largest amount, and so on.

Here’s a very simplified way to think about how principal components operate. Looking at this imaginary dataset, what might be the primary difference between the items on the top and the ones on the bottom? What about the ones on the left and the right? What are some other characteristics of these items that might also be used to categorize them?

Of course, if we treat this dataset as one comprising elements that can be primarily categorized as plants and animals, or as land or sea life, there’s a lot of complexity we’ll lose. For example, some of these items are food for humans, and others aren’t. Some have teeth and others don’t. They also vary significantly in their sizes, their colors, and their geographic distributions. All of that information is important! But, if we’re focused on organizing these items by just two kinds of variables, those other types of information aren’t going to cover as much of the variety in our data.

We really need to emphasize that this is a simplification, and that the principal components won’t map neatly onto human concepts like sea and land life. But, hopefully, this example can help to illuminate the key goal of PCA: reducing variables without losing information.

Visualization example 1: PCA plotting of closest terms

Now, we’ll go through three example plots, before demonstrating the code to create them.

This first example shows the closest words to a set of animal-based query terms plotted according to principal component analysis.

This example and all the ones that follow are for the regularized Women Writers Online model.

Visualization example 2: Exploring clusters with dendrograms

This next example demonstrates a dendrogram, or tree diagram. This is showing levels of connectedness or closeness among a set of terms, organized hierarchically.

To read the dendrogram, look at where different terms branch off from each other. The more steps up the tree you need to go to find their branching point, the farther away from each other they are. For example, lace and ribband on the left only branch up one level, so they are quite close.

Visualization example 3: Comparing cosine similarities for word pairs

This next example might look complicated, but it’s actually relying much more directly on methods you’re already familiar with. This plots the terms closest to a pair of query terms—in this case, happy and sad—according to their cosine similarities.

For example, the word accomplished has a cosine similarity of around 0.5 relative to the word happy but a cosine similarity of almost zero relative to sad. By contrast, the word mournful has a cosine similarity of around 0.8 relative to sad and a cosine similarity of around 0.3 relative to happy.

What’s particularly interesting are the words in the middle, those that are fairly close to both terms: wretched, unhappy, pleasing, and miserable. This helps to reinforce our awareness that close terms aren’t necessarily synonyms, and also shows which words from this corpus’s discourse around happiness are also used in contexts related to sadness, and which terms aren’t.

Visualization example 4: Plotting terms associated with multiple keywords

You might also find that it’s useful to look at relationships among more than two terms. This example will let you plot the terms closest to multiple input terms using principal component analysis.

As we’ve seen before, we focus on a subset of words that are closest in terms of cosine similarity to a group of terms that we can choose. Then, we use principal component analysis to plot these. This allows us to discover both how our individual query terms relate to each other, and also patterns in the relationships among their closely related terms.

Visualization example 5: Plotting a conceptual domain across two binaries

This last example is more complex. Here, we’re pulling out a set of terms related to a conceptual domain, and then plotting those terms across two binaries.

We’ve selected a group of terms related to clothing—words like dress, frock, and coat—and then plotted these out in a space where the vertical dimension relates to gender and the horizontal one to wealth.

We use multiple terms to establish each of the poles for our paired binaries, just as we do for the clothing words. In this example, the terms related to the concept of poverty are: poor, poorer, poverty, and indigence.

We can see that the terms overall lean slightly toward maleness and wealth; that’s a result of the input terms that we chose, and we could try adding to or changing the query terms to see how the plot might vary. There are also some interesting patterns that emerge for individual words. We get a sense of the materials used to make expensive clothing for both women and men, such as ermine, satin, and silks. We can also see that ribbons are more associated with wealthy women but trimmings are more associated with wealthy men. Handkerchiefs for women and shirts for men are interestingly wealth-neutral.

As these examples have hopefully shown, visualizations can open up very productive opportunities for analyzing and exploring word vector models.