Introduction to Word Vectors

Julia Flanders

2019-04-01

A Road Map

As we’ve already seen, word vectors are complicated...

The next few sessions are intended to offer an overview, from several different angles:

Hopefully by the end, we’ll have gone over the same material from enough different perspectives that it will all make perfect sense!

And at various points, we’ll take a step back and think about the explanatory process itself: what kinds of explanation might work best for different audiences (especially our students, but also readers of our scholarship? project collaborators?)

Corpus and model

We’re going to hear the terms corpus and model a lot this week: let’s look more closely at those terms

Corpus:

Model:

The data preparation process is how you get from the research collection to the corpus

The training process is how you get from the corpus to the model.

Parameters

You can think of the training process (where we take a corpus and create a model of it) as being sort of like an industrial operation:

We’ll take a quick look now at two of these parameters, so that you can get a sense of how they affect the training process; they also have an important impact on how we interpret the results of the model. Later in the week, we’ll look at these parameters in more detail.

Window

The first parameter for us to consider is the concept of the window

And here we come to a fundamental assumption for a lot of text analysis: that words that are used together have something to do with one another

What does it mean for words to be used together?

In Word2Vec, instead of these, we use a window:

Iterations

We’ve talked about the creation of a model as a training process, and we’ve just imagined it as a caterpillar eating its way through the text. We can extend this metaphor and think about the caterpillar learning about the text as it goes:

Vectors: a first look

Let’s look next at some terms that may seem most distant from our humanistic expertise: the ones that refer to the mathematical aspects of word embedding models. We call our toolkit the vector toolkit: what is a vector and how is it relevant in this case? We’ll start with a simple explanation first, and then circle back a bit later for more detail.

A vector is basically a line that has both a specific length and a specific direction or orientation in space:

In a word-embedding model, the model represents a text corpus almost like a dandelion: as if each word were at the end of one of the little dandelion threads:

Cosine Similarity

So what does it mean to be near something in vector space? Here’s another mathematical term:

So in this example (a real-world example from the WWP corpus), if we take the word sacred as our starting point, the words holy and consecrated are fairly close in meaning (and have high cosine similarity); the word shrine is more distant but still related enough to be interesting

So far so good? Questions?

Querying

So what can we do with this information? We’ve created a model of our corpus (a representation that helps us see some aspect of that information more clearly and easily): how do we use it?

The first thing we might try is just querying the model about the neighborhood of a word we’re interested in: essentially, asking it questions about where specific words are located and what is around them:

Clustering

Another way we can interact with the model is to ask it more generally, where are your semantically dense zones? Or please show me some clusters of related words!

This process is somewhat similar to topic modeling:

Again, if we were writing code ourselves we could exert some fine control over this process, but in the toolkit we have a simple version:

To generate this list:

Clusters: an example

So what we get at the end of the process is clusters of words that are like neighborhoods within the vector space: densely populated areas where words are grouped together around a concept or a textual phenomenon.

Vector Math 1

One more thing we can do to explore the word information in our vector space model: we can examine the relationships between words, taking advantage of the fact that each word is represented as a vector, which is a kind of number

To understand how this works, we need to envision a little more clearly how words are positioned in this vector space model:

Vector Math 2

We can use this information to tease out more specific semantic spaces for individual words:

Note that words here are just proxies or symptoms (imperfect ones) for the concepts we might be interested in:

Validation

As we use our model in these various ways, we’re going to get some results (hopefully) that look very predictable, and some others that look provocative and fascinating, and maybe some others that look bizarre and unexpected. How can we tell the difference between an interpretive breakthrough and a glitch resulting from some terrible flaw in our training process?

Once we’ve generated a model, there are ways we can and should test it to see whether it is actually a useful representation that will give research results we can use. That testing process is called validation. To validate a model, we can ask questions like these:

Are your results consistent across models?

Do you get plausible word groupings?

Does vector math work as you would expect?

Circling back: another look at vectors

Now that we’ve worked through the basic concepts, let’s circle back and consider the whole picture of word vectors or word embedding models, and introduce a few additional complexities.

A quick review: we’ve already noted that a vector is basically a line that has both a specific length and a specific direction or orientation in space:

Words and dimensions

Each of those pieces of information, that contribute to the precise direction of the vector, is coming from a word in the corpus. Putting this a different way: when we train a model based on a corpus of words, each word contributes a dimension (an informational axis) to the location of all the other words in the corpus.

Let’s pause and let that sink in:

Way more dimensions...

With all that in mind, let’s try picturing a higher-dimensional reality....

With a real-world corpus, our word vectors are defined with way more than two or three dimensions, although this gets very difficult to draw and to visualize in our minds

But let’s try to imagine it:

In word embedding models, we might have hundreds of dimensions:

At this stage we can also come back to look with more expert eyes at the parameters that we talked about earlier: the settings we can control as part of the model training process.

Two parameters we’re already familiar with:

We can now add another: we can control the number of dimensions in our model. What does this really mean?

Negative sampling

Finally we come to the most abstruse parameter of all: negative sampling. To understand what this is, we need first to remind ourselves about the model training process:

So negative sampling is a way to reduce that work:

Any ideas about what effect a large negative sampling value would have on our model and on the training process?

Embedding (semi-technical)

We’re now ready to explore one term we haven’t defined yet: embedding, which is a curious term, especially in the context of the phrase word embedding models. To explain embedding I can offer a semi-technical view, and then a metaphorical view, and you can decide which works better.

For the semi-technical view, we need to remember a few things:

So embedding is a way of reducing the number of dimensions we’re working with: embedding some of the dimensions in each other and eliminating the empty space.

Embedding (metaphorical)

Metaphorically, we can imagine embedding as being sort of like flattening, or like a projection:

Questions at this stage?

The word vector process: Data preparation

So another way to put this all together is to walk through the entire process in order, step by step. There are basically three major acts in this drama, very much like a classic comedy

In the first act, we set up the problem and introduce the main characters:

As part of this process, we might discover things that cause us to reassess or expand our research question: so it’s helpful to keep an open mind and be prepared to treat this as an iterative process.

The word vector process: Training the model

In the second act, we get the real meat of the plot: in this case, the process where we train our model and create a vector space representation of our corpus:

The word vector process: Iteration and refinement

As before, this is an iterative process!

The word vector process: Querying and research

In the final act, as with a proper comedy, we reach resolution and answers: this is where we can start querying our model and doing our research (although as we’ve seen, the corpus-building and model-training processes are also definitely integral to the research process)

Tools for word embedding models

To wrap up this session, let’s take a quick look at the tools we use for working with word embedding models

We can arrange them in order of abstraction:

However, those layers are all sitting underneath us and they each have effects on the outcomes of our work:

Discussion and questions

So now let’s take a step back, with this more detailed perspective: