Introduction to Word Vectors

Julia Flanders

2019-04-01

A Road Map

As we’ve already seen, word vectors are complicated...

The next few sessions are intended to offer an overview, from several different angles:

Hopefully by the end, we’ll have gone over the same material from enough different perspectives that it will all make perfect sense!

And at various points, we’ll take a step back and think about the explanatory process itself: what kinds of explanation might work best for different audiences (especially readers of our scholarship, project collaborators, colleagues, grant reviewers, also potentially our students)

Corpus and model

We’re going to hear the terms corpus and model a lot this week: let’s look more closely at those terms

Corpus:

Model:

The data preparation process is how you get from the research collection to the corpus

The training process is how you get from the corpus to the model.

Parameters

Remember that we said different researchers might want to use the model for different things, which would result in training/generating the model somewhat differently. The way we control that training process is by adjusting a set of parameters.

You can think of the training process (where we take a corpus and create a model of it) as being sort of like an industrial operation:

We’ll take a quick look now at two of these parameters, so that you can get a sense of how they affect the training process; they also have an important impact on how we interpret the results of the model. Later in the week, we’ll look at these parameters in more detail and think about the effect these specific settings have on our models.

Window

The first parameter for us to consider is the concept of the window

And here we come to a fundamental assumption for a lot of text analysis: that words that are used together have something to do with one another

What does it mean for words to be used together?

In Word2Vec, instead of these, we use a window:

Remember that this is a machine learning process and moreover it is an unsupervised machine learning process: one that starts from a state of complete ignorance and has to bootstrap itself.

Iterations

We’ve talked about the creation of a model as a training process, and we’ve just imagined it as a bookworm eating its way through the text, repeatedly. The trained model is the representation of the probability that words appear within the same window.

Vectors: a first look

Let’s look next at some terms that may seem most distant from our humanistic expertise: the ones that refer to the mathematical aspects of word embedding models. The word vector has come up already: what is a vector and how is it relevant in this case? We’ll start with a simple explanation first, and then circle back a bit later for more detail.

A vector is basically a line that has both a specific length and a specific direction or orientation in space:

In a word-embedding model, the model represents a text corpus almost like a dandelion: as if each word were at the end of one of the little dandelion threads:

Cosine Similarity: What is a cosine anyway?

So what does it mean to be near something in vector space? How do we measure this kind of proximity or association? If we understand these vectors as lines whose directionality and length reflects word associations in the corpus, then the more closely aligned two vectors are (the more they are going in the same direction for the same distance), the nearer they are for our purposes.

We can measure that alignment by using a mathematical expression called a cosine. What is a cosine?

Cosine Similarity

So now we can come back to our question of how to measure nearness. In word embedding models the measure of nearness that we use is something called cosine similarity.

So in this example (a real-world example from the WWP corpus), if we take the word sacred as our starting point, the words holy and consecrated are fairly close in meaning (and have high cosine similarity); the word shrine is more distant but still related enough to be interesting

So far so good? Questions?

Querying

So what can we do with this information? We’ve created a model of our corpus (a representation that helps us see some aspect of that information more clearly and easily): how do we use it?

The first thing we might try is just querying the model about the neighborhood of a word we’re interested in: essentially, asking it questions about where specific words are located and what is around them:

Clustering

Another way we can interact with the model is to ask it more generally, where are your semantically dense zones? Or please show me some clusters of related words!

This process is somewhat similar to topic modeling:

To generate these clusters (as part of the initial model training process):

Clusters: an example

So what we get at the end of the process is clusters of words that are like neighborhoods within the vector space: densely populated areas where words are grouped together around a concept or a textual phenomenon.

Check the time and consider stopping here!

Vector Math 1

One more thing we can do to explore the word information in our vector space model: we can examine the relationships between words, taking advantage of the fact that each word is represented as a vector, which is a kind of number

To understand how this works, we need to envision a little more clearly how words are positioned in this vector space model:

Vector Math 2

We can use this information to tease out more specific semantic spaces for individual words:

Note that words here are just proxies or symptoms (imperfect ones) for the concepts we might be interested in:

Validation

As we use our model in these various ways, we’re going to get some results (hopefully) that look very predictable, and some others that look provocative and fascinating, and maybe some others that look bizarre and unexpected. How can we tell the difference between an interpretive breakthrough and a glitch resulting from some terrible flaw in our training process?

Once we’ve generated a model, there are ways we can and should test it to see whether it is actually a useful representation that will give research results we can use. That testing process is called validation. To validate a model, we can ask questions like these:

Are your results consistent across models?

Do you get plausible word groupings?

Does vector math work as you would expect?

If we didn’t stop before, consider stopping now!

Circling back: another look at vectors

Now that we’ve worked through the basic concepts, let’s circle back and consider the whole picture of word vectors or word embedding models, and introduce a few additional complexities.

[if starting the day here, check in and see if people want to recap anything]

A quick review: we’ve already noted that a vector is basically a line that has both a specific length and a specific direction or orientation in space:

Words as vectors

The example we were just looking at shows a vector defined by three dimensions: three different numbers representing three different axes of meaning. However, when we’re working with word embedding models, we are working with vectors that are defined by many more dimensions. So in order to understand that scenario, we need to get a little more comfortable with two ideas:

So let’s take the first example on this slide (the idea may look familiar if you read the Jay Allamar Illustrated Introduction to Word Embeddings):

Pause for questions and reflection!

So, taking this a step farther, let’s look at the chart on the right:

So what do we see when we look at the righthand chart?

A few interesting things to note:

Negative Sampling

So let’s now add another concept. Cast our minds back to the little bookworm eating through the corpus, making observations about the words that are near the target word, and adjusting the position of the words within the model. The information about those words that it observes is being fed into our little chart here. But how about the words that aren’t being observed?

We mentioned earlier that these are also significant. When the bookworm takes a bite, there are a huge number of words that are not in that sample, and the model training process could (in principle) use that information to adjust all of the words in the corpus, moving them away from the target word. In practice, it doesn’t adjust all of the words (since that would be too much work) but it adjusts some of the words: a random sample. This is called negative sampling, and it is one of the parameters we can adjust: we can say how many of these non-appearing words should have their positions updated with each observation. If we have a large negative sampling value, the model training will be more precise, but the training process will take a lot longer.

Looking again at our chart: If time and computing power were no object, we could imagine the chart extending off to the right so that every word in the corpus is listed, and we could imagine the position of every word in the model being adjusted with each observation, so that both the positive and negative sampling information would be fully reflected in the model. We could think of this situation as a kind of perfect model:

Let’s test this idea a little further:

One more look at our perfect model:

So what do we do about that?

Embedding!

To make the model more compact, and hence easier to process while you wait, clever people developed a technique called embedding which flattens the model: reducing it from a very large number of dimensions (like, thousands) to a somewhat smaller number of dimensions (like, hundreds).

For those of you who may have read Edwin Abbott’s Flatland, you might remember how when a sphere visits Flatland, the two-dimensional creatures there see it as a circle: a three-dimensional entity flattened or projected onto two dimensions. Something similar sometimes happens to Wily Coyote.

We are not going to cover the mathematics of it, but we will look at a few effects/results.

In simple terms:

So by specifying the number of dimensions, we are in effect specifying how many other words each word’s position takes into account:

I’m afraid there’s a little ’magic happens here’ at this stage--the mathematical details are a little out of scope for this institute, but there are some good sources in the readings for those who want to understand this more fully.

The word vector process: Data preparation

So another way to put this all together is to walk through the entire process in order, step by step. There are basically three major acts in this drama, very much like a classic comedy

In the first act, we set up the problem and introduce the main characters:

As part of this process, we might discover things that cause us to reassess or expand our research question: so it’s helpful to keep an open mind and be prepared to treat this as an iterative process.

The word vector process: Training the model

In the second act, we get the real meat of the plot: in this case, the process where we train our model and create a vector space representation of our corpus:

The word vector process: Iteration and refinement

As before, this is an iterative process!

The word vector process: Querying and research

In the final act, as with a proper comedy, we reach resolution and answers: this is where we can start querying our model and doing our research (although as we’ve seen, the corpus-building and model-training processes are also definitely integral to the research process)

Tools for word embedding models

To wrap up this session, let’s take a quick look at the tools we use for working with word embedding models

We can arrange them in order of abstraction:

Those layers are all sitting underneath us and they each have effects on the outcomes of our work:

Discussion and questions

So now let’s take a step back, with this more detailed perspective: