Introduction to Word Vectors

Julia Flanders

2019-04-01

A Road Map

As we’ve already seen, word vectors are complicated...

This session is intended to offer an overview, from several different angles:

Hopefully by the end, we’ll have gone over the same material from enough different perspectives that it will all make perfect sense!

And at the end of the session, we’ll take a step back and think about the explanatory process itself: what kinds of explanation might work best for different audiences (our students? readers of our scholarship? project collaborators?)

Corpus and model

Corpus:

Model:

The "training" process is how you get from the corpus to the model.

Parameters

You can think of the training process (where we take a corpus and create a model of it) as being sort of like an industrial operation:

Window

A fundamental assumption for a lot of text analysis: that words that are used together have something to do with one another

What does it mean for words to be used together?

In Word2Vec, instead of these, we use a window:

How does the window actually get used?

So just to get our minds around this for a moment, let’s think through how this might actually impact our research: first, imagine a very small window...then a very large one

Iterations

We’ve talked about the creation of a model as a training process, and we’ve just imagined it as a caterpillar eating its way through the text. We can extend this metaphor and think about the caterpillar learning about the text as it goes:

Vectors: a first look

Let’s look next at a term that may seem most distant from our humanistic expertise: the ones that refer to the mathematical aspects of word embedding models. We call our toolkit the "vector toolkit": what is a vecto and how is it relevant in this case? We’ll start with a simple explanation first, and then circle back a bit later for more detail.

A vector is basically a line that has both a specific length and a specific direction or orientation in space:

In a word-embedding model, the model represents a text corpus almost like a dandelion: as if each word were at the end of one of the little dandelion threads:

Cosine Similarity

So what does it mean to be "near" something in vector space?

So in this example (a real-world example from the WWP corpus), if we take the word sacred as our starting point, the words holy and consecrated are fairly close in meaning (and have high cosine similarity); the word shrine is more distant but still related enough to be interesting

Querying

So what can we do with this information? We’ve created a model of our corpus (a representation that helps us see some aspect of that information more clearly and easily): how do we use it?

The first thing we might try is just querying the model: essentially, asking it questions about where specific words are located and what is around them:

Clustering

Another way we can interact with the model is to ask it to tell us about itself; one question we can ask is where are your semantically dense zones? Or please show me some clusters of related words!

This process is somewhat similar to topic modeling:

Again, if we were writing code ourselves we could exert some fine control over this process, but in the toolkit we have a simple version:

Clustering

To generate this list:

Clusters

So what we get at the end of the process is 150 clusters of words that are like neighborhoods within the vector space: densely populated areas where words are grouped together around a concept or a textual phenomenon.

Vector Math 1

One more thing we can do to explore the word information in our vector space model: we can examine the relationships between words, taking advantage of the fact that each word is represented as a vector, which is a kind of number

To understand how this works, we need to envision a little more clearly how words are positioned in this vector space model:

Vector Math 2

We can use this information to tease out more specific semantic spaces for individual words:

Note that words here are just proxies or symptoms (imperfect ones) for the concepts we might be interested in; in strengthening or weakening the pull of any given word, we’re pulling words in specific directions; good to try multiple different words

also note corpus-specificity of semantic association

Validation

As we use our model in these various ways, we’re going to get some results (hopefully) that look very predictable, and some others that look provocative and fascinating, and maybe some others that look bizarre and unexpected. How can we tell the difference between an interpretive breakthrough and a glitch resulting from some terrible flaw in our training process?

Once we’ve generated a model (that big vector space representation of the words in our corpus) there are ways we can and should test it to see whether it is actually a useful representation that will give research results we can use. That testing process is called validation. To validate a model, we can ask questions like these:

Are your results consistent across models?

Do you get plausible word groupings?

Does vector math work as you would expect?

Circling back: another look at vectors

Now that we’ve worked through the basic concepts, let’s circle back and consider the whole picture of word vectors or word embedding models, and introduce a few additional complexities.

A quick review: we’ve already noted that a vector is basically a line that has both a specific length and a specific direction or orientation in space:

Vectors, more...

We could also imagine having a vector with less information: for instance, if we were working in two-dimensional space, we would have only two axes to deal with, only two pieces of information determining the vector’s directionality and extent.

Vectors, way more...

We could also imagine having a vector with more information, although this gets very difficult to draw and to visualize in our minds

But let’s try to imagine it:

At this stage we can also come back to look with more expert eyes at the parameters that we talked about earlier: the settings we can control as part of the model training process.

Two parameters we’re already familiar with:

We can now add another: we can control the number of dimensions in our model. What does this really mean?

Negative sampling

Finally we come to the most abstruse parameter of all: negative sampling. To understand what this is, we need first to remind ourselves about the model training process:

So negative sampling is a way to reduce that work:

Embedding (semi-technical)

We’re now ready to explore one term we haven’t defined yet: embedding, which is a curious term, especially in the context of the phrase word embedding models. To explain embedding I can offer a semi-technical view, and then a metaphorical view, and you can decide which works better.

For the semi-technical view, we need to remember a few things:

So embedding is a way of reducing the number of dimensions we’re working with: embedding some of the dimensions in each other and eliminating the empty space.

Embedding (metaphorical)

Metaphorically, we can imagine embedding as being sort of like flattening:

The word vector process: Data preparation

So another way to put this all together is to walk through the entire process in order, step by step. There are basically three major acts in this drama, very much like a classic comedy

In the first act, we set up the problem and introduce the main characters:

As part of this process, we might discover things that cause us to reassess or expand our research question: so it’s helpful to keep an open mind and be prepared to treat this as an iterative process.

The word vector process: Training the model

In the second act, we get the real meat of the plot: in this case, the process where we train our model and create a vector space representation of our corpus:

The word vector process: Iteration and refinement

As before, this is an iterative process!

The word vector process: Querying and research

In the final act, as with a proper comedy, we reach resolution and answers: this is where we can start querying our model and doing our research (although as we’ve seen, the corpus-building and model-training processes are also definitely integral to the research process)

Tools for word embedding models

To wrap up this session, let’s take a quick look at the tools we use for working with word embedding models

We can arrange them in order of abstraction:

However, those layers are all sitting underneath us and they each have effects on the outcomes of our work:

Layered; we’re showing the result of a bunch of decisions that could have been made differently; maybe different tools/decisions would be more suitable for different projects; you might want to revisit in time, with more expertise; the things we may observe about word vectors in this workshop are not universal but local and situational

Discussion and questions

So now let’s take a step back, with this more detailed perspective: