Word Vectors Institute: Introductions and Overview

Julia Flanders



To situate this event a bit:

We’re not expecting any prior knowledge of text analysis and certainly none of word embedding models (that’s why you’re here!) but we hope everyone will come away feeling comfortable with several things:

What we will not be covering:

Finding the right level

This is also a sort of meta-workshop:

A quick look at the schedule...

Quick look at the schedule:

Making notes

We’ve provided a fair amount of time for individual and small-group experimentation, and time for you to think about your own research projects

However, this workshop will really just be a start, a chance to get comfortable with fundamental concepts

I want to talk for a moment about some suggestions for how to take this work with you and continue it in your own time after you get home:


So, with those preliminaries out of the way, let’s get into our first explanation of word embedding models. For this first explanatory pass through, we won’t dwell in detail on the terminology or the mathematics: we’ll keep to a sort of metaphorical level of explanation to get a feel for things.

And the first term I want to talk about is the word model

But models can also operate in more precise ways:

In this workshop, the word model works in both of these senses. When we talk about word embedding models or vector space models, the model conveys several things:

A first look at word vectors

At the simplest level, a word embedding model is a model of a text corpus that represents word usage in the corpus by locating each word in space

Metaphorically, we can imagine that those spatial locations show us neighborhoods of words that tend to occur in the same contexts

For instance, in this example:

Thinking with vectors

That’s interesting in itself:

It’s also interesting because we can do further analysis:

If you had a chance to read Ben Schmidt’s analysis of the Rate My Professor data, where he considers breaking down the gender binary, he is taking advantage of this same idea:

Locating words in vector space

So how do those words get located in this space? What does the spatial metaphor really mean?

We will go into the details much more fully after lunch. For this initial orientation:

You may be thinking, as I did, words have many different associations: if location in space is representing the semantic affiliations of each word, how can a word be in multiple places at one time?

In this diagram, on the left, the word bank has two associations:

On the right, we have a more complicated situation: the word set has many more associations. We can’t draw an equivalent diagram, but we can still imagine:

Factors that matter

I mentioned earlier that we need to be attentive and critical about how this model is created; there are a number of things that affect how a word embedding model will perform for us.

The size of the corpus matters a lot (and you’ll remember that we specified that you had to have at least a million words):

The content of the corpus also matters a lot:

The data preparation also matters a lot (and we’re going to spend a whole session on this later on):

And finally, the training process matters:

Comparison with other forms of text analysis

As part of our orientation, it may also be helpful to situate word embedding in relation to some other kinds of digital analysis we may already be familiar with; all of these are ways to get an understanding of texts at scale

Has anyone here already experimented with word frequency, for instance with Voyant tools?

How about topic models: has anyone used those? For instance, tools like Mallet?

What’s distinctive about word embedding models:

The larger question of what word embedding models are distinctively good for is one that we will explore as a group in the rest of the institute!

Disclaimers! Questions?

I should note here: we have been working hard to understand word embedding models and develop this curriculum; however, the underlying math is undeniably challenging. At some points in the next few days, I anticipate that you’re going to have questions that we actually can’t answer, because we haven’t yet fully mastered that deeper layer. We’re going to treat these as learning and teaching moments! After all, these are also questions that our students and colleagues will be asking us. So part of what we’re exploring here is how to understand the boundaries of what we know, and how to respond effectively based on that knowledge, whatever level we may be at.

Questions at this stage?