Word Vectors Institute: Introductions and Overview
Julia Flanders
2019-04-01
Overview
To situate this event a bit:
- this is the third of a series of four institutes, in which we’re trying to
approach the general topic of word embedding models from both a teaching and a
research perspective, and also for audiences with different levels of comfort with
programming
- so this third event focuses on research usage from an
intensive standpoint
We’re not expecting any prior knowledge of text analysis and certainly none of word
embedding models (that’s why you’re here!) but we hope everyone will come away feeling
comfortable with several things:
- What word embedding models are and how they differ from other text
analysis/machine learning approaches
- The vocabulary and specialized terminology used to talk about word embedding
models
- How word embedding models work: what is actually happening under the hood and
how that affects the kinds of research and interpretive work we can do with this
technique
- How to explain and contextualize these approaches, particularly in the context of
our research and scholarship
- How to read and modify the R code used to train and query the models (but not write new R code from scratch)
What we will not be covering:
- We have developed an easy-to-use web interface which we’ll use a bit (and which you
may find very useful for teaching), which lets you query existing trained models
- But for this intensive workshop, we will be getting into the actual process of training
and querying models on the command line
- We will be using RStudio, which is a command-line environment for running R code
Finding the right level
This is also a sort of meta-workshop:
- Part of the goal of this grant is to explore ways of making word embedding
models approachable and useful and persuasive to many different audiences without
dumbing them down;
- we’re trying to develop appropriate explanatory narratives that are somewhere in
between “word vectors are a fun tool! See the clusters!” and technical language that
assumes deep expertise
- So we are going to be interested in thinking with you about that boundary: about what
parts of this topic are especially challenging, and how we can best understand them
and explain them to others: for instance, colleagues, students, and readers of articles
where you draw on these techniques
- Your current unfamiliarity with the topic is a brief and precious resource for you
as teachers: this is your moment to reflect on what is hardest to understand, so that
you can anticipate the things others may find confusing or worth unpacking, and explain
them in terms that are legible and appropriately pitched
A quick look at the schedule...
Quick look at the schedule:
- Our basic strategy here is to examine and explain word embedding models several
times, at increasing levels of detail, so that you have a chance to internalize one
level of knowledge before we dive into the next deeper level.
- We’ll be working intensively with commented code walkthroughs: these are R programs
with detailed comments and some specific places where you can make modifications and
specify parameters; these are designed so that you don’t have to actually write any R code, but can become familiar with how it works and how to adapt it
- We’ll also spend time doing hands-on work in small groups so that you have a chance
to practice and explore on your own
- During the workshop, we will be using a version of RStudio that is installed on a
shared server, so that you (and we) don’t have to deal with the complexities of getting
RStudio running on everyone’s individual computers. However, before the workshop on
Wednesday and Thursday, for those who are interested, we will also do a
walkthrough of how to download and install RStudio on your own computer; no obligation
but if you’re
interested all are welcome.
- On the final day, we’ll do a bit of experimentation with code to visualize word embedding
models, and then we’ll wrap up with a discussion of next steps (including what
would be involved in tackling RStudio and the command line if you’re so
inclined).
Making notes
We’ve provided a fair amount of time for individual and small-group experimentation,
and time for you to think about your own research projects
However, this workshop will really just be a start, a chance to get comfortable with
fundamental concepts
I want to talk for a moment about some suggestions for how to take this work with
you
and continue it in your own time after you get home:
- For all of the sessions, we have a shared notes document [share link] for anything
you want to write down that might be useful to the group, and we’ll also ask you to
make notes there during some of the small-group hands-on work
- we’d also like to suggest that you keep something like a lab notebook: an informal,
personal (but somewhat detailed) record of what you tried, what worked, what
questions you have, what you want to follow up on later
- More specifically, it’s helpful to remember details like what words you queried, what
corpora you were
comparing, what settings you used
- We have created some samples and templates as inspiration, which are in our shared
Google space
- Later on, these kinds of notes can also be useful in documenting your results,
for purposes of writing about them in your research; very similar to documenting
your bibliographic sources for a research article
- Screen shots can also be a convenient way to keep a record of a notable result.
- Questions?
Model?
So, with those preliminaries out of the way, let’s get into our first explanation
of
word embedding models. For this first explanatory pass through, we won’t dwell in
detail
on the terminology or the mathematics: we’ll keep to a sort of metaphorical level
of
explanation to get a feel for things.
And the first term I want to talk about is the word model
- model is a potent concept in digital humanities, because so much of what we do depends
on models of one kind or another: creating digital representations of real-world objects
and ideas, and using them to study those things
- in some of the earlier domains of DH we’re used to thinking of models representationally:
as static proxies for research objects (like texts or artifacts) that capture what
is salient to us about those artifacts: a model as a TEI-encoded text
- in more recent domains such as machine learning, a model is more of a predictive or generative tool: something we can use to model the behavior
of a system and not only learn more about it, but also produce new things that follow
the rules and probabilities of the system: the kind of model that is represented by
a schema
Word-embedding models have properties of both, but in important respects are more
like this latter type:
- they model the language of a corpus in a way that focuses on questions like "if I’m
reading or writing this sentence, what’s the most likely next word?" or "based on
the words I’m seeing in this little region, what is the most likely word at the center
of that region"?
- in other words, word-embedding models are interested in a probabilistic model of language
that represents the interconnections between words as likelihoods based on proximity
The practical applications of this kind of modeling are familiar: predictive text
on your phone! But in digital humanities, models of this kind are also valuable because
they let us understand language better and help us do research on specific topics
and historical formations. So where the machine-learning research in industry is focused
on getting the most accurate predictions of what word I’m trying to type, through
a somewhat abstract, de-historicized understanding of language, in digital humanities
we need to pay close attention to language as represented in our specific corpora
(representing a time period, a genre, a set of authors, etc.) and also to the assumptions
we’re making about language when we train our models.
A first look at word vectors
At the simplest level, a word embedding model is a model of a text corpus that
represents word usage in the corpus by locating each word in
space
Metaphorically, we can imagine that those spatial locations show us
neighborhoods of words that tend to occur in the same
contexts
Another way to think about these neighborhoods is that they are answers to the question: what are the words most likely to appear near word X? or what word X is most likely to appear in this context?
So the clusters we see are groups of words that might be predicted by the same kinds
of contexts. What can we imagine those contexts to be, based on the clusters we’re
seeing here?
- Start with cluster 5 (accompanying mad lib): words relating to expressions of risk and despair, unhappy futurity
- we can see how these words could plausibly fit into very similar contexts
- How about clusters 6 and 7? (righteous war; early modern female virtue?)
- Cluster 8 is a little different: not really a thematic cluster: what is the predictive context here?
- How about clusters 9 and 4?
Thinking with vectors
So this is interesting in itself:
- these clusters of words tell us something about how our corpus uses
language
- it shows semantic connections between words
It’s also interesting because we can do further analysis:
- These neighborhoods aren’t just clusters of words that are impressionistically near one another: they
are positioned in a spatial relationship to the rest of the model
- that spatial relationship can be described mathematically
- the position of each word is represented by a vector (essentially, vectors are
lines that aim out at different angles and distances)
- this means that we can actually compare the position of one word mathematically
with the position of another word, and we can represent the difference in their positions
as: another
vector!
- We don’t want to examine that math just yet, but we can take
advantage of it.
If you had a chance to read Ryan Heuser’s analysis of riches and virtue, or Ben Schmidt’s analysis of the Rate My Professor data, where he considers breaking down the gender binary, they are taking advantage of this same idea:
- that we can use these vectors, these spatialized relationships between words, as an
analytical tool
- and that although in a sense space is a metaphor here (or at least a purely mathematical kind of reality), nonetheless
it has a level of internal consistency and truth-value that means we can do meaningful
analyses based on it.
Locating words in vector space
So how do those words get located in this space? What does the spatial metaphor really
mean?
We will go into the details much more fully, very soon. But for this initial orientation:
- This model of our corpus, in which each word is represented by a vector, is created
through a training process, in which a software program works its way through the text, over and over,
making observations about what words appear near one another
- essentially, building a model of the corpus that addresses the question If I have word X, what words are most likely to appear nearby?
- at each observation, it adjusts the position of the words
- by the end of the training process, the model contains very detailed information about
where each word is positioned relative to all or most of the others: this information
is more detailed the more thoroughly we do the training
- this training process can be varied depending on what actual task or insight or research
we’re trying to support: if we are Google and we’re trying to develop text prediction
systems, the most interesting words will be the single word right after word X. On
the other hand, if we’re digital humanists and we’re trying to understand discourse
more generally, the words surrounding word X might all be equally interesting. And
in fact different researchers might be interested in the words very close to word
X (words that suggest how syntax behaves) or in the words more loosely associated
(which might suggest conceptual connections)
This slide shows some actual quotations from WWO where the word danger occurs:
- if we imagine the training process working its way through the text and making observations,
we can see that when it encounters the word danger it repeatedly sees words nearby like approaching, imminent, apprehend: terms that convey futurity, threat, warning, causality, states of knowledge: these
establish a semantic context
- there are also function-words that appear nearby that don’t carry semantic associations,
but do establish that danger is a noun and can be the object of prepositions like to and the subject of prepositions like of, which would assist in the Google-word-prediction kinds of tasks.
Multidimensionality
You may be thinking, as I did, words have many different associations: if location in space is representing the semantic affiliations of each word, how can a word be in multiple
places at one time?
- In three-dimensional space, this would indeed be very difficult
- but in our word vector model, there are enormous numbers of dimensions; very difficult
to picture
In this diagram, on the left, the word bank has two associations:
- with the semantic space of money, and with the semantic space of rivers
- in this very simple view, each of those relationships is expressed as a single dimension
(the river association is on the y axis and the money association is on the x axis
- each line only has dimensionality/distance on that one axis, and the location of bank is thus defined by two dimensions (easy to draw on a slide)
On the right, we have a more complicated situation: the word set has many more associations. We can’t draw an equivalent diagram, but we can still
imagine:
- each relationship is on a single, distinct dimension
- there are just way more than two or three of these dimensions (we have to imagine
them all sprouting off in five-dimensional space)
- and the position of set is defined by five dimensions
- so it’s not that the word is in five different places at a time, but rather that its
unique location within this cloud of vectors is based on information about those five
relationships
If this feels baffling right now, don’t worry--in my experience this idea takes a
little time to sink in. Let it sit in your mind as a metaphor for now: a big cloud
of words, with neighborhoods of related words; closer words are more closely related.
Questions at this stage?
Factors that affect the behavior of the model
I mentioned earlier that we need to be attentive and critical about how this model
is created; there are a number of things that affect how a word embedding model will
perform for us.
The size of the corpus matters a lot (and you’ll remember that we specified that you had to
have at least a million words):
- this is because the training process, where we actually create the model, starts from
zero information: everything the model knows about where words are located, it learns
from that training process, which goes through the text and observes what words are
near what other words
- for common words, the training process gets a lot of data very quickly, but for uncommon
words there’s less information available
- it takes a certain minimum size corpus to provide enough information about each word
(from repeated usage) to make the model reasonably accurate in its representation
of less common words
- what other factors might be in play here? When might we be able to get away with a
smaller corpus?
The content of the corpus also matters a lot:
- what if you have a corpus where there are no common words? (what would be an example
of such a corpus?)
- what about a corpus in multiple languages?
- some genres are much more vocabulary-dense than others: for instance, poetry has more
uncommon words, less filler; novels use more commonplace words; a corpus of technical
documents might have a very large proportion of uncommon words (how might that affect
our model?)
The data preparation also matters a lot (and we’re going to spend two whole sessions on this later on):
- remember that a word here is any token, any string of characters with space around it, so if the text
has lots of typographical errors, each incorrect word will still count as a unique
word; how might that affect our model?
- similarly, our corpus might contain things like page numbers, stage directions, running
headers: would those be useful? inconvenient?
And finally, the training process matters:
- during the training process, we can control various settings that affect what observations
are made about the texts, and how that information is used
- we will also explore this at greater length over the next few days
Comparison with other forms of text analysis
As part of our orientation, it may also be helpful to situate word embedding in relation
to some other kinds of digital analysis we may already be familiar with; all of these
are ways to get an understanding of texts at scale
Has anyone here already experimented with word frequency, for instance with Voyant
tools?
- Just what it sounds like: computing the frequency of different words in the corpus,
possibly comparing frequency of words between different texts
- including their relative frequency (that is, frequency that has been normalized, such
as frequency per thousand words)
- useful as a way to get a sense of the vocabulary of a text
- can be used even on small collections and individual texts
How about topic models: has anyone used those? For instance, tools like Mallet?
- Topic models are closer to word embedding models
- They are trained models: that is, we go through a training process that examines a
text corpus and generates a model based on it
- A topic model assigns words to topics based on their occurrence within the same document:
it gives you a view of the document collection that represents the topics or patterns of word collocation that appear in them
- but it doesn’t pay attention to where they occur within that document: it treats the
whole document as a single bag of words
- A topic model can be generated from a small text collection
What’s distinctive about word embedding models:
- they give you a view of semantic relationships and spaces within the model (i.e. the
corpus) as a whole
- they pay much closer attention to word proximity than topic models do: they use information
about the immediate context of a word
- they don’t pay attention to individual documents during the training process (and
there’s no way to get back to the individual documents once the model is trained)
- they require a much larger corpus to get meaningful results
- they give us much more information about the semantics of individual words, whereas
topic models mostly give us a view of the topics rather than the individual words in the topic
The larger question of what word embedding models are distinctively good for is one
that we will explore as a group in the rest of the institute!
Disclaimers! Questions?
I should note here: we have been working hard to understand word embedding models
and develop this curriculum; however, the underlying math is undeniably challenging.
At some points in the next few days, I anticipate that you’re going to have questions
that we actually can’t answer, because we haven’t yet fully mastered that deeper layer.
We’re going to treat these as learning and teaching moments! After all, these are
also questions that our students and colleagues will be asking us. So part of what
we’re exploring here is how to understand the boundaries of what we know, and how
to respond effectively based on that knowledge, whatever level we may be at.
Questions at this stage?