Browsed by
Category: Text Analysis

Word Embedding Models Are the New Topic Models

Word Embedding Models Are the New Topic Models

By Jonathan Fitzgerald, Ph.D. Candidate in English, Northeastern University

I remember the first time I trained a topic model. It was in a course called Humanities Data Analysis, taught by Ben Schmidt. He provided us a corpus of the Federalist Papers and some code that he adapted from David Mimno, contributor to the original MALLET package and author of the R implementation of MALLET. After the initial confusion–“topics” aren’t topics in the traditional sense, after all–it felt like magic. The computer read the corpus and derived some topics!

Enthusiasm for topic modelling remains strong in DH in part because it is a useful way to reduce a corpus down to manageable chunks and to get a sense as to what individual documents are “about.” That said, in the years since my first experiments with the Federalist Papers, I’ve come to believe that topic modelling is perhaps most useful as part of a workflow aimed at some goal beyond identifying topics in a corpus. To that end, I’ve been using topics to seed a genre classifier in my work with the Viral Texts Project.

While I’ve been working with topic models these past few years, a new modelling craze has taken DH by storm…relatively speaking. That is, thanks in large part to this post by Ben Schmidt, DH’ers have enthusiastically adopted word embedding models (WEMs), particularly the most popular manifestation, word2vec, created by Tomas Mikolov and his colleagues at Google.

In a post in which we announced our current project “Word Vector Analysis for TEI/XML: A user-Friendly Toolkit,” my colleague Liz Polcha offers a helpful introduction to word embedding models and the way we plan to use them with data from Women Writers Online. As part of that effort, I’ve been tasked with experimenting with word2vec and our corpus, as well as with imagining a way to provide an interactive online user interface. All this work is in progress, but as a part of those efforts, I’ve been thinking about what we can actually do with word embedding models. Just as my experience with topic models led me to the conclusion that they are most helpful on the way to some other interpretive exercise, my sense is that the same may be true for word embedding models.

This sense is born out of my own experiments with word2vec, but also from the existing literature on word embedding models. That is, you probably won’t read a write-up on word2vec that doesn’t provide the classic analogy example about kings and queens. And, to be sure, it’s cool. That word vectors seem to reflect expected relationships between words is solid evidence for their potential usefulness. But beyond seeing words related to other words and parsing out analogous relationships between terms, what can we do with word embedding models?

I have a few ideas. For one, to stay on the topic of this post (pun!), though word embedding models are unsupervised algorithms, we can work with the models in a way that is akin to a kind of supervised topic modelling. Topic modelling derives “topics” from a corpus starting with a random seed word and grouping words that frequently co-occur within a document; working with the output of a word embedding model, on the other hand, lets us set that seed, so to speak. As Ben Schmidt writes, “You could think of this as a supervised form of topic modeling: it lets you assemble a list of words that typically appear in similar contexts.” A significant difference between topic models and word embedding models is that  topics are generated with an awareness of the documents in a corpus. Word embedding models, on the other hand, look across an entire corpus for word collocations within a narrowly defined context, sometimes referred to as a “window.” Or, as Schmidt defines the difference, “A topic model aims to reduce words down [to] some core meaning so you can see what each individual document in a library is really about. Effectively, this is about getting rid of words so we can understand documents more clearly. WEMs do nearly the opposite: they try to ignore information about individual documents so that you can better understand the relationships between words.”

Schmidt’s food example from the Chronicling America corpus illustrates the way we can use the relationship between words to build a kind of supervised topic. He starts with one food word, “oysters,” and finds other food words that appear in similar semantic contexts: “ham,” “bread,” “chicken.” Then, he adds these terms to the original vector and runs the search again, adding even more food words. In a way, he is building a “food” topic. The process is, of course, different from what topic modelling does, but the end result is a list of words that have a meaningful relationship within a corpus of texts—except the topics are created rather than derived.

In unsupervised machine learning, the algorithm does not explicitly “know” what we are searching for; in the above example, it does not know that “oysters”, “ham”, and “bread” are foods. It just knows they cooccur. But when a human researcher intentionally selects and combines these words into a vector that she describes as food words, the unsupervised algorithm is being used in a supervised way.

This notion of creating topics through a supervised process of selecting and combining also works to allay an anxiety I have about the opaqueness of unsupervised modelling (particularly for a humanist who is new to computational modelling!). Ted Underwood frames the issue in a recent blog post in which he argues, among other things, that we should “make quantitative methods more explicit about their grounding in interpretive communities.” That is, we shouldn’t let computational models stand in for humanistic interpretation. Underwood writes: “Researchers are attracted to unsupervised methods like topic modeling in part because those methods seem to generate analytic categories that are entirely untainted by arbitrary human choices. But as [Stanley] Fish explained, you can’t escape making choices.”

Underwood emphasizes the importance of grounding these choices (or, interpretations) in “interpretive communities.” And, for his part, Underwood notes that one way to do this is to favor supervised modelling methods; he writes:

I have been leaning on supervised algorithms a lot lately—not because they’re easier to test or more reliable than unsupervised ones—but because they explicitly acknowledge that interpretation has to be anchored in human history.

I like this notion of relying on human judgment and generating interpretations “anchored in human history.” That is, after all, what makes us humanists, right? To that end—and again acknowledging that word2vec is an unsupervised algorithm—I’ve been thinking about the ways we can ground our work with unsupervised models in human interpretation.

Particularly, I’m interested in the potential for examining if and how word usage might reflect broad cultural trends over time. In the WWO corpus, which covers the 16th through 19th centuries, it might be interesting to consider how the use of the word “freedom” reflects the culture of the authors who use the word. I approached this question by dividing our corpus up by century and creating “supervised topics” based on the word “freedom.”1

In the 16th and 17th centuries,2 freedom seems to be a more personal affair; words that show up in the same vector space as “freedom” include “happiness,” “friendship,” and “wedlock.”

In the eighteenth century, freedom takes on a particularly revolutionary bent: “patriots,” “rights,” “triumph,” “country’s,” and “tyrants,” for example.

Finally, in the nineteenth century, the revolutionary language remains (“rights,” “tyranny”)—though diminishes some—and a new kind of freedom (or lack thereof) appears in the form of words like “slave,” “slaves,” “enslave,” “degrading,” “minority.”

I considered that these shifts might be geographically-contingent; after all, only one work in our corpus from the 17th century was published in the colonies. However, in the 18th century, when we see a lot of what I’ve described as “revolutionary” language, Europe still dominates as place of publication. It’s not until the 19th century that the number of works published in the United States even begins to come close to those published in Europe (34 in the U.S. and 54 in Europe). So it seems that geography does not have a significant impact on the way the word “freedom” is used in our corpus across three centuries, if only because of the collection’s own unevenness in its inclusion of works published outside of the United Kingdom.

Treating these vectors like supervised topics allows us to consider interesting cultural questions through the lens of the WWO corpus. But, typically word embedding models are deployed to tell us something about the words themselves. In this sense we can use a similar process to consider the way a word’s usage shifted over time. My colleague Sarah Connell suggested that “grace” might make for an interesting test case. And indeed, when compared across centuries, “grace” moves from a word typically used in religious contexts in relation to God’s grace (e.g., “mercie,” “wisdome,” “fatherlie,” “redeemer,” “almightie”; note the archaic spellings) to secular contexts in relation to women’s beauty (e.g., “beauty,” “charms,” “fair,” “virtue,” “sweetness,” “smile,” “lovely”). Of course, the meaning of the word hasn’t changed, but its application has. And, as with “freedom,” this can tell us something about shifting cultural contexts—in this case secularization.

In fact, this shift toward secularization is confirmed in our corpus using another method. The WWP’s previous experimentation with markup-based exploration has already shown how growing literary secularization manifests in the collection after the seventeenth century. For one example, of the 3,228 bibliographic citations in the WWP’s sixteenth- and seventeenth-century materials, 3,689 (86%) contain biblical references; by contrast, 256 (15%) of the 1,742 eighteenth-century bibliographic citations are biblical references.3

In both examples, word2vec gives us words that appear in similar contexts to our seed words, but it is up to us to recognize these similarities as significant, intentionally group them into ever more expansive topics, and then to theorize what (if anything) these collocation patterns might mean. Thus, we can work with an unsupervised model in a way that acknowledges, to quote Underwood once more, “we’re no longer just modeling language; we can model interpretive communities at the same time.”

We’re still early on in our explorations of word embedding models and literary corpora, both at the WWP and in DH in general. But that, in part, is what makes this project so interesting to us: it’s an opportunity to explore word embedding models on a corpus that is focused on gender, relatively free of digitization errors, and easily transformed into markup-based subcorpora using XSLT and XQuery.

As we continue to work with this corpus, it’d be helpful to know what potential users might be interested in. If you have any suggestions for possible use-cases, please let us know in the comments.

Notes.

The WWP Begins Research into Word Vector Analysis

The WWP Begins Research into Word Vector Analysis

We are thrilled to announce that the Women Writers Project has begun work on a new project “Word Vector Analysis for TEI/XML: A user-Friendly Toolkit,” funded under a Tier 1 grant from Northeastern University, awarded to project co-PIs, professors Julia Flanders, Elizabeth Dillon, and Cody Dunne.

“Word Vector Analysis for TEI/XML” brings together two major digital humanities methodologies: text encoding and text analysis, as we aim to develop an exploratory web interface as part of the WWO Lab, which will allow users to visualize vector-space models with data from the Women Writers Online (WWO) TEI corpus and its sister corpus from the Victorian Women Writers Project (VWWP). For this collaborative project, a team of faculty, graduate students, and WWP staff will develop mechanisms to transform TEI documents for analysis with word embedding models, using the WWO/VWWP corpus as a test case. We will also publish a prototype web interface for exploring these models. The interface will enable users to input terms of interest and discover similar terms, locate analogies, and explore thematic clusters of terms. The WWP is especially suited to this type of text analysis because of our collection of TEI-encoded documents that are information-rich and relatively free from digitization errors. We will thus be able to create text-analysis-friendly data from TEI documents without losing the significant informational content of the markup.

Our prototype will integrate the word2vec text analysis tool using skip-gram negative sampling, as well as other experimental word vector model training methods, into an information pipeline that takes in TEI/XML data, performs a variety of preliminary processing steps taking advantage of information in the markup, passes the resulting text file on for analysis, and then sends the results to an appropriate visualization tool. The user interface will enable users to make selections based on information in the markup (e.g., to analyze texts within a certain time period or of a certain genre, or to focus on specific textual components such as background narration or direct speech), and to choose the visualization options for the output. The tool set will take advantage of TEI markup to improve tokenization in text analysis (for example, encoding with <orgName> removes any ambiguity about whether the string “Massachusetts Historical Society” refers to a single organization) and to enable comparison within corpora based on the semantic features represented by the markup. The prototype will initially focus on word vector analysis, but other text analysis routines such as topic modeling could be included as options in the pipeline as well.

As we move forward in our research, we will be publishing use cases and sample research projects using the word2vec text analysis tool. For example, a scholar of early American literature might be interested in the ways in which “America” as a concept is represented in early women’s writing. Using our tool on the WWO/VWWP corpus, she may start by finding words that are nearest to words like “America,” “England,” “country,” and “nation.” Or, she might test out the analogy function (such as in the often cited [king] – [man] + [woman] = [queen] word2vec example), experimenting with relationships between words like “England” and “New England” or “Virginia” and “Massachusetts”.  

For further reading on word embedding models and word2vec, we’ve provided some suggested blog posts below:

Vector Space Models for the Digital Humanities” by Ben Schmidt

Pride and Prejudice and Word Embedding Distance” by Lynn Cherny

Word Vectors in the Eighteenth Century” by Ryan Heuser

Numberless Degrees of Similitude”: A Response to Ryan Heuser’s ‘Word Vectors in the Eighteenth Century, Part 1’” by Gabriel Recchia