Browsed by
Month: July 2017

Word Embedding Models Are the New Topic Models

Word Embedding Models Are the New Topic Models

By Jonathan Fitzgerald, Ph.D. Candidate in English, Northeastern University

I remember the first time I trained a topic model. It was in a course called Humanities Data Analysis, taught by Ben Schmidt. He provided us a corpus of the Federalist Papers and some code that he adapted from David Mimno, contributor to the original MALLET package and author of the R implementation of MALLET. After the initial confusion–“topics” aren’t topics in the traditional sense, after all–it felt like magic. The computer read the corpus and derived some topics!

Enthusiasm for topic modelling remains strong in DH in part because it is a useful way to reduce a corpus down to manageable chunks and to get a sense as to what individual documents are “about.” That said, in the years since my first experiments with the Federalist Papers, I’ve come to believe that topic modelling is perhaps most useful as part of a workflow aimed at some goal beyond identifying topics in a corpus. To that end, I’ve been using topics to seed a genre classifier in my work with the Viral Texts Project.

While I’ve been working with topic models these past few years, a new modelling craze has taken DH by storm…relatively speaking. That is, thanks in large part to this post by Ben Schmidt, DH’ers have enthusiastically adopted word embedding models (WEMs), particularly the most popular manifestation, word2vec, created by Tomas Mikolov and his colleagues at Google.

In a post in which we announced our current project “Word Vector Analysis for TEI/XML: A user-Friendly Toolkit,” my colleague Liz Polcha offers a helpful introduction to word embedding models and the way we plan to use them with data from Women Writers Online. As part of that effort, I’ve been tasked with experimenting with word2vec and our corpus, as well as with imagining a way to provide an interactive online user interface. All this work is in progress, but as a part of those efforts, I’ve been thinking about what we can actually do with word embedding models. Just as my experience with topic models led me to the conclusion that they are most helpful on the way to some other interpretive exercise, my sense is that the same may be true for word embedding models.

This sense is born out of my own experiments with word2vec, but also from the existing literature on word embedding models. That is, you probably won’t read a write-up on word2vec that doesn’t provide the classic analogy example about kings and queens. And, to be sure, it’s cool. That word vectors seem to reflect expected relationships between words is solid evidence for their potential usefulness. But beyond seeing words related to other words and parsing out analogous relationships between terms, what can we do with word embedding models?

I have a few ideas. For one, to stay on the topic of this post (pun!), though word embedding models are unsupervised algorithms, we can work with the models in a way that is akin to a kind of supervised topic modelling. Topic modelling derives “topics” from a corpus starting with a random seed word and grouping words that frequently co-occur within a document; working with the output of a word embedding model, on the other hand, lets us set that seed, so to speak. As Ben Schmidt writes, “You could think of this as a supervised form of topic modeling: it lets you assemble a list of words that typically appear in similar contexts.” A significant difference between topic models and word embedding models is that  topics are generated with an awareness of the documents in a corpus. Word embedding models, on the other hand, look across an entire corpus for word collocations within a narrowly defined context, sometimes referred to as a “window.” Or, as Schmidt defines the difference, “A topic model aims to reduce words down [to] some core meaning so you can see what each individual document in a library is really about. Effectively, this is about getting rid of words so we can understand documents more clearly. WEMs do nearly the opposite: they try to ignore information about individual documents so that you can better understand the relationships between words.”

Schmidt’s food example from the Chronicling America corpus illustrates the way we can use the relationship between words to build a kind of supervised topic. He starts with one food word, “oysters,” and finds other food words that appear in similar semantic contexts: “ham,” “bread,” “chicken.” Then, he adds these terms to the original vector and runs the search again, adding even more food words. In a way, he is building a “food” topic. The process is, of course, different from what topic modelling does, but the end result is a list of words that have a meaningful relationship within a corpus of texts—except the topics are created rather than derived.

In unsupervised machine learning, the algorithm does not explicitly “know” what we are searching for; in the above example, it does not know that “oysters”, “ham”, and “bread” are foods. It just knows they cooccur. But when a human researcher intentionally selects and combines these words into a vector that she describes as food words, the unsupervised algorithm is being used in a supervised way.

This notion of creating topics through a supervised process of selecting and combining also works to allay an anxiety I have about the opaqueness of unsupervised modelling (particularly for a humanist who is new to computational modelling!). Ted Underwood frames the issue in a recent blog post in which he argues, among other things, that we should “make quantitative methods more explicit about their grounding in interpretive communities.” That is, we shouldn’t let computational models stand in for humanistic interpretation. Underwood writes: “Researchers are attracted to unsupervised methods like topic modeling in part because those methods seem to generate analytic categories that are entirely untainted by arbitrary human choices. But as [Stanley] Fish explained, you can’t escape making choices.”

Underwood emphasizes the importance of grounding these choices (or, interpretations) in “interpretive communities.” And, for his part, Underwood notes that one way to do this is to favor supervised modelling methods; he writes:

I have been leaning on supervised algorithms a lot lately—not because they’re easier to test or more reliable than unsupervised ones—but because they explicitly acknowledge that interpretation has to be anchored in human history.

I like this notion of relying on human judgment and generating interpretations “anchored in human history.” That is, after all, what makes us humanists, right? To that end—and again acknowledging that word2vec is an unsupervised algorithm—I’ve been thinking about the ways we can ground our work with unsupervised models in human interpretation.

Particularly, I’m interested in the potential for examining if and how word usage might reflect broad cultural trends over time. In the WWO corpus, which covers the 16th through 19th centuries, it might be interesting to consider how the use of the word “freedom” reflects the culture of the authors who use the word. I approached this question by dividing our corpus up by century and creating “supervised topics” based on the word “freedom.”1

In the 16th and 17th centuries,2 freedom seems to be a more personal affair; words that show up in the same vector space as “freedom” include “happiness,” “friendship,” and “wedlock.”

In the eighteenth century, freedom takes on a particularly revolutionary bent: “patriots,” “rights,” “triumph,” “country’s,” and “tyrants,” for example.

Finally, in the nineteenth century, the revolutionary language remains (“rights,” “tyranny”)—though diminishes some—and a new kind of freedom (or lack thereof) appears in the form of words like “slave,” “slaves,” “enslave,” “degrading,” “minority.”

I considered that these shifts might be geographically-contingent; after all, only one work in our corpus from the 17th century was published in the colonies. However, in the 18th century, when we see a lot of what I’ve described as “revolutionary” language, Europe still dominates as place of publication. It’s not until the 19th century that the number of works published in the United States even begins to come close to those published in Europe (34 in the U.S. and 54 in Europe). So it seems that geography does not have a significant impact on the way the word “freedom” is used in our corpus across three centuries, if only because of the collection’s own unevenness in its inclusion of works published outside of the United Kingdom.

Treating these vectors like supervised topics allows us to consider interesting cultural questions through the lens of the WWO corpus. But, typically word embedding models are deployed to tell us something about the words themselves. In this sense we can use a similar process to consider the way a word’s usage shifted over time. My colleague Sarah Connell suggested that “grace” might make for an interesting test case. And indeed, when compared across centuries, “grace” moves from a word typically used in religious contexts in relation to God’s grace (e.g., “mercie,” “wisdome,” “fatherlie,” “redeemer,” “almightie”; note the archaic spellings) to secular contexts in relation to women’s beauty (e.g., “beauty,” “charms,” “fair,” “virtue,” “sweetness,” “smile,” “lovely”). Of course, the meaning of the word hasn’t changed, but its application has. And, as with “freedom,” this can tell us something about shifting cultural contexts—in this case secularization.

In fact, this shift toward secularization is confirmed in our corpus using another method. The WWP’s previous experimentation with markup-based exploration has already shown how growing literary secularization manifests in the collection after the seventeenth century. For one example, of the 3,228 bibliographic citations in the WWP’s sixteenth- and seventeenth-century materials, 3,689 (86%) contain biblical references; by contrast, 256 (15%) of the 1,742 eighteenth-century bibliographic citations are biblical references.3

In both examples, word2vec gives us words that appear in similar contexts to our seed words, but it is up to us to recognize these similarities as significant, intentionally group them into ever more expansive topics, and then to theorize what (if anything) these collocation patterns might mean. Thus, we can work with an unsupervised model in a way that acknowledges, to quote Underwood once more, “we’re no longer just modeling language; we can model interpretive communities at the same time.”

We’re still early on in our explorations of word embedding models and literary corpora, both at the WWP and in DH in general. But that, in part, is what makes this project so interesting to us: it’s an opportunity to explore word embedding models on a corpus that is focused on gender, relatively free of digitization errors, and easily transformed into markup-based subcorpora using XSLT and XQuery.

As we continue to work with this corpus, it’d be helpful to know what potential users might be interested in. If you have any suggestions for possible use-cases, please let us know in the comments.

Notes.

Intertextuality in Mary Astell’s A Serious Proposal to the Ladies (1694) and in Reflections upon Marriage (1706)

Intertextuality in Mary Astell’s A Serious Proposal to the Ladies (1694) and in Reflections upon Marriage (1706)

This post is part of a series authored by our collaborators on the Intertextual Networks project. For more information, see here. 

By Ioanna Kyvernitou, National University of Ireland, Galway

 For Intertextual Networks, I am evaluating the markup in two works of Mary Astell (1666–1731) as found in Women Writers Online–A Serious Proposal to the Ladies, for the Advancement of Their True and Greatest Interest (1694) and the third edition of Reflections upon Marriage (1706)–in order to consider practices for encoding intertextuality. Astell, a philosopher and theologian who supported women’s right to education, is considered one of the earliest English feminist writers. She is also known for her critiques of the philosophers John Norris and John Locke. Current scholarship analyses her writings within the context of her political (Toryism), philosophical (Cartesianism-Platonism), and religious (Anglicanism) beliefs. Within this framework, this study aims to identify Astell’s intertextual practices by exploring WWP’s the XML markup–specifically the elements designed to encode bibliographic features (e.g., <quote>, <said>, <bibl>). These two works are treated here as case studies in order to discuss the ways in which XML representation can provide a formal framework for representing complex intertextual practices in literary works.

In my preliminary work, I have identified relevant markup in relation to intertextuality (from the WWP’s Internal Encoding Documentation) in order to query and retrieve the occurrences of those tags from the two XML files as provided by the WWP. Specifically, with the help of Sarah Connell and Syd Bauman, I used XQuery–a language for querying XML data–to search for Astell’s references to proper names (i.e. <persName>, <name>, <placeName>, <orgName>). Further, I investigated the personal names’ structural contexts (<p>), aiming to identify the function of onomastic intertextuality (person and place names). Finally, I searched for biblical, classical and bibliographic references (i.e. <quote>, <said>, <bibl>, and <regMe>) in these works.

In the case of indirect references, which go beyond the straightforward markup of direct quotations, it is necessary to consult secondary literature to help us identify the source(s) of reference and the identity of implicitly noted authors. The challenge is that, on many occasions, there are different interpretations among scholars regarding the source of influence or person quoted (as discussed below). Thus, in incorporating multiple interpretations within the markup, the encoding process becomes more complex and expensive—but also more enriched. While the existing markup does not annotate implicit references to an author or indirect quotes, the Intertextual Networks project will be piloting such encoding in an initial set of texts; the project will also be linking quotations to their sources and authors, which will make retrieval and analysis of quoted passages easier.

Quotations

 According to the WWP’s internal documentation:

The <quote> element is used to encode material which is identified as originating outside of the passage where it appears, regardless of where the material actually originates. For our purposes, <quote> can include proverbs, mottoes, common sayings, passages from other texts (including fictional passages from imagined texts), or quotations from other parts of the same text in which the quotation appears.

Following this definition, I searched within the XML files for occurrences of the <quote> element in order to identify its use in Astell’s works. In Proposal, is used only four times and in Reflections eighteen. Currently, the WWP uses a pilot encoding in order to implement more detailed markup for cases where quoted material is paraphrased or parodied from its source. For these cases, the @type attribute is used with values of “parody” and “paraphrase”. Some of these conceptual challenges are addressed in the ‘Methods’ section of the proposal for Intertextual Networks: Reading and Citation in Women’s Writing 1450-1850, where it is recommended, similarly to parody and paraphrase, to handle allusions by treating them as special types of quotation and using the TEI @type attribute to characterize quotes as “direct,” “paraphrase,” “allusion” (and other terms as needed).

Along these lines, updating Astell’s XML files with an expanded and more detailed markup–for example, tagging paraphrases, proverbs and echoes–would be useful, especially for retrieval purposes of these instances. This post uses a passage from Reflections to explore how a more in-depth encoding can be made in order to include information concerning: quoted person(s) – explicitly or implicitly mentioned –, paraphrased passages, and ways to connect quote(s) with quoted person(s).

In the passage below (presented first without markup), Astell argues about the role of custom in perpetuating the subordination of women (emphasis added),

That the Cuſtom of the World has put Women, generally ſpeaking, into a State of Subjection, is not deny’d; but the Right can no more be prov’d from the Fact, than the Predominancy of Vice can juſtifie it. A certain great Man has endeavour’d to prove by Reaſons not contemptible, that in the Original State of things the Woman was the Superior, and that her Subjection to the Man is an Effect of the Fall, and the Puniſhment of her Sin. And that Ingenious Theoriſt Mr. Whiſton aſſerts, That before the Fall there was a greater equallity between the two Sexes. However this be, ’tis certainly no Arrogance in a Woman to conclude, that ſhe was made for the Service of God, and that this is her End. Becauſe God made all Things for Himſelf, and a Rational Mind is too noble a Being to be Made for the Sake and Service of any Creature. The Service ſhe at any time becomes oblig’d to pay to a Man, is only a Buſineſs by the Bye. Juſt as it may be any Man’s Buſineſs and Duty to keep Hogs; he was not Made for this, but if he hires himſelf out to ſuch an Employment, he ought conſcientiouſly to perform it. Nor can any thing be concluded to the contrary from St. Paul’s Argument, 1 Cor. II. For he argues only for Decency and Order, according to the preſent Cuſtom and State of things. Taking his Words ſtrictly and literally, they prove too much, in that Praying and Prophecying in the Church are allow’d the Women, provided they do it with their Head Cover’d, as well as the Men; and no inequality can be inferr’d from hence, neither from the Gradation the Apoſtle there uſes, that “the Head of every Man is Chriſt, and that the Head of the Woman man is the Man, and the Head of Chriſt is God” (A2r–A2v)

Astell uses three sources to support her argument. She first notes ‘A certain great Man’ who argued about women’s superiority before the Fall; she then paraphrases William Whiston, a Cambridge theologian; and she concludes with a biblical reference (1 Corinthians 11:3) to support women’s equality. In the current markup, only the biblical reference (i.e. <bibl><regMe>1 Cor. II.</regMe></bibl>) and the direct quote are encoded, whereas the two cases of indirect references are not tagged.

‘A certain great Man’ & ‘Mr. Whiſton

For a more complete encoding, the <quote> element and @type attribute with a value of “paraphrase” could be added to highlight instances of these indirect references, bearing in mind that, as noted in the ‘Methods’ section of the proposal for Intertextual Networks, “the boundaries of paraphrases and allusions are less determinate than those of direct quotations.”

Regarding the authors quoted, in the first case, Astell refers indirectly to ‘A certain great Man’, whereas ‘Mr. Whiſton’ is explicitly named (i.e. <persName ref='p:wwhiston.ycp'>Mr. <hi rend='slant(upright)'>Whiston</hi></persName>). For the latter case, we can also use @role on <persName> to indicate that Mr. Whiston is being referenced as an author; we can use @source on <quote> to point to a bibliography entry, with more detailed information on the source.

 For the “certain great Man,” we could add <rs> with a @type of “author” to mark this as a reference to an author, however indirect; we can also use @ref to point to more information on the identity of this author. In this case, there are different interpretations among scholars regarding the author’s identity. Specifically, Apetrei suggests that it is possible that the “great Man” was Agrippa von Nettesheim, a German polymath, who argued for the superiority of the female sex (131). Springborg, on the other hand, proposes that this could be a reference to the English philosopher Thomas Hobbes (11). Based on these authorship claims, one approach would be to use @ref to point to an <alt> element, whose @targets would themselves point to personographic entries for the two potential authors. Even where there is no agreement on the quoted person, it would be helpful to incorporate current scholarship in the encoding of the primary text to reflect the different interpretations. This can be achieved, for example, by adding a <note> element in the XML file, discussing the different scholarly interpretations and identities of probable sources.

Biblical and Bibliographic References: ‘St. Paul’s Argument’

The third case is an example of encoding bibliographic references and citations by using the <bibl> element. Within <bibl>, the tag <author> is used to encode the author’s name, if present, along with a nested <persName>. The <regMe> element is used to encode bibliographic references or citations of the Bible or other texts for which a standard or canonical reference system exists.  The WWP internal documentation suggests that <regMe> should be placed within the <bibl> element that encloses the complete reference. Following these definitions, I have counted eight occurrences where <regMe> is nested within <bibl> in Reflections and found none in Proposal.

A closer look at these occurrences, with the XML markup of this passage from Reflections, shows two distinct usages of personal names (the markup below has been simplified for the purposes of this example):

Often a personal name can be a quoted author, as in the case of Saint Paul in the above example. But there are also occasions where personal names are nested within a <quote>, as in the case of ‘Christ’. This is another case where we can use @source as described above to make authorship and other bibliographic information more explicit and queryable. Lastly, before introducing Saint Paul’s quote, as seen above, Astell refers to him as ‘Apostle’. This is one of many examples of coreference–when two or more expressions in a text refer to the same person. Thus, this is another example of where <rs> with @role of “author” and @ref pointing to a persongraphy entry could make the markup more detailed and useful for future research.

The challenges of formally representing the various types of intertextuality mean that the boundaries of structural and interpretive markup become more fluid. The more detailed the markup becomes, the more in-depth understanding of the primary text and its secondary literature is required. This is a process that can be time-consuming, especially for large-scale projects. Nevertheless, investigation of the use of personal names within their surrounding contexts can enrich the representation of intertextuality. As a next step for this study, I will explore further how linguistic and rhetorical emphasis tags (i.e.<emp>, <term>, <distinct>) can be connected to indirect quotation practices in order to identify other implicit references, currently not present in the markup. I will base this on Astell’s practices in her correspondence with John Norris, Letters Concerning the Love of God (1695), aiming to compare references in her three works, and open the way to reconstructing a more complete picture of her intertextual practices.

Works Cited

Apetrei, Sarah Louise Trethewey. Women, Feminism and Religion in Early Enlightenment                

England. Cambridge University Press, 2010. Print.

Springborg, Patricia. Mary Astell, Political Writings. 1st ed. New York: Cambridge University

Press, 1996. Print.