R, Voyant, and the Search for Computational Delicacy in an Early Modern Corpus

February 22, 2017 Joanne DeCaro Comments 0 Comment

This post is part of a series authored by our collaborators on the Intertextual Networks project. For more information, see here.

By Amanda Henrichs, Institute for Digital Arts and Humanities, Department of English, Indiana University

My contribution to the Intertextual Networks takes up the literary and historical relationships between Lady Mary Wroth (1587–1651) and her aunt Mary Sidney-Herbert (1561–1621). These two women are members of the Sidney family, one of the most influential families in English literature and politics for over 200 years (the 2015 Ashgate Research Companion is invaluable here.) Both women were active in Queen Elizabeth’s court, and both provided literary and artistic patronage to writers, artists, and musicians. Further, both were known as prolific and respected authors to their contemporaries. Wroth in particular has enjoyed a resurgence in popularity (and scholarly praise for her literary skill) over the past few decades.

These women lived together and—scholars tell us—wrote together. Yet, the primary evidence for their relationship is historical. That is, when scholars assert that Sidney-Herbert was a formative literary influence for Wroth, they do not cite stylistic similarities. Rather, they mention the time the two spent together at Penshurst, the Sidney family’s home in Kent, and the loving relationship between the two women. But it seems nearly necessary that there would be stylistic evidence of Wroth’s literary homage to her aunt: Wroth is a highly allusive and intertextual writer, with clear allusions to, and borrowings or translations of, Petrarch, Philip Sidney, Fulke Greville, Edmund Spenser, and others. But Sidney-Herbert seems to be entirely absent from Wroth’s works.

There is thus an absence of intertextual connection where there should be a presence. And this is what my current project takes up. I am writing an R script to mine Wroth’s long prose romance Urania and Sidney-Herbert’s translations The Tragedie of Antonie and A Discourse of Life and Death for similarities in word choice, sentence structure, turns of phrase, and other stylistic similarities. Then, based on these results, I will use another coding language to visualize the results. In effect, I want to visualize literary absence.

I want to pause here, though, and mention some of the problems I’ve run in to. The biggest one is R itself. For those who aren’t familiar, R was originally used to run statistical analyses on very large datasets, and is now quite popular with humanists who want to do things like text mining and topic modeling. R is a very powerful tool, but it is also idiosyncratic, complex, and difficult to master. Even working through Matthew Jockers’ incredible book Text Analysis with R for Students of Literature, I keep getting bogged down in cleaning and parsing the text files I’m examining; I also have to continually remind myself of R prompts and commands, since even a single wrong keystroke creates an error I need to go back and dig out—a debugging practice that is second nature to trained programmers, but less familiar to traditional researchers in the humanities. From what I can tell, this is a common experience for scholars who, for whatever reason, want to employ computational approaches in their research.

Other problems include asking the right questions; or rather, asking questions in a way that R can understand. I am at the point where I can tell R to pull a .txt file from the internet (or my computer), clean out the extraneous metadata from the beginning and end of the text, split the text according to its internal divisions (be they chapters or stanzas), find the relative frequency of a word or words across the text, and plot those frequencies in a graph of my choice. In Shakespeare’s Sonnets, for example, I found that there are 4,612 unique words in the collection. The word “I” accounts for 1.8% of the total words; “my” for 2.6%. But a patient and dedicated reader could do this work without a line of code. At this point, I’m saving enormous amounts of time, which is of course incredibly valuable in itself, but I am gaining old insights more quickly, rather than coming to new conclusions. And what does this data actually mean? It isn’t enough simply to spout statistics, as interesting as it may be to have these numbers handy.

In the case of Wroth’s Urania, for example, I know that the word “she” declines dramatically toward the end of the romance, precisely at the point when the words “lo”, “loue”, “louing”, “loued”, etc., spike dramatically. In the interest of quick results, I uploaded the romance to Voyant, an online visualization tool that remediates a text of your choice. Here, the blue line is the “loue” variations and the purple is “she.”¹

Voyant visualization of “she” and “loue” variations in Wroth’s Urania.

Towards the end of the romance is where the heroine Pamphilia finds happiness in love; and “she” simultaneously disappears, both literally and figuratively. Does this chart also open up a feminist critique of the loss of selfhood of an otherwise proactive and literarily productive female protagonist? Or does it simply reflect that Wroth appended the sonnet collection Pamphilia to Amphilanthus to the romance? In this collection, she details her constancy in her “loue” for Amphilanthus, but writes in the first person instead of the third. Thus the decline of “she.” I’m inclined to the latter interpretation; but, given the immense difference in length between the prose romance and the sonnet collection, there is still an interesting shift that might need further investigation. If you’re reading this blog, I don’t need to convince you of the value of digital or computational approaches and what these kinds of results remind me is that approaching old texts in new ways might let us see things we simply haven’t noticed yet. Computational approaches—once we learn them—are not only incredibly fast, they can also help us make remarkably subtle observations.

Though the multi-text capabilities of Voyant are not as subtle as I would like, they still gesture towards the simultaneous reach and delicacy of computational tools that I hope to achieve with R. When I uploaded all three texts to Voyant, I started to find some interesting things. For example, Antonie has the highest vocabulary density, while Urania has the lowest. (Urania is also the longest text; however, Discourse is the shortest, which lends credence to the density result. That is, Antonie seems to have a proportionally higher vocabulary density than the other texts, regardless of length.) More suggestive still are the words which are distinctive to each text; in Antonie, “hir” is most prevalent (56 instances), followed by “cl”—the speaker tag for Cleopatra(43), and “Antony” (40). In Discourse, we have “wee” (51), “worlde” (20), and “porte” (6); in the Urania, “shee” (1,386), “Amphilanthus” (392), and “Pamphilia” (269).

Again, the question is, what do we do with this data? I might conclude that Antonie is an extended blazon of Cleopatra’s qualities: her estates, her person, her speeches, her beauty. I might also say that it appears that the Urania doesn’t pass the Bechdel test; even though “shee” is four times more present than Amphilanthus, we still have more mentions of Amphilanthus’ name, suggesting that characters (or the author) talk about him more than they talk about Pamphilia.

Yet I am not tied to any of these interpretations; they could be completely wrong. Instead, I am more inspired by the possibilities that are suggested by these lists of numbers. While I will eventually need to come to conclusions about the specifics of my data, for now I am content with what tools like Voyant and R certainly provide me: a different view. In other words, numbers are not enough; but more satisfying are the subtle characteristics that computational tools let me visualize, even when the sheer amount of text seems anything but subtle.

One short postscript: I spent hours (three, I think) trying to create a comparative scatterplot in Voyant of the distinctive words I mentioned above. The closest I came was this:

Attempt at a comparative scatterplot in Voyant.

And this is clearly not very legible. In order even to get to this point, I had to use the raw frequency of each word, and manually strip out partial words like “lo-”, “-ed”, “-ing”, “ha-”, and “bra-”. I also had to use a proximity tool; I asked Voyant to show me the words closest to “she,” and limit the results to about 35 words. One thing we can see is that “he” is the most common word closest to “she”; we also see verbs like “doe” and “make.” This suggests that “he” and “she” are both very active in the texts, and because “she” is more common than “he,” that the female protagonists are most active. However, I’m still not committed to these results, partly because I didn’t tell Voyant how to determine proximity, and partly because I still have a very hard time understanding what this plot is telling me. I present this plot for two reasons: one, because the prevalence of verbs is suggestive; and two, because I want to emphasize how important it is for humanist researchers to know at least a little bit about the back-end of the tool they might use. Since I don’t know exactly how Voyant determines proximity, and I also can’t tell it to consider the “u” character as part of a full word (as in loue, haue, or braue), I’m not willing to draw interpretations from this data. In other words, with Voyant I’m left with interesting directions for future inquiry; with R, because I will have written the code myself, I will feel confident in my results.

Voyant is very fun to play with; you can upload a text from your computer, or paste a URL (say, to a Project Gutenberg text). There are several issues with Voyant, however. Primarily, Voyant found that the character string “lo” was the most-used string in the romance. This includes the exclamation “lo,” as well as all of the variations on “loue.” (As far as I understand, this is because of how Voyant parses xml; that is, when a text with Early Modern spelling conventions is uploaded, Voyant can’t recognize that “lo” and “loue” are two different words, because the introduction of the “u” character looks like the start of a new word. I can teach R this convention.) Voyant also segments the text according to its own logic (apparently by 10); in R, I can analyze by chapter, or by poem, or theoretically even by page.

Women Writers Project

The Blog

R, Voyant, and the Search for Computational Delicacy in an Early Modern Corpus

February 22, 2017 Joanne DeCaro Comments 0 Comment

By Amanda Henrichs, Institute for Digital Arts and Humanities, Department of English, Indiana University

Leave a Reply Cancel reply