Browsed by
Author: Sarah Connell

Spring Practicum Series

Spring Practicum Series

We’re delighted to share the spring dates for the WWP’s practicum series of two-hour workshops focused on particular skills and tools. Each session will be held from 12 to 2pm in the Digital Scholarship Commons in Snell Library. In the spring, we will be offering:

  • February 5: An XPath Excursion. This session will cover the basics of exploring and querying XML documents using XPath; we will navigate the document tree, limit our results to meet specific criteria, and discuss ways to take advantage of oXygen’s tools for XPath.
  • March 19: Regular Expressions for the Rest of Us. This session will cover search and replace with regular expressions, a very powerful way to search for patterns in text documents.
  • April 9: XPath XTreme. This session will cover more detailed use of XPath, including functions that make use of regular expressions, building on the topics discussed in the first two sessions.

These sessions are free and open to the public, but guests from outside the NU community should email Sarah Connell (sa.connell[at]northeastern[dot]edu) to arrange library access.

We hope to see you there!

To the Right Honourable, Virtuous, Heroical Reader

To the Right Honourable, Virtuous, Heroical Reader

This post was authored by Anna Kroon, University of New Haven class of 2019, who held an internship at the WWP during the summer of 2017. 

I came to the Women Writers Project really excited to work on such a large project with a wide variety of texts in their files. My experience was limited to Victorian shipboard newspapers, so anything not related to the ocean or intellectual boat humor was thrilling to me.

Since I had experience with XML and the TEI, but not with WWP encoding, I wanted to learn how to encode a short text that was vastly different from what I had transcribed and encoded before. To begin, I worked on Poems on Various Subjects by Elizabeth Sarah Gooch. This is a pessimistic set of poems that mostly deal with the author’s loss and sadness.

What interested me most about this text was at the very end and not even written by Gooch at all. The last poem in the collection was written to Gooch by a Mr. Anthony Pasquin, Esq. Being so new to the WWP encoding guidelines I had to ask “is there anything special I do with a poem not written by the author?” The answer was not what I had expected from such a detail-oriented project: no.

To the Author, with Love

“Non-authorial paratexts” (NAPTs) became my specific interest (and the bane of my existence). My dad would joke “para what? Pair-a-socks?” as I stumbled through explaining the encoding jargon and an oversimplified definition of my research.

Essentially these NAPTs are texts (poems, letters, and other short texts) that are published with a larger text and are written by a person who is not the author of that text. As a general rule NAPTs are written to or about the author of the main text praising her abilities, virtuousness, or life. I had many hypotheses about the purpose, authors, and significance of these texts, but I had to pare them down to match the scale of a summer internship.

I started off looking to see if Pasquin’s poem was not in fact a singular phenomenon. I took to the digital stacks of already published and in-progress texts in Women Writers Online to see what I could find. Using XPath, I tried and erred my way through many possibilities of where and what these non-authorial paratexts could be. As of writing this post, I have found over 120 unique non-authorial paratexts from 30 different larger texts.

The Process of Finding NAPTs with XPath

Full disclosure: I had never used XPath before this summer. That, combined with my beginner’s knowledge of the WWP encoding, made my XPath searching very slow to start. I used some logic and knowledge of general publication formatting to create my first few queries.

Texts are very broadly split into three main sections: front matter, main body, and back matter. Based on my initial assessment, I determined that the author generally writes everything in the main body without much contribution by others. So it wouldn’t be very fruitful to look there. Front matter comes first and often holds dedications and other prefatory materials. Even though the case that sent me to this research was in the back matter I thought it prudent to start with what was in front of me.

Within the front matter, a <div> (textual division) can have a several values for the @type attribute, including: “advert,” “contents,” “ded,” (dedication) “frontispiece,” “prefatory,” or “prologue.” The two values for @type that seemed the most useful were “ded” and “prefatory.” For my first search, I used “prefatory” because it is the broader category that would give me more results so that I could go on to refine my search.

I didn’t want to cast my net too wide on my first search, so I also decided to specify that I was looking for poetry. The XPath I used was: //front//div[@type="prefatory"]//div[@type="poem"]
essentially saying “Look in the front matter, then look in <div>s with a @type value of “prefatory,” and then look for <div>s with a @type value of “poem.” This search gave me 12 poems, 4 of which were NAPTs. One text I found in this fashion was Katherine Philips’ Poems (1664), which had 7 poems written by mostly men praising Philips.

While I went through my first set of results I noticed some of the line groups had a @type of “para.” In the hazy fog of researching NAPTs I assumed that this was an irregularly used value for paratexts. I added this on to my original search creating: //front//div[@type="prefatory"]//div[@type="poem"]//lg[@type="para"]
I was dismayed to find that I was looking at my previous search results except broken down by line group rather than any sort of new material. Still determined that “para” stood for paratext I tried deleting the type of ‘prefatory’ from my search. This gave me more fruitful results. The Poetical Works of the late Mrs. Mary Robinson had an astounding 19 paratexts written by various important men, including the Duke of Leeds.

This success made me sweat a little—maybe what I was researching had already been done and didn’t need me endlessly XPath searching. A quick look through the WWP encoding guidelines assuaged my fears stating that the value “para” actually stood for verse paragraphs. After this I went back to the model of my first query. Since “ded” was the other front matter @type value that seemed worthwhile, I simply exchanged “poem” for “ded” and searched: //front//div[@type="prefatory"]//div[@type="ded"]. I had not yet realized that these were two values that were essentially describing the same level of textual divisions and that dedications would not necessarily be nested within more general prefatory textual divisions. My results helped me to see the issue with my search: there was only one resulting dedication, written by the author to a princess.

From these missteps I decided to go with a simple query: //front//div[@type="ded"]. This action was driven by frustration, but also observation. Most of the non-authorial paratexts I found were poems or letters written to or about the author. In other words, dedicated to the author. There were 143 results, many of which were actually authorial paratexts (that is, paratexts written by the author herself). Even so, from this single search I found 77 unique non-authorial paratexts. However, many of these were not the highlighted passages found by XPath. I scrolled up and down around the search result to see if there were potentially other NAPTs in that specific document. Of the 77 I found with the [@type="ded"] search, 52 were found through scrolling—or, 67% were not a direct result of the XPath query.

As the simple “ded” search was very successful, I thought another simple search would be a good course of action. Searching //front//div[@type="prefatory"] yielded 432 results. While I knew that there would be many non-authorial paratexts hidden under a lot of other odd prefatory material, this was such a large grouping that it would take a single person with a time constraint far too long to parse through. To trim this list, down I looked at prefatory material that contained <signed> elements inside of <closer>s.

When scrolling through texts I quickly learned that a closing byline or signature was more common than an opening one beneath the heading. There were some rare cases where the authorial attribution was baked into the title, but as a general rule NAPTs in the WWP textbase have signatures as authorial attributions, so looking for dedications that contained <signed> elements gave me another way to refine my results. I performed several other searches, including reviews of the back matter, until I had a pool of nearly 130 unique non-authorial paratexts, which seemed like a comfortable amount to make a case for tagging these texts and a significant enough sample size to analyze.

Features of a Non-Authorial Paratext

My lengthy process of trial and error was at times frustrating, but it did make me acutely aware what was and what wasn’t a non-authorial paratext. The first thing I noticed was the signature or byline. When XPath pulled up a potential file, I scrolled to the very end and looked for a line that began with <signed>. If there was a signature, I would check to make sure that the author attributed was not the author herself. At times, this meant looking to the publication information or the personographic data to make sure the initials were not hers. If there were any doubts about authorship (especially for texts written by various or unknown authors) I would not catalog it.

If there was no signature line I would check the <head> or <head> @type “sub” for a byline. This was the less common occurrence, which is why I would look at it second even though it comes before the ending signature. I observed this phenomenon happening most often when the author of the non-authorial paratext was someone of importance such as a duke or a lord.

Another feature I touched upon in my searching was the titles of the paratexts. I focused on the subset of non-authorial paratexts that were written to the author praising her abilities. This is shown in the titles with prepositions like “to,” “on,” or “upon.” The most common first word of the title was, unsurprisingly, “to” with 71 instances or 57%. I looked at the other first words and considered a word significant if there were more than 5 occurrences. I decided to put “Impromptu” and “Sonnet” together class they are both declarations of the poem’s type and neither was significant on its own. The significant identifying words broke down thusly:

“To”: 57%
Less than 5 occurrences, various: 22%
“On/Upon”: 11%
Untitled: 6%
“Impromptu/Sonnet”: 4%

The keyword contents of the title and the author signature were the two most important features for identifying an NAPT. The only other useful tool was context. There were one or two occurrences where the paratext was not written to the author, but by the content of the poem or letter made it clear that it wasn’t written by the author herself.

Marking Up Non-Authorial Paratexts

Since I clearly proved the existence and abundance of NAPTs in the WWP textbase, the next step was to discuss markup to make the non-authorial content distinct from the authorial content. I discussed this topic with Sarah Connell and Ashley Clark as well as some encoders working on the project. We came to the conclusion that this was not going to be decided soon, but created several potential methods of encoding.

The first was simple and logical, but would be more difficult to implement. This meant creating an attribute that would be used on a <div> with a set of values that indicated the paratext author’s assumed gender. For example, an @author attribute with values of “male,” “female,” “collective,” “unknown,” and “nonbinary.” Therefore a poem written by the author’s brother would be marked as:
<div author= "male"> <head>To my beautiful sister by <persName> Joseph Kroon </persName></head> [insert poem here]</div>.

The second method followed the logic of the first, but was more practical. Rather than create a new attribute we would use @resp which indicates “the nature of a person’s intellectual responsibility, or an organization’s role in the production or distribution of a work” according to the Text Encoding Initiative’s P5 Guidelines. Instead of values that indicated the NAPT author’s gender we would create general personography entries for @resp to point to, expanding on the existing entry for “unknown author” to include unknown male, female, collective, and nonbinary authors. For example, that same poem by the author’s brother would now be encoded as:
<div resp="p:umale.agv"><head>To my beautiful sister by <persName> Joseph Kroon </persName></head> [insert poem here] </div>.

This method would also provide more specificity for paratexts where the NAPT author already has an entry in the personography:
<div resp="p:jkroon.doe"><head>To my beautiful sister by <persName> Joseph Kroon </persName></head> [insert poem here] </div>.

As of this writing, no decision has been made on the method of markup. Thinking theoretically, if we had decided upon the markup, my next step would be to implement it in a special test folder on the non-authorial paratexts I cataloged. From there we could see how the new markup interacts with the existing markup to ensure that it doesn’t cause any issues. We could also used this marked-up data to perform simpler searches for NAPTs. We would even have the ability to easily create reports with bibliographic data for analysis.

Analysis of the Non-Authorial Paratexts

After cataloging and marking up the paratexts, my final goal was to research a few hypotheses about the authorship of NAPTs. I brainstormed many different hypotheses that I wanted to look into, but settled on three.
1. The gender composition of paratext authorship is mostly male.
2. Most paratexts are written by people of distinction.
3. Texts with a significant number (more than 5) of NAPTs are published posthumously.

The authorship genders broke down as: 5% female, 36% male, and 59% gender unknown, thus showing my hypothesis was plausible. The considerable number of gender unknown authors comes from the fact that authors would sign their paratext with only their initials or a gender non-specific abbreviation. I did not have the resources or the time to search for mystery authors with only initials and the main author’s bibliographic data.

I defined distinction very broadly as anyone who had a title of any sort. This could be anyone from a king to a judge considered “honourable.” Of the 126 cataloged non-authorial paratexts, 24 (or 19%) were written by someone whom I defined as distinguished. The distinguished titles broke down as follows:
Duke: 4%
Marquis: 4%
Earl: 8%
Lord: 4%
Reverend/Doctor/Reverend Doctor: 13%
Sir: 4%
Honourable: 4%
Esquire: 55%
All this to say that my hypothesis was proven false. However, any of the unknown authors who signed with initials could be a titled person of distinction.

My final hypothesis was going to be researched with the use of the WWP’s counting XQuery and the experimental paratext markup. Since we were not able to come to a decision on this topic I was not able to mark up nor have a an automatic routine parse through lots of data. Curating author death dates and text publication dates for nearly 100 different texts would have been far too time consuming. However, my text with the most non-authorial paratexts was The Poetical Works of the Late Mrs. Mary Robinson with a stunning 19 non-authorial paratexts. From the title and the contents of the individual non-authorial paratexts it is apparent that this volume was published posthumously and Mrs. Robinson was a well-known and well-loved author.

This project is still ongoing. Once a decision is made on how to tag the non-authorial paratexts and they have been tagged, there are many other topics to research. We could easily identify how many NAPTs there are per text or the genre that has the most NAPTs. With more connections to the WWP’s bibliographic information we could see if any of the NAPT authors are other female authors in the database. We could also do a full analysis of publication versus death date to see if texts published posthumously do in fact have more non-authorial paratexts than texts published during the author’s lifetime.

WWP Practicum Series

WWP Practicum Series

We’re delighted to announce that the WWP will be offering a new practicum series during the 2017–2018 academic year. In this series, we’ll be holding two-hour workshops focused on particular skills and tools. Each session will be held from 10am to 12pm in the Digital Scholarship Commons in Snell Library. In the fall, we will be offering:

  • October 4: File Management For Digital Humanities Researchers. This session will cover essential strategies and design considerations for organizing files and research data for the long term, including basics of using the command line to see under the hood of your hard drive
  • November 8: Using Oxygen Like An Expert: projects, frameworks, and scenarios. This session will cover advanced topics in the use of the Oxygen XML editor, including setting up projects, frameworks, transformation and validation scenarios, and version control client plugins. Everything you’ve always wanted to know about Oxygen (but were afraid to ask…).
  • December 6: Efficient Code-Writing in RStudio. This session will share a “cheat sheet” developed to cover the most common usage scenarios for typical digital humanities data. We’ll also discuss how to write R code that can be efficiently embedded in workflows to handle repetitive tasks.

In the spring semester, we’ll look at further topics including basic and advanced XPath, regular expressions, Schematron, and XQuery.

These sessions are free and open to the public, but guests from outside the NU community should email Sarah Connell (sa.connell[at]northeastern[dot]edu) to arrange library access.

We hope to see you there!

Intertextual Networks: Theorizing and Encoding Textual Connections in Early Women’s Writing

Intertextual Networks: Theorizing and Encoding Textual Connections in Early Women’s Writing

Below are lecture notes from a paper by Sarah Connell and Julia Flanders, part of a panel on intertextuality in early women’s texts at DH2017

I want to begin by thanking our co-panelists for their really thoughtful and exciting presentations, as well as my co-author Julia Flanders, the conference organizers, and, of course, all of you for joining us today. I also have to thank the NEH for their support of this project as well as the rest of the WWP team—Lara Rose, Param Ajmera, Ashley Clark, and Syd Bauman—without whom none of this would be possible.

I’ll begin my talk with a quote from a letter written by Ann Candler to two women who had helped sponsor the publication of Candler’s work. This letter is included within the “short narrative” of Candler’s life by the editor of her Poetical Attempts, and, in fact, it comprises much of that narrative in terms of length and biographic detail. In this letter, Candler writes that these ladies may be “surprized” at her “prolixity” which exceeded her own intentions, but, she says: “one circumstance was so connected with another, and one word naturally introduced others” so that her prolixity was unavoidable. Candler also notes that she wanted to be as “explicit as possible” for the satisfaction of her audience. I’d like to take three things from this quotation. First, as a letter initially written by an author to her patrons for distribution among those patrons’ friends, and then published paratextually with that author’s collection of poems, this is a good example of the kinds of layered intertextuality we’ve been exploring at the WWP. Second, Candler’s discussion of how one word leads naturally to others is very neatly descriptive of what we’ve found in our research thus far, where we’ve been trying to map the webs of interconnected texts that appear in our Women Writers Online corpus. And, finally, in this presentation I hope to take inspiration from Candler and even do her one better, dodging the pitfall of surprising prolixity while nonetheless being as clear and explicit as possible, hopefully to the satisfaction of you friends who have joined us here today. To that end, I’ll start with a bit of background.

The Intertextual Networks project builds off of the Women Writers Online collection, which has about four hundred texts by women. These are largely print texts, although we do have one manuscript collection. We have a relatively broad chronological framing, 1526 to 1850, and the texts themselves are quite generically diverse. These texts are published in a web interface called Women Writers Online, and they’re encoded in TEI. 

While the Women Writers Project has been around since the late eighties and Women Writers Online was first released in 1999, the Intertextual Networks project is a much more recent endeavor; this is a three-year, NEH-funded Collaborative Research grant project that started in the fall of 2016. The project is aimed at fostering research into the rhetorics of intertextuality used by early women writers. We’re interested in both straightforward cases of direct quotation and citation as well as subtler forms of intertextual engagement, such as verbal echoes, stylistic similarities, imitation, and parody. Our work is focused on exploring both early women’s intertextual practices and the representation of those practices through encoding—that is, we’re expecting the project’s literary and historical research to help us study what our markup can do, just as we’re using the markup to help answer literary and historical questions.

For this project, we have three major research activities and outcomes. First, we’ll be linking all of the textual references in Women Writers Online to a bibliography we’re creating. I’ll be talking more about this work soon but I’ve given a quick example of what we’re doing at the bottom of the slide—in this case, we’re using the @source attribute on the <quote> element to point to an entry we’ve created in a bibliography file. We’re also investigating how we might mark up intertextual phenomena that are less straightforward than direct reference or quotation and I’ve chosen this particular example because it illustrates one such phenomenon: parody—and, in fact, is a particularly clear instance, since Charlotte Smith is kind enough to tell us that she’s parodying a “sublime sentence” from Edmund Burke’s Reflections on the Revolution in France. Most of our authors, I’m afraid, are not quite this kind. But that’s what makes this interesting!

Which leads to our second project outcome: we’re publishing our research into encoding complex intertextual features on our blog and we’ll be sharing a final report at the end of the project. And finally, we have assembled a team of collaborators who are completing individual research projects, to be published on our open-access Women Writers in Context platform, with more incremental reports on our blog. These collaborators are doing some really promising work—for example, we have a project using network and braid analysis to look for similar dialogue patterns in the works of Margaret Cavendish and Molière, one collaborator who’s been tracing references to Elizabeth I throughout the whole WWO collection, and another project using markup to study marginal biblical references as a counternarrative to the “main” poetic text in Lucy Hutchinson’s Order and Disorder. I’m not going to focus on these today, but I really can’t stress enough how exciting our collaborators’ projects are—and I’ve given some links where you can learn more about everything I just mentioned. 

Today, I’d like to focus on that first strand of our work, the encoding enhancements, and share some lessons that might be useful to other projects engaging in large-scale markup expansions. I’d also like to discuss some of our initial discoveries about early women’s intertextual networks—and the representation of such networks in a collaborative online environment.

On a corpus-wide level, what we’re doing is extracting all of the contents of the <title> and <quote> elements in our collection and creating bibliography entries for each text.  To give you a sense of the scope of this work: we have around 5,500 separate <title> elements in Women Writers Online and around 11,500  distinct <quote> elements. Thankfully, some of those <quote> and <title> elements are pointing to the same texts, so we’re not looking to create a bibliography with 17,000 entries, but it’s still a fair amount of work!

The “before” example here shows you some of the markup we’re relying on to pull out the titles and quotations in our collection so we can create entries for them; you can see that we’ve already marked that Midsummer Night’s Dream is a title and that this group of poetic lines is a quote. The sharp-eyed among you may have noted that this is actually a misquotation, and one that changes the meaning of the quoted text to the much more optimistic claim that “the course of true love ever did run smooth.” Given that this is corrected in the second edition, it’s likely a typesetting error, but we’re still investigating whether this might be deliberate. As an epigraph with an associated bibliographic reference, this is one of the more detailed examples in WWO, so we can even see that “Shakesp.” is not just an abbreviated person’s name but also the name of an author nested in the same <bibl> element as our title. In the “after” version on the right, we’ve created a bibliography entry for Midsummer Night’s Dream and, in our encoded file, we’ve used the @ref and @source attributes on <title> and <quote> to point to that entry. We’re also using @ref to point to our personography entry for Shakespeare. In this example, and most of the ones I’ll show, I’ve simplified the encoding to focus on the features I’m discussing here.

Okay, so how are we getting from before to after? Well….

To begin with, we’ve taken an an aggressive lowest-hanging-fruit stance. In fact, if I can abuse the metaphor a bit, we actually started by scooping up all the apples that had already fallen to the orchard floor: that is, those <title> elements that were inside of <bibl> elements and, thus, likelier to have such helpful information as who their authors were. We wanted to start with the references that would be easiest to track down, so that we could fill in our bibliography file and then apply everything we learned from that process to our work on the increasingly obscure references: that is, <title>s that are not inside of <bibl>, <quote> elements with associated title or author information, and then unattributed quotations.

Other key aspects of our approach have been: combining human and programmatic interventions, using the lightest possible tools for each task, and continually testing our encoding decisions against our actual corpus. For example, here’s the spreadsheet we used to fill in the majority of the <title>s in the collection; we decided a shared spreadsheet would be the best way to make input and versioning very lightweight for the bulk of our title references: that is, all of the ones inside of <bibl> as well as those with multiple attestations. We used XQuery to pull out lightly regularized versions of all the contents of those <title> elements, aggregating, for example, variations where one text might use a long s and another might not. The columns with the titles and authors that we extracted from WWO are locked in the spreadsheet so that encoders can’t accidentally edit them, since we’ll be using that information to automatically add the unique identifiers back into WWO when we’re done. Our encoders are then filling in display titles and full titles, along with standardized versions of authors’ names, publication locations, and publication dates. This is all operating at a high level of abstraction; we usually don’t know which edition or printing is at stake, so we’re instead filling in the earliest-known information. We took advantage of the capabilities the spreadsheet format offered by, for example, using simple color coding to assign texts to our encoders, mark texts that we needed to discuss as a group, and so on.

Our most pressing concerns are ensuring that our unique identifiers are unique and making sure our data is consistent, so we also set up a separate sheet to enter preferred formats for each publication location and then locked the input column in this sheet so encoders can only input publication places established in our “canonical” list. When there are duplicates—for example, you can see we have references to Benjamin Church’s History of Philips War under two title variations—we give both have the same unique identifier and mark repeats with “[dup]” in the “Display title” column. In the parts of the spreadsheet I couldn’t fit on this slide, there are the publication location and date columns, as well as columns for notes and source documents, and for the XPath that located each title, which lets encoders find usages in our WWO corpus whenever more context is needed.

So, this spreadsheet worked really very well for the most straightforward cases. But as anyone who’s worked with bibliographic data knows, you never have as many straightforward cases as you’d hope. We’d known from the outset the spreadsheet would be inadequate for some kinds of texts, but, I have to confess, we didn’t think we’d hit those in the very first week.

But, of course we did. So we also set up an XML input file for the cases where our spreadsheet was inadequate. And we found that these balanced nicely; the spreadsheet made it possible to have two people working on fifteen hundred records without any version-control issues and the XML was there for the smaller number of cases where we needed more structure. Such as: 

Periodicals, analytic-level references to works published within a larger monograph, references to particular editions or printings, and so on. For one example, here is a quotation from the preface to Finch’s Sonnets and Other Poems, citing some advice from Tales of the Castle that a “woman ought never to suffer a man to add a single word to her writings” lest he “pass for the original inventor” and we have a fairly full bibliographic citation with a title, volume number, and page number. Tales of the Castle is a work in translation, so to create a record for this text, we needed to turn to the XML bibliography, which is designed to handle such complexities. 

As you can see here, we are using the different levels of bibliographic abstraction represented in the Functional Requirements for Bibliographic Records (FRBR), and mapping them onto specific TEI elements for recording bibliographic information. Although it’s more detailed than the example I showed in the “before” and “after” slides, this is still is not the final version of how our XML bibliography will be set up. Instead, it’s been sufficient for our encoders to record the information we knew we needed to have—such as adding Thomas Holcroft’s role as the translator of Tales of the Castle.

After we’d reviewed all of the contents of the <title> elements in <bibl>, which amounted to about 1,100 texts named in 2,100 individual <title> elements—we had enough data to feel we’d sufficiently tested both our work processes and our handling of bibliographic complexities. Before we turned our attention to the rest of the collection, we held the first of several review phases, in which we checked for duplicates and inconsistencies. We’ve found that building in periodic review processes has been a really important aspect of our work. Unless your input processes are extremely restrictive, you probably will get some duplicates, so it’s matter of finding a balance where you actually get some work done, check for duplicates, work some more, check for duplicates, and so on.

We’ve adopted a multipronged approach to deduplication: we use alphabetization as a first-pass method, beginning with our input phase in which our alphabetized extracted titles made it clear that, for example, “Collections of the Massachusetts Historical Society” and “Collections of the Historical Society” should be checked as potential duplicates. Of course, variant titles aren’t always alphabetically proximate, so if one author refers to the “New and General Biographical Dictionary” as the “General Biographical Dictionary” and another calls it just the “Biographical Dictionary,” the fact that these are duplicates is less immediately clear. But, since our mechanisms for creating display titles are well-documented and our encoders have been very consistent, we can review the display titles, as the report here shows, to see if there are any apparent exact duplicates as with “A Geographical and Historical View of the World” or likely cases of duplication as with the “Adventures of Telemachus.” In resolving potential duplicates, we’ve been able to use authorship and other publication details to determine which actually are duplicated, which has been particularly necessary for the great many works named in WWO that are titled “Poems.” In addition to alphabetization, we’ve also found that ngram counting is a helpful mechanism for locating potential duplicates and we now have a pool of tests we can draw on to check our data.

We followed an iterative process of data gathering and review to collect information on all of the titles in our collection that were in <bibl>s or named more than once in ways that could be identified programatically. We then set up a canonical bibliography, which our encoders now check before they create any new textual records, to make sure we haven’t already encountered a text, since we’ve now encountered so many.

Before we took on the remaining 1,800 “singleton” titles, that is the ones outside of <bibl> named only once in ways that our routines could identify, we made another adjustment to our processes and switched our primary input mechanism from the spreadsheet to…

The inspectre, which Ashley Clark created as a web-based platform for inspecting and editing XML documents using XPath and XQuery, a platform that enables us to combine programmatic and human interventions. If you’re curious about the inspectre, Ashley and I have a paper on it and I’ve provided the link on this slide. For today, I’ll just say that this interface was ideal for handling the singletons because it allows our encoders to easily see the context for each title, without having to open an XML file. We took all of the process hacks we’d developed in our first phase and applied them here. So, you can see that this particular entry for “The Arraignment of Lewd, Idle, Froward, and Inconstant Women” was assigned to Lara and you can see the XPath Lara would use if she needed to access the original text. In the box, the encoder can see each title in context, with the ability to switch to the XML as needed. We’re now using webform mechanisms like checkboxes to track the information our first phase taught us we would need, including the fact that this is a duplicate, since we already had an entry for the more commonly used version of the title “The Arraignment of Women.” The rest of the inspectre page, not shown here, gives the encoder spaces to fill in all of the publication details and to add some more flags, such as “I have a question about this” and “I’m done with this.” Those flags really come in handy for reviewing the remaining titles as a whole to locate the entries that have been completed, those that need group discussion, and so on.

Currently, we’re wrapping up the last few hundred singleton titles and completing another review of our data, after which we’ll be able to automatically add the unique identifiers for each title back into the WWO corpus and finalize our XML bibliography with all of our entries, regardless of their input mechanism. Then, we’ll turn our attention to the quotes and we’re very optimistic that the processes we’ve established so far will continue to ensure that we’re working efficiently while taking care to produce data we can rely on. We’ve also begun some exploratory investigation of the data that we’ve created thus far, which means that I can now share one example of the research into early women’s intertextualities our encoding enhancements have enabled.

I’ve been looking at citation patterns for a subset of the texts referenced by our authors, narrowing the scope here from all the texts named in WWO to just the histories, two hundred titles out of several thousand.

I took inspiration from Margaret Cavendish, who asserts that there are many sorts of histories, the primary of which are world histories, national histories, and biographies—and, using a slightly broader taxonomy, I categorized the historical works named in Women Writers Online, based on the regularized and deduplicated list of titles that we created.

These numbers are based on the individual titles named, rather than the number of times each title might be referenced—I wanted to examine the historical lexicon of our authors, as it were. From here, I can look at actual numbers of citations, as well as where these different sorts of histories are being named, using our metadata to track references by publication location, genre, or time period. I can also look at the markup around these references to see whether these histories are appearing in notes, in running prose, in advertisements, and so on. I’m planning to add more layers to this general taxonomy of historical texts and I’ve found it very helpful to consult the types of biographies and collections already shared by the CBW project.

Even given the preliminary nature of these results, I’m finding some threads that merit further investigation, particularly in the higher numbers of biographies referenced, which I think may be reflecting a larger historiographic trend. Biographies were a historical genre that became accessible to women writers earlier than the prestige genres such as the universal or national history, so the fact that our authors are also citing more biographies is quite interesting.

As we continue with our encoding enhancements, we’re very excited about the potential that WWO will have to answer research questions much like this one. 

I want to close with some thoughts on how our three projects connect with each other, and with others working on women’s writing. While the scopes of our respective projects are quite distinct, there are some shared threads that I think are worth noting: we are concerned with balancing the large scale and the finely-grained in both the taxonomies and the metadata we are creating. We call attention to the labor-intensiveness of the work that we do and we frame that work as being in conversation with, and—at least ideally—interoperable with other projects’. We have many of the same research questions as we investigate what women were reading, as well as how women were being read and written about—and as we explore digital methods for representing women’s textual networks.

The fact that the texts we are studying, modeling, and publishing are so distinct, while our methods and our research goals intersect in so many ways actually opens up some really exciting opportunities for collaboration and discussion—whether that might be consulting other projects’ taxonomies in developing our own or examining where the texts and persons we are studying appear in multiple contexts.  At the WWP, we’ve worked to support such inquiries by linking our own data to catalogues such as ESTC and WorldCat, and we’re now in the planning stages for creating Linked Open Data datasets for the places, persons, and texts in our collections.

I started with a letter from Ann Candler, describing how one word led so naturally to others that she could not contain them. I’d like to close with a different depiction of interconnectedness: Charlotte Smith’s description of a patchwork quilt produced by an “industrious landlady.” In this description, Smith highlights the disparateness of the quilt’s varied components: “here a little bit of chintz, surrounded by pieces of coarse and tawdry cotton”…”in one place a remnant of the fine gown of the Lady of the manor; in the next, a relict of the bed-gown of her house-maid.” Nevertheless, these scraps do create a whole, motley though it may be. And, just as the quilt stitches together the lady of the manor with her housemaid, despite their differences in status, I think that our work has the potential to uncover connections at many levels: between the women that we study, their texts and those that are written about them, how those these writers and texts circulated in their own times and the mechanisms that we are using to circulate information about them today.

Thank you!


Intertextual Networks has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.

Word Embedding Models Are the New Topic Models

Word Embedding Models Are the New Topic Models

By Jonathan Fitzgerald, Ph.D. Candidate in English, Northeastern University

I remember the first time I trained a topic model. It was in a course called Humanities Data Analysis, taught by Ben Schmidt. He provided us a corpus of the Federalist Papers and some code that he adapted from David Mimno, contributor to the original MALLET package and author of the R implementation of MALLET. After the initial confusion–“topics” aren’t topics in the traditional sense, after all–it felt like magic. The computer read the corpus and derived some topics!

Enthusiasm for topic modelling remains strong in DH in part because it is a useful way to reduce a corpus down to manageable chunks and to get a sense as to what individual documents are “about.” That said, in the years since my first experiments with the Federalist Papers, I’ve come to believe that topic modelling is perhaps most useful as part of a workflow aimed at some goal beyond identifying topics in a corpus. To that end, I’ve been using topics to seed a genre classifier in my work with the Viral Texts Project.

While I’ve been working with topic models these past few years, a new modelling craze has taken DH by storm…relatively speaking. That is, thanks in large part to this post by Ben Schmidt, DH’ers have enthusiastically adopted word embedding models (WEMs), particularly the most popular manifestation, word2vec, created by Tomas Mikolov and his colleagues at Google.

In a post in which we announced our current project “Word Vector Analysis for TEI/XML: A user-Friendly Toolkit,” my colleague Liz Polcha offers a helpful introduction to word embedding models and the way we plan to use them with data from Women Writers Online. As part of that effort, I’ve been tasked with experimenting with word2vec and our corpus, as well as with imagining a way to provide an interactive online user interface. All this work is in progress, but as a part of those efforts, I’ve been thinking about what we can actually do with word embedding models. Just as my experience with topic models led me to the conclusion that they are most helpful on the way to some other interpretive exercise, my sense is that the same may be true for word embedding models.

This sense is born out of my own experiments with word2vec, but also from the existing literature on word embedding models. That is, you probably won’t read a write-up on word2vec that doesn’t provide the classic analogy example about kings and queens. And, to be sure, it’s cool. That word vectors seem to reflect expected relationships between words is solid evidence for their potential usefulness. But beyond seeing words related to other words and parsing out analogous relationships between terms, what can we do with word embedding models?

I have a few ideas. For one, to stay on the topic of this post (pun!), though word embedding models are unsupervised algorithms, we can work with the models in a way that is akin to a kind of supervised topic modelling. Topic modelling derives “topics” from a corpus starting with a random seed word and grouping words that frequently co-occur within a document; working with the output of a word embedding model, on the other hand, lets us set that seed, so to speak. As Ben Schmidt writes, “You could think of this as a supervised form of topic modeling: it lets you assemble a list of words that typically appear in similar contexts.” A significant difference between topic models and word embedding models is that  topics are generated with an awareness of the documents in a corpus. Word embedding models, on the other hand, look across an entire corpus for word collocations within a narrowly defined context, sometimes referred to as a “window.” Or, as Schmidt defines the difference, “A topic model aims to reduce words down [to] some core meaning so you can see what each individual document in a library is really about. Effectively, this is about getting rid of words so we can understand documents more clearly. WEMs do nearly the opposite: they try to ignore information about individual documents so that you can better understand the relationships between words.”

Schmidt’s food example from the Chronicling America corpus illustrates the way we can use the relationship between words to build a kind of supervised topic. He starts with one food word, “oysters,” and finds other food words that appear in similar semantic contexts: “ham,” “bread,” “chicken.” Then, he adds these terms to the original vector and runs the search again, adding even more food words. In a way, he is building a “food” topic. The process is, of course, different from what topic modelling does, but the end result is a list of words that have a meaningful relationship within a corpus of texts—except the topics are created rather than derived.

In unsupervised machine learning, the algorithm does not explicitly “know” what we are searching for; in the above example, it does not know that “oysters”, “ham”, and “bread” are foods. It just knows they cooccur. But when a human researcher intentionally selects and combines these words into a vector that she describes as food words, the unsupervised algorithm is being used in a supervised way.

This notion of creating topics through a supervised process of selecting and combining also works to allay an anxiety I have about the opaqueness of unsupervised modelling (particularly for a humanist who is new to computational modelling!). Ted Underwood frames the issue in a recent blog post in which he argues, among other things, that we should “make quantitative methods more explicit about their grounding in interpretive communities.” That is, we shouldn’t let computational models stand in for humanistic interpretation. Underwood writes: “Researchers are attracted to unsupervised methods like topic modeling in part because those methods seem to generate analytic categories that are entirely untainted by arbitrary human choices. But as [Stanley] Fish explained, you can’t escape making choices.”

Underwood emphasizes the importance of grounding these choices (or, interpretations) in “interpretive communities.” And, for his part, Underwood notes that one way to do this is to favor supervised modelling methods; he writes:

I have been leaning on supervised algorithms a lot lately—not because they’re easier to test or more reliable than unsupervised ones—but because they explicitly acknowledge that interpretation has to be anchored in human history.

I like this notion of relying on human judgment and generating interpretations “anchored in human history.” That is, after all, what makes us humanists, right? To that end—and again acknowledging that word2vec is an unsupervised algorithm—I’ve been thinking about the ways we can ground our work with unsupervised models in human interpretation.

Particularly, I’m interested in the potential for examining if and how word usage might reflect broad cultural trends over time. In the WWO corpus, which covers the 16th through 19th centuries, it might be interesting to consider how the use of the word “freedom” reflects the culture of the authors who use the word. I approached this question by dividing our corpus up by century and creating “supervised topics” based on the word “freedom.”1

In the 16th and 17th centuries,2 freedom seems to be a more personal affair; words that show up in the same vector space as “freedom” include “happiness,” “friendship,” and “wedlock.”

In the eighteenth century, freedom takes on a particularly revolutionary bent: “patriots,” “rights,” “triumph,” “country’s,” and “tyrants,” for example.

Finally, in the nineteenth century, the revolutionary language remains (“rights,” “tyranny”)—though diminishes some—and a new kind of freedom (or lack thereof) appears in the form of words like “slave,” “slaves,” “enslave,” “degrading,” “minority.”

I considered that these shifts might be geographically-contingent; after all, only one work in our corpus from the 17th century was published in the colonies. However, in the 18th century, when we see a lot of what I’ve described as “revolutionary” language, Europe still dominates as place of publication. It’s not until the 19th century that the number of works published in the United States even begins to come close to those published in Europe (34 in the U.S. and 54 in Europe). So it seems that geography does not have a significant impact on the way the word “freedom” is used in our corpus across three centuries, if only because of the collection’s own unevenness in its inclusion of works published outside of the United Kingdom.

Treating these vectors like supervised topics allows us to consider interesting cultural questions through the lens of the WWO corpus. But, typically word embedding models are deployed to tell us something about the words themselves. In this sense we can use a similar process to consider the way a word’s usage shifted over time. My colleague Sarah Connell suggested that “grace” might make for an interesting test case. And indeed, when compared across centuries, “grace” moves from a word typically used in religious contexts in relation to God’s grace (e.g., “mercie,” “wisdome,” “fatherlie,” “redeemer,” “almightie”; note the archaic spellings) to secular contexts in relation to women’s beauty (e.g., “beauty,” “charms,” “fair,” “virtue,” “sweetness,” “smile,” “lovely”). Of course, the meaning of the word hasn’t changed, but its application has. And, as with “freedom,” this can tell us something about shifting cultural contexts—in this case secularization.

In fact, this shift toward secularization is confirmed in our corpus using another method. The WWP’s previous experimentation with markup-based exploration has already shown how growing literary secularization manifests in the collection after the seventeenth century. For one example, of the 3,228 bibliographic citations in the WWP’s sixteenth- and seventeenth-century materials, 3,689 (86%) contain biblical references; by contrast, 256 (15%) of the 1,742 eighteenth-century bibliographic citations are biblical references.3

In both examples, word2vec gives us words that appear in similar contexts to our seed words, but it is up to us to recognize these similarities as significant, intentionally group them into ever more expansive topics, and then to theorize what (if anything) these collocation patterns might mean. Thus, we can work with an unsupervised model in a way that acknowledges, to quote Underwood once more, “we’re no longer just modeling language; we can model interpretive communities at the same time.”

We’re still early on in our explorations of word embedding models and literary corpora, both at the WWP and in DH in general. But that, in part, is what makes this project so interesting to us: it’s an opportunity to explore word embedding models on a corpus that is focused on gender, relatively free of digitization errors, and easily transformed into markup-based subcorpora using XSLT and XQuery.

As we continue to work with this corpus, it’d be helpful to know what potential users might be interested in. If you have any suggestions for possible use-cases, please let us know in the comments.


Intertextuality in Mary Astell’s A Serious Proposal to the Ladies (1694) and in Reflections upon Marriage (1706)

Intertextuality in Mary Astell’s A Serious Proposal to the Ladies (1694) and in Reflections upon Marriage (1706)

This post is part of a series authored by our collaborators on the Intertextual Networks project. For more information, see here. 

By Ioanna Kyvernitou, National University of Ireland, Galway

 For Intertextual Networks, I am evaluating the markup in two works of Mary Astell (1666–1731) as found in Women Writers Online–A Serious Proposal to the Ladies, for the Advancement of Their True and Greatest Interest (1694) and the third edition of Reflections upon Marriage (1706)–in order to consider practices for encoding intertextuality. Astell, a philosopher and theologian who supported women’s right to education, is considered one of the earliest English feminist writers. She is also known for her critiques of the philosophers John Norris and John Locke. Current scholarship analyses her writings within the context of her political (Toryism), philosophical (Cartesianism-Platonism), and religious (Anglicanism) beliefs. Within this framework, this study aims to identify Astell’s intertextual practices by exploring WWP’s the XML markup–specifically the elements designed to encode bibliographic features (e.g., <quote>, <said>, <bibl>). These two works are treated here as case studies in order to discuss the ways in which XML representation can provide a formal framework for representing complex intertextual practices in literary works.

In my preliminary work, I have identified relevant markup in relation to intertextuality (from the WWP’s Internal Encoding Documentation) in order to query and retrieve the occurrences of those tags from the two XML files as provided by the WWP. Specifically, with the help of Sarah Connell and Syd Bauman, I used XQuery–a language for querying XML data–to search for Astell’s references to proper names (i.e. <persName>, <name>, <placeName>, <orgName>). Further, I investigated the personal names’ structural contexts (<p>), aiming to identify the function of onomastic intertextuality (person and place names). Finally, I searched for biblical, classical and bibliographic references (i.e. <quote>, <said>, <bibl>, and <regMe>) in these works.

In the case of indirect references, which go beyond the straightforward markup of direct quotations, it is necessary to consult secondary literature to help us identify the source(s) of reference and the identity of implicitly noted authors. The challenge is that, on many occasions, there are different interpretations among scholars regarding the source of influence or person quoted (as discussed below). Thus, in incorporating multiple interpretations within the markup, the encoding process becomes more complex and expensive—but also more enriched. While the existing markup does not annotate implicit references to an author or indirect quotes, the Intertextual Networks project will be piloting such encoding in an initial set of texts; the project will also be linking quotations to their sources and authors, which will make retrieval and analysis of quoted passages easier.


 According to the WWP’s internal documentation:

The <quote> element is used to encode material which is identified as originating outside of the passage where it appears, regardless of where the material actually originates. For our purposes, <quote> can include proverbs, mottoes, common sayings, passages from other texts (including fictional passages from imagined texts), or quotations from other parts of the same text in which the quotation appears.

Following this definition, I searched within the XML files for occurrences of the <quote> element in order to identify its use in Astell’s works. In Proposal, is used only four times and in Reflections eighteen. Currently, the WWP uses a pilot encoding in order to implement more detailed markup for cases where quoted material is paraphrased or parodied from its source. For these cases, the @type attribute is used with values of “parody” and “paraphrase”. Some of these conceptual challenges are addressed in the ‘Methods’ section of the proposal for Intertextual Networks: Reading and Citation in Women’s Writing 1450-1850, where it is recommended, similarly to parody and paraphrase, to handle allusions by treating them as special types of quotation and using the TEI @type attribute to characterize quotes as “direct,” “paraphrase,” “allusion” (and other terms as needed).

Along these lines, updating Astell’s XML files with an expanded and more detailed markup–for example, tagging paraphrases, proverbs and echoes–would be useful, especially for retrieval purposes of these instances. This post uses a passage from Reflections to explore how a more in-depth encoding can be made in order to include information concerning: quoted person(s) – explicitly or implicitly mentioned –, paraphrased passages, and ways to connect quote(s) with quoted person(s).

In the passage below (presented first without markup), Astell argues about the role of custom in perpetuating the subordination of women (emphasis added),

That the Cuſtom of the World has put Women, generally ſpeaking, into a State of Subjection, is not deny’d; but the Right can no more be prov’d from the Fact, than the Predominancy of Vice can juſtifie it. A certain great Man has endeavour’d to prove by Reaſons not contemptible, that in the Original State of things the Woman was the Superior, and that her Subjection to the Man is an Effect of the Fall, and the Puniſhment of her Sin. And that Ingenious Theoriſt Mr. Whiſton aſſerts, That before the Fall there was a greater equallity between the two Sexes. However this be, ’tis certainly no Arrogance in a Woman to conclude, that ſhe was made for the Service of God, and that this is her End. Becauſe God made all Things for Himſelf, and a Rational Mind is too noble a Being to be Made for the Sake and Service of any Creature. The Service ſhe at any time becomes oblig’d to pay to a Man, is only a Buſineſs by the Bye. Juſt as it may be any Man’s Buſineſs and Duty to keep Hogs; he was not Made for this, but if he hires himſelf out to ſuch an Employment, he ought conſcientiouſly to perform it. Nor can any thing be concluded to the contrary from St. Paul’s Argument, 1 Cor. II. For he argues only for Decency and Order, according to the preſent Cuſtom and State of things. Taking his Words ſtrictly and literally, they prove too much, in that Praying and Prophecying in the Church are allow’d the Women, provided they do it with their Head Cover’d, as well as the Men; and no inequality can be inferr’d from hence, neither from the Gradation the Apoſtle there uſes, that “the Head of every Man is Chriſt, and that the Head of the Woman man is the Man, and the Head of Chriſt is God” (A2r–A2v)

Astell uses three sources to support her argument. She first notes ‘A certain great Man’ who argued about women’s superiority before the Fall; she then paraphrases William Whiston, a Cambridge theologian; and she concludes with a biblical reference (1 Corinthians 11:3) to support women’s equality. In the current markup, only the biblical reference (i.e. <bibl><regMe>1 Cor. II.</regMe></bibl>) and the direct quote are encoded, whereas the two cases of indirect references are not tagged.

‘A certain great Man’ & ‘Mr. Whiſton

For a more complete encoding, the <quote> element and @type attribute with a value of “paraphrase” could be added to highlight instances of these indirect references, bearing in mind that, as noted in the ‘Methods’ section of the proposal for Intertextual Networks, “the boundaries of paraphrases and allusions are less determinate than those of direct quotations.”

Regarding the authors quoted, in the first case, Astell refers indirectly to ‘A certain great Man’, whereas ‘Mr. Whiſton’ is explicitly named (i.e. <persName ref='p:wwhiston.ycp'>Mr. <hi rend='slant(upright)'>Whiston</hi></persName>). For the latter case, we can also use @role on <persName> to indicate that Mr. Whiston is being referenced as an author; we can use @source on <quote> to point to a bibliography entry, with more detailed information on the source.

 For the “certain great Man,” we could add <rs> with a @type of “author” to mark this as a reference to an author, however indirect; we can also use @ref to point to more information on the identity of this author. In this case, there are different interpretations among scholars regarding the author’s identity. Specifically, Apetrei suggests that it is possible that the “great Man” was Agrippa von Nettesheim, a German polymath, who argued for the superiority of the female sex (131). Springborg, on the other hand, proposes that this could be a reference to the English philosopher Thomas Hobbes (11). Based on these authorship claims, one approach would be to use @ref to point to an <alt> element, whose @targets would themselves point to personographic entries for the two potential authors. Even where there is no agreement on the quoted person, it would be helpful to incorporate current scholarship in the encoding of the primary text to reflect the different interpretations. This can be achieved, for example, by adding a <note> element in the XML file, discussing the different scholarly interpretations and identities of probable sources.

Biblical and Bibliographic References: ‘St. Paul’s Argument’

The third case is an example of encoding bibliographic references and citations by using the <bibl> element. Within <bibl>, the tag <author> is used to encode the author’s name, if present, along with a nested <persName>. The <regMe> element is used to encode bibliographic references or citations of the Bible or other texts for which a standard or canonical reference system exists.  The WWP internal documentation suggests that <regMe> should be placed within the <bibl> element that encloses the complete reference. Following these definitions, I have counted eight occurrences where <regMe> is nested within <bibl> in Reflections and found none in Proposal.

A closer look at these occurrences, with the XML markup of this passage from Reflections, shows two distinct usages of personal names (the markup below has been simplified for the purposes of this example):

Often a personal name can be a quoted author, as in the case of Saint Paul in the above example. But there are also occasions where personal names are nested within a <quote>, as in the case of ‘Christ’. This is another case where we can use @source as described above to make authorship and other bibliographic information more explicit and queryable. Lastly, before introducing Saint Paul’s quote, as seen above, Astell refers to him as ‘Apostle’. This is one of many examples of coreference–when two or more expressions in a text refer to the same person. Thus, this is another example of where <rs> with @role of “author” and @ref pointing to a persongraphy entry could make the markup more detailed and useful for future research.

The challenges of formally representing the various types of intertextuality mean that the boundaries of structural and interpretive markup become more fluid. The more detailed the markup becomes, the more in-depth understanding of the primary text and its secondary literature is required. This is a process that can be time-consuming, especially for large-scale projects. Nevertheless, investigation of the use of personal names within their surrounding contexts can enrich the representation of intertextuality. As a next step for this study, I will explore further how linguistic and rhetorical emphasis tags (i.e.<emp>, <term>, <distinct>) can be connected to indirect quotation practices in order to identify other implicit references, currently not present in the markup. I will base this on Astell’s practices in her correspondence with John Norris, Letters Concerning the Love of God (1695), aiming to compare references in her three works, and open the way to reconstructing a more complete picture of her intertextual practices.

Works Cited

Apetrei, Sarah Louise Trethewey. Women, Feminism and Religion in Early Enlightenment                

England. Cambridge University Press, 2010. Print.

Springborg, Patricia. Mary Astell, Political Writings. 1st ed. New York: Cambridge University

Press, 1996. Print.



“‘The Text is Variety’: Contextualizing and Analyzing the Works of Margaret Cavendish with Text Encoding

“‘The Text is Variety’: Contextualizing and Analyzing the Works of Margaret Cavendish with Text Encoding

Below are lecture notes from Sarah Connell’s presentation at the 2017 International Margaret Cavendish Society Conference. The slides are available as a separate file here.

Okay, so, since one of the themes of this conference is how Cavendish was received, I want to begin with a quote about her from a text in Women Writers Online.

So, here we have Elizabeth Benger on Cavendish, speaking of her fertile fancy, her uncommon genius, her wildness and inaccuracy, and her voluminous works. And, as much as this feels like a textbook example of damning with faint praise, I have to say I find myself sympathizing with Benger when she speaks of Cavendish’s wildness—you see, I didn’t come to this project expecting to work on Cavendish at all. I was trying to do research with Women Writers Online as a collection but I found that Cavendish just kept popping up. Her works started to feel wild precisely because they are so voluminous; they represent a very significant percentage of our corpus, so it’s not really surprising that they were so prominent in all of my searches through the collection. But, I’ve found that with Cavendish, it’s not just about sheer numbers; she also was showing up in my research because her texts have a high number of unusual phenomena. It seemed as if, whenever I found some textual feature that was unique to a particular author, that author would be Cavendish. Well, or Eleanor Davies. But, it was Cavendish a lot of the time. So, clearly, Cavendish called for a research project of her own, which is what I’m going to share with you today. But first, I’ll give you a bit of background.

So, as I said, I was working from the Women Writers Online collection, which has about four hundred texts by women. These are largely print texts, although we do have one manuscript collection with the Almanacks of Mary Moody Emerson. We have a relatively broad chronological framing, 1526 to 1850, and the texts themselves are quite generically diverse. These texts are published in a web interface called Women Writers Online, but they’re encoded in TEI, which is much more detailed and information-rich than we’re able to show on the web.

And here’s what I mean by information rich; in fact, I’ve simplified this and all other examples of our encoding to make it more readable. TEI markup is a very complicated and diverse topic, so I’ll focus on the basics here. We use elements, such as this <head> element, which marks that “Scene 8” is a heading. Here is a <div> element, for a textual division. The TEI is very, very good at labeling things—saying, for example: this is a stage direction, this is a division, this is a paragraph, this is a speaker label—and it’s very good at marking their boundaries; this stage direction starts here and ends here. The TEI is also good at showing hierarchical relationships, the nesting of textual features; so, here we have a <sp> element, used to mark a dramatic speech—and, inside of that, we have a speaker label and a paragraph. There’s no ambiguity that this speaker label and this paragraph belong together, because they’re both in the same <sp>. In addition to elements, the TEI also has attributes, which are kind of like adjectives. They give more information about their elements. For instance, we have three examples of the @type attribute, one on <div>, asserting that the type of division we have is a scene and two on <stage>, describing which types of stage directions we have. This @who attribute points to a cast list elsewhere in the document, where we’ve defined “ign” as referring to Lady Ignorant. That way, every time she speaks, we’ve marked those speeches as belonging to her in a way that’s easily readable by a computer. There’s no ambiguity, even if the speaker label is missing or incorrect. Okay, so, like I said, this is a big topic, but that covers the basics. Elements both name and mark the boundaries of features within a textual hierarchy and attributes provide more information about elements. My work has been on how we can use this markup in literary research; I’ve been developing methodologies for asking questions about our collection, taking advantage of the really enormous amount of information that’s available in encoded texts. So, turning to Cavendish now.

Here’s what we have of hers. Depending on how you count things, we have at least nine and as many as twenty-seven works by Cavendish, if you count each play separately. When you’re only talking about 400 texts, that really is quite a a high percentage. And, if you use the markup to get into the details of those texts, you can get an even better sense of just how much Cavendish there is.

We have over a million words, more than 15 thousand paragraphs, 13 thousand lines of verse, and 11 thousand dramatic speeches. There are almost 3,500 page breaks, which I had to double-check, because it didn’t seem believable to me. But, that’s correct. In addition to those basic structural elements, we also have markup for quotations and for phrase level features like names of persons and places, as well as the proper names of works, encoded with <title>. So, that’s one way that the markup can give you a sense of what’s in the collection of Cavendish works in WWO. And, here’s another.

As I mentioned, we use the @type attribute to categorize our textual divisions, so you can count those and see how our Cavendish materials fall into the WWP’s categorizations. Essentially, you can use text encoding to get a profile of a particular text or set of texts; there are this many poems, that many scenes, and so on. Even in these basic counts, we’re already starting to see potentially interesting patterns, particularly around paratexts. Cavendish’s works have quite a lot of general prefatory materials, for example, but much less general concluding material. Epilogues and prologues, on the other hand, are nearly evenly balanced. There’s just one advertisement and one table of contents. And so on.

For basic element counts and types of textual divisions, there’s really just too much Cavendish to compare with anyone else in the collection. But, looking at language usage, we can compare different authors. I’ve given you an example of the markup that makes this kind of query possible; the @xml:lang attribute has values from a controlled vocabulary for describing languages. This attribute can go on any element to indicate its language and, if there’s no more appropriate element, you can use <foreign>, as I’ve shown here. So, for all of these authors, French and Latin dominate across the board, with Italian coming in third. But, the relative percentages are different in Cavendish; she has about twice as much Latin as she has French, which does set her out among this group, but puts her in line with Women Writers Online as a whole. In total, we have about 2,000 instances of Latin, 1,600 of French, and 200 of Italian. Relative percentages of Latin and French are very much a distinction of period. If you look in the seventeenth century, there is about four times as much Latin as there is French; in the eighteenth century, there’s twice as much French as there is Latin. Which, I suppose, doesn’t really surprise anyone who’s worked in those periods, but it is I think reassuring to know that markup-based results can be verified by what we already know. Okay so, getting a bit more complex than simple counts, we can also ask questions about where elements of interest are appearing. In my research, I’ve discovered that it’s useful to look at both general patterns, where particular elements most often appear, and at outliers: where there are unusual cases. So, here’s just one such unusual case:

I’ve been doing a fair amount of work on intertextuality, for a current project at the WWP, so I wanted to look at where <title> elements for proper names of works were appearing. For a bit of context, there are more than 5,000 <title> elements in Women Writers Online, and these generally show up in bibliographic citations, notes, advertisements, and, quite often, just in prose paragraphs. By contrast, only about sixty appear in drama as I’ve identified it here, using a fairly conservative definition. As you can see, Cavendish comes in just after Cowley for number of titles named in drama. Now, remember that encoding is really good at making layered textual hierarchies explicit, so once you’ve narrowed to this definition of drama, you can then go look at the elements inside of drama to get more specific about where titles appear. Most of them are in prose rather than verse. About forty of these titles are in the <sp> element, that is they’re named by the characters in the play, about fifteen are in stage directions. In the whole of Women Writers Online, there are just three titles in cast lists; all in the works of, you guessed it, Margaret Cavendish.

Here’s one of those. The paragraph above gives a bit of context from elsewhere in the text and the encoding below shows you the markup I found in my search: essentially, Plays Never Before Printed contains a fragmentary play that was meant to be published with the Blazing World; as Cavendish explains, she found her “genius did not tend that way” so she left the project behind, but did “suffer” the piece to be published in the 1688 Plays collection. Then as the heading in the encoded cast list explains, Cavendish also authored characters’ names for a farce that would have followed the play in the Blazing World. But, the first play being unfinished “the farse was not so much as begun.” Nevertheless, Cavendish did include the farce’s cast list in her collection and that’s what you’re seeing here. To my mind, this is a particularly clear example of how unusual instances in the encoding—title elements within cast lists—are effective at pinpointing noteworthy textual phenomena. You might also have noted that this <title> element references one of Cavendish’s own works, which is something else that can be examined with markup. So, here are Cavendish’s most-named titles.

Our current work on intertextuality will make this search much more precise, but for now we’re still relying a degree of human intervention, and there’s a chance I’ve missed some titles if the spelling variations were significant enough. But, even with that in mind, you can still see some overall patterns. I think the immediately obvious aspect of these results is that the titles Cavendish is naming are, often, Cavendish titles. This isn’t really unusual, though I haven’t seen any other author in WWO reference her own work quite this extensively. In fact, if you look at all of the titles named in all of our seventeenth-century texts, Philosophical and Physical Opinions still comes out in the top three. So, what were other seventeenth-century writers naming? That’s something else that can be queried with the markup.

I ran the same search in the non-Cavendish texts that had publication dates in the 17thc and the results were…rather different. First of all, I should note that the search for <titles>s is actually underreporting biblical references because the WWP uses a different element in cases where writers cite biblical texts by chapter and verse; these are just references to the titles of entire biblical books. With that in mind, I wanted to look at biblical citations as well and I found that, for the seventeenth century, there are another 1869 chapter-and-verse biblical citations. Two of those are in works by Cavendish. So, I think it’s fair to say that her citation practices are measurably different from other seventeenth-century women writers, in ways you can track with text encoding.

Finally, I’d like to close with an example of some research I’ve really just begun. I’m at the stage now of gathering results and I’m not yet sure precisely what all of this means, but that’s actually something I’d hoped that you all might be able to help with. So, I’ve been looking at a particular element, <mcr>, which is is an element that was actually created by the WWP. <mcr> stands for “meaningful change in rendition.” “Rendition” means the appearance of the text, for example, is it italicized, underlined, in all caps. We consider text “renditionally distinct” when its appearance shifts to be different from the text around it, for example words that are italicized when surrounding text isn’t. Often, words will be renditionally distinct if they’re names, or if they’re foreign-language words, or if they’re being emphasized. But sometimes they’ll be renditionally distinct in ways that we can’t attribute to naming or linguistic features and that’s when we use <mcr>, to say: there is a change in rendition here, and we think it’s meaningful, not just decorative, but we’re not able to be more precise about why the rendition has changed.

So, I wanted to examine the words in Cavendish’s texts encoded with <mcr>. Here’s what I’ve found; this is a listing of the most frequent words in <mcr> by unique occurrences, so, for example, the word Atomes also shows up many other times with adjectives like sharp atomes, flat atomes, round atomes, fiery atoms and so on. Here, Cavendish follows a usual pattern for WWO, in which words in <mcr> are generally nouns and usually capitalized. Now, as satisfying as it is to survey entire corpora with a few keystrokes, one thing I’ve learned in my research is that it’s very important to be moving back and forth between collection-wide results and individual texts. And, in fact, one of the things I find really valuable about the methods I’ve been establishing is that they make it possible to move seamlessly between these birds-eye views and the texts on the ground, so to speak. <mcr> usage in Cavendish is a really good example of why it is important to keep individual texts in focus, because, in fact, most of the words in this slide are from a single text.

In fact, of those almost 15,000 <mcr> elements in Cavendish, 13,710 are Poems and Fancies, marking italicization shifts. And when you see this line group, you can start to see how those numbers got so high. It’s worth noting that there is *nothing* in Women Writers Online that comes remotely close to this proliferation of meaningful changes in rendition. The next highest text is Jane Barker’s Poetical Recreations (1688), with about 5,000 <mcr> elements. Judith Murray’s The Gleaner (1798) has about 3,000 and Elizabeth Rowe’s Poems on Several Occasions (1696) has 1800. Only eight texts in the whole collection have more than a thousand <mcr> elements. And, certainly, there are quite a few verses in Poems and Fancies like this one where nearly every noun is italicized.

What I’ve actually discovered, though, is that there are still plenty of nouns that are not distinct; and,  in fact, when you look word-by-word, you can see some interesting patterns in where words are or aren’t distinct. I’ve begun looking at individual words from Poems and Fancies, particularly those that are well represented in both the renditionally distinct and the non-distinct columns. So, by contrast, Atomes is almost always renditionally distinct, to the point where I’d wonder whether the two non-distinct instances are actually errors. With terms like “love” and “reason” that have a more even split, there are pretty clear patterns about which are distinct. When love is used as a verb (as in “love to play”) it tends not to be distinct. When it’s a noun (“love and hate”, for example) it’s likelier to be distinct. When reason is a verb, or used in constructions like “the reason why” it tends not to be distinct. Capital-R Reason as in “The Rule of Reason” tends to be distinct. These aren’t hard and fast differences, but they’re recognizable tendencies. You see the same thing with “feare” and with “care”; noun forms, particularly those referring to abstract concepts, tend to be distinct where verb forms aren’t.

Cases where words are usually distinct, with some exceptions, are also interesting. With Death, the non-distinct cases are all but one lowercase and all but one (a different one) clustered at the end of Poems and Fancies. You see the same sort of thing with “gods”; all but one of the non-distinct instances are lowercased and they’re fairly tightly clustered. I’ve only just started working with this material and I’m still figuring out how to make sense of it all, but I do think there’s something interesting here and, as I said, I’d be grateful for your thoughts.

Finally, I’ve also found that some words, like delight, tend to be non-distinct, so I’ve been looking at the cases where they are distinct to see whether that might have a particular significance. I’ve given you an example of one such usage here, partly because I think it highlights a pattern I want to investigate next—is there a correlation between verses that have very high instances of italicized terms and distinction in words that otherwise tend not to be distinct? This is a fairly large question, but it is one that the encoding makes it possible to answer. So, in the example I’ve included here, not only is delight italicized, but also horses, carts, cows, butter, and milk, among quite a few others. I’ve chosen to end with this verse not just because it does show that high rate of italicization but also because it is an example of the real pleasure I’ve had in making new discoveries in our collections through the research I’ve been doing, since it contains what is very possibly my favorite example of any term inside of <mcr>: I’m speaking, of course, about the unforgettably-named “friendship cheese.” Thank you!


Announcing New Publications to Women Writers Online and Women Writers in Context

Announcing New Publications to Women Writers Online and Women Writers in Context

The WWP is delighted to report that we have added six new texts to Women Writers Online. These are: Hester Chapone’s 1777 A Letter to a New-Married Lady, Emily Clark’s 1819 The Esquimaux (vol. 3), Anne Conway’s 1692 Principles of the Most Ancient and Modern Philosophy, Caroline Cushing’s 1832 Letters, Descriptive of Public Monuments, Scenery, and Manners in France and Spain (vol. 2), Sarah Osborn and Susanna Anthony’s 1807 Familiar Letters, and Mary Pix’s 1699 The False Friend.

In addition to spanning three centuries, these texts highlight the diversity of genres in Women Writers Online, representing travel writing, drama, philosophy, epistolary writing, religious meditation, and the novel. For more information on these texts, and the WWP’s other recent publications, please see this list of new additions to WWO.

Accompanying the publication of these early texts by women, we have added nine new exhibits to Women Writers in Context, an experimental publication series designed to engage readers in exploration and discovery of topics related to early women’s writing.

Some highlights from the new texts in WWO include:

An advertisement for an elixer that the publisher of Conway’s Principles shares “for the good of the public”:

The Elixir Proprietatis (ſo highly commended by the Renowned Paracelſus and Helmont) it reſiſteth all Putrefaction of the Blood, ſtrengtheneth the Digeſtive Faculty. Its Excellent Virtues are prevalent in the Curing of continual FeversQuotidian and Tertian AguesSmall Pox, and Meaſles, or Swine Pox, with other Peſtilential Diſtempers; as alſo the Palſy, Apoplexy, Falling-Sickneſs, Aſthma’s, Tabes, or Conſumption of the Lungs. Its Doſe is from 10 to 20, 30, or 40 drops in a Glaſs of Sack. This Noble Elixir is Philoſophically prepared, by John Spire, Chymico Medicus, at four Shillings the Ounce. Who hath, by his Labour and Study in the Chymical Art, attained unto ſeveral ſecret Arcanums, (not vulgarly known) particularly a Soveraign Remedy for the Gout. If any one is deſirous therefore, or the aforeſaid Elixir Proprietatis, Let them apply themſelves to my Friend, Mr. Dorman Newman, at the King’s Arms in the Poultry, and the Author at his Houſe in Horſly-down-Fair- ſtreet, Southwark; or at his Country Houſe, at the upper end of Twitnam, near the Sign of the White-Hart, in Middleſex.

Some advice on handling a difficult mother-in-law in Chapone’s Letter:

I am told that he is an excellent ſon to a mother, who, with many good qualities, has defects of temper which determined him to decline her continuing to live with him after his marriage. In this he is equally kind and prudent; for though he could himſelf meritoriouſly bear with failings to which he had been accuſtomed from his infancy, in a parent who doats upon him, yet this would have been too hard a taſk upon you, who have not an equal affection to support your duty, and to whom her ways would have been new and unuſual. But though I thus far highly approve his conſideration for you, yet you muſt remember how great a part of her happineſs ſhe is thus deprived of on your account, and make her all the amends in your power by your own attentions, as well as by promoting opportunities of indulging her in the company of her ſon….Be armed againſt the ſallies of her temper, and predetermined never to quarrel with her, whatever ſhe may ſay or do. In ſuch a relationſhip, this conduct would not be meanneſs but merit; nor would it imply any unworthy compliance or falſe aſſent; ſince ſilence and good-humoured ſteadineſs may always preſerve ſincerity in your converſation, and proper freedom in your conduct. If ſhe ſhould deſire to controul your actions, or to intermeddle in the affairs of your family, more than you think is reaſonable, hear her advice with patience, and anſwer with reſpect, but in a manner that may let her ſee you mean to judge of your own duties for yourſelf.

And some very dramatic pauses, indicated by extra whitespace in both the original text and WWO, from dying characters in Pix’s False Friend:

We hope that these new publications will complement each other, inviting readers to explore works by women from multiple angles and perspectives—and we hope our readers have as much fun exploring the texts as we have had preparing them for publication!

The Queen’s Two Corpora: Finding Elizabeth and Creating Corpora using the WWO Database

The Queen’s Two Corpora: Finding Elizabeth and Creating Corpora using the WWO Database

This post is part of a series authored by our collaborators on the Intertextual Networks project. For more information, see here. 

By Kristen Abbott Bennett, Stonehill College

At Tilbury, Elizabeth I gave a rousing speech to motivate her subjects, exclaiming: “I know I have the bodie, but of a weak and feeble woman, but I have the heart and Stomach of a King, and of a King of England” (Cabala). Elizabeth’s recognition of her female princely bodies as simultaneously separate and the same reflects awareness of her politically constructed dual corpora. Historically, the “King’s two bodies” theory was adapted from ideas surrounding the divine right of kings. During Elizabeth’s reign, it was legislated to preserve her interests in lands acquired by Edward IV in his minority:

For the King has in him two Bodies, viz., a Body natural, and a Body politic…. [The latter] is a Body that cannot be seen or handled, consisting of Policy and Government, and constituted for the Direction of the people, and the Management of the public weal, and this Body is utterly void of Infancy, and old Age, and other natural Defects and Imbecilities, which the Body natural is subject to, and for this Cause, what the King does in his Body politic, cannot be invalidated or frustrated by any Disability in his natural Body. (Kantorowicz 7)

The “King’s Two Bodies” construction offers an apt metaphor for thinking about approaches to corpus-based linguistic analyses. These approaches allow one to consider a single body of work in and of itself, as well as realize its rhetorical relationship to a larger corpus.1 In the context of sub-corpora created from the Women Writers Online database, the “intertexts” corpus I discuss here analogizes Elizabeth’s “body politic” that both embodies, yet remains distinctive from “the body natural”–here another sub-corpus containing Elizabeth’s speeches.

What follows is a brief account of the methods I have used to create corpora from the WWO database, ranging from basic keyword searches to more complex computationally assisted searches, along with a short discussion about the choices I made along the way. With an eye toward next steps, I close with an overview of how one may convert XML documents into different kinds of file types that lend themselves well to computational and visual analysis.

Finding Elizabeth

Initially, I used keyword searches to find the works that mention Elizabeth; works authored by her are listed, with WWO links, here. I quickly learned that my attempts to search “Elizabeth I” in a database featuring works produced between 1526–1850 was not the best move; Elizabeth II was yet to exist. This initial foray revealed 120 works of the 390 in the WWO corpus (as of spring 2017) that mention an Elizabeth, plus 276 discrete references to women named “Elizabeth.” I persevered, using Ctrl + F and skimming, ultimately locating suitable intertexts (that is, intertextual references to Elizabeth I) dating between the early seventeenth and early nineteenth centuries that discuss her in both historical and fictional contexts.

For example, both Esther Sowernam’s 1617 pamphlet, Esther Hang’d Haman and Bathusa Makin’s 1673 Essay to revive the ancient education of gentlewomen laud the historical Elizabeth’s virtues and learning. Yet in Mary Deverell’s 1792 play, Mary Queen of Scots, the fictionalized Scottish queen suggests that Elizabeth’s learnedness is undesirable and unfeminine: “my sister’s mind is masculine” (O2v).

Although Deverell’s work ultimately presents an even-handed assessment of two Queens surrounded by male advisors and doing the best they can, American writer Judith Sargent Murray’s 1798 fictional account of Elizabeth and Mary’s history portrays the English queen as manipulative, dissembling, and self-serving. I had high hopes for Margaret Cavendish saying something excessive, but she mentions Elizabeth’s reign only to mark time in Nature’s Pictures. This early research generated enough information and questions for me to propose, and commit to creating, a multimedia intertextual exhibit that networks transcontinental representations of Elizabeth by six other WWO authors in the context of common discourses associated with the queen: her dual-gender, her “cult of love,” renowned learning, relationship with Mary, Queen of Scots, and her refusal to marry.

At this point in the process, I was introduced to Ashley Clark’s (Northeastern) brilliant Counting Robot (an XQuery for performing basic counts on WWP files) and saw an opportunity to test human-brain approaches to “finding” related texts in a large database against basic computational methods.

Creating the Corpora

A  <persName> search for “Elizabeth” produced 103 files including Elizabeth’s speeches, but it still threw out false positives. Eventually, I adapted Ashley’s code to create multiple search strings using early modern spellings and alternate names (Eliz, Elizabeth, Princess, Bess, etc.) and then checked contexts manually—this method resulted in finding 33 files, including Elizabeth’s works.

The results were similar when my colleague Mary Erica Zimmer suggested the labor-saving method of searching for cases where @ref on <persName> pointed to the unique identifier established for Elizabeth in the WWP’s personography; this method helped us extract Elizabeth I from her many (likely) namesakes and locate 21 valid intertexts.

During the first pass, it made sense to create one corpus containing Elizabeth’s works, another of her intertexts, and a third including all the files. Although this seems relatively straightforward, the concept of “Elizabeth’s works” is problematic. The WWO database includes her speeches, one translation, and one “true copie of a letter.” Although Elizabeth’s speeches were transcribed and printed by men, they offer a record of the way she presented herself to her subjects. It made sense to limit the “Elizabeth” corpus to her speeches, and excise the “true copy of a letter” and the translation to focus on a single genre. Once “Elizabeth” was defined, the intertexts were easy to manage; the sole criterion for inclusion was at least one clear mention of Elizabeth I. In the context of the “two bodies” metaphor, these corpora situate Elizabeth’s “natural” body in the context of her “body politic.”

Now What?

The first corpora were encoded in XML and lent themselves well to computational inquiry using the Counting Robot, XPath searches in oXygen, and AntConc. For example, these initial forays revealed that elements with @rend (indicating typographic changes) often point to a given work’s proper nouns and linguistic shifts, in addition to elements such as <persName> and <emph> that mark such features more explicitly. For the purposes of this project, I put that query aside for the time being and thought about the possibilities for these specific corpora.

It quickly became apparent that any computational analysis of these works called for creating additional corpora. Any text mining, visualization, or mapping approaches required removing the tags from the texts. Following Sarah Connell’s suggestion of a quick, if relatively low-tech, method for transforming the XML files, we opened the texts in oXygen, switched to “Author” mode, and then copied and pasted each text into a Word document. The last step was to make a plain text corpus.2

Why so many corpora? The first set, in XML, lend themselves well to computational queries about tagged elements. Reformatting the corpora into Word docs made the works more easily searchable, plus these documents lend themselves well to visualization using tools like Voyant. Similarly, conversion into text files permits users to work with visualization and analytical tools such as AntConc and Recogito. Although clearly exceeding the “two corpora” promised by this title, I hope to have offered people who may be new to working with literary databases helpful approaches toward getting up and running.


Anon. Cabala: sive Scrinia Sacra. London, Printed for G. Bedel, and T. Collins, and are to be ſold at their Shop at the Middle-Temple-gate in Fleetſtreet, 1654. Women Writers Online, Accessed 5 May 2017.

Cavendish, Margaret (Lucas), Duchess of Newcastle. Natures Pictures Drawn by Fancies Pencil to the Live, J. Martin and J. Allstrye, 1656. Women Writers Online, Accessed 5 May 2017.

Deverell, Mary. Mary Queen of Scots; an Historical Tragedy, or, Dramatic Poem. Deverell, 1792. Women Writers Online, Accessed 5 May 2017.

Kantorowicz, Ernst A. The King’s Two Bodies: A Study in Mediaeval Political Theology. Princeton UP, 1957.

Makin, Bathusa. An Essay to Revive the Antient Education of Gentlewomen. J.D., 1673. Women Writers Online, Accessed 5 May 2017.

Murray, Judith (Sargent). The Gleaner, I. Thomas and E.T. Andrews, 1798. Women Writers Online, Accessed 5 May 2017.

Sowernam, Esther. Esther Hath Hanged Haman. Nicholas Bourne, 1617. Women Writers Online, Accessed 5 May 2017.



Teaching Tags and Metadata in Women Writers in Review

Teaching Tags and Metadata in Women Writers in Review

By Jason M. Payton, Sam Houston State University

Note: Jason M. Payton is a pedagogical development consultant for the WWP.


My course is a junior-level survey of American literature to 1865, and my students are primarily English majors and minors (course syllabus here). Most of my students have never had a class in women’s and gender studies, so I wanted to use the survey course as an opportunity to engage students with some of the critical issues raised in these fields. In addition to wanting my survey classes to engage students on women’s and gender studies issues, I also wanted to shift my teaching toward project-based models that would introduce students to important developments in the field of early American studies. The chance to partner with the Women Writers in Review team provided an opportunity to pursue both of these pedagogical aims.

I framed the WWiR project as an exercise in discovery. During the first week of class, I introduced the critical issues that would be explored at length during the course. These included nationalism and imperialism, colonialism, religion, race, class, and gender. I explained that these were intersectional issues and that we would return to them persistently to learn how the material and discursive practices associated with these concepts took shape at specific moments, and how they changed over time. While I framed this course as one that would reflect on the issue of gender, I purposely did not bookend the WWiR project with any specific theoretical readings, nor did I attempt to fully articulate how the WWiR project would fit into the course as a whole. I wanted students to do that discovery work on their own.

My students were tasked with a multi-phase project constructed in collaboration with Sarah Connell (assignment sequence here). They were given an introduction to the archive by Sarah Connell in the second week of the course and were also given a scavenger hunt exercise designed to familiarize them with the site’s organization and content. On completion of this initial site overview, I broke students into small groups and asked each group to identify one or two thematic tags for review. They were to read all the reviews published in America under their tags and compile two documents in response to their research. The first document was a group report. This report distilled important research findings and offered a series of specific illustrations of general trends; it also made recommendations regarding WWiR metadata, user interface, and user experience. The second document was a personal report on the most significant reviews and insights for each individual group member. The group reports allowed students to begin identifying trends in the archive through a collation of their individual notes on particular reviews under their chosen thematic tags. The individual reports allowed students to engage in more extensive close reading of particular archival documents than the general report permitted, and it also allowed me the opportunity to assess individual students’ levels of engagement with the project.

This two-part assignment structure allowed me to teach students about the WWiR archive itself, while also allowing me to teach them about the research process in early American studies. I believe that students benefit immensely from immersing themselves in an archive, but access to physical archives and well-managed special collections in early American studies is limited for most students at most U.S. colleges and universities. The digital archive can give students this experience of immersion without requiring physical proximity to an archive. I also believe that students benefit immensely from learning to balance the desire to find meaning in individual texts (recorded in students’ personal reports) and to find patterns of meaning across the wider archive (recorded in the group reports). The WWiR archive allows students to perform both kinds of analytical work, and the assignment structure I used ensures that students articulate both types of findings and make conscious decisions about how to weigh individual textual utterances in their assessment of the tenor of the archive as a whole.

Each group chose its own tag(s) for review. The tags they chose included:

  • Class or socioeconomic status
  • Education
  • Gender identities
  • Moral impacts of literature
  • Nation or empire
  • Slavery and abolition
  • Racial identities
  • Religious identities
  • Women as writers and readers

While not exhaustive, my students’ reviews covered over two-thirds of the thematic tags in the archive and covered the full range of sources published in America under those tags.


The individual and group reports showed that students were indeed making synchronic and diachronic judgments about the discourses on women and women’s writing in early America that I hoped they would make. The individual reports also showed that students made important connections between this project and the work of the course.

Indya F. writes, “The WWiR project is a concentrated version of what I really believe this course is about. To give students the most unedited version of history that we can take. There is some sanitization that we can’t rid ourselves of. But the WWiR takes these ideas of what misogyny used to be, and without verbalizing it, compares to the misogyny of now and allows us to take a hard look at what prejudice looked like and how far we haven’t come. Students involved in this process really have to bare [sic] witness to it.” As Indya notes, the archive’s choice to present reviews to readers with a minimal interpretive apparatus allows students to encounter historical discourse about women and women’s writing directly. The force of this encounter registers for Indya, who recognizes the ethical dilemma the encounter poses to the reader: With the knowledge provided by WWiR about misogyny in early America, how shall contemporary students of this material respond to misogyny in the present?

Hannah T. writes, “Before this project, I never realized the weight of gender inequality among literature during the eighteenth and nineteenth centuries. Through a careful analysis of 3 reviews, I noticed a running theme of discrimination against women writers…Two of my articles concerned moral impacts, and one concerned religious identities. However, although they were reviews of different authors, pieces, and themes, I still noticed the recurring bias against the literature simply because they [sic] were written by women.” Hannah’s observation focuses particularly on how WWiR helps students understand the nature and scope of the discrimination women writers experienced in our period of study.

These general reflections demonstrated that the inclusion of WWiR into a traditional early American survey course can do important critical work. Students are encouraged to think about specific historical moments and about change over time. They are also encouraged to let the encounter with the archive reshape their understanding of early American history and of their own cultural moment.

Such broad reflections were also accompanied by several specific insights about the reviewers and their subjects that are worth sharing for the way they illustrate the pedagogical potential of using WWiR in undergraduate courses.

Elizabeth C. recognizes “just how much context matters” when considering the reception of women’s writing. Having read reviews of Benger’s Memoirs of the Late Mrs. Elizabeth Hamilton and Hand’s The Death of Amnon, Elizabeth observes that in both cases, the authors’ personal lives are invoked as relevant interpretive contexts. For Benger’s reviewer, the choice of Elizabeth Hamilton is especially worthy because of Hamilton’s ability to “follow traditional domestic gender roles,” as Elizabeth surmises. If Benger’s review is relatively positive in its rhetoric, the review of Hand is less so—it labels her a “poetess” and reminds readers that she is “the wife of a blacksmith.” Elizabeth found in these reviews telling evidence that “When a man produces literature, no one mentions what his wife does for a living. When a woman produces literature, they can only be as successful as their husband.” Elizabeth’s observations highlight two running themes in my classes’ reports. The first relates to Benger specifically, and the second relates to the relation of women’s writing to men’s writing more broadly.

Several students found the reviews of Benger’s works curious. Josh A. was particularly drawn to the footnote at the end of The Christian Observer’s review of Poems on the Abolition of the Slave Trade, which lists “E. Benger” as an author. The note at the end of this review explains the reviewer’s choice to refer to the author as “Mr. Benger” as an act of benevolence: “[I]n such a case of doubt, the ambiguous person out to be supposed of the nobler sex.” Rebecca R. focused on The Christian Disciple’s review of Benger’s Memoirs of the Late Mrs. Elizabeth Hamilton, which, Rebecca notes, mentions Benger only once in the review, while mentioning Hamilton fifteen times. Donald M. further observed that reviews of Benger’s work are ambivalent in their assessment of the author. The Christian Disciple’s review praises Benger’s choice of Elizabeth Hamilton as a subject but states that it “cannot bestow a similar commendation upon what she has written on the Revelation of St. John.” In that work, Benger “seems to have taken up an hypothesis with relation to this very obscure and difficult book, without sufficient consideration, and apparently without a proper acquaintance with the labors of others upon it.” Whereas The Christian Disciple calls Benger’s fitness for exegetical work into question, The Literary Gazette, or, Journal of Criticism, Science, and the Arts says in its review of her Memoirs of the Life of Anne Boleyn, Queen of Henry VIII, that “The works of Miss Benger have so fully established her in the literary world, among the female ornaments of the times, that the praise which these interesting volumes would otherwise extort, may well be spared.” Donald’s analysis of reviews of Benger led him to conclude that the assessment of a writer’s worth could have much more to do with the reviewer’s sense of what women should or should not be writing about as with the writer’s actual merit.

The repeated return in both group and individual reports to reviews of Benger’s works highlights several critical issues that can be discussed as a whole class and related to broad course themes. First, the purportedly benevolent assumption of “E. Benger’s” masculinity raises questions about historical assumptions about the construction of gender and gender difference that can be pursued in my course from Sor Juana Inés de la Cruz and Anne Bradstreet in the seventeenth century to Judith Sargent Murray and Hannah Webster Foster in the eighteenth century. Second, the elision of Benger’s name in the review of her memoir of Elizabeth Hamilton raises questions about recognition and attribution that can be pursued along similar lines. Third, the ambivalence of reviewers toward Benger depending on the subject matter of a given work raises questions about the notion of separate spheres. These questions intersect with gender and race, and as such, can be pursued in the context of the eighteenth- and nineteenth-century epistolary and sentimental novel, but also in the context of Jefferson’s “reviews” of Phillis Wheatley and Ignatius Sancho in Notes on the State of Virginia. In this way, the WWiR archive can be placed in conversation with more easily recognized and more frequently anthologized texts in the survey course. Such an approach can help students see how archival research can enhance and challenge dominant field narratives, such as those about gender in early America, about public and private spheres, and about print culture.

In addition to helping students see the more canonical texts in early American literature differently, working with the WWiR archive also stimulated students’ curiosity to read across wider swaths of the archive than their original group assignments demanded. Bailey A. reviewed texts under the “education” and “racial identities” tags. In order to have a confident sense of how reviews under these tags were representing women and women’s writing, Bailey read across other tag collections to get a sense of general rhetorical trends, and she observes “women [being] ridiculed for being smart” at various points in the archive. Marlisa E. reviewed texts under the “slavery and abolition” tag. Her work on reviews of Poems on the Abolition of the Slave Trade (Benger, et al. This review, as noted above, presumes “E. Benger” to be a man) led her to read extra reviews, including reviews of Hannah Adams’s The Truth and Excellence of the Christian Religion Exhibited and Maria Edgeworth’s Works of Maria Edgeworth. This analysis highlights the difference in length of reviews presumed to be written by men. The review of Poems is fifteen pages in its original format, while the other two ran just nine pages in total in their original format. It also highlights key differences in the language used to talk about women’s writing. The review of Poems speaks with praise about the presumptively male “Christian poet,” while the reviews of women’s writing struck Marlisa as “extremely misogynistic.”

Bailey’s and Marlisa’s reports show that the structure of the archive can encourage students to explore more of the collection than they had originally planned to do. They also suggest some ways that the curiosity sparked by the archival encounter can be used to drive important thematic conversations in the broad survey course. Marlisa, for example, wanted to know after reading all of the American reviews under the “slavery and abolition” tag what the role of women’s writing in the abolitionist cause was. Because I introduced this project very early in the course, Marlisa’s question came up before we had begun our unit on the Atlantic slave trade and American slavery. I answered this query via a whole-class discussion about the poems of Phillis Wheatley and Sarah Wentworth Morton, as well as the novels of Harriet Beecher Stowe and Harriet Jacobs. Similarly, Briana P. raised questions about what it means for a reviewer to describe Maria Edgeworth as the “Franklin of novelists.” These questions can be answered in part by a reading of anthologized selections from Poor Richard’s Almanac and The Autobiography, as well as by reading texts such as Susanna Rowson’s Charlotte Temple and Hannah Webster Foster’s The Coquette.


Using the WWiR archive in my survey course has inspired several ideas about undergraduate pedagogy and research that I wish to share with those who are considering using the archive in their classes.

My survey courses have traditionally been anthology-based, largely out of consideration for the financial constraints faced by many of my students. While a single course text that can be used in multiple sections has the benefit of being more affordable than a litany of individual texts, the drawbacks of the anthology-based course are many. One set of problems involves questions about selection and representation in the production of the anthology. Another set of problems involves questions about breadth of coverage and depth of understanding in anthology-based survey courses. These problems are particularly vexing for early Americanists, given the capaciousness of the field. My undergraduate pedagogy is moving away from the anthology-based course and away from breadth-of-coverage models in an effort to include voices normally excluded from popular anthologies and to privilege a depth of understanding of a few key issues in early American studies over a comprehensive coverage of major authors and movements. I am also moving toward project-based courses that engage students with cutting-edge developments in literary studies and related fields. The WWiR archive is an ideal teaching tool for teachers who are similarly inclined. The project I devised gets students out of the anthology and into the archive, it helps students rethink the narrative of literary history implicit in major anthologies, and it engages students with critical issues in the digital humanities.

Ideally, undergraduate survey courses that use the WWiR archive will leave space in the course to explore critical issues raised during the research process. Students in my course raised a host of excellent questions on the basis of their work with WWiR that we simply did not have time to answer. A period of one to two weeks at the end of the term would have allowed my class the opportunity to do additional research on the question about women and abolitionism raised by Marlisa E., for example. My class would have benefitted from additional research on the writings and speeches of Maria Stewart, Sojourner Truth, Angelina and Sarah Grimke, and Lydia Maria Child. Reserving time for additional research allows students to shape the direction of the course. It also allows instructors to teach the research process by showing how research questions often develop in ways that demand that we consult new archives and develop new competencies. Additional research on women and abolitionism would lead to students from the anthology to the WWiR archive to other print and digital archives; it would also encourage students to practice intersectional thinking as they untangle imbricated historical discourses about gender and race.

An additional research period at the end of the course can also include a specific focus on critical issues in the digital humanities raised by students’ work in WWiR. Sarah Connell’s introductory presentation to my class included a demonstration of the XML markup behind the front-end WWiR display. This demonstration raised students’ awareness of the many interpretive decisions that have to be made to render a document digitally in WWiR. It sparked a course-long conversation about information architecture that could have been explored more fully through additional research on eighteenth- and nineteenth-century print culture or on the relationship between book history and the digital humanities.

Queries such as these provide excellent material for undergraduate research. The incorporation of WWiR into the undergraduate curriculum is thus an excellent way to foster students’ professional development via presentations at undergraduate research conferences. It is also an excellent tool for generating undergraduate thesis and capstone projects.