I want to begin by thanking our co-panelists for their really thoughtful and exciting presentations, as well as my co-author Julia Flanders, the conference organizers, and, of course, all of you for joining us today. I also have to thank the NEH for their support of this project as well as the rest of the WWP team—Lara Rose, Param Ajmera, Ashley Clark, and Syd Bauman—without whom none of this would be possible.
I’ll begin my talk with a quote from a letter written by Ann Candler to two women who had helped sponsor the publication of Candler’s work. This letter is included within the “short narrative” of Candler’s life by the editor of her Poetical Attempts, and, in fact, it comprises much of that narrative in terms of length and biographic detail. In this letter, Candler writes that these ladies may be “surprized” at her “prolixity” which exceeded her own intentions, but, she says: “one circumstance was so connected with another, and one word naturally introduced others” so that her prolixity was unavoidable. Candler also notes that she wanted to be as “explicit as possible” for the satisfaction of her audience. I’d like to take three things from this quotation. First, as a letter initially written by an author to her patrons for distribution among those patrons’ friends, and then published paratextually with that author’s collection of poems, this is a good example of the kinds of layered intertextuality we’ve been exploring at the WWP. Second, Candler’s discussion of how one word leads naturally to others is very neatly descriptive of what we’ve found in our research thus far, where we’ve been trying to map the webs of interconnected texts that appear in our Women Writers Online corpus. And, finally, in this presentation I hope to take inspiration from Candler and even do her one better, dodging the pitfall of surprising prolixity while nonetheless being as clear and explicit as possible, hopefully to the satisfaction of you friends who have joined us here today. To that end, I’ll start with a bit of background.
The Intertextual Networks project builds off of the Women Writers Online collection, which has about four hundred texts by women. These are largely print texts, although we do have one manuscript collection. We have a relatively broad chronological framing, 1526 to 1850, and the texts themselves are quite generically diverse. These texts are published in a web interface called Women Writers Online, and they’re encoded in TEI.
While the Women Writers Project has been around since the late eighties and Women Writers Online was first released in 1999, the Intertextual Networks project is a much more recent endeavor; this is a three-year, NEH-funded Collaborative Research grant project that started in the fall of 2016. The project is aimed at fostering research into the rhetorics of intertextuality used by early women writers. We’re interested in both straightforward cases of direct quotation and citation as well as subtler forms of intertextual engagement, such as verbal echoes, stylistic similarities, imitation, and parody. Our work is focused on exploring both early women’s intertextual practices and the representation of those practices through encoding—that is, we’re expecting the project’s literary and historical research to help us study what our markup can do, just as we’re using the markup to help answer literary and historical questions.
For this project, we have three major research activities and outcomes. First, we’ll be linking all of the textual references in Women Writers Online to a bibliography we’re creating. I’ll be talking more about this work soon but I’ve given a quick example of what we’re doing at the bottom of the slide—in this case, we’re using the @source attribute on the <quote> element to point to an entry we’ve created in a bibliography file. We’re also investigating how we might mark up intertextual phenomena that are less straightforward than direct reference or quotation and I’ve chosen this particular example because it illustrates one such phenomenon: parody—and, in fact, is a particularly clear instance, since Charlotte Smith is kind enough to tell us that she’s parodying a “sublime sentence” from Edmund Burke’s Reflections on the Revolution in France. Most of our authors, I’m afraid, are not quite this kind. But that’s what makes this interesting!
Which leads to our second project outcome: we’re publishing our research into encoding complex intertextual features on our blog and we’ll be sharing a final report at the end of the project. And finally, we have assembled a team of collaborators who are completing individual research projects, to be published on our open-access Women Writers in Context platform, with more incremental reports on our blog. These collaborators are doing some really promising work—for example, we have a project using network and braid analysis to look for similar dialogue patterns in the works of Margaret Cavendish and Molière, one collaborator who’s been tracing references to Elizabeth I throughout the whole WWO collection, and another project using markup to study marginal biblical references as a counternarrative to the “main” poetic text in Lucy Hutchinson’s Order and Disorder. I’m not going to focus on these today, but I really can’t stress enough how exciting our collaborators’ projects are—and I’ve given some links where you can learn more about everything I just mentioned.
Today, I’d like to focus on that first strand of our work, the encoding enhancements, and share some lessons that might be useful to other projects engaging in large-scale markup expansions. I’d also like to discuss some of our initial discoveries about early women’s intertextual networks—and the representation of such networks in a collaborative online environment.
On a corpus-wide level, what we’re doing is extracting all of the contents of the <title> and <quote> elements in our collection and creating bibliography entries for each text. To give you a sense of the scope of this work: we have around 5,500 separate <title> elements in Women Writers Online and around 11,500 distinct <quote> elements. Thankfully, some of those <quote> and <title> elements are pointing to the same texts, so we’re not looking to create a bibliography with 17,000 entries, but it’s still a fair amount of work!
The “before” example here shows you some of the markup we’re relying on to pull out the titles and quotations in our collection so we can create entries for them; you can see that we’ve already marked that Midsummer Night’s Dream is a title and that this group of poetic lines is a quote. The sharp-eyed among you may have noted that this is actually a misquotation, and one that changes the meaning of the quoted text to the much more optimistic claim that “the course of true love ever did run smooth.” Given that this is corrected in the second edition, it’s likely a typesetting error, but we’re still investigating whether this might be deliberate. As an epigraph with an associated bibliographic reference, this is one of the more detailed examples in WWO, so we can even see that “Shakesp.” is not just an abbreviated person’s name but also the name of an author nested in the same <bibl> element as our title. In the “after” version on the right, we’ve created a bibliography entry for Midsummer Night’s Dream and, in our encoded file, we’ve used the @ref and @source attributes on <title> and <quote> to point to that entry. We’re also using @ref to point to our personography entry for Shakespeare. In this example, and most of the ones I’ll show, I’ve simplified the encoding to focus on the features I’m discussing here.
Okay, so how are we getting from before to after? Well….
To begin with, we’ve taken an an aggressive lowest-hanging-fruit stance. In fact, if I can abuse the metaphor a bit, we actually started by scooping up all the apples that had already fallen to the orchard floor: that is, those <title> elements that were inside of <bibl> elements and, thus, likelier to have such helpful information as who their authors were. We wanted to start with the references that would be easiest to track down, so that we could fill in our bibliography file and then apply everything we learned from that process to our work on the increasingly obscure references: that is, <title>s that are not inside of <bibl>, <quote> elements with associated title or author information, and then unattributed quotations.
Other key aspects of our approach have been: combining human and programmatic interventions, using the lightest possible tools for each task, and continually testing our encoding decisions against our actual corpus. For example, here’s the spreadsheet we used to fill in the majority of the <title>s in the collection; we decided a shared spreadsheet would be the best way to make input and versioning very lightweight for the bulk of our title references: that is, all of the ones inside of <bibl> as well as those with multiple attestations. We used XQuery to pull out lightly regularized versions of all the contents of those <title> elements, aggregating, for example, variations where one text might use a long s and another might not. The columns with the titles and authors that we extracted from WWO are locked in the spreadsheet so that encoders can’t accidentally edit them, since we’ll be using that information to automatically add the unique identifiers back into WWO when we’re done. Our encoders are then filling in display titles and full titles, along with standardized versions of authors’ names, publication locations, and publication dates. This is all operating at a high level of abstraction; we usually don’t know which edition or printing is at stake, so we’re instead filling in the earliest-known information. We took advantage of the capabilities the spreadsheet format offered by, for example, using simple color coding to assign texts to our encoders, mark texts that we needed to discuss as a group, and so on.
Our most pressing concerns are ensuring that our unique identifiers are unique and making sure our data is consistent, so we also set up a separate sheet to enter preferred formats for each publication location and then locked the input column in this sheet so encoders can only input publication places established in our “canonical” list. When there are duplicates—for example, you can see we have references to Benjamin Church’s History of Philips War under two title variations—we give both have the same unique identifier and mark repeats with “[dup]” in the “Display title” column. In the parts of the spreadsheet I couldn’t fit on this slide, there are the publication location and date columns, as well as columns for notes and source documents, and for the XPath that located each title, which lets encoders find usages in our WWO corpus whenever more context is needed.
So, this spreadsheet worked really very well for the most straightforward cases. But as anyone who’s worked with bibliographic data knows, you never have as many straightforward cases as you’d hope. We’d known from the outset the spreadsheet would be inadequate for some kinds of texts, but, I have to confess, we didn’t think we’d hit those in the very first week.
But, of course we did. So we also set up an XML input file for the cases where our spreadsheet was inadequate. And we found that these balanced nicely; the spreadsheet made it possible to have two people working on fifteen hundred records without any version-control issues and the XML was there for the smaller number of cases where we needed more structure. Such as:
Periodicals, analytic-level references to works published within a larger monograph, references to particular editions or printings, and so on. For one example, here is a quotation from the preface to Finch’s Sonnets and Other Poems, citing some advice from Tales of the Castle that a “woman ought never to suffer a man to add a single word to her writings” lest he “pass for the original inventor” and we have a fairly full bibliographic citation with a title, volume number, and page number. Tales of the Castle is a work in translation, so to create a record for this text, we needed to turn to the XML bibliography, which is designed to handle such complexities.
As you can see here, we are using the different levels of bibliographic abstraction represented in the Functional Requirements for Bibliographic Records (FRBR), and mapping them onto specific TEI elements for recording bibliographic information. Although it’s more detailed than the example I showed in the “before” and “after” slides, this is still is not the final version of how our XML bibliography will be set up. Instead, it’s been sufficient for our encoders to record the information we knew we needed to have—such as adding Thomas Holcroft’s role as the translator of Tales of the Castle.
After we’d reviewed all of the contents of the <title> elements in <bibl>, which amounted to about 1,100 texts named in 2,100 individual <title> elements—we had enough data to feel we’d sufficiently tested both our work processes and our handling of bibliographic complexities. Before we turned our attention to the rest of the collection, we held the first of several review phases, in which we checked for duplicates and inconsistencies. We’ve found that building in periodic review processes has been a really important aspect of our work. Unless your input processes are extremely restrictive, you probably will get some duplicates, so it’s matter of finding a balance where you actually get some work done, check for duplicates, work some more, check for duplicates, and so on.
We’ve adopted a multipronged approach to deduplication: we use alphabetization as a first-pass method, beginning with our input phase in which our alphabetized extracted titles made it clear that, for example, “Collections of the Massachusetts Historical Society” and “Collections of the Historical Society” should be checked as potential duplicates. Of course, variant titles aren’t always alphabetically proximate, so if one author refers to the “New and General Biographical Dictionary” as the “General Biographical Dictionary” and another calls it just the “Biographical Dictionary,” the fact that these are duplicates is less immediately clear. But, since our mechanisms for creating display titles are well-documented and our encoders have been very consistent, we can review the display titles, as the report here shows, to see if there are any apparent exact duplicates as with “A Geographical and Historical View of the World” or likely cases of duplication as with the “Adventures of Telemachus.” In resolving potential duplicates, we’ve been able to use authorship and other publication details to determine which actually are duplicated, which has been particularly necessary for the great many works named in WWO that are titled “Poems.” In addition to alphabetization, we’ve also found that ngram counting is a helpful mechanism for locating potential duplicates and we now have a pool of tests we can draw on to check our data.
We followed an iterative process of data gathering and review to collect information on all of the titles in our collection that were in <bibl>s or named more than once in ways that could be identified programatically. We then set up a canonical bibliography, which our encoders now check before they create any new textual records, to make sure we haven’t already encountered a text, since we’ve now encountered so many.
Before we took on the remaining 1,800 “singleton” titles, that is the ones outside of <bibl> named only once in ways that our routines could identify, we made another adjustment to our processes and switched our primary input mechanism from the spreadsheet to…
The inspectre, which Ashley Clark created as a web-based platform for inspecting and editing XML documents using XPath and XQuery, a platform that enables us to combine programmatic and human interventions. If you’re curious about the inspectre, Ashley and I have a paper on it and I’ve provided the link on this slide. For today, I’ll just say that this interface was ideal for handling the singletons because it allows our encoders to easily see the context for each title, without having to open an XML file. We took all of the process hacks we’d developed in our first phase and applied them here. So, you can see that this particular entry for “The Arraignment of Lewd, Idle, Froward, and Inconstant Women” was assigned to Lara and you can see the XPath Lara would use if she needed to access the original text. In the box, the encoder can see each title in context, with the ability to switch to the XML as needed. We’re now using webform mechanisms like checkboxes to track the information our first phase taught us we would need, including the fact that this is a duplicate, since we already had an entry for the more commonly used version of the title “The Arraignment of Women.” The rest of the inspectre page, not shown here, gives the encoder spaces to fill in all of the publication details and to add some more flags, such as “I have a question about this” and “I’m done with this.” Those flags really come in handy for reviewing the remaining titles as a whole to locate the entries that have been completed, those that need group discussion, and so on.
Currently, we’re wrapping up the last few hundred singleton titles and completing another review of our data, after which we’ll be able to automatically add the unique identifiers for each title back into the WWO corpus and finalize our XML bibliography with all of our entries, regardless of their input mechanism. Then, we’ll turn our attention to the quotes and we’re very optimistic that the processes we’ve established so far will continue to ensure that we’re working efficiently while taking care to produce data we can rely on. We’ve also begun some exploratory investigation of the data that we’ve created thus far, which means that I can now share one example of the research into early women’s intertextualities our encoding enhancements have enabled.
I’ve been looking at citation patterns for a subset of the texts referenced by our authors, narrowing the scope here from all the texts named in WWO to just the histories, two hundred titles out of several thousand.
I took inspiration from Margaret Cavendish, who asserts that there are many sorts of histories, the primary of which are world histories, national histories, and biographies—and, using a slightly broader taxonomy, I categorized the historical works named in Women Writers Online, based on the regularized and deduplicated list of titles that we created.
These numbers are based on the individual titles named, rather than the number of times each title might be referenced—I wanted to examine the historical lexicon of our authors, as it were. From here, I can look at actual numbers of citations, as well as where these different sorts of histories are being named, using our metadata to track references by publication location, genre, or time period. I can also look at the markup around these references to see whether these histories are appearing in notes, in running prose, in advertisements, and so on. I’m planning to add more layers to this general taxonomy of historical texts and I’ve found it very helpful to consult the types of biographies and collections already shared by the CBW project.
Even given the preliminary nature of these results, I’m finding some threads that merit further investigation, particularly in the higher numbers of biographies referenced, which I think may be reflecting a larger historiographic trend. Biographies were a historical genre that became accessible to women writers earlier than the prestige genres such as the universal or national history, so the fact that our authors are also citing more biographies is quite interesting.
As we continue with our encoding enhancements, we’re very excited about the potential that WWO will have to answer research questions much like this one.
I want to close with some thoughts on how our three projects connect with each other, and with others working on women’s writing. While the scopes of our respective projects are quite distinct, there are some shared threads that I think are worth noting: we are concerned with balancing the large scale and the finely-grained in both the taxonomies and the metadata we are creating. We call attention to the labor-intensiveness of the work that we do and we frame that work as being in conversation with, and—at least ideally—interoperable with other projects’. We have many of the same research questions as we investigate what women were reading, as well as how women were being read and written about—and as we explore digital methods for representing women’s textual networks.
The fact that the texts we are studying, modeling, and publishing are so distinct, while our methods and our research goals intersect in so many ways actually opens up some really exciting opportunities for collaboration and discussion—whether that might be consulting other projects’ taxonomies in developing our own or examining where the texts and persons we are studying appear in multiple contexts. At the WWP, we’ve worked to support such inquiries by linking our own data to catalogues such as ESTC and WorldCat, and we’re now in the planning stages for creating Linked Open Data datasets for the places, persons, and texts in our collections.
I started with a letter from Ann Candler, describing how one word led so naturally to others that she could not contain them. I’d like to close with a different depiction of interconnectedness: Charlotte Smith’s description of a patchwork quilt produced by an “industrious landlady.” In this description, Smith highlights the disparateness of the quilt’s varied components: “here a little bit of chintz, surrounded by pieces of coarse and tawdry cotton”…”in one place a remnant of the fine gown of the Lady of the manor; in the next, a relict of the bed-gown of her house-maid.” Nevertheless, these scraps do create a whole, motley though it may be. And, just as the quilt stitches together the lady of the manor with her housemaid, despite their differences in status, I think that our work has the potential to uncover connections at many levels: between the women that we study, their texts and those that are written about them, how those these writers and texts circulated in their own times and the mechanisms that we are using to circulate information about them today.
Intertextual Networks has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.Tweet