Browsed by
Category: Encoding Explorations

To the Right Honourable, Virtuous, Heroical Reader

To the Right Honourable, Virtuous, Heroical Reader

This post was authored by Anna Kroon, University of New Haven class of 2019, who held an internship at the WWP during the summer of 2017. 

I came to the Women Writers Project really excited to work on such a large project with a wide variety of texts in their files. My experience was limited to Victorian shipboard newspapers, so anything not related to the ocean or intellectual boat humor was thrilling to me.

Since I had experience with XML and the TEI, but not with WWP encoding, I wanted to learn how to encode a short text that was vastly different from what I had transcribed and encoded before. To begin, I worked on Poems on Various Subjects by Elizabeth Sarah Gooch. This is a pessimistic set of poems that mostly deal with the author’s loss and sadness.

What interested me most about this text was at the very end and not even written by Gooch at all. The last poem in the collection was written to Gooch by a Mr. Anthony Pasquin, Esq. Being so new to the WWP encoding guidelines I had to ask “is there anything special I do with a poem not written by the author?” The answer was not what I had expected from such a detail-oriented project: no.

To the Author, with Love

“Non-authorial paratexts” (NAPTs) became my specific interest (and the bane of my existence). My dad would joke “para what? Pair-a-socks?” as I stumbled through explaining the encoding jargon and an oversimplified definition of my research.

Essentially these NAPTs are texts (poems, letters, and other short texts) that are published with a larger text and are written by a person who is not the author of that text. As a general rule NAPTs are written to or about the author of the main text praising her abilities, virtuousness, or life. I had many hypotheses about the purpose, authors, and significance of these texts, but I had to pare them down to match the scale of a summer internship.

I started off looking to see if Pasquin’s poem was not in fact a singular phenomenon. I took to the digital stacks of already published and in-progress texts in Women Writers Online to see what I could find. Using XPath, I tried and erred my way through many possibilities of where and what these non-authorial paratexts could be. As of writing this post, I have found over 120 unique non-authorial paratexts from 30 different larger texts.

The Process of Finding NAPTs with XPath

Full disclosure: I had never used XPath before this summer. That, combined with my beginner’s knowledge of the WWP encoding, made my XPath searching very slow to start. I used some logic and knowledge of general publication formatting to create my first few queries.

Texts are very broadly split into three main sections: front matter, main body, and back matter. Based on my initial assessment, I determined that the author generally writes everything in the main body without much contribution by others. So it wouldn’t be very fruitful to look there. Front matter comes first and often holds dedications and other prefatory materials. Even though the case that sent me to this research was in the back matter I thought it prudent to start with what was in front of me.

Within the front matter, a <div> (textual division) can have a several values for the @type attribute, including: “advert,” “contents,” “ded,” (dedication) “frontispiece,” “prefatory,” or “prologue.” The two values for @type that seemed the most useful were “ded” and “prefatory.” For my first search, I used “prefatory” because it is the broader category that would give me more results so that I could go on to refine my search.

I didn’t want to cast my net too wide on my first search, so I also decided to specify that I was looking for poetry. The XPath I used was: //front//div[@type="prefatory"]//div[@type="poem"]
essentially saying “Look in the front matter, then look in <div>s with a @type value of “prefatory,” and then look for <div>s with a @type value of “poem.” This search gave me 12 poems, 4 of which were NAPTs. One text I found in this fashion was Katherine Philips’ Poems (1664), which had 7 poems written by mostly men praising Philips.

While I went through my first set of results I noticed some of the line groups had a @type of “para.” In the hazy fog of researching NAPTs I assumed that this was an irregularly used value for paratexts. I added this on to my original search creating: //front//div[@type="prefatory"]//div[@type="poem"]//lg[@type="para"]
I was dismayed to find that I was looking at my previous search results except broken down by line group rather than any sort of new material. Still determined that “para” stood for paratext I tried deleting the type of ‘prefatory’ from my search. This gave me more fruitful results. The Poetical Works of the late Mrs. Mary Robinson had an astounding 19 paratexts written by various important men, including the Duke of Leeds.

This success made me sweat a little—maybe what I was researching had already been done and didn’t need me endlessly XPath searching. A quick look through the WWP encoding guidelines assuaged my fears stating that the value “para” actually stood for verse paragraphs. After this I went back to the model of my first query. Since “ded” was the other front matter @type value that seemed worthwhile, I simply exchanged “poem” for “ded” and searched: //front//div[@type="prefatory"]//div[@type="ded"]. I had not yet realized that these were two values that were essentially describing the same level of textual divisions and that dedications would not necessarily be nested within more general prefatory textual divisions. My results helped me to see the issue with my search: there was only one resulting dedication, written by the author to a princess.

From these missteps I decided to go with a simple query: //front//div[@type="ded"]. This action was driven by frustration, but also observation. Most of the non-authorial paratexts I found were poems or letters written to or about the author. In other words, dedicated to the author. There were 143 results, many of which were actually authorial paratexts (that is, paratexts written by the author herself). Even so, from this single search I found 77 unique non-authorial paratexts. However, many of these were not the highlighted passages found by XPath. I scrolled up and down around the search result to see if there were potentially other NAPTs in that specific document. Of the 77 I found with the [@type="ded"] search, 52 were found through scrolling—or, 67% were not a direct result of the XPath query.

As the simple “ded” search was very successful, I thought another simple search would be a good course of action. Searching //front//div[@type="prefatory"] yielded 432 results. While I knew that there would be many non-authorial paratexts hidden under a lot of other odd prefatory material, this was such a large grouping that it would take a single person with a time constraint far too long to parse through. To trim this list, down I looked at prefatory material that contained <signed> elements inside of <closer>s.

When scrolling through texts I quickly learned that a closing byline or signature was more common than an opening one beneath the heading. There were some rare cases where the authorial attribution was baked into the title, but as a general rule NAPTs in the WWP textbase have signatures as authorial attributions, so looking for dedications that contained <signed> elements gave me another way to refine my results. I performed several other searches, including reviews of the back matter, until I had a pool of nearly 130 unique non-authorial paratexts, which seemed like a comfortable amount to make a case for tagging these texts and a significant enough sample size to analyze.

Features of a Non-Authorial Paratext

My lengthy process of trial and error was at times frustrating, but it did make me acutely aware what was and what wasn’t a non-authorial paratext. The first thing I noticed was the signature or byline. When XPath pulled up a potential file, I scrolled to the very end and looked for a line that began with <signed>. If there was a signature, I would check to make sure that the author attributed was not the author herself. At times, this meant looking to the publication information or the personographic data to make sure the initials were not hers. If there were any doubts about authorship (especially for texts written by various or unknown authors) I would not catalog it.

If there was no signature line I would check the <head> or <head> @type “sub” for a byline. This was the less common occurrence, which is why I would look at it second even though it comes before the ending signature. I observed this phenomenon happening most often when the author of the non-authorial paratext was someone of importance such as a duke or a lord.

Another feature I touched upon in my searching was the titles of the paratexts. I focused on the subset of non-authorial paratexts that were written to the author praising her abilities. This is shown in the titles with prepositions like “to,” “on,” or “upon.” The most common first word of the title was, unsurprisingly, “to” with 71 instances or 57%. I looked at the other first words and considered a word significant if there were more than 5 occurrences. I decided to put “Impromptu” and “Sonnet” together class they are both declarations of the poem’s type and neither was significant on its own. The significant identifying words broke down thusly:

“To”: 57%
Less than 5 occurrences, various: 22%
“On/Upon”: 11%
Untitled: 6%
“Impromptu/Sonnet”: 4%

The keyword contents of the title and the author signature were the two most important features for identifying an NAPT. The only other useful tool was context. There were one or two occurrences where the paratext was not written to the author, but by the content of the poem or letter made it clear that it wasn’t written by the author herself.

Marking Up Non-Authorial Paratexts

Since I clearly proved the existence and abundance of NAPTs in the WWP textbase, the next step was to discuss markup to make the non-authorial content distinct from the authorial content. I discussed this topic with Sarah Connell and Ashley Clark as well as some encoders working on the project. We came to the conclusion that this was not going to be decided soon, but created several potential methods of encoding.

The first was simple and logical, but would be more difficult to implement. This meant creating an attribute that would be used on a <div> with a set of values that indicated the paratext author’s assumed gender. For example, an @author attribute with values of “male,” “female,” “collective,” “unknown,” and “nonbinary.” Therefore a poem written by the author’s brother would be marked as:
<div author= "male"> <head>To my beautiful sister by <persName> Joseph Kroon </persName></head> [insert poem here]</div>.

The second method followed the logic of the first, but was more practical. Rather than create a new attribute we would use @resp which indicates “the nature of a person’s intellectual responsibility, or an organization’s role in the production or distribution of a work” according to the Text Encoding Initiative’s P5 Guidelines. Instead of values that indicated the NAPT author’s gender we would create general personography entries for @resp to point to, expanding on the existing entry for “unknown author” to include unknown male, female, collective, and nonbinary authors. For example, that same poem by the author’s brother would now be encoded as:
<div resp="p:umale.agv"><head>To my beautiful sister by <persName> Joseph Kroon </persName></head> [insert poem here] </div>.

This method would also provide more specificity for paratexts where the NAPT author already has an entry in the personography:
<div resp="p:jkroon.doe"><head>To my beautiful sister by <persName> Joseph Kroon </persName></head> [insert poem here] </div>.

As of this writing, no decision has been made on the method of markup. Thinking theoretically, if we had decided upon the markup, my next step would be to implement it in a special test folder on the non-authorial paratexts I cataloged. From there we could see how the new markup interacts with the existing markup to ensure that it doesn’t cause any issues. We could also used this marked-up data to perform simpler searches for NAPTs. We would even have the ability to easily create reports with bibliographic data for analysis.

Analysis of the Non-Authorial Paratexts

After cataloging and marking up the paratexts, my final goal was to research a few hypotheses about the authorship of NAPTs. I brainstormed many different hypotheses that I wanted to look into, but settled on three.
1. The gender composition of paratext authorship is mostly male.
2. Most paratexts are written by people of distinction.
3. Texts with a significant number (more than 5) of NAPTs are published posthumously.

The authorship genders broke down as: 5% female, 36% male, and 59% gender unknown, thus showing my hypothesis was plausible. The considerable number of gender unknown authors comes from the fact that authors would sign their paratext with only their initials or a gender non-specific abbreviation. I did not have the resources or the time to search for mystery authors with only initials and the main author’s bibliographic data.

I defined distinction very broadly as anyone who had a title of any sort. This could be anyone from a king to a judge considered “honourable.” Of the 126 cataloged non-authorial paratexts, 24 (or 19%) were written by someone whom I defined as distinguished. The distinguished titles broke down as follows:
Duke: 4%
Marquis: 4%
Earl: 8%
Lord: 4%
Reverend/Doctor/Reverend Doctor: 13%
Sir: 4%
Honourable: 4%
Esquire: 55%
All this to say that my hypothesis was proven false. However, any of the unknown authors who signed with initials could be a titled person of distinction.

My final hypothesis was going to be researched with the use of the WWP’s counting XQuery and the experimental paratext markup. Since we were not able to come to a decision on this topic I was not able to mark up nor have a an automatic routine parse through lots of data. Curating author death dates and text publication dates for nearly 100 different texts would have been far too time consuming. However, my text with the most non-authorial paratexts was The Poetical Works of the Late Mrs. Mary Robinson with a stunning 19 non-authorial paratexts. From the title and the contents of the individual non-authorial paratexts it is apparent that this volume was published posthumously and Mrs. Robinson was a well-known and well-loved author.

This project is still ongoing. Once a decision is made on how to tag the non-authorial paratexts and they have been tagged, there are many other topics to research. We could easily identify how many NAPTs there are per text or the genre that has the most NAPTs. With more connections to the WWP’s bibliographic information we could see if any of the NAPT authors are other female authors in the database. We could also do a full analysis of publication versus death date to see if texts published posthumously do in fact have more non-authorial paratexts than texts published during the author’s lifetime.

“‘The Text is Variety’: Contextualizing and Analyzing the Works of Margaret Cavendish with Text Encoding

“‘The Text is Variety’: Contextualizing and Analyzing the Works of Margaret Cavendish with Text Encoding

Below are lecture notes from Sarah Connell’s presentation at the 2017 International Margaret Cavendish Society Conference. The slides are available as a separate file here.

Okay, so, since one of the themes of this conference is how Cavendish was received, I want to begin with a quote about her from a text in Women Writers Online.

So, here we have Elizabeth Benger on Cavendish, speaking of her fertile fancy, her uncommon genius, her wildness and inaccuracy, and her voluminous works. And, as much as this feels like a textbook example of damning with faint praise, I have to say I find myself sympathizing with Benger when she speaks of Cavendish’s wildness—you see, I didn’t come to this project expecting to work on Cavendish at all. I was trying to do research with Women Writers Online as a collection but I found that Cavendish just kept popping up. Her works started to feel wild precisely because they are so voluminous; they represent a very significant percentage of our corpus, so it’s not really surprising that they were so prominent in all of my searches through the collection. But, I’ve found that with Cavendish, it’s not just about sheer numbers; she also was showing up in my research because her texts have a high number of unusual phenomena. It seemed as if, whenever I found some textual feature that was unique to a particular author, that author would be Cavendish. Well, or Eleanor Davies. But, it was Cavendish a lot of the time. So, clearly, Cavendish called for a research project of her own, which is what I’m going to share with you today. But first, I’ll give you a bit of background.

So, as I said, I was working from the Women Writers Online collection, which has about four hundred texts by women. These are largely print texts, although we do have one manuscript collection with the Almanacks of Mary Moody Emerson. We have a relatively broad chronological framing, 1526 to 1850, and the texts themselves are quite generically diverse. These texts are published in a web interface called Women Writers Online, but they’re encoded in TEI, which is much more detailed and information-rich than we’re able to show on the web.

And here’s what I mean by information rich; in fact, I’ve simplified this and all other examples of our encoding to make it more readable. TEI markup is a very complicated and diverse topic, so I’ll focus on the basics here. We use elements, such as this <head> element, which marks that “Scene 8” is a heading. Here is a <div> element, for a textual division. The TEI is very, very good at labeling things—saying, for example: this is a stage direction, this is a division, this is a paragraph, this is a speaker label—and it’s very good at marking their boundaries; this stage direction starts here and ends here. The TEI is also good at showing hierarchical relationships, the nesting of textual features; so, here we have a <sp> element, used to mark a dramatic speech—and, inside of that, we have a speaker label and a paragraph. There’s no ambiguity that this speaker label and this paragraph belong together, because they’re both in the same <sp>. In addition to elements, the TEI also has attributes, which are kind of like adjectives. They give more information about their elements. For instance, we have three examples of the @type attribute, one on <div>, asserting that the type of division we have is a scene and two on <stage>, describing which types of stage directions we have. This @who attribute points to a cast list elsewhere in the document, where we’ve defined “ign” as referring to Lady Ignorant. That way, every time she speaks, we’ve marked those speeches as belonging to her in a way that’s easily readable by a computer. There’s no ambiguity, even if the speaker label is missing or incorrect. Okay, so, like I said, this is a big topic, but that covers the basics. Elements both name and mark the boundaries of features within a textual hierarchy and attributes provide more information about elements. My work has been on how we can use this markup in literary research; I’ve been developing methodologies for asking questions about our collection, taking advantage of the really enormous amount of information that’s available in encoded texts. So, turning to Cavendish now.

Here’s what we have of hers. Depending on how you count things, we have at least nine and as many as twenty-seven works by Cavendish, if you count each play separately. When you’re only talking about 400 texts, that really is quite a a high percentage. And, if you use the markup to get into the details of those texts, you can get an even better sense of just how much Cavendish there is.

We have over a million words, more than 15 thousand paragraphs, 13 thousand lines of verse, and 11 thousand dramatic speeches. There are almost 3,500 page breaks, which I had to double-check, because it didn’t seem believable to me. But, that’s correct. In addition to those basic structural elements, we also have markup for quotations and for phrase level features like names of persons and places, as well as the proper names of works, encoded with <title>. So, that’s one way that the markup can give you a sense of what’s in the collection of Cavendish works in WWO. And, here’s another.

As I mentioned, we use the @type attribute to categorize our textual divisions, so you can count those and see how our Cavendish materials fall into the WWP’s categorizations. Essentially, you can use text encoding to get a profile of a particular text or set of texts; there are this many poems, that many scenes, and so on. Even in these basic counts, we’re already starting to see potentially interesting patterns, particularly around paratexts. Cavendish’s works have quite a lot of general prefatory materials, for example, but much less general concluding material. Epilogues and prologues, on the other hand, are nearly evenly balanced. There’s just one advertisement and one table of contents. And so on.

For basic element counts and types of textual divisions, there’s really just too much Cavendish to compare with anyone else in the collection. But, looking at language usage, we can compare different authors. I’ve given you an example of the markup that makes this kind of query possible; the @xml:lang attribute has values from a controlled vocabulary for describing languages. This attribute can go on any element to indicate its language and, if there’s no more appropriate element, you can use <foreign>, as I’ve shown here. So, for all of these authors, French and Latin dominate across the board, with Italian coming in third. But, the relative percentages are different in Cavendish; she has about twice as much Latin as she has French, which does set her out among this group, but puts her in line with Women Writers Online as a whole. In total, we have about 2,000 instances of Latin, 1,600 of French, and 200 of Italian. Relative percentages of Latin and French are very much a distinction of period. If you look in the seventeenth century, there is about four times as much Latin as there is French; in the eighteenth century, there’s twice as much French as there is Latin. Which, I suppose, doesn’t really surprise anyone who’s worked in those periods, but it is I think reassuring to know that markup-based results can be verified by what we already know. Okay so, getting a bit more complex than simple counts, we can also ask questions about where elements of interest are appearing. In my research, I’ve discovered that it’s useful to look at both general patterns, where particular elements most often appear, and at outliers: where there are unusual cases. So, here’s just one such unusual case:

I’ve been doing a fair amount of work on intertextuality, for a current project at the WWP, so I wanted to look at where <title> elements for proper names of works were appearing. For a bit of context, there are more than 5,000 <title> elements in Women Writers Online, and these generally show up in bibliographic citations, notes, advertisements, and, quite often, just in prose paragraphs. By contrast, only about sixty appear in drama as I’ve identified it here, using a fairly conservative definition. As you can see, Cavendish comes in just after Cowley for number of titles named in drama. Now, remember that encoding is really good at making layered textual hierarchies explicit, so once you’ve narrowed to this definition of drama, you can then go look at the elements inside of drama to get more specific about where titles appear. Most of them are in prose rather than verse. About forty of these titles are in the <sp> element, that is they’re named by the characters in the play, about fifteen are in stage directions. In the whole of Women Writers Online, there are just three titles in cast lists; all in the works of, you guessed it, Margaret Cavendish.

Here’s one of those. The paragraph above gives a bit of context from elsewhere in the text and the encoding below shows you the markup I found in my search: essentially, Plays Never Before Printed contains a fragmentary play that was meant to be published with the Blazing World; as Cavendish explains, she found her “genius did not tend that way” so she left the project behind, but did “suffer” the piece to be published in the 1688 Plays collection. Then as the heading in the encoded cast list explains, Cavendish also authored characters’ names for a farce that would have followed the play in the Blazing World. But, the first play being unfinished “the farse was not so much as begun.” Nevertheless, Cavendish did include the farce’s cast list in her collection and that’s what you’re seeing here. To my mind, this is a particularly clear example of how unusual instances in the encoding—title elements within cast lists—are effective at pinpointing noteworthy textual phenomena. You might also have noted that this <title> element references one of Cavendish’s own works, which is something else that can be examined with markup. So, here are Cavendish’s most-named titles.

Our current work on intertextuality will make this search much more precise, but for now we’re still relying a degree of human intervention, and there’s a chance I’ve missed some titles if the spelling variations were significant enough. But, even with that in mind, you can still see some overall patterns. I think the immediately obvious aspect of these results is that the titles Cavendish is naming are, often, Cavendish titles. This isn’t really unusual, though I haven’t seen any other author in WWO reference her own work quite this extensively. In fact, if you look at all of the titles named in all of our seventeenth-century texts, Philosophical and Physical Opinions still comes out in the top three. So, what were other seventeenth-century writers naming? That’s something else that can be queried with the markup.

I ran the same search in the non-Cavendish texts that had publication dates in the 17thc and the results were…rather different. First of all, I should note that the search for <titles>s is actually underreporting biblical references because the WWP uses a different element in cases where writers cite biblical texts by chapter and verse; these are just references to the titles of entire biblical books. With that in mind, I wanted to look at biblical citations as well and I found that, for the seventeenth century, there are another 1869 chapter-and-verse biblical citations. Two of those are in works by Cavendish. So, I think it’s fair to say that her citation practices are measurably different from other seventeenth-century women writers, in ways you can track with text encoding.

Finally, I’d like to close with an example of some research I’ve really just begun. I’m at the stage now of gathering results and I’m not yet sure precisely what all of this means, but that’s actually something I’d hoped that you all might be able to help with. So, I’ve been looking at a particular element, <mcr>, which is is an element that was actually created by the WWP. <mcr> stands for “meaningful change in rendition.” “Rendition” means the appearance of the text, for example, is it italicized, underlined, in all caps. We consider text “renditionally distinct” when its appearance shifts to be different from the text around it, for example words that are italicized when surrounding text isn’t. Often, words will be renditionally distinct if they’re names, or if they’re foreign-language words, or if they’re being emphasized. But sometimes they’ll be renditionally distinct in ways that we can’t attribute to naming or linguistic features and that’s when we use <mcr>, to say: there is a change in rendition here, and we think it’s meaningful, not just decorative, but we’re not able to be more precise about why the rendition has changed.

So, I wanted to examine the words in Cavendish’s texts encoded with <mcr>. Here’s what I’ve found; this is a listing of the most frequent words in <mcr> by unique occurrences, so, for example, the word Atomes also shows up many other times with adjectives like sharp atomes, flat atomes, round atomes, fiery atoms and so on. Here, Cavendish follows a usual pattern for WWO, in which words in <mcr> are generally nouns and usually capitalized. Now, as satisfying as it is to survey entire corpora with a few keystrokes, one thing I’ve learned in my research is that it’s very important to be moving back and forth between collection-wide results and individual texts. And, in fact, one of the things I find really valuable about the methods I’ve been establishing is that they make it possible to move seamlessly between these birds-eye views and the texts on the ground, so to speak. <mcr> usage in Cavendish is a really good example of why it is important to keep individual texts in focus, because, in fact, most of the words in this slide are from a single text.

In fact, of those almost 15,000 <mcr> elements in Cavendish, 13,710 are Poems and Fancies, marking italicization shifts. And when you see this line group, you can start to see how those numbers got so high. It’s worth noting that there is *nothing* in Women Writers Online that comes remotely close to this proliferation of meaningful changes in rendition. The next highest text is Jane Barker’s Poetical Recreations (1688), with about 5,000 <mcr> elements. Judith Murray’s The Gleaner (1798) has about 3,000 and Elizabeth Rowe’s Poems on Several Occasions (1696) has 1800. Only eight texts in the whole collection have more than a thousand <mcr> elements. And, certainly, there are quite a few verses in Poems and Fancies like this one where nearly every noun is italicized.

What I’ve actually discovered, though, is that there are still plenty of nouns that are not distinct; and,  in fact, when you look word-by-word, you can see some interesting patterns in where words are or aren’t distinct. I’ve begun looking at individual words from Poems and Fancies, particularly those that are well represented in both the renditionally distinct and the non-distinct columns. So, by contrast, Atomes is almost always renditionally distinct, to the point where I’d wonder whether the two non-distinct instances are actually errors. With terms like “love” and “reason” that have a more even split, there are pretty clear patterns about which are distinct. When love is used as a verb (as in “love to play”) it tends not to be distinct. When it’s a noun (“love and hate”, for example) it’s likelier to be distinct. When reason is a verb, or used in constructions like “the reason why” it tends not to be distinct. Capital-R Reason as in “The Rule of Reason” tends to be distinct. These aren’t hard and fast differences, but they’re recognizable tendencies. You see the same thing with “feare” and with “care”; noun forms, particularly those referring to abstract concepts, tend to be distinct where verb forms aren’t.

Cases where words are usually distinct, with some exceptions, are also interesting. With Death, the non-distinct cases are all but one lowercase and all but one (a different one) clustered at the end of Poems and Fancies. You see the same sort of thing with “gods”; all but one of the non-distinct instances are lowercased and they’re fairly tightly clustered. I’ve only just started working with this material and I’m still figuring out how to make sense of it all, but I do think there’s something interesting here and, as I said, I’d be grateful for your thoughts.

Finally, I’ve also found that some words, like delight, tend to be non-distinct, so I’ve been looking at the cases where they are distinct to see whether that might have a particular significance. I’ve given you an example of one such usage here, partly because I think it highlights a pattern I want to investigate next—is there a correlation between verses that have very high instances of italicized terms and distinction in words that otherwise tend not to be distinct? This is a fairly large question, but it is one that the encoding makes it possible to answer. So, in the example I’ve included here, not only is delight italicized, but also horses, carts, cows, butter, and milk, among quite a few others. I’ve chosen to end with this verse not just because it does show that high rate of italicization but also because it is an example of the real pleasure I’ve had in making new discoveries in our collections through the research I’ve been doing, since it contains what is very possibly my favorite example of any term inside of <mcr>: I’m speaking, of course, about the unforgettably-named “friendship cheese.” Thank you!

 

“Day of DH” Snapshots of Our Daily Lives

“Day of DH” Snapshots of Our Daily Lives

The Women Writers Project is proud to host our local Digital Scholarship Group “Day of DH” post this year. “Day of DH” provides an opportunity for members of the DH community to share “day in the life” vignettes with each other. For more information about “Day of DH,” please view the official site and you can follow the twitter hashtag #DayofDH.  I hope these snapshots offer a fun array of some of the people, activities, and work that comprises the DH community at Northeastern.

Julia Flanders, Director of the Digital Scholarship Group and the Women Writers Project

This year for “Day of DH” I had an unusually substantive day–in the past I’ve sometimes found myself trying to create an inspiring narrative about the relevance of administrative work, but today I did some genuinely digital-humanities things. My first activity was a meeting of the research group for a seedling grant that is focused on using the Women Writers Project corpus with Word2Vec. In the coming year we’ll be expanding some tools Ashley Clark developed that produce a modified version of the WWP’s TEI/XML markup from which we can then extract plain-text data to feed into the word vector analysis. The modifications handle things like hyphenated words broken across a line break (representing these as a single word for analysis purposes), or selecting the regularized-spelling option for words which the WWP has marked for regularization. The resulting output produces more meaningful results in the word vector analysis (since it doesn’t include word fragments and typographical variants). We sat down together as a group and installed the current version of Ashley’s XSLT and XQuery routines, so that as the grant work gets going we can all experiment together.

After that, the Digital Scholarship Group had its weekly staff meeting at which we discussed the recently announced NHPRC/Mellon “Digital Edition Publishing Cooperatives” funding program, and the potential it might hold for DSG. Then in the afternoon, Syd Bauman and I taught the second session of a short and intensive workshop on schema-writing with RelaxNG, for graduate students in Northeastern’s Digital Humanities Certificate program.

A good and enjoyable day with wonderful colleagues–I feel really lucky for these moments of routine productivity, amid more uncertain and threatening circumstances.

Sarah Connell, Assistant Director of the Women Writers Project and the NULab for Texts, Maps, and Networks

You can get a reasonable picture of my day by looking at “before” and “after” versions of my to-do list, combined with my calendar. Today was a fairly standard Thursday in that it was mostly meetings, with other work happening in the gaps between. On my train ride in and for the first half-hour of the day, I was able to prepare for a training session I have tomorrow and send out a scheduling notice for an upcoming meeting that the NULab faculty will be having to plan for our programming next year, which will focus on the theme of fake news and disinformation. I also checked one of our WWO texts to see if my suspicions that a semicolon really needed to be a period were correct (they were). I replied to a few emails as well (there are always emails) and I got some incremental work done in reviewing the newest set of Women Writers in Context exhibits for publication.

Then, Ashley Clark and I met with the team who will be working on a new WWP project, funded by one of Northeastern’s TIER 1 grants, to set up a prototype vector space analysis web platform for Women Writers Online. This was a fun meeting because we were getting the whole team up and running with the XSLT and XQuery transformations necessary to take encoded texts and prepare them for analysis using Ben Schmidt’s word2vec package in R. It was a good chance for me to practice walking people through these processes and, as always, there were some new wrinkles that came up, which Ashley and I will now be able to anticipate the next time we teach this. That meeting ran late, so I ended up going right into the Digital Scholarship Group team meeting (which actually just meant moving to a different seat on the couch in our media lounge).

After the DSG meeting I grabbed a bit of lunch and sent a few more emails, including a scheduling message for a meeting on using the CERES Toolkit in a class on Literature and Digital Diversity that Elizabeth Dillon and I will be teaching in the fall. I was also able to take care of a few WWP admin tasks before the next meeting—in this case, actually a workshop on RELAX NG and schema planning, the second of two sessions led by Julia Flanders and Syd Bauman. After that workshop, Julia and I had our weekly meeting, which enabled me to check off a few items on my to-do list, particularly around our planning for the DH Certificate and for the work that the WWP and other DSG & NULab projects will be doing over the summer. As often happens, I added a few new items to my to-do list as well.

Finally, it was time for a Barrs Lecture, “Senecan Inwardness and the Staging of Race in Titus Andronicus and Othello” by Curtis Perry, followed by dinner with the speaker and then a train ride home (during which I’ll probably write more emails). I’m sending this for posting prior to the lecture and I’m really looking forward to it.

And now it’s time to check off one last item on my to-do list: “Write Day of DH post.”

Sarah’s “to-do” list at the beginning of the day.
Sarah’s “to-do” list at the end of the day. At the WWP we are all amazed at everything Sarah manages every single day.

Ashley Clark, XML Applications Programmer

This morning I assisted Sarah Connell in introducing the process we use to generate full-text versions of Women Writers Project TEI. The process consists of an XSL transformation I wrote to regularize things like <choice> elements and soft hyphens—phenomena that the WWP encoders have dutifully transcribed, but the implications of which can be lost when one strips out the markup, retaining only the text content. For example, a typo transcribed as:

will, when the encoding is stripped out, appear like this:

The XSLT creates a normalized version of the WWP TEI, moving non-useful text into an attribute I’ve called ‘read’ (as in, “for this element, read ‘This'”):

which translates into this plain text version:

But! Since the original text content is preserved in `@read`, you can reconstitute it and use XPath to find the matching phrase in its original context:

`//text//p[matches(normalize-space(.),’the Emrppre[sſ]s’)]`

(Note that I haven’t yet made explicit the normalization of long-S to regular S. Ideally, the XSLT would use @read for the long-S as well, so you wouldn’t have to resort to regular expressions.)

Lara Roberts, PhD Candidate in English

Lara’s Day of Digital *Human*ities

0930-1100 I was part of a group that transformed the WWP corpus with XSLT and XQuery to use later with the word2vec R package.
1130-1300 I went to our weekly meeting for the Early Caribbean Digital Archive. We were so excited working on prepping the website for launch that I forgot to take a picture. Instead, here’s a slide with pictures of the team members (past and present)!
From 1300-1600, I joined my cohort in our teeny office to have weekly work time trying to understand data analysis through RStudio.
1600, Usually, at some point, we have to go get snacks to keep our brains fueled, before…
1630-1900 I ended the day in the always challenging and entertaining Humanities Data Analysis class.

Joanne DeCaro Afornalli, Outreach Coordinator for the Women Writers Project

After a brisk morning walk with my exceedingly energetic little puppy Brooke, I settled in to some tea and emails. I was very excited to see a congratulatory email from David Lazer, Co-Director of NULab, on a recent presentation I gave for the NULab faculty on my Digital Humanities Certificate project. Afterwards, I spent some time looking over a new contribution for our Intertextual Networks series. I’m really looking forward to sharing Cassie Childs’ upcoming post on Delarivier Manley’s Letters Written by Mrs Manley and food history. It includes some fascistic analysis of archival images from eighteenth-century recipe books and botanical guides, and the post’s images immediately struck me with their beauty and nostalgia.

My big event of the day was attending Northeastern’s Academic Honors Convocation to receive the Outstanding Graduate Student Award for Experiential Learning. The award recognizes a graduate student who has “shown an extraordinary capacity to integrate academics and professional work, and establish themselves as an emerging leader in their field.” I was highly honored to received it, and very glad I could share the experience with my advisor Elizabeth Maddock Dillon, my Co-op coordinator Lisa Cantwell Doherty, and Marina Leslie (who so kindly nominated me for the award).

Now that I’m home for the night, I plan on making the final minor formatting touches on my master’s thesis, and then submitting it to ProQuest! My thesis, “Angelenos Incarcerated: The La County Jail Oral History Project” is a DH project that features the oral histories of ex-inmates told through videography, mapping, exhibits, and encoded texts (with a customized TEI schema). You can view the project’s website here.

Overall, it was a pretty big day. Not necessarily the heaviest DH day for me. But, I was so honored to have the multimedia and digital humanities work I do recognized in a big way today. And I was beyond grateful to have such an amazing group of women cheering me on.

Liz Polcha, PhD Candidate in English

Cara Messina, PhD Candidate in English

This morning I woke up feeling the familiar finals anxiety. Even so, I pushed myself to attend the RelaxNG workshop run by Julia Flanders. Thanks to learning the different approaches to schema building (and Julia’s excellent scaffolding and metaphors), I have begun creating a flexible XML schema that I plan to use as a pedagogical tool next semester. Learning new DH tools is the perfect form of productive procrastination!

After the workshop, I attended Ryan Cordell’s Humanities Data Analysis final class. Throughout the semester, we’ve used R to analyze our corpora; my corpus contains the metadata and actual texts of 3,000 Korra x Asami (Korrasami) fanfictions from Archive of Our Own.  We went over topic modeling and classification again; Ryan encouraged us to embrace topic modeling’s lack of stability. Although most of the class revolved around discussing challenges and asking/answering questions about our struggles with R, we had a few laughs reading Day of DH Tweets and reflecting on the semester.

Bill Quinn, PhD Candidate in English

Today for DH, I worked on writing my prospectus. I wrote about how computational text analysis will help me explore intertextuality in modernist magazines. It feels really weird writing about what computers do between inputting the data and rendering the visualizations, and I am trying to figure out how some people do it so well. Fortunately, Stanley the dog was there to help out.
Manicules, double daggers, and silcrows! Oh my!

Manicules, double daggers, and silcrows! Oh my!

The power of the corpus-wide query can often unearth a few surprise gems. While the team was researching the way notes are formatted in WWO, we became curious about which characters appear before notes in our texts. A quick XQuery script later, we had uncovered a few fun and interesting findings in the list of characters that are prefixed to the <note> elements in WWO.1 You can see the whole list at the bottom of this post.

It’s not really surprising that the asterisk (*) tops the list, with 2,431 instances across our published texts. Daggers (†) and double daggers (‡) are also fairly common, with hundreds of instances each. Looking further down the list, however, reveals some characters that might be less familiar to those who haven’t worked extensively with early texts. For example, the manicule appears six times. The term “manicule” comes from the Latin maniculum or “little hand.” The first known use of the manicule dates to the 1086 Domesday Book, a meticulous recording of landownership in England produced for William I; however, popular usage of the symbol really picked up steam in the Renaissance period.

A right-pointing manicule from the Specimen Book of the Cincinnati Type Foundry, 1882.
A right-pointing manicule from the Specimen Book of the Cincinnati Type Foundry, 1882. Public Domain. Wikimedia Commons.
Note the manicule inserted before the first reward headline in this 1865 broadside concerning the capture of Lincoln assassination conspirators. This document was published in the Eyes of the Nation : a Visual History of the United States by Vincent Virga. Public Domain. Wikimedia Commons.

Three authors in the WWO corpus use the manicule as a notational symbol: Charlotte Turner Smith, Mary Deverell, and Anne Francis. Smith and Deverell make use of the symbol once, while Francis uses it four times in her 1790 collection, Miscellaneous Poems. While exploring the manicule’s use across the corpus, the team also unearthed an author, Katharine Chidley, who uses the manicule alone as a note—as in the manicule appears by itself in the margins to literally point to places of emphasis. Chidley uses the manicule this way eleven times (!) in A New-Years Gift, or a Brief Exhortation to Mr. Thomas Edwards. 

An example of the manicule encoded as the paragraph contents of a note in
An example of the manicule encoded as the contents of a note in Chidley’s A New-Years Gift, or a Brief Exhortation to Mr. Thomas Edwards.

Returning to the list, the fourth most common item, “#rule,” deserves some glossing. The WWP uses “#rule” with the pre/post keywords on our rendition ladders to indicate cases where horizontal ruled lines appear in our texts; for the WWP’s purposes, ruled lines might include series of dashes or straight lines with minor detailing and they might be used either indicate divisions in a text or for decorative purposes. For more detail on the Women Writers Project’s encoding of ruled lines and ornaments, see here.

Finally, the silcrow (§), or section sign, or “double S” (it goes by many names) appears as the fifth most common notational symbol in the corpus. The silcrow points to sections—much like the paragraph sign (¶) is used to point to paragraphs. In the WWO corpus, the silcrow is regularly used, like the dagger and double dagger, to mark the anchor points of notes. The silcrow’s modern usage has evolved to primarily encompass citation in legal texts. In fact, in some European countries its symbolism has become intertwined with law to the point where it serves as the sign for the justice system, much like the use of the scales.

The logo of the Bundesministerium für Justiz, or the Austrain Federal Ministry of Justice.
The logo of the Bundesministerium für Justiz, or the Austrain Federal Ministry of Justice. Public Domain. Wikimedia Commons.

The full list of our results for characters prefixed to notes follows. We’ll continue sharing any potentially interesting results from our cross-corpus queries on this blog, so watch this space for more!

Symbols prefixed to notes in WWO

2431 *
821 †
170 ‡
82 #rule
46 §
23 a
20 b
19 c
14 d
14 ∥
12 e
11 1
10 ]
9 ‖
8 |
7 (1)
7 (2)
7 (3)
7 (4)
7 C
7 x
7 ¶
6 (6)
6 (7)
6 (a)
6 A
6 B
6 D
6 E
6 f
6 F
6 ☞
5 [
5 (10)
5 (5)
5 (8)
5 (9)
4 (a)
4 **
4 (11)
4 (b)
4 g
4 G
3 (b
3 +
3 (12)
3 (13)
3 h
3 H
3 I
3 K
2 (1)
2 (2)
2 (3)
2 (4)
2 (c)
2 (d)
2 (e)
2 (14)
2 (15)
2 k
2 l
2 m
2 n
2 q
2 X
1 (f)
1 (g)
1 (h)
1 (i)
1 (k)
1 (l)
1 (*)
1 16
1 2
1 2.
1 (16)
1 (17)
1 (18)
1 (19)
1 (c)
1 i
1 j
1 L
1 M
1 N
1 O
1 o
1 P
1 p.
1 r
1 ſ
1 ”
1 ⫲

 

A (semi-)Serious Proposal to the Linguists

A (semi-)Serious Proposal to the Linguists

God, Vertue, Ladies, and Souls

A few days ago, I came across this really interesting Language Log post, which talks about capitalization in one of our Women Writers Online texts—Mary Astell’s A Serious Proposal to the Ladies (1694). In the post, Mark Liberman asks the question: “Why did authors from Astell’s time distribute initial capital letters in the apparently erratic way that they did?” Liberman looks at sentences like this one, which describes the purpose of Astell’s proposal:

It’s aim is to fix that Beauty, to make it laſting and permanent, which Nature with all the helps of Art, cannot ſecure: And to place it out of the reach of Sickneſs and Old Age, by transferring it from a corruptible Body to an immortal Mind.

Since this is a WWO text, I decided to try a bit of experimentation and see what I might be able to uncover using not just the text itself, but also the markup. For just a bit of background, the texts in WWO are encoded according to the guidelines of the Text Encoding Initiative. You do need a subscription to access the collection, but we are always happy to offer free trials, so if you don’t have institutional access or an individual subscription and are interested in reading the texts in WWO, you can find instructions for how to set up a month-long trial here. If you’re curious about the details of our markup, those are covered in our internal documentation.

The first thing I did was enlist some help from Syd Bauman and Ashley Clark, our XML developers. Syd generated a list of all the capitalized words in Astell’s Proposal, along with their immediate ancestry (i.e., the local elements around each word). We found 2,491 capitalized words in total. Reviewing the elements in this list, I could see that it was likely many words were capitalized for reasons reflected in their markup. For example, there were proper nouns (tagged with <name>, <persName>, and <placeName>), titles of other texts (tagged with <title>), and the document’s own headings (tagged with <head>). There were also some words that were simply appearing at the starts of sentences.

So, I asked Ashley and Syd to help me come up with a new list of the capitalized words in Proposal, excluding those in proper nouns, titles, headings, and at the start of sentences. That list is here (original spellings preserved). The top results are: “God” with 31 instances; “Vertue” with 31; “Ladies” with 24; and “Souls” with 21 (in case you’re wondering, the WWP does not encode “God” with <persName>; see here for more details). The rest of the top fifteen—Women, World, Good, Nature, Piety, Religious, Religion, Soul, Beauty, Education, Glory—are all the sorts of word I’d expect to see capitalized in a seventeenth-century text.

Beauty and Death

Having looked at the capitalized words in an individual file, I thought it would be worth investigating all of the occurrences of those words across our corpus. So, since “Beauty” was a commonly capitalized word for Astell (in addition to being relatively short and without too many potential spelling variations), I started with that.

I first wanted to determine if I should be concerned with weeding out the capitalized cases of “Beauty” in sentence-initial positions. A bit of exploration showed me that there weren’t many such cases, and most of these came from texts that also had instances of “beauty” capitalized in the middle of sentences. I found only a handful of clear cases where “beauty” was being capitalized just because it was at the start of the sentence, so I decided not to worry about sentence position. I did find several texts that capitalized “beauty” only some of the time—in a few cases, this seemed to indicate a distinction between personified beauty and a more general usage (e.g., contrast “Soft Beauty’s timid smile serene” with “youth and the bloom of beauty,” both from the 1824 Poetical Works of the Late Mrs. Mary Robinson); in other cases the pattern was less clear. These instances, presumably, would be one place I might start if I were investigating this phenomenon in earnest.

So, armed with the power of XPath, I set out to investigate the beauties of WWO. Here’s what I found. There are:
1577 total instances of Capital-B “Beauty” and
1863 cases of lowercase-b “beauty”
Looking across the whole corpus, that’s about 46% capitalized instances.

I repeated the search with “beautie” (to catch both “beauties” and the alternate spelling of “beautie”) and while there were fewer hits, the results were similar in terms of percent capitalized:
438 Beautie; 580 beautie (43% capitalized)

For “beautiful” I saw a different distribution:
71 Beautiful; 1619 beautiful (4% capitalized)

Since I suspected that this kind of capitalization would be more common in our earlier set of texts, I decided to narrow down the results. That just meant adding a bit of XPath before my search to look only in texts with publication dates before 1701 (198 out of 388 texts total).

Here’s what I found:
872 Beauty; 415 beauty (68% capitalized)
270 Beautie; 235 beautie (53% capitalized)
36 Beautiful; 212 beautiful (16% capitalized)

For this term at least (and with all appropriate acknowledgement of the highly rudimentary nature of this search), there does seem to be a bit more capitalization in the earlier half of the collection. Next, I wanted to see what else I could do with our markup. In my review of the tags we used for capitalized words in Astell’s Proposal, I had noticed that there were quite a few occurrences of <mcr>; this is a WWP-created element for a “meaningful change in rendition.” We use it where there are changes in rendition (such as between upright and italicized text) that are neither a printer’s error nor a merely decorative shift and that we can’t encode with more specific elements (such as <emph>, <name>, &c.). It’s essentially an element that says: “we think something semantically significant is happening with rendition here, but we’re not able to say exactly what.” Liberman alluded to this sort of thing when he wrote: “[And never mind, for now, Astell’s italicization choices…]”

Thinking that there might be interesting links between capitalization and these meaningful-but-unspecified changes in rendition, I tried my “beauty” search again, but restricted my results to text inside of <mcr>.

Here’s what I found, first looking across the corpus as a whole:
102 Beauty; 16 beauty (86% capitalized)

And then just the pre-1701 texts:
83 Beauty; 5 beauty (94% capitalized)

Admittedly, the corpus is small enough that narrowing down this far means you have fairly few results. (I also tried “beautie” and “beautiful,” but there really weren’t that many once I narrowed to the contents of <mcr>; for what it’s worth, 35 out of 37 instances of “beautie” in <mcr> are capitalized.) Still, there does seem to be something potentially interesting here. Most of the time, the rendition doesn’t change with capitalization (there are, after all, 1475 instances of “Beauty” in the collection that are not in <mcr>), but when the rendition does change, there is a higher percentage of capitalization. I decided to try another keyword and see what came up. I went with “death” this time, using the same criteria that it’s short, fairly common in the corpus, and without many spelling variations (there is “deathe,” which had 5 capitalized and 138 lowercase instances overall, none in <mcr>, all from texts published before 1701). Here’s what I found:

Corpus-wide
2578 Death; 4759 death (35% capitalized)
239 Dead; 2381 dead (9% capitalized)

Pre-1701
1226 Death; 2115 death (37% capitalized)
110 Dead; 1313 dead (8% capitalized)

Contents of <mcr>
Corpus-wide
251 Death; 54 death (82% capitalized)
Pre-1701
218 Death; 34 death (87% capitalized)

These are just two specific keywords, of course; if I were pursuing this seriously, I’d want to refine the search itself and try quite a few more terms as well as other XPath variations: looking at headings and titles, checking for items in lists, perhaps comparing verse and prose, and so on.

“Friendship Cheese”

Finally, I decided to take a look at the contents of <mcr> itself, using an XQuery that Ashley Clark wrote for the WWP (affectionately nicknamed “The Counting Robot” and available here). I normalized punctuation, long s (ſ) characters, and whitespace, but preserved capitalization. I got 21,741 different strings inside of <mcr>; of those, 16,832 were unique. Many of the unique cases are not single words or short phrases, but entire sentences or clauses where the renditional shifts cannot be attributed to emphasis or quotation. The top term on the list was “God,” with 1237 results; rounding out the top-five for the corpus are: Lord, I, Love, and Author.

Of the 127 cases with 30 or more hits, all but ten are capitalized—the exceptions are: “life,” “death,” “lying,” “they,” “she,” “love,” “one,” “her,” “he,” and “royal paper.” (This last item serves as a small caveat regarding the size of our corpus: all 204 instances of “royal paper” appear in a single text, Mary Jones’s 1750 Miscellanies in Prose and Verse.) Nevertheless, I do think that these exploratory results show that there is a great deal of potential for more serious research into these features using the WWO corpus—and if anyone is interested in a project along these lines, I’d be delighted to help set that up. In fact, this is my semi-serious proposal to anyone in the research community (linguists or otherwise) who might want to take this kind of work up.

One of my favorite things about this sort of exploration is that it brings me into contact with our texts in unpredictable ways, usually emphasizing how interesting and genuinely fun our corpus is. This was no exception and I’ll end here with my personal Top Ten results from the contents of <mcr>:

  • Wretched productions! inspired by hunger and dictated by stupidity and a disposition to lying! &c &c
  • As Irish ladies pass in jaunting cars
  • Confounded Harlot!
  • Effemenate Cat
  • For Gad Madam I don’t love being baulk’d thus
  • Friendship Cheese
  • Great Cuttle’s gland
  • Hedges of the Eyebrows
  • His lisping children hail their sire’s return!
  • Julius Cesar when he was beheaded by Oliver Cromwell

 

Loanwords, Macrons, and Orientalism: Encoding an Eighteenth-Century Fictional Translation

Loanwords, Macrons, and Orientalism: Encoding an Eighteenth-Century Fictional Translation

By Elizabeth Polcha, WWP Encoder and Ph.D. Candidate in English

Since late last fall, I’ve been encoding a text that poses some interesting markup challenges because of its use of Orientalist language: Scottish author Eliza Hamilton’s 1796 epistolary novel, Translation of the Letters of a Hindoo Rajah. While I was excited to encode Translation because my own research considers eighteenth-century colonial literature, I focus on Caribbean and American literature. So, as an encoder, I approached Translation with an interest in how Hamilton is using distinct language to construct colonial notions of race and gender, but with only a limited familiarity with Orientalist print culture and history.

Before I lay out the details of how I’ve been encoding linguistically distinct language in Translation, it is necessary to explain just how Orientalist (and orientalist, to use Edward Said’s version of the term) this novel is. And no, Translation is not actually a collection of letters that Hamilton translated from Hindi.1 The “translated” letters of Hamilton’s text are fictional, mostly authored by the titular character and protagonist, Zāārmilla, the Rajah of Almora. Hamilton supplements the letters with a “preliminary dissertation,” lengthy footnotes, and a glossary of terms. She strategically includes these textual addendums as a way of demonstrating her expertise in the Orientalist scholarship of her time. Also, as you can see from the macrons included on “Zāārmilla” and on another major character’s name, “Māāndāāra,” Hamilton is a fan of using diacritical marks as a kind of typographic flourish. In writing Translation, Hamilton participated in a scholarly discourse rooted in a Western imperialist fascination with Eastern Asia, citing British colonial scholarship like Nathaniel Halhed’s A Code of Gentoo Laws Or, Ordinations of the Pundits and Orientalist groups like The Asiatic Society.2

Part of our encoding process at the Women Writers Project is to begin with a preliminary document analysis. This means that once we’ve acquired a text to encode, we look through the text carefully to take note of its structure and textual features before opening up an XML file and marking up our text in TEI. During my preliminary document analysis of Translation, aside from noticing the epistolary structure and Hamilton’s unusual diacritical marks I’ve described above, I also noticed quite a few Hindi and Sanskrit terms and phrases that seemed to be roughly transliterated into English (such as “Poojah” or Pūjā, पूजा, a Sanskrit-derived word for Hindu ritual prayer). From my document analysis I knew that it would be important to look up the etymology and meaning of Hamilton’s transliterated terms in order to decide how to most accurately describe them using the TEI. My encoding practice for Translation so far has involved occasionally switching between my XML file, the Oxford English Dictionary (OED), and Google Books in determining the best way to tag specific terms and phrases.

The WWP follows the TEI Guidelines for capturing specialized language with the element <distinct>, which means that we use <distinct> to tag language that is “archaic, technical, dialectal, non-preferred.” In addition to <distinct>, <foreign> and <term> were also particularly important in my encoding of Translation. The WWP also uses the @xml:lang attribute with a value from the IANA language registry to provide standardized identifications for non-English words and phrases.3 This means that my encoding process involves paying attention to the etymology of distinct words and phrases in order to assign each <distinct> or <foreign> element an IANA language code.

For example, in the first letter in Translation, Zāārmilla refers to a character’s “Ayammi Shadee,” which Hamilton defines in a footnote as “the present made to a young woman by her relations during the period of her betrothment” (58). In determining how to encode this term, I first searched for it in the OED—which returned no results. I then searched in Google Books, which brought me to Halhed’s A Code of Gentoo Laws, Hamilton’s original source. Eventually, I determined that “Shadee” must be Hamilton’s (and Halhed’s) version of the Hindi word, shadi, or, marriage.

Example encoding of “Ayammi Shadee.”
Example encoding of “Ayammi Shadee.”

This term stood out to me in the text not only because it was capitalized and footnoted, but also because I did not recognize it. If Hamilton had simply used the word “Marriage” there would be no need to tag it with a more descriptive element, but because the WWP is interested in tagging non-English and linguistically distinct language, I needed to figure out the best way to encode the term. I ended up encoding “Ayammi Shadee” using the element <foreign>, which is used to tag non-English words in cases where there is not another more appropriate element, such as <name>, <persName>, or <placeName>. I also used the @xml:lang attribute with a value of “hi” for Hindi.

As in the example above, one of the challenges of marking up non-English and linguistically distinct terminology in texts like Hamilton’s Translation is that it is sometimes difficult to know when a word is being referenced in the text as a foreign language term, or when the text is using a term that has been adapted into English as a loanword. For example, the English word “pundit” is a loanword from the Sanskrit term “pandit” meaning knowledge owner, or, according to the OED, “a person with knowledge of Sanskrit and Indian philosophy, religion, and law.” So, when Halhed includes “Ordinations of the Pundits” in the title of his text, he is referring to a “pundit” as an intermediary who could clarify Indian law for colonial authorities.

It is also difficult to distinguish when a term can accurately be tagged “foreign” or “distinct” (<distinct> is the element we use for linguistically or dialectically distinct terms that are not distinct enough to constitute a ‘foreign” language), since what is considered foreign or distinct to me may not have been foreign or distinct to an eighteenth-century reader. The WWP aims to best represent the documents we encode within the context in which they were written and published, which is part of the reason why the OED is so often a valuable resource for encoders—we wouldn’t want to mark an early modern spelling of a particular word as a typographical error using the elements <sic> and <corr>, for example. But it is also important to recognize that each encoder approaches the encoding process with her own understanding of the text. My choices in marking up the term “Ayammi Shadee” are based on my understanding of the WWP’s encoding practices and my analysis of the text—and these choices will be reviewed by other encoders and may change as Translation moves through our proofing process and into final publication on Women Writers Online.

What I love about working for the WWP is the endlessly evolving way we think about markup, and the collaborative nature of the encoding process. From the many discussions I’ve had in encoding meetings with my WWP colleagues about Hamilton’s Translation, we’ve shifted slightly in our thinking about elements like <distinct>. Ultimately, the complicated way Hamilton uses Hindi- and Sanskrit-derived terms has helped me to think more critically about the linguistic complexity of eighteenth-century colonial writing.

Notes

Prototype Visualizations for Cultures of Reception

Prototype Visualizations for Cultures of Reception

We will soon be publishing an exploratory interface for the more than 600 reviews, advertisements, and other periodical items that we’ve encoded for our Cultures of Reception project—which explores how the authors in Women Writers Online were discussed in periodicals from 1770 to 1830. In preparation for that interface, we’re also working with Steven Braun, the Data Analytics and Visualization Specialist in the Northeastern University Library’s Digital Scholarship Group, to set up some visualizations that will help to highlight patterns across the texts in the collection.

Steven recently sent a few prototype visualizations to us and we wanted to share those here, since we’re really excited about them. Essentially, the reviews in Cultures of Reception are tagged by their evaluations, running from “very positive” to “very negative.” The visualizations Steven designed show variations in individual authors’ reception over time by mapping each possible evaluation to an integer value and plotting those evaluations over the course of each author’s lifetime. Positive evaluations are represented by dark green circles (on the upper y-axis), negative evaluations are represented by dark red circles (on the lower y-axis), and partial gradations are colored accordingly in between. Each circle represents a cluster of reception evaluations at that point in time and the size of each circle is proportional to the number of evaluations.

For example, here’s Maria Edgeworth, who was very widely reviewed over a fairly long period of time; her reviews are usually positive, but there are a few negative responses:m_edgeworth-evals

And here’s Charlotte Smith, who received more positive responses overall: c_smith-evals

Mary Darby Robinson, by contrast, has a narrower timeframe, with a particularly notable dip in review positivity around 1800 (in responses to The Natural Daughter):

m_robinson-evals

Finally, here’s the collection as a whole:

evals-all-authors

When we publish the exploratory interface (which should be very soon!), we’ll be including more evaluation visualizations like these, along with others that will show the geographic ranges of periodicals and reviewed texts, the topics covered in the collection, the circulation of reviews and editions—and quite a few more. So, if you’re interested in the reception of eighteenth-and nineteenth-century women’s texts (or in transatlantic periodical cultures, publication practices, literary circulation, &c.), watch for the publication announcement here and on our website—and, in the meantime, we hope you enjoy these visualizations!