WWP The Project Newsletter Archive Volume 2, Number 2 Scholarly Needs, Encoding Challenges

NOTE: Due to limitations inherent in HTML we are unable to provide a clear representation of the macron, which is a key feature of this article. Therefore, when you see either ó or é, they should be read as an o macron or an e macron (the letter with a perfectly horizontal line over it).

Scholarly Needs, Encoding Challenges: Correction, Regularization, and Expansion

Carole Mah and Julia Flanders

Note on Text Encoding

This article discusses some of the scholarly issues the WWP faces, in the context of the text encoding that helps us address them. For anyone interested in reading about text encoding in more detail, we recommendA Gentle Introduction to SGML, which can be found on the Web at the following site:


To encode the texts in our textbase, the WWP uses Standard Generalized Markup Language (SGML), which is a system for describing the important structural and analytical aspects of the text in a form that the computer can understand. The basic building blocks of the text, such as headings, chapters, paragraphs, poetic stanzas and lines, dramatic acts and scenes, speeches, quotations, notes, and the like, are all identified as interrelated parts of the document. Their relationships to one another are made explicit in the Document Type Definition (DTD), which specifies how these various elements of the text fit together: for instance, that drama is usually divided into acts and scenes, which typically contain speeches of prose or poetry, or that prose works typically contain systems of headings, sections, and paragraphs. Using this information, one can then create software which "understands" the structure of the document and can use that understanding to help the user navigate within the document, analyze it and compare it to others, and find words and concepts within particular contexts.

This kind of structural information is essential, and one of the better-known and more universally used aspects of SGML, but it is by no means the only area where the WWP relies on SGML to record and express important features of the text. Like many other projects which are dealing with rare archival sources, the WWP has specialized encoding needs determined by the nature of the material and the needs of its users. Our texts cover a 500-year span of history, and a wide range of printing practices; as original documentary sources they are immensely valuable in their details, but at the same time they constitute a large and disparate body of material which scholars may need to search and analyze as a whole. To address these multiple demands requires specialized encoding of particular textual features. To enable useful word searches, for instance, typographical errors must be encoded to provide a corrected form, but at the same time the original error may very well be important for other purposes and must also be retained. Similarly, the encoding should expand abbreviations to facilitate searching and to provide a less encumbered reading text, but should also store the original abbreviation for those scholars who want to check the expansion or study the abbreviation. In the encoded version of the text, both readings are preserved and remain accessible; no data is lost, but it can be viewed selectively. As we will see in what follows, though, the preservation of multiple readings can pose some encoding challenges which require careful consideration to solve.

Example: Expansion of abbreviations, correction of apparent errors, and regularization

The Text Encoding Initiative (TEI), which was established to provide a common encoding system for humanities projects, has defined a number of elements specifically for the purposes just described. These include <sic>, which is used to mark errors in the original text, <abbr> , which is used for abbreviations, and <orig>, which is used for spelling and typography which is correct by the standards of its time but does not conform to modern usage. For many encoding projects, the application of such tags is extremely simple; for instance:

<sic corr="humanities">humanoties</sic>
<orig reg="Jesus">Iesvs</orig>
<abbr expan="whom">wh&omacr;</abbr> (where "&omacr;" indicates an o with a macron, a character which can- not be typed directly into the file, but must be represented indirectly using an entity reference).

In both the TEI Guidelines and elsewhere, it is often assumed that all three of these tags will be applied predominantly to the entire word in question, as in the examples above, rather than to the single letter affected:

human<sic corr="i">o</sic>ties.

Either method (applying <sic> to the whole word or just the letter or letters in question) is perfectly legal, and may simply be a matter of context and taste -- tagging the entire word seems to make more common sense, and produces a file which is easier for the human to read (though of course it is chiefly the computer which needs to read the encoding). However, there may be consequences to the choice which need to be taken into account. For instance, software developers may depend on assumptions about the method of tagging examples of this sort. of, for instance, a piece of software assumes that the value of the <corr> attribute will be a whole word, then it might try to use that value to generate a list of all corrected words. In cases where only the letter was tagged, the results would be a useless list of one-letter words! Thus it is important to encode consistently and document one's practice, so as to be sure to obtain the results one is expecting.

But in addition to these considerations, the WWP also faces more complex issues resulting from the nature of our texts. Their age and idiosyncrasies of printing mean that they are filled with features which require these tags, sometimes several tags applied to the same word or letter. In these cases tagging the whole word is out of the question; the issue becomes how to tag the letter in a way that produces a useful result. Foxe's Actes and Monuments (which contains a version of Anne Askew's Examinations) contains the following challenge to encoding ingenuity:

Image of macron; excerpt from original text (17K)

A full encoding of this example using <abbr> and <orig> (and omitting for clarity the line break and the issue of the omitted line-end hyphen) would be

c<abbr expan="on">&omacr;</abbr><orig reg "v">u</orig>ersation

From this single piece of encoding, one can generate four possible readings, each of which might be of interest to a different group of users:

cóuersation (unexpanded, unregularized)
conuersation (expanded, unregularized)
cóversation (unexpanded, regularized)
conversation (expanded, regularized)

This set of readings represents the full range of possible permutations, and hence a full expression of the information contained in this brief passage. However, not all examples are this straight forward. In another example from Askew, two tags must be applied to a single letter, introducing an additional layer of complexity

Image of macron; excerpt from original text (4K)

If we regard the second macron as a printer's error, the four possible readings of this word are:

tépéted (uncorrected, unexpanded)
tempemted (uncorrected, expanded)
tépted (corrected, unexpanded)
tempted (corrected, expanded)

However, because there are two tags applied to one letter, there are several different ways to encode this word, depending on how the tags are nested within one another. To understand this problem fully there are two things to take into account. First, when software interprets the tagging for purposes of display, it is easiest for it to interpret all instances of a given element in a similar way -- for instance, always displaying the abbreviated version of material encoded with <abbr>, rather than sometimes taking the abbreviated version and sometimes the expanded version. Second, when two elements are nested, the processing software will process them from the inside out, so that the result of processing the inner element will then be modified by the processing of the outer element. Based on these two principles, the following example shows one way of encoding the sample text which will yield all of the desired readings:

t<abbr expan="em">é</abbr>p<sic corr= " "><abbr expan="em">é</abbr></sic>ted

The readings derived from this encoding can be summarized as follows, with the processing that produces them:

tépéted (using the content of <sic> and the content of <abbr>)
tempemted (using the content of <sic> and the attribute of <abbr>)
tépted (using the attribute of <sic> and the content of <abbr>)
tempted (using the attribute of <sic> and the attribute of <abbr>)

Just to show how crucial the nesting of the tags is, consider the following encoding, which is very similar to the above, but with the nesting of the <abbr> and <sic> elements reversed:

t<abbr expan= "em">é</ abbr>p<abbr expan="em"><sic corr=" ">é</ sic></abbr>ted

In this case, unlike the one above, the fourth reading "tempted" cannot be produced, since when we take the attribute of <sic> and the attribute of <abbr>, the attribute of the outer tag (which in this case is <abbr>, yielding the reading "em") supersedes that of the inner. The reading "tempemted" is thus produced no matter what reading emerges from the <sic> element; there is no way of accessing the corrected reading as long as the user has chosen to view the document with abbreviations expanded.

Since "tempted" is in fact one of the readings which is most likely to be of interest to users, this encoding is clearly undesirable compared to the previous example. The question to consider next, though, is whether all of the possible readings are so essential that they justify the difficult encoding necessary to generate them. Since complicated nesting poses challenges to processing software, it is worth considering what information the user will actually need, and whether there are some readings which can safely be eliminated.

In cases like this, the WWP must take into account a number of issues, among them the need to facilitate searching, to preserve the original witness, and to provide a reading text. In addition, we need to keep our encoding as clear and systematic as possible. In our current encoding practice, we have concluded that the expanded and uncorrected "tempemted" reading is the least desirable to our potential audiences, since it is likely that any reader who would want the abbreviations expanded would also want the errors corrected; in addition, expanding a brevigraph which is itself a typographical error produces a word which is a product of neither the author nor the printer, but rather an artifact of the encoding. Our encoding of this example therefore eliminates the expansion of the brevigraph, reducing the convolutions of the nested elements, and yielding the following solution:

<abbr expan="tempted">tép<sic corr=" ">é</sic>ted</abbr>

It is important to emphasize that the problem of concurrent correction and expansion is an intellectual and scholarly challenge that exists quite apart from SGML. In complex cases like this SGML can introduce some difficulties, but only because it also introduces the opportunity to provide multiple versions of a text from a single transcription -- something that would not otherwise be possible.

For a more thorough discussion of this problem, see "Some Problems of TEI Markup and Early Printed Books" by Carole Mah, Julia Flanders and John Lavagnino, forthcoming in Revue Informatique et Statistique dans les Sciences Humaines. Université de Liège. 32.1­4 (1996): xx­xx.

The Project | The Texts | Research and Encoding
Contact | Site Index | Northeastern University