Tagging at the letter or word level

typographical error letter word
orig reg abbr expan sic corr sic corr abbr expan orig reg

The TEI provides a number of elements whose function is to encode alternative readings:

sic for typographical errors in the source orig for old-style spelling and typography abbr for abbreviations and their expansions

All of these elements may be applied either at the word level or at the individual letter level, and there are several issues connected with this usage that are worth noting in detail.

As a general rule, there are advantages to applying these elements to whole words, chiefly having to do with practicality. Most XML publication software assumes that element boundaries match word boundaries, and as a result any encoding which falls within a word may produce strange results such as unwanted white space, inaccurate tokenization (i.e. identification of word boundaries) and display problems. There are ways of getting around these problems but on the whole word-level encoding is simpler to handle than intra-word encoding.

However, for early modern texts, there are significant problems as well to word-level encoding, arising from the use of the three elements listed above. Older texts are quite likely to contain words in which two or even three of these features—typographical error, abbreviation, old spelling—are present, and if all three kinds of encoding are applied at the word level, there are limitations on the kinds of results one can achieve. For instance, in a case where both an abbreviation and and old spelling must be encoded, only three of the four possible results (abbreviated/old, abbreviated/modern, expanded/old, expanded/modern) can be expressed:

<abbr expan="conuocation"><orig reg="convocati&omacr;">conuocati&omacr;</orig></abbr>

Some of the possible combinations may turn out to be less useful than others (for instance, a modernized but uncorrected reading of beautle, where the l is a misprint for i, is meaningless because the modernization eliminates the error) but in principle, tagging at the word level increases the likelihood that conflicts will arise.

For this reason, we recommend encoding these features at the letter level, which will avoid the preponderance of conflicts, since the number of instances where two or more elements apply to a single letter will be quite small.

In cases where more than one encoding must be applied to a single letter or word, it’s helpful to have in mind a model for how the encoding will be processed: that is, in what order the information contained in each element will be applied. The most conventional (and intuitive) approach is to resolve the encoding from the inside out: that is, starting with the innermost element and working outward through successive layers of encoding. In the example given above, the orig element would be resolved first, either by taking its content (conuocati&omacr;) or the value of its reg attribute (convocati&omacr;). The result would then be taken as the content of the enclosing abbr element. When the abbr is resolved, its output is either its own content (i.e. the result of the orig element) or the value of its expan attribute (conuocation).

It is important to remember that in cases where two elements are applied to the same word or letter, although you can represent any of the four possible readings, there will always be two readings that are alternatives and cannot be represented at the same time. In the example above, the unattested result is the expanded and regularized reading convocation. For a project in which this result was important, a better encoding might be

<abbr expan="convocation"><orig reg="convocati&omacr;">conuocati&omacr;</orig></abbr>

in which the regularized value assumes an expanded reading as its basis. But if that project wanted to represent both the expanded/unregularized and the expanded/regularized version (both conuocation and convocation) could not use the encoding above, since these two options are strictly alternatives. For such a project, the reverse nesting would be necessary:

<orig reg="convocation"><abbr expan="conuocation">conuocati&omacr;</abbr></orig>

Representing the possible outcomes as a matrix may help you conceptualize your best solution:

[outer]orig reg
[inner]abbr conuocati&omacr; convocati&omacr;
expan conuocation convocation

If the inner element is represented on the vertical axis of the matrix (with the content first and the attribute value second), and the outer element is represented on the horizontal axis, you will always have to choose between the two values in the right-hand column.